## Inference in Deep Learning

There are many, many new generative methods developed in the recent years.

• denoising autoencoders
• generative stochastic networks
• variational autoencoders
• importance weighted autoencoders
• generative adversarial networks
• infusion training
• variational walkback
• stacked generative adversarial networks
• generative latent optimization
• deep learning through the use of non-equilibrium thermodynamics

# Deep Models

We can’t delve into the details of those old workhorse models, but let us summarize a few of them nevertheless.

A Boltzmann machine can be seen as a stochastic generalization of a Hopfield network. In their unrestricted form Hebbian learning is often used to learn representations.

A restricted Boltzmann machine, or Harmonium, restricts a Boltzmann machine in the sense that the neurons have to form a bipartite graph. Neurons in one “group” are allowed connections to another group, and the other way around, but they are not allowed to be connected to neurons in the same group. This restriction naturally, but not necessarily leads to structures that resemble layers.

A deep belief network hand deep Boltzmann machines have multiple (hidden) layers that are each connected to each other in the restricted sense of above. These models are basically stacks of restricted Boltzmann machines. This is by the way only true in a handwaving manner. A deep belief network is not a true Boltzmann machine because its lower layers form a directed generative model. Salakhutdinov and Hinton (pdf) spell out the differences in detail.

# Markov Chain Monte Carlo (MCMC)

Restricted Boltzmann Machines, Deep Belief Networks, and Deep Boltzmann Machines were trained by MCMC methods. MCMC computes the gradient of the log-likelihood (see post on contrastive divergence). MCMC has particular difficulty in mixing between modes.

# Autoencoder

An autoencoder has an input layer, one or more hidden layers, and an output layer. If the hidden layer has fewer nodes than the input layer it is a dimension reduction technique. Given a particular input, the hidden layer represents only particular abstractions that are subsequently enriched so that the output corresponds to the original input. An other dimension reduction technique is for example principle component analysis which has some additional constraints such as linearity of the nodes. Given the shape an autoencoder can also be called a bottleneck or sandglass network.

If we represent the encoder $F: X \rightarrow H$ and the decoder $G: H \rightarrow X$. We apply the individual $x$ to the product as $x' = (G \circ F)x$, then we can define the autoencoder as:

Here we choose for an L2 norm for the reconstruction: $L(x,x') = \| x-x' \|^2$.

An autoencoder is typically trained using a variant of backpropagation (conjugate gradient method, steepest descent). It is possible to use so-called pre-training. Train each two subsequent layers as a restricted Boltzmann machine and use backpropagation for fine-tuning.

# Denoising Autoencoders

A denoising autoencoder (DAE) is a regular autoencoder with the input signal corrupted by noice (on purpose: $\tilde{x} = B(x)$). This forces the autoencoder to be resilient against missing or corrupted values in the input.

The reconstruction error is again measured by $L(x,x') = \| x - x'\|^2$, but now $x'$ is formed by a distortion of the original $x$, denoted by $\tilde{x}$, hence $x' = (G \circ F) \tilde{x}$.

Note that a denoising autoencoder can be seen as a stochastic transition operator from input space to input space. In other words, if some input is given, it will generate something “nearby” in some abstract sense. An autoencoder is typically started from or very close to the training data. The goal is to get an equilibrium distribution that contains all the modes. It is henceforth important that the autoencoder mixes properly between the different modes, also modes that are “far” away.

# Variational Autoencoders

The post by Miriam Shiffman is a nice introduction to variational autoencoders. They have been designed by (Kingma and Welling, 2014) and (Rezende et al., 2014). The main difference is that $h$ is now a full-fledged random variable, often Gaussian.

A variational autoencoder can be seen as a (bottom-up) recognition model and a (top-down) generative model. The recognition model maps observations to latent variables. The generative model maps latent variables to observations. In an autoencoder setup the generated observations should be similar to the real observations that go into the recognition model. Both models are trained simultanously. The latent variables are constrained in such a way that a representation is found that is approximately factorial.

# Helmholtz Machine

A Helmholtz machine is a probabilistic model similar to the variational autoencoder. It is trained by the so-called sleep-wake algorithm (similar to expectation-maximization).

# Importance weighted Autoencoders

The importance weighted autoencoder (Burda et al., 2015) is similar to the variational autoencoder, but it uses a tighter loglikelihood lower bound through applying importance weighting. The main difference is that the recognition model uses multiple samples (to approximate the posterior distribution over latent variables given the observations). In order words, the recognition model is run a few times and the suggested latent variables are combined to get a better estimate. The model gives more weight to the recognition model than the generative model.

# Generative Adversarial Networks

Generative Adversarial Networks (Goodfellow et al., 2014) use two networks. A generative model $G$ and a discriminative model $D$. The generative model maps latent variables $z$ to data points $x'$. The discriminator has to make a choice between true data $x$ and fake data $x'$. Hereby should $D(x)$ have a large value and $D(x')$ have a small value. The discriminator maximizes (we fix the generator):

The generator in contrast maximizes:

It is clearly visualized by Mark Chang’s slide.

Adversarial Autoencoders (Makhzani et al., 2016) is an autoencoder that uses generative adversarial networks. The latent variables (the code) are matched with a prior distribution. This prior distribution can be anything. The autoencoder subsequently maps this to the data distribution.

Note that the use of the adversarial network is on the level of the hidden variables. The discriminator attempts to distinguish “true” from “fake” hidden variables.

This immediately rises the following question: Can we also generate fake data as well? If one discriminator has the goal to distinguish true from fake hidden variables, the other can have as goal to distinguish true from fake data. We should take provisions to not have the former discriminator punished by a bad performing second discriminator.

# Deep Learning Through The Use Of Non-Equilibrium Thermodynamics

Non-equilibrium Thermodynamics (Sohl-Dickstein et al., 2015) slowly destroys structure in a data distribution through a diffusion process. Then a reverse diffusion process is learned that restores the structure in the ata.

Both processes are factorial Gaussians, the forward process, $p(x^{t}\mid p(x^{t-1})$ and the inverse process, $p(x^{t-1}\mid p(x^t)$.

To have an exact inverse diffusion the chain requires thousands of small steps.

# Infusion Training

Infusion training (Bordes et al., 2017) learns a generative model as the transition operator of a Markov chain. When applied multiple times on unstructured random noise, infusion training will denoise it into a sample that matches the target distribution.

# Variational Walkback

Variational Walkback (Goyal et al., 2017) learns a transition operator as a stochastic recurrent network. It learns those operators which can represent a nonequilibrium stationary distribution (also violating detailed balance) directly. The training objective is a variational one. The chain is allowed to “walk back” and revisit states that were quite “interesting” in the past.

Compared to MCMC we do not have detailed balance, nor an energy function. A detailed balance condition would by the way mean a network with symmetric weights.

# Nonparametric autoencoders

The latent variables in the standard variational autoencoder are Gaussian and have a fixed quantity. The ideal hidden representation however might require a dynamic number of such latent variables. For example if the neural network has only 8 latent variables in the MNIST task it has to somehow represent 10 digits with these 8 variables.

To extend the hidden layer from a fixed to a variable number of nodes it is possible to use methods developed in the nonparametric Bayesian literature.

There have been already several developments:

• A stick-breaking variational autoencoder (Nalisnick and Smyth, 2017) where the latent variables are represented by a stick-breaking process (SB-VAE);
• A nested Chinese Restaurant Process as a prior on the latent variables (Goyal et al., 2017);
• An (ordinary) Gaussian mixture as a prior distribution on the latent variables (Dilokthanakul et al., 2017), but see this interesting blog post for a critical review (GMVAE);
• A deep latent Gaussian mixture model (Nalisnick et al, 2016) where a Gaussian mixture is used as the approximate posterior (DLGMM);
• Variational deep embedding uses (again) a mixture of Gaussians as a prior (Jiang et al., 2017) (VaDE);
• Variational autoencoded deep Gaussian Processes (Dai et al., 2016) uses a “chain” of Gaussian Processes to represent multiple layers of latent variables (VAE-DGP).

The problem with autoencoders is that they actually not necessarily say how the latent variables are to be used. For example, with InfoGAN (not yet explained) mutual information between input and latent variables is maximized to make sure that the latter are actually used. This is useful to avoid the “uninformative latent code problem”, where latent features are actually not used in the training. However, with for example the information bottleneck approach the mutual information between input and latent variables is minimized (under the constraint that the features still predict some labels). This is logically from the perspective of compression. This behavior can all be seen as a so-called information-autoencoding family (Zhao et al., 2017).

It is interesting to study how nonparametric Bayesian methods fair with respect to this family and what role they fulfill in such a constrained optimization problem. Existing models namely use fixed values for the Lagrangian multipliers (the tradeoffs they make).