## Inference in Deep Learning

There are many, many new generative methods developed in the recent years.

• denoising autoencoders
• generative stochastic networks
• variational autoencoders
• importance weighted autoencoders
• generative adversarial networks
• infusion training
• variational walkback
• stacked generative adversarial networks
• generative latent optimization
• deep learning through the use of non-equilibrium thermodynamics

# Deep Models

We can’t delve into the details of those old workhorse models, but let us summarize a few of them nevertheless.

A Boltzmann machine can be seen as a stochastic generalization of a Hopfield network. In their unrestricted form Hebbian learning is often used to learn representations.

A restricted Boltzmann machine, or Harmonium, restricts a Boltzmann machine in the sense that the neurons have to form a bipartite graph. Neurons in one “group” are allowed connections to another group, and the other way around, but they are not allowed to be connected to neurons in the same group. This restriction naturally, but not necessarily leads to structures that resemble layers.

A deep belief network hand deep Boltzmann machines have multiple (hidden) layers that are each connected to each other in the restricted sense of above. These models are basically stacks of restricted Boltzmann machines. This is by the way only true in a handwaving manner. A deep belief network is not a true Boltzmann machine because its lower layers form a directed generative model. Salakhutdinov and Hinton (pdf) spell out the differences in detail.

# Markov Chain Monte Carlo (MCMC)

Restricted Boltzmann Machines, Deep Belief Networks, and Deep Boltzmann Machines were trained by MCMC methods. MCMC computes the gradient of the log-likelihood (see post on contrastive divergence). MCMC has particular difficulty in mixing between modes.

# Autoencoder

An autoencoder has an input layer, one or more hidden layers, and an output layer. If the hidden layer has fewer nodes than the input layer it is a dimension reduction technique. Given a particular input, the hidden layer represents only particular abstractions that are subsequently enriched so that the output corresponds to the original input. An other dimension reduction technique is for example principle component analysis which has some additional constraints such as linearity of the nodes. Given the shape an autoencoder can also be called a bottleneck or sandglass network.

If we represent the encoder $F: X \rightarrow H$ and the decoder $G: H \rightarrow X$. We apply the individual $x$ to the product as $x' = (G \circ F)x$, then we can define the autoencoder as:

Here we choose for an L2 norm for the reconstruction: $L(x,x') = \| x-x' \|^2$.

An autoencoder is typically trained using a variant of backpropagation (conjugate gradient method, steepest descent). It is possible to use so-called pre-training. Train each two subsequent layers as a restricted Boltzmann machine and use backpropagation for fine-tuning.

A nice blog post at Keras explains also some of the disadvantages of autoencoders, a very clarifying read!

# Denoising Autoencoders

A denoising autoencoder (DAE) is a regular autoencoder with the input signal corrupted by noice (on purpose: $\tilde{x} = B(x)$). This forces the autoencoder to be resilient against missing or corrupted values in the input.

The reconstruction error is again measured by $L(x,x') = \| x - x'\|^2$, but now $x'$ is formed by a distortion of the original $x$, denoted by $\tilde{x}$, hence $x' = (G \circ F) \tilde{x}$.

Note that a denoising autoencoder can be seen as a stochastic transition operator from input space to input space. In other words, if some input is given, it will generate something “nearby” in some abstract sense. An autoencoder is typically started from or very close to the training data. The goal is to get an equilibrium distribution that contains all the modes. It is henceforth important that the autoencoder mixes properly between the different modes, also modes that are “far” away.

# Variational Autoencoders

The post by Miriam Shiffman is a nice introduction to variational autoencoders. They have been designed by (Kingma and Welling, 2014) and (Rezende et al., 2014). The main difference is that $h$ is now a full-fledged random variable, often Gaussian.

A variational autoencoder can be seen as a (bottom-up) recognition model and a (top-down) generative model. The recognition model maps observations to latent variables. The generative model maps latent variables to observations. In an autoencoder setup the generated observations should be similar to the real observations that go into the recognition model. Both models are trained simultanously. The latent variables are constrained in such a way that a representation is found that is approximately factorial.

# Helmholtz Machine

A Helmholtz machine is a probabilistic model similar to the variational autoencoder. It is trained by the so-called sleep-wake algorithm (similar to expectation-maximization).

# Importance weighted Autoencoders

The importance weighted autoencoder (Burda et al., 2015) is similar to the variational autoencoder, but it uses a tighter loglikelihood lower bound through applying importance weighting. The main difference is that the recognition model uses multiple samples (to approximate the posterior distribution over latent variables given the observations). In order words, the recognition model is run a few times and the suggested latent variables are combined to get a better estimate. The model gives more weight to the recognition model than the generative model.

# Generative Adversarial Networks

Generative Adversarial Networks (Goodfellow et al., 2014) use two networks. A generative model $G$ and a discriminative model $D$. The generative model maps latent variables $z$ to data points $x'$. The discriminator has to make a choice between true data $x$ and fake data $x'$. Hereby should $D(x)$ have a large value and $D(x')$ have a small value. The discriminator maximizes (we fix the generator):

The generator in contrast maximizes:

It is clearly visualized by Mark Chang’s slide.

The distribution $p_z(z)$ is a arbitrary noise distribution. In other words, the generator morphs totally random stuff into meaningful $x$. It is like throwing darts randomly into a dart board and the generator folding the board into a hat. Similarly from pure random values we can draw point clouds that have elaborate structure.

The latent variables $z$ are totally random, however there is something else important here. If $z$ is a multidimensional random variable information across all dimensions can be used to construct $x' \leftarrow G(z)$. There is no information about $z$ if we would like to reason back from $x'$. This means that from a representation learning perspective the unconstrained use of $z$ leads to entangled use of it in $G$. InfoGAN introduces an additional mutual information term between a latent code $C$ and generated data $X$.

Adversarial Autoencoders (Makhzani et al., 2016) is an autoencoder that uses generative adversarial networks. The latent variables (the code) are matched with a prior distribution. This prior distribution can be anything. The autoencoder subsequently maps this to the data distribution.

Note that the use of the adversarial network is on the level of the hidden variables. The discriminator attempts to distinguish “true” from “fake” hidden variables.

This immediately rises the following question: Can we also generate fake data as well? If one discriminator has the goal to distinguish true from fake hidden variables, the other can have as goal to distinguish true from fake data. We should take provisions to not have the former discriminator punished by a bad performing second discriminator.

# Deep Learning Through The Use Of Non-Equilibrium Thermodynamics

Non-equilibrium Thermodynamics (Sohl-Dickstein et al., 2015) slowly destroys structure in a data distribution through a diffusion process. Then a reverse diffusion process is learned that restores the structure in the ata.

Both processes are factorial Gaussians, the forward process, $p(x^{t}\mid p(x^{t-1})$ and the inverse process, $p(x^{t-1}\mid p(x^t)$.

To have an exact inverse diffusion the chain requires thousands of small steps.

# Infusion Training

Infusion training (Bordes et al., 2017) learns a generative model as the transition operator of a Markov chain. When applied multiple times on unstructured random noise, infusion training will denoise it into a sample that matches the target distribution.

# Variational Walkback

Variational Walkback (Goyal et al., 2017) learns a transition operator as a stochastic recurrent network. It learns those operators which can represent a nonequilibrium stationary distribution (also violating detailed balance) directly. The training objective is a variational one. The chain is allowed to “walk back” and revisit states that were quite “interesting” in the past.

Compared to MCMC we do not have detailed balance, nor an energy function. A detailed balance condition would by the way mean a network with symmetric weights.

# Nonparametric autoencoders

The latent variables in the standard variational autoencoder are Gaussian and have a fixed quantity. The ideal hidden representation however might require a dynamic number of such latent variables. For example if the neural network has only 8 latent variables in the MNIST task it has to somehow represent 10 digits with these 8 variables.

To extend the hidden layer from a fixed to a variable number of nodes it is possible to use methods developed in the nonparametric Bayesian literature.

There have been already several developments:

• A stick-breaking variational autoencoder (Nalisnick and Smyth, 2017) where the latent variables are represented by a stick-breaking process (SB-VAE);
• A nested Chinese Restaurant Process as a prior on the latent variables (Goyal et al., 2017);
• An (ordinary) Gaussian mixture as a prior distribution on the latent variables (Dilokthanakul et al., 2017), but see this interesting blog post for a critical review (GMVAE);
• A deep latent Gaussian mixture model (Nalisnick et al, 2016) where a Gaussian mixture is used as the approximate posterior (DLGMM);
• Variational deep embedding uses (again) a mixture of Gaussians as a prior (Jiang et al., 2017) (VaDE);
• Variational autoencoded deep Gaussian Processes (Dai et al., 2016) uses a “chain” of Gaussian Processes to represent multiple layers of latent variables (VAE-DGP).

The problem with autoencoders is that they actually not necessarily say how the latent variables are to be used.

Inherently, without additional constraints the representation problem is ill-posed. Suppose for example that the generator is just a dictionary of images and that training will make the latent variables point to a particular index in this dictionary. In this way no deep structure has been uncovered by the network at all. It’s just pretty dumb pointing at what it has been seen before at the training and generalization can be expected to be pretty bad.

Especially when variational autoencoders are used in sequence modeling it becomes apparent that the latent code is generally not used. The variational lossy autoencoder introduces control over the latent code to successfully combine them with recurrent networks (Chen et al., 2017).

From an information-theoretic perspective the differences can be formulated in an extreme manner: maximization or minimization of mutual information. With InfoGAN (not explained in this blog post) mutual information between input and latent variables is maximized to make sure that the variables are all used. This is useful to avoid the “uninformative latent code problem”, where latent features are actually not used in the training. However, with for example the information bottleneck approach the mutual information between input and latent variables is minimized (under the constraint that the features still predict some labels). This is logically from the perspective of compression. This behavior can all be seen as a so-called information-autoencoding family (Zhao et al., 2017).

It is interesting to study how nonparametric Bayesian methods fair with respect to this family and what role they fulfill in such a constrained optimization problem. Existing models namely use fixed values for the Lagrangian multipliers (the tradeoffs they make).

# Regularization

Training deep networks has undergone several advances. One of the first innovations has been the layer by layer training.

## Dropout

Another key idea has been to randomly drop units including connections during training. This prevents overfitting. During training in this way a collection of differently thinned networks is used. At testing an unthinned network is used. This is called dropout (Srivastava et al., 2014).

## Stochastic gradient descent

Gradient descent or steepest descent is an iterative method where we take steps that depend on the slope (or more general, that depend on the gradient) with as purpose to end up at a minimum. To get (negative) gradients we need to have differential functions.

Stochastic gradient descent is a stochastic approximation to gradient descent. What is approximated is the true gradient. Adjusting the parameters $\theta$ it minimizes the following loss function:

Here $x_1, \ldots x_N$ is the training set. Stochastic gradient descent typically uses a mini-batch $x_1, \ldots, x_m$ of size $m$. The gradient is then approximated by:

$\frac{1}{m} {\partial l(x_i, \theta)}{\partial \theta}$.

## Batch normalization

The distribution of network activities change during training due to the fact that the network parameters change. This phenomenon is called internal covariate shift. It is possible to fix the distribution of the layer inputs $x$ as the training progresses. It is for example well-known that whitening the inputs (linear transforming them to zero means, unit variances and decorrelating them) makes a network converge faster. Batch normalization does not simply whiten each layer’s input, but makes two simplifications: (1) normalize each scalar feature independently, and (2) introduce scale and shift parameters to preserve nonlinearities. Batch normalization improved significantly on the ImageNet classification task (Ioffe and Szegedy, 2015).

## Residual learning

Making networks deeper and deeper counterintuitively increases the training and thus the test error. Consider for example an identity mapping, a network needs to learn to duplicate the input and the output. Emperical evidence shows however that learning the difference (zero) is easier for a network. This is called residual learning (He et al., 2015. At ImageNet such residual nets achieve 3.57% error on the test set. It is hence no surprise that the fourth edition of the Inception networks use residual learning (Szegedy et al., 2017).