Variational inference approximates the posterior distribution in probabilistic models. Given observed variables $x$ we would like to know the underlying phenomenon $z$, defined probabilistically as $p(z | x)$. Variational inference approximates $p(z|x)$ through a simpler distribution $q(z,v)$. The approximation is defined through a distance/divergence, often the Kullback-Leibler divergence:

It is interesting to see that this deterministic strategy does not require Monte Carlo updates. It can be seen as a deterministic optimization problem. However, it is definitely possible to solve this deterministic problem stochastically as well! We can formulate it as a stochastic optimization problem!

There are two main strategies:

• the reparametrization trick
• the log-derivate trick

The log-derivate trick is quite general but still suffers from high variance. Henceforth, so-called control variates have been introduced that reduce variance. We will spend quite a bit of time to clarify what a control variate is. The last section describes modern approaches that combine features from both strategies.

# The reparametrization trick

The reparametrization trick introduces auxiliary random variables that are stochastic such that the parameters to be optimized over are only occuring in deterministic functions. This is convenient because it can reduce variance and sometimes the derivatives of the probability density functions do not exist in closed-form (which means no autodifferentation). See the Inference in deep learning post.

# The log-derivative trick

The log-derivative trick is also called the score function method, REINFORCE, or black-box variational inference. The term black-box variational inference reveals that this trick is completely general. It can be applied to any model. For instance, models that have both continuous and discrete latent variables. The joint distribution does not need to be differentiable either.

It uses the following identity:

This identity is just obtained by differentiating using $\nabla \log x = \frac{1}{x}$ and applying the chain rule $\nabla \log f(x) = \frac{1}{f(x)} \nabla f(x)$. Let’s subsequently rewrite this identity as a product:

The expected costs we want to minimize:

We can use Leibniz’s integral rule (differentiation under the integral sign) to shift the differential operator into the integral. To recall the rule:

In our case:

Using the log identity:

Now we can use Monte Carlo to estimate:

Here $x_s \sim p_\phi(x)$ i.i.d. This is general estimator: $f_\theta(x)$ does not need to be differentiable or continuous with respect to $x$. Note that $\log p_\phi(x)$ needs to be differentiable with respect to $\phi$.

We should show that the variance is actually reduced… However, let us first explain something that you will find time after time. Namely the notion of control variates…

# Control variates

Let us estimate the expectation over a function $E_x[f(x)]$ given a function $f(x)$. The Monte Carlo estimator is of the form $E_x[f(x)] \approx \frac{1}{k} \sum_i f(x^i)$ with $x^i \sim p(x)$. We can introduce a control variate to reduce the variance:

The parameter $\eta$ can be chosen to minimize the variance, which turns out to be optimally:

More information can be found at Wikipedia. The final variance will be something along the lines:

Here $Var(f) = E[f^2] - E[f]^2$ and $Cov(f,g) = E[(f-E[f])(g-E[g])]$. So, how we can explain this best?

Assume we have to sum over the function $f(x) = 1/(1+x)$ with $% $, then if we sample uniformly random values between $0$ and $1$ we will have results between $1/(1+0)=1$ and $1/(1+1)=1/2$. We would like to transform this function in such way that these results are closer to each other. The values at $x=0$ should be going to the mean, and the values at $x=1$ as well. At wikipedia they give the example of the covariate $g(x) = 1 + x$ (this could have just been $g(x) = x$). By adding $x$ and subtracting the average (in this case $\int_0^1 (1+x) dx = 3/2$) we make the function flatter with picking $\eta=0.4773$, in other words we reduced the variance. We sample 100 values uniformly and demonstate in the following graph that the function using the covariate is indeed flatter.

Another covariate could be $g(x) = \log(x + 1)$. We then have to subtract the expectation of that function, namely $\int_0^1 \log(x+1) dx = \log(4)-1$. This function is even flatter and has an even smaller variance. You can see that in the graph above. We have picked a value for $\eta=0.72$. The covariate which would make the compound function completely flat would be $g(x) = 1/(2-x)$, which is $f(x)$ mirrored over the range from $x=[0,1]$. However, this would of course render the Monte Carlo sampling redundant, because we would need the expectation over $g(x)$ which is in this case just as hard as that over $f(x)$.

# Recent approaches (and combinations)

The log-derivative trick (or the score function estimator) still suffers from high variance. Common techniques to reduce variance is by introducing baselines. Examples of unbiased single sample gradient estimators, are NVIL (Mnih and Gregor, 2014) and MuProp (Gu et al., 2015). An example of an unbiased multisample case is VIMCO (Mnih and Rezende, 2016).

Examples of biased single sample gradient estimators, are Gumbel-Softmax (Jang et al., 2016) and Concrete relaxiations (Maddison et al., 2017), independent researchers coming to the same strategy. The family of concrete distributions (Maddison et al, 2017) has closed-form densities and a simple reparametrization. The concrete distributions can replace discrete distributions on training so all gradients can properly be calculated. During training the concrete distributions can be replaced by discrete distributions.

REBAR (Tucker et al., 2017) is a new approach that uses a novel control variate to make the Concrete relaxation approach unbiased again.

# References

## Machine Learning Done Bayesian

In the dark corners of the academic world there is a rampant fight between practitioners of deep learning and researchers of Bayesian methods. This polemic article testifies to this, although firmly establishing itself as anti-Bayesian.

There is not much you can have against Bayes’ rule, so the hate runs deeper than this. I think it stems from the very behavior of Bayesian researchers rewriting existing methods as approximations to Bayesian methods.

Ferenc Huszár, a machine learning researcher at Twitter describes some of these approximations.

• L1 regularization is just Maximum A Posteriori (MAP) estimation with sparsity inducing priors;
• Support vector machines are just the wrong way to train Gaussian processes;
• Herding is just Bayesian quadrature done slightly wrong;
• Dropout is just variational inference done slightly wrong;
• Stochastic gradient descent (SGD) is just variational inference (variational EM) done slightly wrong.

Do you have other approximations you can think of?

## Inference in Deep Learning

There are many, many new generative methods developed in the recent years.

• denoising autoencoders
• generative stochastic networks
• variational autoencoders
• importance weighted autoencoders
• infusion training
• variational walkback
• generative latent optimization
• deep learning through the use of non-equilibrium thermodynamics

# Deep Models

We can’t delve into the details of those old workhorse models, but let us summarize a few of them nevertheless.

A Boltzmann machine can be seen as a stochastic generalization of a Hopfield network. In their unrestricted form Hebbian learning is often used to learn representations.

Don't just read the excerpt. :-) Sit down and read for real! →

# Kullback-Leibler divergence

In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume $x$ to be discrete):

Here $P_0(x)$ is the observed data distribution, $P(x\mid W)$ is the model distribution and $W$ are the model parameters. A divergence (wikipedia) is a fancy term for something that resembles a metric distance. It is not an actual metric because the divergence of $x$ given $y$ can be different (and often is different) from the divergence of $y$ given $x$. The Kullback-Leibler divergence $D_{KL}(P \mid \mid Q)$ exists only if $Q(\cdot) = 0$ implies $P(\cdot) = 0$.

Don't just read the excerpt. :-) Sit down and read for real! →

## Yoga 900 on Linux

The Yoga 900 is a beautiful machine that has a considerably long battery lifetime and can be folded such that it functions as a tablet. The Yoga arrived on Friday and the entire Crownstone team was enjoying how it came out of the box: it lifts up! If you’re creating your own hardware you suddenly appreciate how other people pay attention to packaging!

Don't just read the excerpt. :-) Sit down and read for real! →