## Attend, Infer, Repeat

A long, long time ago - namely, in terms of these fast moving times of advances in deep learning - two years (2016), there was once a paper studying how we can teach neural networks to count.

# Attend, infer, repeat

This paper is titled “Attend, infer, repeat: Fast scene understanding with generative models” and the authors are Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa (github, nice, he does couchsurfing), David Szepesvari, Koray Kavukcuoglu, and Geoffrey Hinton. A team at Deepmind based in London.

This has been a personal interest of mine. I felt it very satisfying that bees for example can count landmarks or at least have a capability that approximates this fairly good. It is such an abstract concept, but very rich. Just take the fact that you can recognize yourself in the mirror (I hope). It’s grounded on something that really strongly believes that there is only one of you, that you are pretty unique.

From a learning perspective, counting feels like mapping in autonomous robotics. The very well-known chicken and egg problem of simultaneous localisation and mapping (SLAM) immediately addresses that mapping and localisation is an intertwined problem where one task immediately influences the other task. To properly map it would be very useful if you have good odometry and can tell accurately how your location is changing. To properly locate yourself it would be very useful to have a very good map. In the beginning the robot sucks in both, but by learning (for example through expectation maximization) it learns to perform both better and better.

Counting objects likewise benefits from properly being able to recognize objects. Moreover, it also likely benefits from localization objects. A child counts by pointing to the objects and even sometimes verbalizes the object in the process. Of course a network might do all three things in different layers, but that would remove the chance to have these layers to inform each other. If we introduce cross-connections manually the network would not learn to decompose in an autonomous manner. Ideally the network learns the decomposition itself so that we do not artificially introduce limitations in the information transfer between those tasks.

The paper by Eslami introduces several aspects that is important for a system like this:

• Learning latent spaces of variable dimensionality.
• An iterative process that attends to one object at a time. This requires also a stopping condition to stop counting.
• Complete end-to-end learning by amortized variational inference.

It is coined the AIR model by the authors: attend, infer, repeat.

## Learning latent spaces of variable dimensions

The representation of a scene is with a fixed upper limit on the number of objects. A nice extension would be to make this a nonparametric prior like a Dirichlet Process. The number of objects is drawn from a Binomial distribution, $p_N(n)$, and the scene model generates a variable length feature vector $z \sim p_\theta(\cdot|n)$. The data itself is generated from the features through $x \sim p_\theta(\cdot|n)$. Summarized:

with the prior decomposed as:

The posterior is given by Bayes’ rule, prior times likelihood divided by the evidence:

Equivalently:

And:

We approximate the posterior variationally by a simpler distribution $q_\phi(z,n|x)$ using the Kullback-Leibler divergence:

The divergence is minimized by searching through the parameter space $\phi \in \Phi$.

## An iterative process and a stopping condition

One difficulty arises through $n$ being generated through a random variable. This requires evaluating:

for all values of $n = 1 \ldots N$.

Now it is suggested to representent $n$ through a latent vector $z_{present}$ that is formed out of $n$ ones followed by a zero (and has hence size $n + 1$). So we have $q_\phi(z,z_{present}|x)$ rather than $q_\phi(z,n|x)$. The posterior than does have the following form:

The first term describes the stopping condition. If $z_{present} = 0$ then there are no more objects to detect. The second term contains a conditional on previous objects. We do not want to describe the same object twice!

## A variational implementation

To optimize for $\theta$ and $\phi$ we use the negative free energy $\mathcal{L}$. The negative free energy is guaranteed to be smaller than $\log p_\theta(x)$ so can be used to approximate the latter by increasing it as much as possible.

We now have to calculate both $\frac{\partial}{\partial\theta} \mathcal{L}$ and $\frac{\partial}{\partial\phi} \mathcal{L}$ to perform gradient ascent.

The estimate of the latter term is quite involved. First $\omega_i$ denotes all parameters at time step $i$ in $(z_{present}^i, z^i)$. Then we map $x$ to $\omega^i$ through a recurrent function $(\omega^i,h^i) = R_\phi(x,h^{i-1})$. Here the recurrent function $R_\phi$ is a recurrent neural network. The gradient obeys the chain rule:

Now, we have to calculate $\frac{\partial \mathcal{L}}{\partial \omega^i}$. Remember $\omega_i$ can contain either continuous or discrete variables. With continuous variables the reparametrization trick is applied. With discrete variables a likelihood ratio estimator is used. The latter might have high variance with is reduced using structured neural baselines.

## Discussion

What do we learn from this?

• We have to come up with a particular representation of the number of objects. Using this representation we do not only inform the network that it has to count, but also that this has to be used as a stopping condition. It very much looks like a handcrafted architecture.
• There is apparently no satisfying black-box approach to calculate the gradients. Not only do we have to manually describe which strategy has to be used for which parameter. For discrete variables we have to go even further and come up with manners to reduce the variance of the estimator.

If we would use this architecture would we be surprised that the network learns to count? No, I don’t think so. We pretty much hardcoded this in the architecture.

An interesting observations by the authors concerns generalization. When the model is trained on images with up to two digits in a multi-MNIST task, it will not generalize to three digits. Likewise if it is trained on images with zero, one, or three digits, it will not be able to handle images with two digits. Another architecture change has been applied with the recurrent network fed by differences with the input $(\omega^i,h^i = R_\phi(x^i - x, h^{i=1})$. The author coin this the DAIR model rather than just the AIR model.

The authors compare the system with the Deep Recurrent Attentive Writer (DRAW) architecture. The latter exhibits good performance with the same counting task. Where it lacks is a task where a task of counting zero, one, or two digits is followed by another task using two digits. That other task is a) summing the two digits, or b) determining if the digits are in ascending order. Here the AIR model outperforms DRAW.

## Research direction

One the things that is interesting from the neuroscientific literature is the concept of subitizing. It might, or might not be the case, that it is faster to count up to four than upwards from four. Over four there is a sequential process like the one described in this blog post. Some scientists think there is a different pathway that allows a more instantaneous response if there only a few objects.

The paper titled “Subitizing with Variational Autoencoders” by the authors Rijnder Wever (github) and Tom Runia from the University of Amsterdam describes subitizing as an emerging phenomenon in an ordinary autoencoder. A supervised classifier is trained on top of this unsupervised autoencoder. It is not entirely clear to me that the latent representation indeed somehow disentangled the object identification from the number of objects.

Variational inference approximates the posterior distribution in probabilistic models. Given observed variables $x$ we would like to know the underlying phenomenon $z$, defined probabilistically as $p(z | x)$. Variational inference approximates $p(z|x)$ through a simpler distribution $q(z,v)$. The approximation is defined through a distance/divergence, often the Kullback-Leibler divergence:

It is interesting to see that this deterministic strategy does not require Monte Carlo updates. It can be seen as a deterministic optimization problem. However, it is definitely possible to solve this deterministic problem stochastically as well! We can formulate it as a stochastic optimization problem!

There are two main strategies:

• the reparametrization trick
• the log-derivate trick

The log-derivate trick is quite general but still suffers from high variance. Henceforth, so-called control variates have been introduced that reduce variance. We will spend quite a bit of time to clarify what a control variate is. The last section describes modern approaches that combine features from both strategies.

# The reparametrization trick

The reparametrization trick introduces auxiliary random variables that are stochastic such that the parameters to be optimized over are only occuring in deterministic functions. This is convenient because it can reduce variance and sometimes the derivatives of the probability density functions do not exist in closed-form (which means no autodifferentation). See the Inference in deep learning post.

# The log-derivative trick

The log-derivative trick is also called the score function method, REINFORCE, or black-box variational inference. The term black-box variational inference reveals that this trick is completely general. It can be applied to any model. For instance, models that have both continuous and discrete latent variables. The joint distribution does not need to be differentiable either.

It uses the following identity:

This identity is just obtained by differentiating using $\nabla \log x = \frac{1}{x}$ and applying the chain rule $\nabla \log f(x) = \frac{1}{f(x)} \nabla f(x)$. Let’s subsequently rewrite this identity as a product:

The expected costs we want to minimize:

We can use Leibniz’s integral rule (differentiation under the integral sign) to shift the differential operator into the integral. To recall the rule:

In our case:

Using the log identity:

Now we can use Monte Carlo to estimate:

Here $x_s \sim p_\phi(x)$ i.i.d. This is general estimator: $f_\theta(x)$ does not need to be differentiable or continuous with respect to $x$. Note that $\log p_\phi(x)$ needs to be differentiable with respect to $\phi$.

We should show that the variance is actually reduced… However, let us first explain something that you will find time after time. Namely the notion of control variates…

# Control variates

Let us estimate the expectation over a function $E_x[f(x)]$ given a function $f(x)$. The Monte Carlo estimator is of the form $E_x[f(x)] \approx \frac{1}{k} \sum_i f(x^i)$ with $x^i \sim p(x)$. We can introduce a control variate to reduce the variance:

The parameter $\eta$ can be chosen to minimize the variance, which turns out to be optimally:

More information can be found at Wikipedia. The final variance will be something along the lines:

Here $Var(f) = E[f^2] - E[f]^2$ and $Cov(f,g) = E[(f-E[f])(g-E[g])]$. So, how we can explain this best?

Assume we have to sum over the function $f(x) = 1/(1+x)$ with $% $, then if we sample uniformly random values between $0$ and $1$ we will have results between $1/(1+0)=1$ and $1/(1+1)=1/2$. We would like to transform this function in such way that these results are closer to each other. The values at $x=0$ should be going to the mean, and the values at $x=1$ as well. At wikipedia they give the example of the covariate $g(x) = 1 + x$ (this could have just been $g(x) = x$). By adding $x$ and subtracting the average (in this case $\int_0^1 (1+x) dx = 3/2$) we make the function flatter with picking $\eta=0.4773$, in other words we reduced the variance. We sample 100 values uniformly and demonstate in the following graph that the function using the covariate is indeed flatter.

Another covariate could be $g(x) = \log(x + 1)$. We then have to subtract the expectation of that function, namely $\int_0^1 \log(x+1) dx = \log(4)-1$. This function is even flatter and has an even smaller variance. You can see that in the graph above. We have picked a value for $\eta=0.72$. The covariate which would make the compound function completely flat would be $g(x) = 1/(2-x)$, which is $f(x)$ mirrored over the range from $x=[0,1]$. However, this would of course render the Monte Carlo sampling redundant, because we would need the expectation over $g(x)$ which is in this case just as hard as that over $f(x)$.

# Recent approaches (and combinations)

The log-derivative trick (or the score function estimator) still suffers from high variance. Common techniques to reduce variance is by introducing baselines. Examples of unbiased single sample gradient estimators, are NVIL (Mnih and Gregor, 2014) and MuProp (Gu et al., 2015). An example of an unbiased multisample case is VIMCO (Mnih and Rezende, 2016).

Examples of biased single sample gradient estimators, are Gumbel-Softmax (Jang et al., 2016) and Concrete relaxiations (Maddison et al., 2017), independent researchers coming to the same strategy. The family of concrete distributions (Maddison et al, 2017) has closed-form densities and a simple reparametrization. The concrete distributions can replace discrete distributions on training so all gradients can properly be calculated. During training the concrete distributions can be replaced by discrete distributions.

REBAR (Tucker et al., 2017) is a new approach that uses a novel control variate to make the Concrete relaxation approach unbiased again.

# References

## Machine Learning Done Bayesian

In the dark corners of the academic world there is a rampant fight between practitioners of deep learning and researchers of Bayesian methods. This polemic article testifies to this, although firmly establishing itself as anti-Bayesian.

There is not much you can have against Bayes’ rule, so the hate runs deeper than this. I think it stems from the very behavior of Bayesian researchers rewriting existing methods as approximations to Bayesian methods.

Ferenc Huszár, a machine learning researcher at Twitter describes some of these approximations.

• L1 regularization is just Maximum A Posteriori (MAP) estimation with sparsity inducing priors;
• Support vector machines are just the wrong way to train Gaussian processes;
• Herding is just Bayesian quadrature done slightly wrong;
• Dropout is just variational inference done slightly wrong;
• Stochastic gradient descent (SGD) is just variational inference (variational EM) done slightly wrong.

Do you have other approximations you can think of?

## Inference in Deep Learning

There are many, many new generative methods developed in the recent years.

• denoising autoencoders
• generative stochastic networks
• variational autoencoders
• importance weighted autoencoders
• infusion training
• variational walkback
• generative latent optimization
• deep learning through the use of non-equilibrium thermodynamics

# Deep Models

We can’t delve into the details of those old workhorse models, but let us summarize a few of them nevertheless.

A Boltzmann machine can be seen as a stochastic generalization of a Hopfield network. In their unrestricted form Hebbian learning is often used to learn representations.

Don't just read the excerpt. :-) Sit down and read for real! →

# Kullback-Leibler divergence

In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume $x$ to be discrete):

Here $P_0(x)$ is the observed data distribution, $P(x\mid W)$ is the model distribution and $W$ are the model parameters. A divergence (wikipedia) is a fancy term for something that resembles a metric distance. It is not an actual metric because the divergence of $x$ given $y$ can be different (and often is different) from the divergence of $y$ given $x$. The Kullback-Leibler divergence $D_{KL}(P \mid \mid Q)$ exists only if $Q(\cdot) = 0$ implies $P(\cdot) = 0$.

Don't just read the excerpt. :-) Sit down and read for real! →