Kullback-Leibler divergence

In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume $x$ to be discrete):

Here $P_0(x)$ is the observed data distribution, $P(x\mid W)$ is the model distribution and $W$ are the model parameters. A divergence (wikipedia) is a fancy term for something that resembles a metric distance. It is not an actual metric because the divergence of $x$ given $y$ can be different (and often is different) from the divergence of $y$ given $x$. The Kullback-Leibler divergence $D_{KL}(P \mid \mid Q)$ exists only if $Q(\cdot) = 0$ implies $P(\cdot) = 0$.

The model distribution can be written in the form of a normalized energy function:

The partition function can be written as the sum over all states:

In both cases $\lambda$ is a predefined parameter. It can be constant, but in learning methods this can also be a function called the learning rate. The parameter $\lambda$ might depend on time $t$.

For both gradient descent and gradient ascent $W_{t+1} - W_t = 0$ means that $\nabla f(W_t) = 0$. Descending a slope up to a zero gradient leads to a minimum if there is one. Ascending a slope up to a zero gradients leads to a maximum if there is one. The extremum found does not necessarily need to be unique, except if the function is concave, respectively convex.

Below you will find a step-by-step derivation of a description of gradient descent for the KL-divergence. It needs to be minimization so we will indeed need gradient descent (not ascent). Formally, we have to calculate:

KL-divergence parts that depend on $W$

We are gonna rewrite this equation is a way relevant to taking a derivative: (1) reorganize the equation such that the terms not involving $W$ are separate terms, (2) using log identities to write it as a sum of terms, and (3) removing the terms not involving $W$.

Hence, first, let us rewrite the divergence to obtain separate terms that do and do not involve $W$. Herefore we substitute $P(x\mid W)$ on the fourth line:

Second, use the following identity $\log a + \log b = \log a b$ to reach a sum of terms:

Third, get rid of the first term that does not depend on $W$. Now the part relevant to our derivative is:

In “On Contrastive Divergence Learning” by Carreira-Perpinan and Hinton (proceedings AISTATS 2015) this is written as the log-likelihood objective:

Note, that there is a negative sign here. The maximum log-likelihood is identical to the minimum KL divergence.

Taking the gradient with respect to $W$ (we can then safely omit the term that does not depend on $W$):

Recall the derivative of a logarithm:

Take derivative of logarithm:

The derivative of the partition function:

Recall the derivative of an exponential function:

Use this for the partition function derivative:

Hence:

Using $P(x \mid W)$:

Again, the gradient of the divergence was:

Hence:

Compare with Hinton:

Contrastive divergence uses a different (empirical) distribution to get rid of $P(x \mid W)$: