What Is Contrastive Divergence?
In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume to be discrete):
Here is the observed data distribution, is the model distribution and are the model parameters. A divergence (wikipedia) is a fancy term for something that resembles a metric distance. It is not an actual metric because the divergence of given can be different (and often is different) from the divergence of given . The Kullback-Leibler divergence exists only if implies .
The model distribution can be written in the form of a normalized energy function:
The partition function can be written as the sum over all states:
With gradient descent we use the gradient negatively:
With gradient ascend we use the gradient positively:
In both cases is a predefined parameter. It can be constant, but in learning methods this can also be a function called the learning rate. The parameter might depend on time .
For both gradient descent and gradient ascent means that . Descending a slope up to a zero gradient leads to a minimum if there is one. Ascending a slope up to a zero gradients leads to a maximum if there is one. The extremum found does not necessarily need to be unique, except if the function is concave, respectively convex.
Gradient descent of the KL-divergence
Below you will find a step-by-step derivation of a description of gradient descent for the KL-divergence. It needs to be minimization so we will indeed need gradient descent (not ascent). Formally, we have to calculate:
KL-divergence parts that depend on
We are gonna rewrite this equation is a way relevant to taking a derivative: (1) reorganize the equation such that the terms not involving are separate terms, (2) using log identities to write it as a sum of terms, and (3) removing the terms not involving .
Hence, first, let us rewrite the divergence to obtain separate terms that do and do not involve . Herefore we substitute on the fourth line:
Second, use the following identity to reach a sum of terms:
Third, get rid of the first term that does not depend on . Now the part relevant to our derivative is:
In “On Contrastive Divergence Learning” by Carreira-Perpinan and Hinton (proceedings AISTATS 2015) this is written as the log-likelihood objective:
Note, that there is a negative sign here. The maximum log-likelihood is identical to the minimum KL divergence.
The gradient of the KL-divergence
Taking the gradient with respect to (we can then safely omit the term that does not depend on ):
Recall the derivative of a logarithm:
Take derivative of logarithm:
The derivative of the partition function:
Recall the derivative of an exponential function:
Use this for the partition function derivative:
Again, the gradient of the divergence was:
Compare with Hinton:
We arrived at the formulation of minimization of KL-divergence that allows comparing it with Contrastive divergence.
Contrastive divergence uses a different (empirical) distribution to get rid of :