## What Is Contrastive Divergence?

# Kullback-Leibler divergence

In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume to be discrete):

Here is the observed data distribution, is the model distribution and are the model parameters. A **divergence** (wikipedia) is a fancy term for something that resembles a **metric** distance. It is not an actual metric because the divergence of given can be different (and often is different) from the divergence of given . The Kullback-Leibler divergence exists only if implies .

The model distribution can be written in the form of a normalized energy function:

The partition function can be written as the sum over all states:

## Gradients

With gradient descent we use the gradient negatively:

With gradient ascend we use the gradient positively:

In both cases is a predefined parameter. It can be constant, but in learning methods this can also be a function called the **learning rate**. The parameter might depend on time .

For both gradient descent and gradient ascent means that . Descending a slope up to a zero gradient leads to a minimum if there is one. Ascending a slope up to a zero gradients leads to a maximum if there is one. The extremum found does not necessarily need to be unique, except if the function is concave, respectively convex.

## Gradient descent of the KL-divergence

Below you will find a step-by-step derivation of a description of gradient descent for the KL-divergence. It needs to
be **minimization** so we will indeed need gradient descent (not ascent). Formally, we have to calculate:

### KL-divergence parts that depend on

We are gonna rewrite this equation is a way relevant to taking a derivative: (1) reorganize the equation such that the terms not involving are separate terms, (2) using log identities to write it as a sum of terms, and (3) removing the terms not involving .

Hence, first, let us rewrite the divergence to obtain separate terms that do and do not involve . Herefore we substitute on the fourth line:

Second, use the following identity to reach a sum of terms:

Third, get rid of the first term that does not depend on . Now the part relevant to our derivative is:

In “On Contrastive Divergence Learning” by Carreira-Perpinan and Hinton (proceedings AISTATS 2015) this is written as the log-likelihood objective:

Note, that there is a negative sign here. The maximum log-likelihood is identical to the minimum KL divergence.

### The gradient of the KL-divergence

Taking the gradient with respect to (we can then safely omit the term that does not depend on ):

Recall the derivative of a logarithm:

Take derivative of logarithm:

The derivative of the partition function:

Recall the derivative of an exponential function:

Use this for the partition function derivative:

Hence:

Using :

Again, the gradient of the divergence was:

Hence:

Compare with Hinton:

Gradient descent:

Thus,

We arrived at the formulation of minimization of KL-divergence that allows comparing it with Contrastive divergence.

# Constrastive divergence

Contrastive divergence uses a different (empirical) distribution to get rid of :