Sampling of Dirichlet Process
Thousands of articles describe the use of the Dirichlet Process, but very few describe how to sample from it. Most often one is referred to Markov chain sampling methods for Dirichlet process mixture models (pdf) by Radford Neal (at University of Toronto), which is a nice piece of work, but still a bit dense as an introduction. I contacted him by email about certain things that were difficult to understand at first and he was kind enough to respond, thanks a lot! Definitely also check out his blog in which he regularly showcases his fast version of R.
Let us introduce the Dirichlet Process with two lines:
First of all, if you haven’t been reading an article about this material in the last decennia, the notation stands for “is distributed as” (see stackexchange).
On the first line we have defined that being distributed as a Dirichlet Process (DP) with parameters and . The parameter has kind of the role of a mean, the parameter has kind of the role of precision (inverse variance). The parameter is called the base distribution, the parameter is called the dispersion factor.
On the second line we generate a bunch of parameters conditioned on the generated distribution . A peculiar property of this distribution is that exactly the same can be sampled multiple times. Hence, the Dirichlet Process can generate discrete objects out of a continuous base distribution .
If we have parameters we can subsequently generate observations as well:
Here describes the mapping of parameters to observations.
We can integrate over to obtain a presentation such as:
And just to write this down in a totally equivalent notation for your convenience:
Here we have picked an index mid-way, and not a new item which complicates the notation considerably. The notation with the minus sign means in this case the set of parameters with excluded.
This trick in which we can act as if a parameter mid-way is the same as a new one, is possible because of an underlying assumed property: exchangeability of observations.
In both notations we have , the distribution at the point . Here has the beautiful role of a functor which maps a function to a value. Something we will see back when we will take Hilbert spaces seriously, but that’s something for later.
If we incorporate the likelihood , we will arrive at the following formula:
Here is a normalization factor to make it a proper probability summing to one. The entity is the posterior distribution of given as prior and given observation . The integral on the right side has the imposing form of a (Lebesgue-)Stieltjes integral. The Lebesgue integral is a nice topic on its own as well, but skimming through wikipedia should be sufficient. A one-line summary: For a two-dimensional function you can see it as “horizontal” integration rather than “vertical” integration. The Stieltjes part of it generalizes integration by (linear) parts to that of integration by . If is differentiable everywhere and the derivative is continuous, it has the more familiar presentation:
There is a solid reason why this familiar notation is insufficient here. In probability theory we often have a nice cumulative distribution function . However, things become nasty for the probability density function as such. The Lebesgue measure for (the “horizontal” integration) if the distribution of is discrete becomes nonsense.
Studying this multiplicative (or convolutive) form it is clear that a conjugate pair of and is extremely convenient.
The definition of leads to Gibbs sampling by just drawing from it for all indices .
To remind you about the basics of Gibbs sampling; if we sample the conditionals and iteratively, we will get the distribution which is joint in and . So, our sampling results in with all observations. Note that Gibbs sampling allows you to forget a value when it is time to sample it again.
Using cluster indices
It is possible to calculate stuff using the parameters directly, but considering that we are here interested in clustering, it makes sense to introduce one level of indirection. In this case: cluster indices. Each parameter is paired with a cluster index . The values of the cluster indices do not matter, as long as all the parameters that belong to the same cluster are having the same value . For computational convenience a series of integers is appropriate, but it could just have as well be names from a dictionary about tropical trees.
In Neal’s paper the derivation of the following is quite brief, so let me expand here a bit. The system under consideration here is not a Dirichlet Process, but a mixture of Dirichlet distributions:
Now we have to integrate out the mixing proportions to get a formula that operates on cluster indices. Again, the number of cluster indices is exactly equal to the number of parameters and exactly equal to the number of observations.
Let us now consider the Dirichlet-Multinomial as it is often called (terrible name!). As you see from above it should be called the Dirichlet-Categorical (see for a derivation stackexchange again).
Here , and runs from to (over all observations). We will use the shorter notation as the number of cluster indices that point to the same cluster . Do not forget however that this number is a function of , or else you might think: where have all the ’s gone!? And of course the guy who uses a variable in probabilistic texts should be a -victim in Jackass. :-) I think you deserve a break on the moment:
In our case is taken to be the same for all clusters, so :
What we are after is:
Basically the means that we identified it as a variable. However, we haven’t decided yet if it is going to be a random variable or just a parameter. We can just leave it out in the next equations, just remember that the entire thing depends on in the end. Anyway, the above two functions are required to calculate the entity we are after. Using the definition of a conditional probability, we calculate the probability of a value for a specific cluster index given all other cluster indices:
I try to follow here the notation in Neal’s paper as close as possible. Do not fry me because I use here as an instance of . It ain’t pretty, I know!
And the denominator where we have one less component index to calculate with (so to sum over):
The notation counts the cluster indices as for but does not take into account . This notation is a bit awkward admittedly, because the left-hand side does not seem to contain . But it does because if you remember. The notation just becomes a bit less cluttered by writing it down like this, trust me. The division becomes:
And we are happy to see we can cross-out a lot of the terms in this fraction:
As you probably know , so it is not hard to see that just as with a factorial: . Also, , which leads to a simplification of the first fraction:
The second fraction requires splitting of the product for or, equivalently, . So, we bring the index that (amongst others) is pointed at by observation to the front:
If we think this through, we now know that does contain the cluster corresponding with the observation with index and henceforth that does not contain that cluster. Thus the following is absolutely equivalent to the above:
Thinking this through, we also realize that does indeed contain this cluster corresponding with observation , and therefore must be exactly one larger than if it would not be counting !
We can again make use of the fact:
And we can move the Gamma term with within the product again:
Let me reiterate the equation from above (because it has now scrolled off your screen except if you keep your laptop vertically - on its side - when reading this, or if you use a ridiculous high resolution):
Fill in our neat endevour with shifting in and out of the product, just to go through a Gamma function, and we arrive at:
The first time I encountered the sentence “just integrate out the mixing proportions ” I did not have a clue that a dozen steps like these were required. But, perhaps, I need them spelled out in more detail than a person with a normal brain…
The step subsequently fo contains probably also many steps, but only for a mathematical rigorous mind, not my sloppy one. If we only consider cluster , we can use the equation above and assume goes to zero. The only other assumption we need in this case, is that there is some other observation with cluster index , so .
If we consider again and now with :
I won’t say often “of course”, but here it is of course that . Then if we allow to take all possible values, in this case the sum is not one because of the “and” operation:
So, this is straightforward the number of clusters, , times the previously calculated probability for a single cluster. We arrive sweetly - without taking limits in this case - at:
We kept here explicit that we consider all possible cluster indices for by drawing it from the set . Sorry, this is not standard notation, feel free to suggest improvements using the discussion section! The reason why I explicitly add it is because might be read as the sum of the probabilities of any two cluster indices being unequal (where we do not fix to a specific cluster index). Note also that in Neal’s exposition is the last observation. The notation here does not assume that is the last observation, and hence uses rather than , but this is absolutely equivalent.
Neal’s article is again a source of insight. Make sure however to also take a look at the first people working on these problems. Antoniak for example extended Ferguson’s Dirichlet Process to the mixture model case and reading this old text might be enlightening as well. Let us recapitulate quickly what we have. We have an equation for the conditional probabilities of cluster indices given the other cluster indices for an infinite number of clusters. The equation has two parts. One in which we have a cluster index that has been encountered before:
And one part in which we have a cluster index that has not been encountered before:
The term is true for any , while is true only when all of are unequal to .
We now have to take a shortcut and bluntly introduce the Metropolis-Hastings algorithm. Why it works I would love to show another time. It is a typical Monte Carlo method driven by detailed balance. Ehrenfest, the inventer of the dog-flea model would have been proud.
To sample a density a proposal density, , is used in calculating an acceptance probability :
With probability the new state is accepted. With probability the old state is maintained.
Here we made explicit that and might be different. Moreover, the notation stands for with excluded.
First, I went on a tangent here. I thought Neal was picking the following as a transition probability:
And subsequently I started calculating this ratio, where I made mistakes as well not realizing at first that $k$ and $k^*$ do not need to be the same. However, after “breaking my head”, I finally understood that Neal proposes something much simpler:
In such a case the acceptance rule becomes indeed quite simple:
Only the likelihood is used to calculate the Metropolis-Hastings ratio.
I have to study this part later.