Sampling of Dirichlet Process

Thousands of articles describe the use of the Dirichlet Process, but very few describe how to sample from it. Most often one is referred to Markov chain sampling methods for Dirichlet process mixture models (pdf) by Radford Neal (at University of Toronto), which is a nice piece of work, but still a bit dense as an introduction. I contacted him by email about certain things that were difficult to understand at first and he was kind enough to respond, thanks a lot! Definitely also check out his blog in which he regularly showcases his fast version of R.

Let us introduce the Dirichlet Process with two lines:

First of all, if you haven’t been reading an article about this material in the last decennia, the notation stands for “is distributed as” (see stackexchange).

On the first line we have defined that being distributed as a Dirichlet Process (DP) with parameters and . The parameter has kind of the role of a mean, the parameter has kind of the role of precision (inverse variance). The parameter is called the base distribution, the parameter is called the dispersion factor.

On the second line we generate a bunch of parameters conditioned on the generated distribution . A peculiar property of this distribution is that exactly the same can be sampled multiple times. Hence, the Dirichlet Process can generate discrete objects out of a continuous base distribution .

If we have parameters we can subsequently generate observations as well:

Here describes the mapping of parameters to observations.

We can integrate over to obtain a presentation such as:

And just to write this down in a totally equivalent notation for your convenience:

Here we have picked an index mid-way, and not a new item which complicates the notation considerably. The notation with the minus sign means in this case the set of parameters with excluded.

This trick in which we can act as if a parameter mid-way is the same as a new one, is possible because of an underlying assumed property: exchangeability of observations.

In both notations we have , the distribution at the point . Here has the beautiful role of a functor which maps a function to a value. Something we will see back when we will take Hilbert spaces seriously, but that’s something for later.

If we incorporate the likelihood , we will arrive at the following formula:

Here is a normalization factor to make it a proper probability summing to one. The entity is the posterior distribution of given as prior and given observation . The integral on the right side has the imposing form of a (Lebesgue-)Stieltjes integral. The Lebesgue integral is a nice topic on its own as well, but skimming through wikipedia should be sufficient. A one-line summary: For a two-dimensional function you can see it as “horizontal” integration rather than “vertical” integration. The Stieltjes part of it generalizes integration by (linear) parts to that of integration by . If is differentiable everywhere and the derivative is continuous, it has the more familiar presentation:

There is a solid reason why this familiar notation is insufficient here. In probability theory we often have a nice cumulative distribution function . However, things become nasty for the probability density function as such. The Lebesgue measure for (the “horizontal” integration) if the distribution of is discrete becomes nonsense.

Studying this multiplicative (or convolutive) form it is clear that a conjugate pair of and is extremely convenient.

The definition of leads to Gibbs sampling by just drawing from it for all indices .

To remind you about the basics of Gibbs sampling; if we sample the conditionals and iteratively, we will get the distribution which is joint in and . So, our sampling results in with all observations. Note that Gibbs sampling allows you to forget a value when it is time to sample it again.

Using cluster indices

It is possible to calculate stuff using the parameters directly, but considering that we are here interested in clustering, it makes sense to introduce one level of indirection. In this case: cluster indices. Each parameter is paired with a cluster index . The values of the cluster indices do not matter, as long as all the parameters that belong to the same cluster are having the same value . For computational convenience a series of integers is appropriate, but it could just have as well be names from a dictionary about tropical trees.

In Neal’s paper the derivation of the following is quite brief, so let me expand here a bit. The system under consideration here is not a Dirichlet Process, but a mixture of Dirichlet distributions:

Now we have to integrate out the mixing proportions to get a formula that operates on cluster indices. Again, the number of cluster indices is exactly equal to the number of parameters and exactly equal to the number of observations.

Let us now consider the Dirichlet-Multinomial as it is often called (terrible name!). As you see from above it should be called the Dirichlet-Categorical (see for a derivation stackexchange again).

Here , and runs from to (over all observations). We will use the shorter notation as the number of cluster indices that point to the same cluster . Do not forget however that this number is a function of , or else you might think: where have all the ’s gone!? And of course the guy who uses a variable in probabilistic texts should be a -victim in Jackass. :-) I think you deserve a break on the moment:

In our case is taken to be the same for all clusters, so :

What we are after is:

And:

Basically the means that we identified it as a variable. However, we haven’t decided yet if it is going to be a random variable or just a parameter. We can just leave it out in the next equations, just remember that the entire thing depends on in the end. Anyway, the above two functions are required to calculate the entity we are after. Using the definition of a conditional probability, we calculate the probability of a value for a specific cluster index given all other cluster indices:

I try to follow here the notation in Neal’s paper as close as possible. Do not fry me because I use here as an instance of . It ain’t pretty, I know!

The nominator:

And the denominator where we have one less component index to calculate with (so to sum over):

The notation counts the cluster indices as for but does not take into account . This notation is a bit awkward admittedly, because the left-hand side does not seem to contain . But it does because if you remember. The notation just becomes a bit less cluttered by writing it down like this, trust me. The division becomes:

And we are happy to see we can cross-out a lot of the terms in this fraction:

As you probably know , so it is not hard to see that just as with a factorial: . Also, , which leads to a simplification of the first fraction:

The second fraction requires splitting of the product for or, equivalently, . So, we bring the index that (amongst others) is pointed at by observation to the front:

If we think this through, we now know that does contain the cluster corresponding with the observation with index and henceforth that does not contain that cluster. Thus the following is absolutely equivalent to the above:

Thinking this through, we also realize that does indeed contain this cluster corresponding with observation , and therefore must be exactly one larger than if it would not be counting !

We can again make use of the fact:

And we can move the Gamma term with within the product again:

Let me reiterate the equation from above (because it has now scrolled off your screen except if you keep your laptop vertically - on its side - when reading this, or if you use a ridiculous high resolution):

Fill in our neat endevour with shifting in and out of the product, just to go through a Gamma function, and we arrive at:

The first time I encountered the sentence “just integrate out the mixing proportions ” I did not have a clue that a dozen steps like these were required. But, perhaps, I need them spelled out in more detail than a person with a normal brain…

The step subsequently fo contains probably also many steps, but only for a mathematical rigorous mind, not my sloppy one. If we only consider cluster , we can use the equation above and assume goes to zero. The only other assumption we need in this case, is that there is some other observation with cluster index , so .

Or, equivalently:

If we consider again and now with :

I won’t say often “of course”, but here it is of course that . Then if we allow to take all possible values, in this case the sum is not one because of the “and” operation:

So, this is straightforward the number of clusters, , times the previously calculated probability for a single cluster. We arrive sweetly - without taking limits in this case - at:

We kept here explicit that we consider all possible cluster indices for by drawing it from the set . Sorry, this is not standard notation, feel free to suggest improvements using the discussion section! The reason why I explicitly add it is because might be read as the sum of the probabilities of any two cluster indices being unequal (where we do not fix to a specific cluster index). Note also that in Neal’s exposition is the last observation. The notation here does not assume that is the last observation, and hence uses rather than , but this is absolutely equivalent.

Nonconjugate priors

Neal’s article is again a source of insight. Make sure however to also take a look at the first people working on these problems. Antoniak for example extended Ferguson’s Dirichlet Process to the mixture model case and reading this old text might be enlightening as well. Let us recapitulate quickly what we have. We have an equation for the conditional probabilities of cluster indices given the other cluster indices for an infinite number of clusters. The equation has two parts. One in which we have a cluster index that has been encountered before:

And one part in which we have a cluster index that has not been encountered before:

The term is true for any , while is true only when all of are unequal to .

We now have to take a shortcut and bluntly introduce the Metropolis-Hastings algorithm. Why it works I would love to show another time. It is a typical Monte Carlo method driven by detailed balance. Ehrenfest, the inventer of the dog-flea model would have been proud.

To sample a density a proposal density, , is used in calculating an acceptance probability :

With probability the new state is accepted. With probability the old state is maintained.

Here we made explicit that and might be different. Moreover, the notation stands for with excluded.

First, I went on a tangent here. I thought Neal was picking the following as a transition probability:

And subsequently I started calculating this ratio, where I made mistakes as well not realizing at first that $k$ and $k^*$ do not need to be the same. However, after “breaking my head”, I finally understood that Neal proposes something much simpler:

In such a case the acceptance rule becomes indeed quite simple:

Only the likelihood is used to calculate the Metropolis-Hastings ratio.

Auxiliary parameters

I have to study this part later.

Bayesian Approximation

In the world of Bayesian’s, a model is a triplet $p(x,z,\theta)$. The observations are random variables $x$. And then there is a seemingly artificial distinction between the random variables that are called hidden ($z$) and other random variables that are called parameters ($\theta$), and are hidden as well! So, how come that parameters got their distinguished name? In the case of for example a clustering task, we can assign each observation a corresponding hidden variable: an index of the cluster it belongs to. Hence, there are as many hidden variables as there are observations. Now, in contrary, we might define parameters in two different ways:

  • The number of parameters is fixed. The parameters are defined as that part of the model that is not adjusted with respect to the number of observations. In $k$-means there are $k$ means that need to be estimated. There are not more means to be estimated if the number of observations grows.
  • Parameters are more variable then the hidden variables. There are not just as many parameters as observations, no their number needs to be derived from the data. In nonparameteric Bayesian methods there will be more clusters (means) when the number of observations grows.

Consider a dataset $D$, we would like to get our hands on $p(D,z,\theta)$. If this all concerns discrete variables, this is a 3D ‘matrix’ (a tensor of rank 3) with the observations along one dimension, the hidden variables along another dimension, and the parameters along the third dimension. The joint distribution $p(D,z,\theta)$ would tell us everything. Some examples where either marginalize out a variable, or use the definition of a conditional probability:

To get to the matrix $p(D,z)$ we sum over all entries in the rank 3 matrix in the direction of the dimension $\theta$. In other words, we are not interested anymore to what kind of fine granular information $\theta$ brings to bear. We are only interested in how often $D$ occurs with $z$.

The problem is, we often do not have $p(D,z,\theta)$. Now, we suppose we have instead all of the following conditional distributions, $p(D|z,\theta)$, $p(z|\theta,D)$, and $p(\theta|D,z)$, would we have enough information?

The answer is a resounding yes. If you have all conditional distributions amongst a set of random variables you can reconstruct the joint distribution. It always helps to experiment with small discrete conditional and joint probability tables. In this question on https://stats.stackexchange.com you see how I construct a joint probability table from the corresponding conditional probability tables. Note the word corresponding: as Lucas points out in his answer there are problems with compatibility. If the conditional probabilities correspond to a joint probability, the latter can be found. However, if you come up with random conditional probability tables it is very likely that the joint probability table does not exist.

The paper that I found very clarifying with respect to the different ways to do approximations to the full Bayesian method is Bayesian K-Means as a “Maximization-Expectation” Algorithm (pdf) by Welling and Kurihara.

So, let us assume we are given a dataset $D$ (hence $p(D)$) and we want to find $p(z,\theta|D)$.

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

2. Variational Bayes

Instead, we can try to establish a distribution $q(\theta)$ and $q(z)$ and minimize the difference with the distribution we are after $p(\theta,z|D)$. The difference between distributions has a convenient fancy name, the Kullback-Leibler divergence. To minimize $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

This type of variational Bayes that uses the Kullback-Leibler divergence is also known under mean-field variational Bayes.

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don’t we instead consider a point estimate for one of these and keep the other nice and nuanced. In Expectation Maximization the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$.

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let’s not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn’t change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$.

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$-means alternative, but this lucid and concise exposition of approximation.

Where Are the Aliens? Where Are the Dragons?

Where are the aliens? Where are the dragons?

One night I was lying down staring at the stars and it dawned upon me that I was not alone. I had only a few of the many alien eyes. Just like them I was figuring out if my god existed. I felt part of this cosmic family more than anything before. Something bigger than our soccer team, our continental heritage, or our world wide scientific efforts. All these eyes… The universe becoming aware of itself.

Where are the aliens?

But where are they? A question asked many times before (see the Fermi paradox). I would like to explore two options here. The first is actually pretty obvious, and the one I like most. So, the thesis is that the earth is not rare, and that there exists other civilizations in our own galaxy. The Milky Way is around 100,000 light years in diameter. Physically populating our galaxy would take, if we consider moving 1000 times slower than the speed of light, only 100 million years. Considering that there are many planets billions of years older than earth, that must have happened already many times over. So, why don’t we see aliens or million year old alien artifacts on earth? There might be a very simple answer: there is a cosmic communication platform. There is a way to communicate between intelligent species in which cosmic matters such as expansion to other galaxies are discussed. Species negotiate between each other, how much resources are required, and where they are allowed to go. They are careful with inter-world interactions because of viruses and other undesirable side-effects. Inevitably, a civilization discovers this communication structure before they are intellectually capable of physical expansion to other galaxies.

So, it’s up to us to find out how they are communicating! If there would be a faster than the speed of light communication method, how would it be implemented?! Has there been a physical coverage of quantum coupled atoms all over the galaxy and we have to phone in by finding these atoms? How would they have labelled these atoms, so we can find them easily?

There is another explanation, which is less exciting, but stems from my experience with evolutionary algorithms.

Where are the dragons?

Alien dragons? © 2012 Fantasy World, Inc

Evolutionary algorithms seem to have glass ceilings, just as in the job market. Even if you check highly dynamic variants such as artificial gene regulatory networks (Bongard), or neuroevolution of augmented topologies (Stanley), and I even went further (as in evo-devo, to eco-devo, to epigenetics) in incorporating body changes to perceptual phenomena (pdf) as required in the evolution of circadian rhythms. Although interesting light dependent metamorphosis was developed in a simulator, it became quite obvious to me, that it is very hard to scale these things. It is very well known in machine learning that local optima can be very troubling to an algorithm, and evolutionary algorithms, although versatile, are no exception.

Why this long prelude? We have to search over all possible evolutionary trajectories not so much for paths that would lead to our type of intelligence, but for paths that lead to local optima from which it is hard to escape. With our human dexterity, we become quite proficient tool users (as did crows) and very important, we got fire under our control. But, what if evolution “got there first”!? Our stomachs have been evolved to withstand its acidic content. Perhaps, it wouldn’t have been such a stretch to have evolved a small biological oven as an organ, burning all kind of material that is normally hard to chew (for us mere mortals). Cooling systems are required of course, but that has been evolved multiple times over anyway. A newborn would need to get fire from his mother, but that seems a pretty easy procedure compared to feeding milk through specially developed organs for many weeks or months.

It might be that this is actual the normal way of evolution to proceed. And then? We have fire breathing creatures on this alien planet. Would it mean that tool use would get stalled? Or would dragons build space ships as well in the end? Who knows? Perhaps all aliens are dragons…

Google Glass - Translating Cat Meows

Interesting applications of Google glass? I encountered very few still. I think some creative minds have to sit together and go for it! Translations of foreign languages, and reading out loud for blind people, or the illiterate. Sure, two minutes of a creative session under the shower, and you will come up with such ideas. But what’s next? Do we really need to translate all people around us? There are so many annoying conversations! Perhaps the glass can assemble them to a nice creative story, or a poem! And of course, there is no reason to only use human input. A sound from an animal can directly translated in a warm male or female voice. The barks of your dog become “Hey! I see someone I don’t recognize!”, or “Dude, I am so hungry!”.

But what about inanimate matter! Why wouldn’t your plant be able to talk to you!? It will kindly remind you that it’s quite thirsty. The glass will swiftly translate its message through bone conduction to you. Of course as soon as everything will be able to talk to you, you will want to block things out. Or else, you will get advertisements from every product on your table until you’ve to be carried away to the madhouse.

A totally different type of game will arise. Be not be surprised that, in the future, there will be a discussion about violent games in which you can virtually kill people with a virtual ax or a virtual chainsaw. A seemingly innocent person with a glass looking at you, might be virtually heaving her ax. You might never know!

Do you have a crazy idea!? Share it with the world!

Our Future

Black Mirror, the first television series, really describing the future. The near future, a black future.

The future we have to prevent. It is easy to think of the future as science fiction. Monsters from other planets, people with unusual mutations, world wide disasters. But, the future is us. What do we want? What do we need?

Credit

Your online credibility is independent from your offline credibility

We are important, we are people. I have an identity. You can be trusted. The stakes of online transactions will go up. Buying houses online, do investments online, dating online. My prediction is that there will be a company in the future, “Identity Inc.“, which delivers services around identity. Being able to guarantee that you are who you say you are online will be incredibly important. Commitment towards providing people with means to see who you are, will give you power. A company that can provide online privacy while at the same time provide online credibility will prosper. Currently credibility is an opaque concept and conflated with financial credibility. We all know what happens if bankers are allowed to define credibility. In the end, a transparent system will win. A system in which you can be anonymous when you want, and prove who you are when you want.

Money

Wie het kleine niet eert, is het grote niet weerd

You and me buy potatoes, bread, coffee, newspapers and alike every day. And we pay a lot for these things! Counting in entire dollar cents, euro cents, or the exotic currencies of the future. However, online service we want for free. There real costs might be a fraction of a potato, but we want them for free. Why is that? Do we really not want to pay 10 cents per month for our Google searches? Actually, I don’t think so. It is the hassle of micro-payments that keep us from paying for these things. We now pay a few bucks for a game from an app store. Why? Because it is made easy. Just as micro-loans open up opportunities in developing countries, there are immense opportunities for people that crack the problem of micro-payments and micro-subscriptions. If money has to flow, if liquidity is important, this is the way to go. The future lies in the miniaturization of our money.

Cloths

Nobody “wears” their nails

Naked. We are alone. Till now. This will change. As humans, we have been good in decorating ourselves with eye extensions, skin extensions, hair extensions. We came up with special words to describe them; glasses, cloths, hats. Our technology however hasn’t been organic enough to seamlessly integrate with us. When we sleep, when we shower, when we sport, we get rid of them. Google Glass is one example of technology that wants to be more integrated with us. But it is a clumsy attempt. The struggle for the always-there-device is only starting right now. What will win? It can be electronic lenses, piercings, underwear, jewelry, finger nails, wigs, artificial teeth, but in the end, it will be more organic than any of these. It will be beautiful, it will be us. The modern human is born.