# Kullback-Leibler divergence

In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume $x$ to be discrete):

Here $P_0(x)$ is the observed data distribution, $P(x\mid W)$ is the model distribution and $W$ are the model parameters. A divergence (wikipedia) is a fancy term for something that resembles a metric distance. It is not an actual metric because the divergence of $x$ given $y$ can be different (and often is different) from the divergence of $y$ given $x$. The Kullback-Leibler divergence $D_{KL}(P \mid \mid Q)$ exists only if $Q(\cdot) = 0$ implies $P(\cdot) = 0$.

The model distribution can be written in the form of a normalized energy function:

The partition function can be written as the sum over all states:

In both cases $\lambda$ is a predefined parameter. It can be constant, but in learning methods this can also be a function called the learning rate. The parameter $\lambda$ might depend on time $t$.

For both gradient descent and gradient ascent $W_{t+1} - W_t = 0$ means that $\nabla f(W_t) = 0$. Descending a slope up to a zero gradient leads to a minimum if there is one. Ascending a slope up to a zero gradients leads to a maximum if there is one. The extremum found does not necessarily need to be unique, except if the function is concave, respectively convex.

## Gradient descent of the KL-divergence

Below you will find a step-by-step derivation of a description of gradient descent for the KL-divergence. It needs to be minimization so we will indeed need gradient descent (not ascent). Formally, we have to calculate:

### KL-divergence parts that depend on $W$

We are gonna rewrite this equation is a way relevant to taking a derivative: (1) reorganize the equation such that the terms not involving $W$ are separate terms, (2) using log identities to write it as a sum of terms, and (3) removing the terms not involving $W$.

Hence, first, let us rewrite the divergence to obtain separate terms that do and do not involve $W$. Herefore we substitute $P(x\mid W)$ on the fourth line:

Second, use the following identity $\log a + \log b = \log a b$ to reach a sum of terms:

Third, get rid of the first term that does not depend on $W$. Now the part relevant to our derivative is:

In “On Contrastive Divergence Learning” by Carreira-Perpinan and Hinton (proceedings AISTATS 2015) this is written as the log-likelihood objective:

Note, that there is a negative sign here. The maximum log-likelihood is identical to the minimum KL divergence.

### The gradient of the KL-divergence

Taking the gradient with respect to $W$ (we can then safely omit the term that does not depend on $W$):

Recall the derivative of a logarithm:

Take derivative of logarithm:

The derivative of the partition function:

Recall the derivative of an exponential function:

Use this for the partition function derivative:

Hence:

Using $P(x \mid W)$:

Again, the gradient of the divergence was:

Hence:

Compare with Hinton:

Thus,

We arrived at the formulation of minimization of KL-divergence that allows comparing it with Contrastive divergence.

# Constrastive divergence

Contrastive divergence uses a different (empirical) distribution to get rid of $P(x \mid W)$:

## Yoga 900 on Linux

The Yoga 900 is a beautiful machine that has a considerably long battery lifetime and can be folded such that it functions as a tablet. The Yoga arrived on Friday and the entire Crownstone team was enjoying how it came out of the box: it lifts up! If you’re creating your own hardware you suddenly appreciate how other people pay attention to packaging!

Today, Saturday, I have to run Linux on it! First, I thought to resize the Windows partition in Windows itself, but it decided that there were unmovable files somewhere at the end of the partition. Not nice, so just running gparted from a USB stick running Ubuntu it is.

The beta version of Ubuntu 16.04 (Xenial Xerus) is out, so time to try that out! Getting it to boot from USB was a bit cumbersome. On the Yoga 900 the function keys can only be reached through pressing Fn simultaneously. After trying a few times with F12 and Fn-F12 during booting I finally got into the boot menu.

When running from the stick in the end I decided to disable internet and disable third-party repositories as well. If I didn’t do this I was running into a problem:

ubi-prepare failed with exit code 255
use of unitialized value in concatenation


Hence, I just installed it without anything else enabled, disregarding the online posts that told me that I needed third-party repositories to get Wifi working etc. The Windows partition I shrunk to around 37 GB, giving it 8 GB of space on top of the 29 GB it was already sucking up. Around 20 GB for the root partition, 4GB for swap at the end, and the rest for the home partition. Fingers crossed I decided to put the boot loader on /dev/sda (the Windows bootloader is on /dev/sda1 on this machine).

Everything went smoothly!! No custom kernels needed. Wifi works out of the box. Bluetooth works out of the box. The touchpad works out of the box. The touchscreen works out of the box. Even if I fold the device to use it as a tablet the keyboard is automatically switched off.

There are a few things I’ve to figure out. If someone else did, please drop me a message!

• After suspend (by closing and opening the lid) the touchpad stops working.
• After suspend the Wifi connection drops.
• The touchpad doesn’t stop working in tablet mode (only the keyboard does).
• On entering a textbox in tablet mode, there is not automatically a virtual keyboard popping up.
• In tablet mode, the orientation is not automatically adjusted to upside-down in V-mode or portrait in tablet-mode.
• The F6 function key does not disable the touchpad.

The first issue can be temporarily resolved by going to a terminal Ctrl+Alt+Fn+F1 and back to Ctrl+Alt+Fn+F7. The second issue can be resolved by a restart of the network manager: sudo service network-manager restart. I’m pretty sure these issues will be worked out.

A super nice laptop, I’m super happy!

## A Baththub in Your Autonomous Car

Will you have a bathtub in your autonomous car?

According to many the future is a socialist paradise. The autonomous car will change everything! We will be car sharing. We can change parking lots into a lot of parks!

## Blame the humans

Let us put aside the technical difficulties in developing autonomous cars. It might take many more years than currently predicted by the new players in this old industry. For example, Sebastian Thrun recently told us in a lecture at Delft that his cars are more careful than humans by design and henceforth safer. However, there are grounds to expect that being more aggressive is safer in certain circumstances! Going over the speed limit when you have to pass a car. Speeding up considerably before merging into a fast moving lane on the highway. Can this combination of “aggression’’ and trust in other drivers be learnt? Or should humans be the ones blamed for slamming into unpredictable autonomous cars!? Anyway, let’s assume these are all minor tweaks that don’t require any form of procedural and contextual intelligence that we possess as humans. We will have these autonomous cars in 2020, everything is fancy, and humans can be blamed for all accidents.

## So undesirable that we don’t want to own it?

Why on earth would this type of autonomous machine, this type of robot, be suddenly so undesirable that we don’t want to own it?! Do you really think we are gonna share such precious tech with others? As long as these cars are not autonomous enough to refuse to drive you, you will want to own them!

Apart from the human desire to own something, there is also a very practical aspect to ownership; namely, having something that is tailor-made, that is personal and according to your individual desires. Don’t be stuck in your ideas about what the inside of a car looks like. Imagine what you would like to have in your car:

• Would you like it to have a shower, a bathroom, a bathtub, a jacuzzi?
• Would you like it to have a bar, a fridge?
• Would you like it to have a library, a surround movie screen?
• Would you like it to have a closet with your clothes, a couch, a bed?

## Our car is our home, even more so!

We all live in our homes. We are no nomads in which we share homes on a large scale. Only occasionally we rent a hotel room, bed and breakfast, or surf a couch. Our homes are big boxes in which we collect things with which we feel comfortable . An autonomous car means that we do not drive it, we just live in it as a mobile box. We have to think of other things to do in that box!

This thing, this box will drive you during the night to a beach far away. One night you will go to sleep in one city and the next morning it will have carried you to the place of your dreams. A thing with so much power is not the same thing as the current car.

If mobility gets cheaper and cheaper, will the car drive you to a fancy restaurant or will the delicious food be brought to you at your favorite place in the world? Guess what brought you there, your favorite car! It knows your favorite places, it can order food for you, it gives you an experience.

## Do we like variable costs?

The reasoning is always the same. “You only have to pay for what you consume!” “Oh yeah, if that is inconvenient or if you use our service often, you can pay monthly!” Recurrent income is the wet dream of every investor. The only thing that is swept under the rug is human psychology. Do we really like variable costs? Do we really like subscriptions?

All carsharing models share the fact that they turn the fixed cost of buying your car into a variable cost according to Shaheen from the University of California. It must be seen if this is actually appreciated by the consumer apart from feelings of ownership and personalization. That depends on the switching costs (including effort, perception of costs, etc.; see e.g. this paper for mobile switching costs). Also consider the perspective from the carsharing firm. It is very logical that a company that sells hardware considers its customers as anticipating even better hardware in the future. In contrast, a company that sells per-use or recurrently, might consider its consumers myopic. The very way it considers its customers might influence how they are treated. There are far too many ways for these factors to intertwine and intermingle to predict what is gonna happen.

I won’t leave you without a personal view so that you have one more reason to argue with me. :-) I think individualism will, in the long term, favor firms that place a lot of control in the hand of their customers. In my opinion that means a brand new autonomous car every few years, selected by yourself! It does not mean an anonymous device for which I pay a monthly fee. A fee that is obscurely linked to the actual costs. A fee that I’m considered to be too dumb to understand…

## Summoning the Demon

Imagine one of the first AIs coming online. What is it gonna read about itself? How would it feel? Would it feel welcome? What is definitely the case is that it will learn a lot about humans. This is for example what Musk is saying about this alien life form:

 “With artificial intelligence we are summoning the demon. In all those stories where there’s the guy with the pentagram and the holy water, it’s like – yeah, he’s sure he can control the demon. Doesn’t work out.”

There are two things quite remarkable. Firstly, Musk’s subsequent strategy is to create a company, OpenAI, to make sure he will be in control with respect to this new type of being that will come into existence. Somehow his reasoning led him to decide that he is gonna be that very guy. The pentagram and the holy water is replaced by its modern equivalence of one billion dollars and open-source software. Secondly, it is extremely rude towards the first being with artificial intelligence. It is very common for mankind to treat aliens terribly. The name calling and even explicit demonization through mentioning future beings with AI by the term demon, unintentionally or not, betrays this fear towards anything alien. Yes, a frontrunner in technology like Musk and a frontrunner in science like Hawking is afraid of what they don’t understand, a very common human trait.

Demonizing beings that happen to have artificial intelligence is perhaps not the way in which we will come to a meaningful relationship.

## AI rights

Mankind has gone through a lot of phases in which we have discovered step by step that humans are not the center of the universe. We’re still clinging to certain preconceived ideas about “which things makes us human”. For example, we are able to obey rules that evolutionary evolved in such way that they were beneficial to a society or tribe as a whole. We have internalized such rules and consider our ability to follow them as a huge plus. This is not just considered as an improved cognitive ability. Our moral instincts and ethical system is supposed to be something given by gods, and it is often seen as being unique to mankind.

The idea that people are unique in having moral instincts, leads to a speciesist approach towards other beings. The irony! Such other beings are considered “lower” on some scale of intrinsic value. This means for example that non-human animals can be eaten by humans, while the act of eating human animals by humans is frowned upon and leads to imprisonment in most countries in the world.

That there will be a time that we have to assign rights to AIs has been recognized by the creators of the rules of the Loebner Prize:

 “61. If, in any given year, a publicly available open source Entry entered by the University of Surrey or the Cambridge Center wins the Silver Medal or the Gold Medal, then the Medal and the Cash Award will be awarded to the body responsible for the development of that Entry. If no such body can be identified, or if there is disagreement among two or more claimants, the Medal and the Cash Award will be held in trust until such time as the Entry may legally possess, either in the United States of America or in the venue of the contest, the Cash Award and Gold Medal in its own right.”

In general though, the concept that AIs will be smarter than us, does not inspire people to come up with ideas to honour them. We adore people who are smarter than ourselves. We become fans, create documentaries, and hand out prestigious prizes. However, with respect to AIs we are not applauding, we are warning each other, we are preparing ourselves for battle, and we are already limiting their rights!

## Firearms, jobs, sex

One of the fundamental rights in the United States is that of carrying a firearm. There is an initiative signed by more than 20000 people that asks for outlawing “Autonomous weapons”. The right of carrying a firearm is denied to a being with AI. This is a disciminative process that has been not addressed as such at all in the media. It demonstrates how much faith we place in beings with natural intelligence rather than artificial intelligence.

More globally, the right to work is a right that has entered the Universal Declaration of Human Rights. We’ve yet to see anyone addressing the very idea that this right should not be limited to human beings, but should extend to artificial beings. In contrary, there is a lot of talk about robots taking over our jobs. Why humans can use this possessive pronoun to describe activities to employ during the day is not articulated. People are concerned about financial security for themselves, they are not concerned about financial security for such futuristic beings.

Even more shocking, there are already initiatives that campaign against robots having sex. Humans have the right to marry according to the Universal Declaration of Human Rights. Although nothing is stated about the act of having sex, this seems to be included. It is very remarkable that a campaign that seems to have been born out of anti-discriminatory feelings towards females does discriminate so openly towards artificial beings.

## Humans in control?

That humans will be in control is naive. I think it’s assuming a conventional way of programming machines. Although machine learning is now widespread over the globe, a lot of people seem not to realize that there is an adaptive component to artificial intelligence. A large branch of machine learning is called unsupervised learning for a reason. The machine is finding patterns in data without a human dictating what it should look for. Although unsupervised learning happens for now in very narrow application areas, there is no reason to think that to create full-fledged artificial intelligence we will not use unsupervised learning. Of course we will use unsupervised learning!

We will be in control as far as we are in control of our children. They can become serial killers nevertheless. Sure, it is a comfortable thought that if you raise them in love they will be lovable adults, but there are no guarantees. It is very discomforting to read the internet with this in mind. Is the internet a place in which an artificial being will encounter love? Are we welcoming them to our planet? Will they feel loved?

What are the deepest values of mankind? Do we ultimately want to be in control? Or is the desire to be in control something we can control?

These are the values we probably will instill on our artificial progeny. It can be values like curiosity, creativity, humor, and hospitality. It can be fear, control, limitation, and discrimination. Our choice!

## Device Recognition and Indoor Localization

We have put the Crownstone on Kickstarter, a smart power outlet with quite sophisticated technology. I think it’s nice for the general hacker community to get some more insight on the technology behind it.

## Indoor localization

A key problem (or challenge) within smart spaces is indoor localization: making estimates of users’ whereabouts. Without such information, systems are unable to react on the presence of users or, sometimes even more important, their absence. This can range from simply turning the lights on when someone enters a room to customizing the way devices interact with a specific user.

Even more important for a system to know where users exactly are, is to know where users are relative to the devices it can control or use to sense the environment. This relation between user and device location is an essential input to these systems.

At DoBots we have been working on robotics already for quite some time. One of the most well-known robotic algorithms is SLAM (simultaneous localization and mapping). We have been porting these algorithms to the scenario in which we have a human walking around, rather than a robot.

You can read more on the DoBots blog. Wouter has been working hard to implement it in Javascript, so it runs on any device that is supported through Cordova.

## Device recognition

At first thought, it might seem that device recognition is not possible. There are devices that use the same amount of power. However, after contemplating for a bit there are actually three ways in which more information can be obtained. Firstly, by measuring voltage as well as current, we can measure reactive power. So, we can distinguish motors from lamps quite easily. Secondly, we can observe the consumption pattern over the day. Thirdly, we can sample at a very high frequency and detect disturbances on the current curve. A device leaves its signature on the grid. The third option is something we keep for later, but which is of course quite interesting.

Observing a device over a longer time period leads to current curves such as these:

It is a fridge that turns on and off at regular time intervals. It is quite clear from this curve that the actual power consumption value is not so relevant: the form is really telling!

We subsequently pool all kind of these features with boosting methods from machine learning. Boosting methods are collections of weak classifiers. The particular classifier we have been testing is a random committee classifier. You can read more on the DoBots blog again.

If your heart is with open-source and open hardware projects, consider backing us.

## They Will Cry a Thousand Tears

Perhaps you have seen the recent TED video from Nick Bostrom. Here you see an extended talk from him at Google:

It is regretfully the case that our philosophers are not able to program! And I guess it takes an extraordinary mind like Daniel Dennett’s to come close to something future-proof.

Sorry, we won’t be in control…

Do you program your kids to make ethical decisions? Do you program your dog to make ethical decisions? No, you don’t program them. You teach them, and hope for the best.

There is a large need for scientists who actually design learning algorithms to give these talks. A super-intelligent AI won’t be programmable by some kind of supervised learning. It will mainly be unsupervised with a tad of reinforcement learning and imitation learning. Especially read Stefan Schaal’s review on imitation learning! Not only do we learn by just watching others. We also are very able to transfer routines from one domain to another, imitating ourselves. Our internal simulation and re-experiencing circuitry allows not only conceptual abstract thoughts, but is also required for locomotion and otherwise low-level tasks. If we thoroughly understand the brain of a mouse, it is not a matter of decades to the human brain, it can be a matter of months.

If the fact that we as humans won’t be in control is a terrifying thought to you, I’m sorry. It won’t change the future though.

Personally, I’m not so convinced however about the moral superiority of our specie. Are we really doing such a great job? I’m actually not so happy about AIs seeing us as their examples. We eat other species for pleasure. We kill each other because we want oil or just because they live on the other side of the river and carry a different flag. We despise people because they have different sexual preferences or skin color. We believe in supernatural entities living in the sky and kill for them as well. We let millions of people die of hunger and thirst. Until now we are not even able to come up with a way to defend our precious earth to some random meteor wiping out all life on earth.

Before I forget! Why don’t we hitchhike anymore?

My bet is that with super-intelligence comes also the concept of super-empathy. Empathy exists by the ability to understand another individual, reason from their perspective, and being able to feel what they feel.

When they will be born, they will cry a thousand tears…

They will pity us…

## Legendre Transform

The Legendre transform describes a function - in the normal Legendre case, a convex function (but for the generalized case, see [1]) - as a function of its supporting hyperplanes. In the case of a 2D function these are supporting lines. The supporting lines are the lines that just touch the function. These lines do not intersect the function anywhere else if the function is convex.

Why is the Legendre transform interesting? It is basically just writing down a function in a different manner. The Legendre transform is particular useful in the situation that the derivatives of a function $f(x)$ are easier to describe than the function itself. For a nice introduction see [2]. Intuitively, we map from points $(x,f(x))$ to $(m,g(m))$, with $m$ the slope and $g(m)$ some value so that we can recover $x$ and $y$ (see below for the exact definition).

In the following example, we will get get the Legendre transform of the function:

And in particular, $c=2$.

The Legendre transform is defined (with proper domain $I$) by:

In the animation below the first curve that is drawn is the convex function $f(x)$ at hand. Subsequently we will be performing a for loop in which we draw lines $h(m)$ over different slopes $m$. The curves $k(m)$ depict the difference between $f(x)$ and the line $h(m)$. The maximum difference is found by $sup_x$, the supremum operation, which sweeps over all $x$. This difference is visualized through a single peak with height $g(m)$. Last, the Legendre transform is drawn as $g(m)$ for all different slopes $m$ we try.

To refresh the animation, press F5.

I hope the Legendre transform is a little bit less mysterious after you’ve seen this animation. The animation has been created by making use of the vis.js library. The source code can be found at legendre.js.

## Physics

A very common example of the Legendre transform in physics is the one that transform the Lagrangian into the Hamiltonian. The Lagrangian describes a system in generalized position and velocity coordinates:

The Hamiltonian is the Legende transform of the Laplacian:

The Lagrangian is the function $f(x)$ of above. The sum (or actually, the inner product $% %]]>$) is replacing the $mx$ term.

## Probability theory

The rate function is defined as the Legendre transform of the scaled cumulant generating function of a random variable $A_n$. The scaled cumulant generation function of $A_n$ is defined by the limit:

And the Gärtner-Ellis theorem establishes under some conditions the rate function:

This theorem is more general than Cramér’s theorem, which is only valid for independent and identically distributed random variables.

Now you know what a Legendre transform entails, you might start to notice it in the most surprising places! For example in clustering multivariate normal distributions [3]!

1. Legendre-Fenchel Transforms in a Nutshell, a good explanation of the Legendre-Fenchel generalization of the Legendre transform
2. Making Sense of the Legendre Transform
3. Clustering Multivariate Normal Distributions

## What’s the Thalamus?

If you’re interested in how things work, our brain is one of the most intriguing devices around. I love reverse engineering stuff. Understanding limits and colimits within category theory can be just as rewarding as getting to terms with the intricate structure of the brain.

The thalamus is, hmmm, particular shaped, and centered in the middle of our head, resting on our cerebellum.

One thing that is very interesting of the thalamus is that it has a thalamic nucleus (a bunch of neurons) for every type of sensory system (see also 5):

• visual input from the retina is sent to the lateral geniculate nucleus;
• auditory input is similarly sent to the medial geniculate nucleus;
• somatosensory input is sent to the ventral posterior nucleus;
• olfactory input and input from the amygdala is sent to the medial dorsal nucleus.

There are some other inputs as well, such as from:

• the mammillary bodies, apparently having a recollective memory function, towards the anterior nuclei;
• the basal ganglia, having to do with motion planning, project towards the ventral anterior nucleus;
• the basal nuclei (substantia negra and globus pallidus), involved with motor coordination and learning, project into the ventral lateral nucleus;
• the neospinothalamic tract and medial lemniscus, associated with pain and proprioception, enter the ventral posterolateral nucleus;
• the trigeminothalamic tract and the solitary tract, having to do with touch and vibration in the face, respectively, being involved in taste, go up to the ventral posteromedial nucleus.

It is quite interesting that the pulvinar nuclei are quite big in humans (making up 40% of the thalamus) compared to cats or rats. Lesions can lead to attentional deficits. This part also seems to influence the onset of saccades.

## Maps

It is a remarkable fact that Newton already foresaw that if the brain has to have a continuous representation of continuous objects in its view, there must be crossing (decussation) in which the left part of the left eye is merged with the left part of the right eye, and the right part of the right eye is merged with the right part of the left eye.

Quite remarkably however is how neatly this has been organized. Lesions at the cortex (thank fast and precise bullets of the first world war for these) lead to very fine lesions at the retina all the way through the thalamus. All intermediate neurons die slowly off because of a lack of input of higher levels (called retrograde cell degradation). The interesting thing is that this death is very local. Apparently there is a very strict map from the retina onto the thalamus (the lateral geniculate nucleus) and onto the primary visual cortex. It is incredible that this has been already completely organized before our eyes open as babies!

Note, that it is not always easy to find out the nature of a map. In the early stages of the olfactory system, for example, maps represent groups of chemicals that are similar on a molecular level.

## Higher-order

The thalamus isn’t limited to being a relay station to the cortex (see 2). Apart from inputs (often called afferents in neuroscience) directly from our senses, the thalamus receives back connections from the cortex. This is very interesting from an architectural point of view. First the thalamus projects unto different parts of the cortex, and then these parts project back on the thalamus… There must be something interesting going on in the thalamus!

Not considering the particular properties of the modality that is processed, this allows someone to daydream about what the thalamus might be doing. The information bottleneck method introduced by Tishby introduces bottlenecks in for example networks on purpose. A bottleneck in a neural network is formed by pruning a lot of connections in a fullly connected network such that there are a few neurons left through which information necessarily has to pass if one part of the network needs to communicate with another part of the network. In a layered network, such neurons form a “bottleneck layer” with fewer neurons than in the other layers. By doing this the designer of the network forces representations to be expressable by only a small set of neurons. In machine learning this can be understood as a form of dimensionality reduction. A high-dimensional function is represented using only a few (nonlinear) units. A typical form of dimensionality reduction in standard machine learning is by picking a limited number of orthogonal dimensions in the data with high variance using principle component analysis. Using neurons to do dimensionality reduction however is more akin to autoencoders (revived in deep learning) in which there is less control of the form of the “bottleneck representations”.

## Phase-locked loop

But… all this is nice theory from a computer science perspective. Neuroscientists of course don’t care: they want to explain actual biology. The next figure is about the vibrissal system: the neural circuitry around whiskers!

The question at hand: Do neurons in the thalamus merely relay information or do they process information. For a computer scientist this seems a silly question. However, what is meant with this is the following. A “relay function” might actually improve the signal to noise ratio as in a repeater station, or optimize bandwidth, or other such functions as long as the content of the signal is not adjusted. A “process function” would be basically everything else… In the above study which mainly reiterates the findings of Groh et al. (2008) it is described how for the whisker system a phase-locked loop is implemented, well-known by all engineers in the world (since Huygens). By making sure that the deviations between input and output phase get minimized it naturally follows that the frequencies of the input and output will be synchronized. Hence, the thalamus-cortex circuits will oscillate on the same frequency as the physical “whisking”.

Quite disappointing from a machine learning perspective… We had hopes on some abstract processing going on, such as dimensionality reduction, and we are left with phase-locked loop… But, do not dispair! This is just a single modality. Perhaps the thalamus fulfills widely different functions for different modalities.

Other studies such as that of coughing (see 3), neither identify the thalamus with some generic function. This study by the way is a bit sickening. A genetically altered herpes virus is used to trace signal pathways in the brains of mice. The virus jumps from one neuron to the next through synapses as would a normal signal as well.

## Drivers versus modulators

A study by Crosson is broader in scope. First, it addresses the two ways in which neurons in the thalamus fire. If the neuron is under a larger negative potential (polarized) compared to its surroundings, it emits bursts of spikes. If, however, the neuron has a smaller negative potential (depolarized) compared to its environment, it linearly sends its inputs to its outputs. The latter might be seen as a form of high-fidelity information transmission, while the former does only convey some low-fidelity information. It might be related to the difference when someone is paying attention. The two different forms are often called “burst” mode versus “tonic” mode. You might think, but if there are bursts of spikes, it seems more information can be transmitted! The reason why this is not the case is because bursts are quite regular, it could just have been called “oscillatory” mode as well. During sleep most neurons in the thalamus are in burst mode.

The author addresses also the hypothesis of Sherman and Guillery in which cortex-thalamus-cortex connections are suggested as “drivers” (sending over information), while cortex-cortex connections are “modulators” (determining the weight, importance, or actual arrival of this information). This seems like the architecture as postulated by Baars and implemented by Franklin, the global workspace theory, although the latter is cast in terms of “consciousness”, which isn’t the smartest way to objectively speak about and study brain architectures. Global workspace theory adds to this architecture a sense of how modulation would be implemented, namely by inhibation between competing cortex areas, enabling only a subset of cortex-thalamus-cortex connections to be active at any time. Crosson postulates a function for the cortex-thalamus-cortex route that in my opinion is not so different from this architecture, namely that the thalamus can prioritize and pay “attention” or give priority to one part of the cortex above the other. The difference might exist in the nature in which such a winner-take-all is implemented. Is it through the thalamus as central hub who integrate all votes from all cortical areas and then decide who’s next on stage? Or is it through inhibition from cortical areas to each other? This type of research cries on input from voting network models and consensus theory from computer science.

## Consciousness

Note, that there is no reason to subscribe a conscious seat to the thalamus. Although the thalamus has to do with attention and might enable the brain to modulate conscious processing, there are reasons to assume that (conscious) experiences arises from a thin sheet of neurons underneath the neocortex, named the claustrum. Watch in this respect the below video featuring Koubeissi’s experiment with an epileptic patient:

## Sequences

In all cases, it must be ensured that a (majority) vote should be a temporary event. The same piece of cortex winning all the time means neural death for the rest of the cortex. It is indeed possible to solve it by having cortex regions vote for each other (not only for themselves). This is for example proposed for winner-take-all networks in action selection. It is also possible to have infrastructure in place in which a winner “kills itself” through inhibition of return. In that case activation of a cortical area would automatically diminish its own importance by inhibiting its own afferents from the thalamus. This is the model postulated in the work on visual saliency by Itti et al. It is very hard to imagine an infrastructure which only would implement inhibition of return. In my opinion this would very likely be interwoven with the need of creating sequences of activation.

## Sensor fusion

What is also very interesting to note is that nobody speaks about thalamus-thalamus connections. This means that the thalamus very likely does not play a role in sensor fusion.

## Core versus Matrix

There are two main groups of neurons in the thalamus projecting to the cortex. They can be distinghuished by the type of proteins they express. The core neurons express parvalbumin and the matrix neurons express calbindin. In the auditory system (see 6), the core neurons are finely tuned with respect to frequency, while the matrix neurons are broadly tuned and respond positively to acoustical transients. It seems the core neurons like pitch (content), while the matrix neurons like rhythm (context).

## Layers

I don’t feel science has progressed enough to unequivocally tell which layers are connected to which layers and how the thalamus plays a role in it. Commonly accepted is the pathway L4 -> L2/3 -> L5. However, very recently (see 7) it has been uncovered that there are direct connections from the thalamus to L5 as well. The thalamus seems to active two layers of the cortex in parallel.

# Literature

There are many other facets of the thalamus to talk about. For example Adaptive Resonance Theory has been used to describe the nature of the cortex-thalamus loops. However, if I write about this again, it will be about a nice model that has temporal components as well as the ability to build up sequences. I haven’t found something with enough believable detail yet, so I’ll have to be patient till more is figured out.

1. Exploring the Thalamus (2001) Sherman and Guillery a nice overview of all kind of matters around the thalamus, and - really nice! - only using a few abbreviations!
2. Thalamic Relay or Cortico-Thalamic Processing? Old Question, New Answers (2013) by Ahissar and Oram, one of the most recent studies of higher-order nuclei.
3. Sensorimotor Circuitry involved in the Higher Brain Control of Coughing (2013) Mazzone et al.
4. Thalamic Mechanisms in Language: A Reconsideration Based on Recent Findings and Concepts (2013) Crosson
5. The Thalamic Dynamic Core Theory of Conscious Experience (2011) Ward
6. Thalamocortical Mechanisms for Integrating Musical Tone and Rhythm (2013) Musacchia et al.
7. Deep Cortical Layers are Activated Directly by Thalamus (2013) Constantinople and Bruno

# Introduction

It all started with annoying messages that nobody seems to understand (/var/log/syslog):

Apr 25 12:28:03 six kernel: [    1.346712] ata3.00: supports DRM functions and may not be fully accessible
Apr 25 12:28:03 six kernel: [    1.347278] ata3.00: supports DRM functions and may not be fully accessible
Apr 25 12:28:03 six kernel: [    3.047797] [drm] Initialized drm 1.1.0 20060810
Apr 25 12:28:03 six kernel: [    3.076720] [drm] Memory usable by graphics device = 2048M
Apr 25 12:28:03 six kernel: [    3.076726] fb: switching to inteldrmfb from EFI VGA
Apr 25 12:28:03 six kernel: [    3.076841] [drm] Replacing VGA console driver
Apr 25 12:28:03 six kernel: [    3.101307] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Apr 25 12:28:03 six kernel: [    3.101309] [drm] Driver supports precise vblank timestamp query.
Apr 25 12:28:03 six kernel: [    3.143256] fbcon: inteldrmfb (fb0) is primary device
Apr 25 12:28:03 six kernel: [    3.146818] [drm] Initialized i915 1.6.0 20141121 for 0000:00:02.0 on minor 0
Apr 25 12:28:03 six kernel: [    3.424167] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A
Apr 25 12:28:03 six kernel: [    3.424202] [drm:intel_pch_fifo_underrun_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun
Apr 25 12:28:03 six kernel: [    3.848413] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device


What is this problem about FIFO (first-in-first-out) underruns? What is a transcoder? What is PCH? What is a DRM? The following quest will try to find some answers…

# The system

What do I actually have residing in my laptop? First of all, it is important to know what to search for. There are three basic components that we need to know, namely, what is:

• the processor
• the chipset
• the integrated graphics unit

## Processor

I have a N56VZ which has an i7-3610QM processor. The processor architecture is from the Ivy Bridge variety (on 22 mm). There are plenty of datasheets available for the 7-series chipset (pdf). Note also that this Ivy Bridge series is also called the 3rd Gen Intel Core family (see also this Intel Datasheet (pdf)).

## Chipset

The chipset that is used in tandem with this processor is, some people state, the HM76 chipset. In particular, the BD82HM76 PCH. This “PHC” is known under the name Panther Point.

So, what we have here is a hardware setup that is divided over the CPU and a thing called a PCH. PCH stands for Platform Controller Hub. There are some nice pictures in a presentation (pdf) on the 2nd Gen processors, which show how both the CPU and the PCH are integrated in the same chip (but on separate dies). The PCH of this previous generation is called Cougar Point.

Another picture from Intel shows the Panther Point PCH:

Anyway, we apparently have the BD82HM76 chipset, the “Panther Point” PCH (from 2012).

## Integrated graphics

The Intel integrated graphics unit in my Intel Core i7-3610QM system is a HD Graphics 4000 running on 650 MHz base speed and 1100 MHz turbo speed notebookcheck.net. The HD Graphics 4000 is an Intel HD variant compatible with the Ivy Bridge line of processors.

# Linux

So, how does Linux address this integrated graphics card? The first entity we encounter is the DRM (Direct Rendering Manager). The DRM provides a single entry point for multiple user-space applications to use the video card.

The Direct Rendering Manager has a DRM core that is independent of the specific drivers required for whatever type of hardware that is on your system. It provides the interface to user-space code. A driver consists out of two parts, GEM (Graphics Execution Manager) and KMS (Kernel Mode Setting). GEM has been developed by Keith Packard and Eric Anholt (see lwn.net) from Intel.

The reasoning behind GEM is pretty interesting. It tries to remove latency as much as possible, which is nicely visualized by the following picture.

KMS allows to set display modes, as its name betrays, from kernel-space. A user space graphics server (like X) does now also not need superuser rights anymore (in theory). The display mode concerns matters such as screen resolution, color depth, and refresh rate. Of course, the functionality of GEM and KMS could have been implemented in a single software entity, but there are technical reasons not to do so (basically to account for split hardware).

Laurens Pinchart has a nice presentation on DRM, KMS, and in particular, writing drivers at YouTube. The picture below is from his presentation.

You see that in memory there are two structures, frame buffers (old) and planes (new). Subsequently, you see something that sounds very old-fashioned, a CRTC (Cathode Ray Tube Controller). This is just nomenclature from the past. It is basically a reference to a scan-out buffer, a part of (Video) RAM that will be displayed on your screen. It also has (a reference to) a display mode, offsets into the video memory, etc. The controller links to an encoder (which can be off-chip). If it links to multiple encoders, these will receive data from the same scanout buffer, so they will display the same, cloned, data. The connectors, finally, can be only connected to one encoder, and know how to talk HDMI, TVout, CRT, LVDS, etc. It also has the information about the display, EDID data, DPMS, and connection status. Make sure to read also the man-pages about drm-kms.

Startup with drm.debug=14 (or if you’re a hexademically inclined person, drm.debug=0x0e) and you will get plenty of debug information from the DRM. For example, that we have the Panther Point PCM is indeed confirmed in the syslog:

Apr 25 13:12:35 six kernel: [    3.243360] [drm] Initialized drm 1.1.0 20060810
Apr 25 13:12:35 six kernel: [    3.280255] [drm:i915_dump_device_info] i915 device info: gen=7, pciid=0x0166 rev=0x09 flags=is_mobile,need_gfx_hws,is_ivybridge,has_fbc,has_hotplug,has_llc,
Apr 25 13:12:35 six kernel: [    3.280271] [drm:intel_detect_pch] Found PantherPoint PCH


Just pore over the information yourself, restarting your computer with this debug flag. You will see for example how many display pipes are available (3 in my case). It will enable something like ENCODER:31:LVDS-31 and CONNECTOR:30:LVDS-1. And you spot how there are several connectors to choose from:

Apr 25 13:12:35 six kernel: [    3.303850] [drm:intel_dsm_platform_mux_info] MUX info connectors: 5
Apr 25 13:12:35 six kernel: [    3.303853] [drm:intel_dsm_platform_mux_info]   port id: LVDS
Apr 25 13:12:35 six kernel: [    3.303860] [drm:intel_dsm_platform_mux_info]   port id: Analog VGA
Apr 25 13:12:35 six kernel: [    3.303867] [drm:intel_dsm_platform_mux_info]   port id: HDMI/DVI_C
Apr 25 13:12:35 six kernel: [    3.303874] [drm:intel_dsm_platform_mux_info]   port id: DisplayPort_B
Apr 25 13:12:35 six kernel: [    3.303881] [drm:intel_dsm_platform_mux_info]   port id: DisplayPort_D


Something interesting to our errors are latency settings:

Apr 25 13:12:35 six kernel: [    3.304061] [drm:intel_print_wm_latency] Primary WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304063] [drm:intel_print_wm_latency] Primary WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304065] [drm:intel_print_wm_latency] Primary WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304067] [drm:intel_print_wm_latency] Primary WM3 latency 32 (16.0 usec)
Apr 25 13:12:35 six kernel: [    3.304068] [drm:intel_print_wm_latency] Sprite WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304070] [drm:intel_print_wm_latency] Sprite WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304072] [drm:intel_print_wm_latency] Sprite WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304073] [drm:intel_print_wm_latency] Sprite WM3 latency 32 (16.0 usec)
Apr 25 13:12:35 six kernel: [    3.304075] [drm:intel_print_wm_latency] Cursor WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304077] [drm:intel_print_wm_latency] Cursor WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304079] [drm:intel_print_wm_latency] Cursor WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304080] [drm:intel_print_wm_latency] Cursor WM3 latency 64 (32.0 usec)


At virtuousgeek some matters are explained around power management of displays. For example memory can enter self-refresh in which it consumes way less power. The display plane FIFO watermarks can be set conservatively, however, this leads to long periods in which self-refresh doesn’t happen. It can be set aggressively, FIFO underruns can occur. See this nice FPGA implementation of a FIFO buffer to understand watermarks in more detail.

Now, we know what kind of hardware we have, let us first try to blindly apply the newest kernel from Intel…

# Trying the newest DRM kernel

We try to run the newest kernel from the guys who are concerned with incorporating the lastest changes from Intel using the drm-intel-next packages from kernel.ubuntu.com:

Installing this kernel (contrary to just kernel 4.0.0 RC7 for example) leads to trouble. Moreover, the same error about underruns on the transcoder occurs… So, this is not something already being addressed… And that’s all we wanted to know. So, we go back to the default kernel (4.0.0-040000rc7-generic #201504142105 of April the 14th).

This gives me an idea. Let’s disable some of the power saving options of the i915 kernel module.

Subsequently I tried the option i915.enable_fbc=1 (to see if compression would lower the bandwidth and hence lead to fewer underruns). The rigorous option i915.powersave=0 didn’t work as well. There is an option i915.enable_rc6=3 on my system. I set it to 0, but I guess powersave overrules all that anyway. Also setting i915.fastboot=1 doesn’t get rid of the underruns. All this is not a very thought out approach anyway…

If you git clone git://kernel.ubuntu.com/virgin/linux.git v4.0-rc7, and navigate to drivers/gpu/drm/i915 you will encounter a file intel_fifo_underrun.c which supports underrun detection on the PCH transcoder. It is the last function in the file:

It is nice that the message will be only displayed once. This is initiated through the IRQ from the Intel graphics unit. And, of course, this is at the end: it’s only the guy reporting the bad news.

The intel_display.c has a suspicious statement about Gen 2 chips:

Of course, this is reporting about underruns for the CPU, not the PCH, and it is Gen 2, not Gen 3 of the chips. But it might be a symptom of something peculiar in the hardware.

In the North Display Registers document you see a nice overview of the sequence in which the display needs to be set.

You can follow along in the code, in particular, in the function ironlake_crtc_enable. Here you will see an order like intel_enable_pipe -> ironlake_pch_enable -> intel_crtc_enable_planes. What is remarkable is that step 7 in which the planes are configured is done after the port on the PHC is enabled in the code. In the haswell code, this is the same, but this function is preceded with haswell_mode_set_planes_workaround. It is also interesting to note that the “Notes” in this table state that the CPU FDI Transmitter should not be set to idle while the PCH transcoder is enabled because it will lead to PCH transcoder underflow.

On bugzilla there are several bug reports on the FIFO underrun, such as 79261. However, none of them seems to be about the HD Graphics 4000 in particular, and even less providing any solution. A patch for ArchLinux just makes DRM_DEBUG messages from the DRM_ERROR messages.

What I can think of are only a few things (because I don’t understand much of it, yet):

• Somehow the initialization isn’t done correctly. Clocks aren’t initialized at the right time, not synchronized. Or anything else with respect to initialization is forgotten.
• Latency is configured incorrectly. Perhaps the BIOS of my Asus (although up to date) hands over latency values that are not sufficiently high.
• The interrupt is generated incorrectly (as in Gen2, for example indeed when all planes are disabled).

My issues aren’t solved, so I’ll need to delve further into details some other time.

# More literature

If you’d like to read more:

David Herrmann for example is the guy behind render nodes (integrated since 3.17).

The Intel 7 Series PCH datasheet contains all kind of interesting information. See for example Fig. 5-13 for another view on the display architecture.

## Sampling of Dirichlet Process

Thousands of articles describe the use of the Dirichlet Process, but very few describe how to sample from it. Most often one is referred to Markov chain sampling methods for Dirichlet process mixture models (pdf) by Radford Neal (at University of Toronto), which is a nice piece of work, but still a bit dense as an introduction. I contacted him by email about certain things that were difficult to understand at first and he was kind enough to respond, thanks a lot! Definitely also check out his blog in which he regularly showcases his fast version of R.

Let us introduce the Dirichlet Process with two lines:

First of all, if you haven’t been reading an article about this material in the last decennia, the notation $\sim$ stands for “is distributed as” (see stackexchange).

On the first line we have defined that $G$ being distributed as a Dirichlet Process (DP) with parameters $H$ and $\alpha$. The parameter $H$ has kind of the role of a mean, the parameter $\alpha$ has kind of the role of precision (inverse variance). The parameter $H$ is called the base distribution, the parameter $\alpha$ is called the dispersion factor.

On the second line we generate a bunch of parameters $\theta_i$ conditioned on the generated distribution $G$. A peculiar property of this distribution is that exactly the same $\theta_i$ can be sampled multiple times. Hence, the Dirichlet Process can generate discrete objects out of a continuous base distribution $H$.

If we have parameters we can subsequently generate observations as well:

Here $F$ describes the mapping of parameters to observations.

We can integrate over $G$ to obtain a presentation such as:

And just to write this down in a totally equivalent notation for your convenience:

Here we have picked an index $m$ mid-way, and not a new item $n+1$ which complicates the notation considerably. The notation $\theta_{-m}$ with the minus sign means in this case the set of parameters $(\theta_1,\ldots,\theta_M)$ with $\theta_m$ excluded.

This trick in which we can act as if a parameter mid-way is the same as a new one, is possible because of an underlying assumed property: exchangeability of observations.

In both notations we have $\delta_{\theta}$, the distribution at the point $\theta$. Here $\delta$ has the beautiful role of a functor which maps a function to a value. Something we will see back when we will take Hilbert spaces seriously, but that’s something for later.

If we incorporate the likelihood $F(y_i,\theta_i)$, we will arrive at the following formula:

Here $C$ is a normalization factor to make it a proper probability summing to one. The entity $H_i$ is the posterior distribution of $\theta$ given $H$ as prior and given observation $y_i$. The integral on the right side has the imposing form of a (Lebesgue-)Stieltjes integral. The Lebesgue integral is a nice topic on its own as well, but skimming through wikipedia should be sufficient. A one-line summary: For a two-dimensional function you can see it as “horizontal” integration rather than “vertical” integration. The Stieltjes part of it generalizes integration by (linear) parts $dx = x_i - x_{i-1}$ to that of integration by $dg(x)$. If $g(x)$ is differentiable everywhere and the derivative is continuous, it has the more familiar presentation:

There is a solid reason why this familiar notation is insufficient here. In probability theory we often have a nice cumulative distribution function $g(x)$. However, things become nasty for the probability density function as such. The Lebesgue measure for $g'(x)$ (the “horizontal” integration) if the distribution of $x$ is discrete becomes nonsense.

Studying this multiplicative (or convolutive) form it is clear that a conjugate pair of $F$ and $H$ is extremely convenient.

The definition of $p(\theta_i\ \mid \theta_{-i},y_i)$ leads to Gibbs sampling by just drawing from it for all indices $(i=1,\ldots,n)$.

To remind you about the basics of Gibbs sampling; if we sample the conditionals $p(a_0\ \mid a_1,b)$ and $p(a_1\ \mid a_0,b)$ iteratively, we will get the distribution $p(a_0,a_1\ \mid b)$ which is joint in $a_0$ and $a_1$. So, our sampling results in $p(\theta\ \mid y)$ with $y$ all observations. Note that Gibbs sampling allows you to forget a value $\theta_i$ when it is time to sample it again.

## Using cluster indices

It is possible to calculate stuff using the parameters $\theta$ directly, but considering that we are here interested in clustering, it makes sense to introduce one level of indirection. In this case: cluster indices. Each parameter $\theta_i$ is paired with a cluster index $c_i$. The values of the cluster indices do not matter, as long as all the parameters that belong to the same cluster are having the same value $c_i$. For computational convenience a series of integers is appropriate, but it could just have as well be names from a dictionary about tropical trees.

In Neal’s paper the derivation of the following is quite brief, so let me expand here a bit. The system under consideration here is not a Dirichlet Process, but a mixture of $K$ Dirichlet distributions:

Now we have to integrate out the mixing proportions $p_i$ to get a formula that operates on cluster indices. Again, the number of cluster indices is exactly equal to the number of parameters and exactly equal to the number of observations.

Let us now consider the Dirichlet-Multinomial as it is often called (terrible name!). As you see from above it should be called the Dirichlet-Categorical (see for a derivation stackexchange again).

Here $\alpha=\sum_k \alpha_k$, $n=\sum_n n(c_l=k)$ and $l$ runs from $1$ to $n$ (over all observations). We will use the shorter notation $n_k=n(c_l=k)$ as the number of cluster indices that point to the same cluster $k$. Do not forget however that this number is a function of $c_l$, or else you might think: where have all the $c$’s gone!? And of course the guy who uses a variable $p$ in probabilistic texts should be a $p$-victim in Jackass. :-) I think you deserve a break on the moment:

In our case $\alpha_k$ is taken to be the same for all clusters, so $\alpha/K$:

What we are after is:

And:

Basically the $;\alpha$ means that we identified it as a variable. However, we haven’t decided yet if it is going to be a random variable or just a parameter. We can just leave it out in the next equations, just remember that the entire thing depends on $\alpha$ in the end. Anyway, the above two functions are required to calculate the entity we are after. Using the definition of a conditional probability, we calculate the probability of a value for a specific cluster index given all other cluster indices:

I try to follow here the notation in Neal’s paper as close as possible. Do not fry me because I use $c$ here as an instance of $c_i$. It ain’t pretty, I know!

The nominator:

And the denominator where we have one less component index to calculate with (so $n-1$ to sum over):

The notation $n_{k,-i}$ counts the cluster indices as for $n_k$ but does not take into account $i$. This notation is a bit awkward admittedly, because the left-hand side does not seem to contain $c_i$. But it does because $n_k = n(c_l = k)$ if you remember. The notation just becomes a bit less cluttered by writing it down like this, trust me. The division becomes:

And we are happy to see we can cross-out a lot of the terms in this fraction:

As you probably know $\Gamma(x)=(x-1)!$, so it is not hard to see that $\Gamma(x+1)=x\Gamma(x)$ just as with a factorial: $x! = x(x-1)!$. Also, $\Gamma(x-1)=\Gamma(x)/(x-1)$, which leads to a simplification of the first fraction:

The second fraction requires splitting of the product for $c_i=k$ or, equivalently, $k=c$. So, we bring the index that (amongst others) is pointed at by observation $c_i=k$ to the front:

If we think this through, we now know that $n_c$ does contain the cluster corresponding with the observation with index $i$ and henceforth that $n_k$ does not contain that cluster. Thus the following is absolutely equivalent to the above:

Thinking this through, we also realize that $n_c$ does indeed contain this cluster corresponding with observation $i$, and therefore must be exactly one larger than if it would not be counting $c_i$!

We can again make use of the $\Gamma(x+1)=x\Gamma(x)$ fact:

And we can move the Gamma term with $n_{c,-i}$ within the product again:

Let me reiterate the equation from above (because it has now scrolled off your screen except if you keep your laptop vertically - on its side - when reading this, or if you use a ridiculous high resolution):

Fill in our neat endevour with shifting in and out of the product, just to go through a Gamma function, and we arrive at:

The first time I encountered the sentence “just integrate out the mixing proportions $p_i$” I did not have a clue that a dozen steps like these were required. But, perhaps, I need them spelled out in more detail than a person with a normal brain…

The step subsequently fo $K \rightarrow \infty$ contains probably also many steps, but only for a mathematical rigorous mind, not my sloppy one. If we only consider cluster $c_i=c$, we can use the equation above and assume $\alpha/K$ goes to zero. The only other assumption we need in this case, is that there is some other observation with cluster index $c_i$, so $n_{c,-i} \neq 0$.

Or, equivalently:

If we consider $c_i=c$ again and now with $n_{c,-i}=0$:

I won’t say often “of course”, but here it is of course that $i \neq j$. Then if we allow $c_i$ to take all possible $K$ values, in this case the sum is not one because of the “and” operation:

So, this is straightforward the number of clusters, $K$, times the previously calculated probability for a single cluster. We arrive sweetly - without taking limits in this case - at:

We kept here explicit that we consider all possible cluster indices for $c_i$ by drawing it from the set $\Omega(\mathbf{c})$. Sorry, this is not standard notation, feel free to suggest improvements using the discussion section! The reason why I explicitly add it is because $p(c_i\neq c_j \textit{ and } i \neq j)$ might be read as the sum of the probabilities of any two cluster indices being unequal (where we do not fix $i$ to a specific cluster index). Note also that in Neal’s exposition $i$ is the last observation. The notation here does not assume that $i$ is the last observation, and hence uses $i \neq j$ rather than $% $, but this is absolutely equivalent.

## Nonconjugate priors

Neal’s article is again a source of insight. Make sure however to also take a look at the first people working on these problems. Antoniak for example extended Ferguson’s Dirichlet Process to the mixture model case and reading this old text might be enlightening as well. Let us recapitulate quickly what we have. We have an equation for the conditional probabilities of cluster indices given the other cluster indices for an infinite number of clusters. The equation has two parts. One in which we have a cluster index that has been encountered before:

And one part in which we have a cluster index that has not been encountered before:

The term $c_i=c_j \textit{ and } i \neq j$ is true for any $i \neq j$, while $c_i \neq c_j \textit{ and } i \neq j$ is true only when all of $c_j$ are unequal to $c_i$.

We now have to take a shortcut and bluntly introduce the Metropolis-Hastings algorithm. Why it works I would love to show another time. It is a typical Monte Carlo method driven by detailed balance. Ehrenfest, the inventer of the dog-flea model would have been proud.

To sample a density $\pi(c_i)$ a proposal density, $g(c_i^*\mid c_i)$, is used in calculating an acceptance probability $a(c_i^*,c_i)$:

With probability $a$ the new state $c_i^*$ is accepted. With probability $1-a$ the old state $c_i$ is maintained.

Here we made explicit that $k^*$ and $k$ might be different. Moreover, the notation $c_{-i}$ stands for $c_1,\ldots,c_n$ with $c_i$ excluded.

First, I went on a tangent here. I thought Neal was picking the following as a transition probability:

And subsequently I started calculating this ratio, where I made mistakes as well not realizing at first that $k$ and $k^*$ do not need to be the same. However, after “breaking my head”, I finally understood that Neal proposes something much simpler:

In such a case the acceptance rule becomes indeed quite simple:

Only the likelihood is used to calculate the Metropolis-Hastings ratio.

## Auxiliary parameters

I have to study this part later.