## Legendre Transform

The Legendre transform describes a function - in the normal Legendre case, a convex function (but for the generalized case, see [1]) - as a function of its supporting hyperplanes. In the case of a 2D function these are supporting lines. The supporting lines are the lines that just touch the function. These lines do not intersect the function anywhere else if the function is convex.

Why is the Legendre transform interesting? It is basically just writing down a function in a different manner. The Legendre transform is particular useful in the situation that the derivatives of a function $f(x)$ are easier to describe than the function itself. For a nice introduction see [2]. Intuitively, we map from points $(x,f(x))$ to $(m,g(m))$, with $m$ the slope and $g(m)$ some value so that we can recover $x$ and $y$ (see below for the exact definition).

In the following example, we will get get the Legendre transform of the function:

And in particular, $c=2$.

The Legendre transform is defined (with proper domain $I$) by:

In the animation below the first curve that is drawn is the convex function $f(x)$ at hand. Subsequently we will be performing a for loop in which we draw lines $h(m)$ over different slopes $m$. The curves $k(m)$ depict the difference between $f(x)$ and the line $h(m)$. The maximum difference is found by $sup_x$, the supremum operation, which sweeps over all $x$. This difference is visualized through a single peak with height $g(m)$. Last, the Legendre transform is drawn as $g(m)$ for all different slopes $m$ we try.

To refresh the animation, press F5.

I hope the Legendre transform is a little bit less mysterious after you’ve seen this animation. The animation has been created by making use of the vis.js library. The source code can be found at legendre.js.

## Physics

A very common example of the Legendre transform in physics is the one that transform the Lagrangian into the Hamiltonian. The Lagrangian describes a system in generalized position and velocity coordinates:

The Hamiltonian is the Legende transform of the Lagrangian:

The Lagrangian is the function $f(x)$ of above. The sum (or actually, the inner product $% %]]>$) is replacing the $mx$ term.

## Probability theory

The rate function is defined as the Legendre transform of the scaled cumulant generating function of a random variable $A_n$. The scaled cumulant generation function of $A_n$ is defined by the limit:

And the Gärtner-Ellis theorem establishes under some conditions the rate function:

This theorem is more general than Cramér’s theorem, which is only valid for independent and identically distributed random variables.

Now you know what a Legendre transform entails, you might start to notice it in the most surprising places! For example in clustering multivariate normal distributions [3]!

1. Legendre-Fenchel Transforms in a Nutshell, a good explanation of the Legendre-Fenchel generalization of the Legendre transform
2. Making Sense of the Legendre Transform
3. Clustering Multivariate Normal Distributions

## What's the Thalamus?

If you’re interested in how things work, our brain is one of the most intriguing devices around. I love reverse engineering stuff. Understanding limits and colimits within category theory can be just as rewarding as getting to terms with the intricate structure of the brain.

The thalamus is, hmmm, particular shaped, and centered in the middle of our head, resting on our cerebellum.

One thing that is very interesting of the thalamus is that it has a thalamic nucleus (a bunch of neurons) for every type of sensory system (see also 5):

• visual input from the retina is sent to the lateral geniculate nucleus;
• auditory input is similarly sent to the medial geniculate nucleus;
• somatosensory input is sent to the ventral posterior nucleus;
• olfactory input and input from the amygdala is sent to the medial dorsal nucleus.

There are some other inputs as well, such as from:

• the mammillary bodies, apparently having a recollective memory function, towards the anterior nuclei;
• the basal ganglia, having to do with motion planning, project towards the ventral anterior nucleus;
• the basal nuclei (substantia negra and globus pallidus), involved with motor coordination and learning, project into the ventral lateral nucleus;
• the neospinothalamic tract and medial lemniscus, associated with pain and proprioception, enter the ventral posterolateral nucleus;
• the trigeminothalamic tract and the solitary tract, having to do with touch and vibration in the face, respectively, being involved in taste, go up to the ventral posteromedial nucleus.

It is quite interesting that the pulvinar nuclei are quite big in humans (making up 40% of the thalamus) compared to cats or rats. Lesions can lead to attentional deficits. This part also seems to influence the onset of saccades.

## Maps

It is a remarkable fact that Newton already foresaw that if the brain has to have a continuous representation of continuous objects in its view, there must be crossing (decussation) in which the left part of the left eye is merged with the left part of the right eye, and the right part of the right eye is merged with the right part of the left eye.

Quite remarkably however is how neatly this has been organized. Lesions at the cortex (thank fast and precise bullets of the first world war for these) lead to very fine lesions at the retina all the way through the thalamus. All intermediate neurons die slowly off because of a lack of input of higher levels (called retrograde cell degradation). The interesting thing is that this death is very local. Apparently there is a very strict map from the retina onto the thalamus (the lateral geniculate nucleus) and onto the primary visual cortex. It is incredible that this has been already completely organized before our eyes open as babies!

Note, that it is not always easy to find out the nature of a map. In the early stages of the olfactory system, for example, maps represent groups of chemicals that are similar on a molecular level.

## Higher-order

The thalamus isn’t limited to being a relay station to the cortex (see 2). Apart from inputs (often called afferents in neuroscience) directly from our senses, the thalamus receives back connections from the cortex. This is very interesting from an architectural point of view. First the thalamus projects unto different parts of the cortex, and then these parts project back on the thalamus… There must be something interesting going on in the thalamus!

Not considering the particular properties of the modality that is processed, this allows someone to daydream about what the thalamus might be doing. The information bottleneck method introduced by Tishby introduces bottlenecks in for example networks on purpose. A bottleneck in a neural network is formed by pruning a lot of connections in a fullly connected network such that there are a few neurons left through which information necessarily has to pass if one part of the network needs to communicate with another part of the network. In a layered network, such neurons form a “bottleneck layer” with fewer neurons than in the other layers. By doing this the designer of the network forces representations to be expressable by only a small set of neurons. In machine learning this can be understood as a form of dimensionality reduction. A high-dimensional function is represented using only a few (nonlinear) units. A typical form of dimensionality reduction in standard machine learning is by picking a limited number of orthogonal dimensions in the data with high variance using principle component analysis. Using neurons to do dimensionality reduction however is more akin to autoencoders (revived in deep learning) in which there is less control of the form of the “bottleneck representations”.

## Phase-locked loop

But… all this is nice theory from a computer science perspective. Neuroscientists of course don’t care: they want to explain actual biology. The next figure is about the vibrissal system: the neural circuitry around whiskers!

The question at hand: Do neurons in the thalamus merely relay information or do they process information. For a computer scientist this seems a silly question. However, what is meant with this is the following. A “relay function” might actually improve the signal to noise ratio as in a repeater station, or optimize bandwidth, or other such functions as long as the content of the signal is not adjusted. A “process function” would be basically everything else… In the above study which mainly reiterates the findings of Groh et al. (2008) it is described how for the whisker system a phase-locked loop is implemented, well-known by all engineers in the world (since Huygens). By making sure that the deviations between input and output phase get minimized it naturally follows that the frequencies of the input and output will be synchronized. Hence, the thalamus-cortex circuits will oscillate on the same frequency as the physical “whisking”.

Quite disappointing from a machine learning perspective… We had hopes on some abstract processing going on, such as dimensionality reduction, and we are left with phase-locked loop… But, do not dispair! This is just a single modality. Perhaps the thalamus fulfills widely different functions for different modalities.

Other studies such as that of coughing (see 3), neither identify the thalamus with some generic function. This study by the way is a bit sickening. A genetically altered herpes virus is used to trace signal pathways in the brains of mice. The virus jumps from one neuron to the next through synapses as would a normal signal as well.

## Drivers versus modulators

A study by Crosson is broader in scope. First, it addresses the two ways in which neurons in the thalamus fire. If the neuron is under a larger negative potential (polarized) compared to its surroundings, it emits bursts of spikes. If, however, the neuron has a smaller negative potential (depolarized) compared to its environment, it linearly sends its inputs to its outputs. The latter might be seen as a form of high-fidelity information transmission, while the former does only convey some low-fidelity information. It might be related to the difference when someone is paying attention. The two different forms are often called “burst” mode versus “tonic” mode. You might think, but if there are bursts of spikes, it seems more information can be transmitted! The reason why this is not the case is because bursts are quite regular, it could just have been called “oscillatory” mode as well. During sleep most neurons in the thalamus are in burst mode.

The author addresses also the hypothesis of Sherman and Guillery in which cortex-thalamus-cortex connections are suggested as “drivers” (sending over information), while cortex-cortex connections are “modulators” (determining the weight, importance, or actual arrival of this information). This seems like the architecture as postulated by Baars and implemented by Franklin, the global workspace theory, although the latter is cast in terms of “consciousness”, which isn’t the smartest way to objectively speak about and study brain architectures. Global workspace theory adds to this architecture a sense of how modulation would be implemented, namely by inhibation between competing cortex areas, enabling only a subset of cortex-thalamus-cortex connections to be active at any time. Crosson postulates a function for the cortex-thalamus-cortex route that in my opinion is not so different from this architecture, namely that the thalamus can prioritize and pay “attention” or give priority to one part of the cortex above the other. The difference might exist in the nature in which such a winner-take-all is implemented. Is it through the thalamus as central hub who integrate all votes from all cortical areas and then decide who’s next on stage? Or is it through inhibition from cortical areas to each other? This type of research cries on input from voting network models and consensus theory from computer science.

## Consciousness

Note, that there is no reason to subscribe a conscious seat to the thalamus. Although the thalamus has to do with attention and might enable the brain to modulate conscious processing, there are reasons to assume that (conscious) experiences arises from a thin sheet of neurons underneath the neocortex, named the claustrum. Watch in this respect the below video featuring Koubeissi’s experiment with an epileptic patient:

## Sequences

In all cases, it must be ensured that a (majority) vote should be a temporary event. The same piece of cortex winning all the time means neural death for the rest of the cortex. It is indeed possible to solve it by having cortex regions vote for each other (not only for themselves). This is for example proposed for winner-take-all networks in action selection. It is also possible to have infrastructure in place in which a winner “kills itself” through inhibition of return. In that case activation of a cortical area would automatically diminish its own importance by inhibiting its own afferents from the thalamus. This is the model postulated in the work on visual saliency by Itti et al. It is very hard to imagine an infrastructure which only would implement inhibition of return. In my opinion this would very likely be interwoven with the need of creating sequences of activation.

## Sensor fusion

What is also very interesting to note is that nobody speaks about thalamus-thalamus connections. This means that the thalamus very likely does not play a role in sensor fusion.

## Core versus Matrix

There are two main groups of neurons in the thalamus projecting to the cortex. They can be distinghuished by the type of proteins they express. The core neurons express parvalbumin and the matrix neurons express calbindin. In the auditory system (see 6), the core neurons are finely tuned with respect to frequency, while the matrix neurons are broadly tuned and respond positively to acoustical transients. It seems the core neurons like pitch (content), while the matrix neurons like rhythm (context).

## Layers

I don’t feel science has progressed enough to unequivocally tell which layers are connected to which layers and how the thalamus plays a role in it. Commonly accepted is the pathway L4 -> L2/3 -> L5. However, very recently (see 7) it has been uncovered that there are direct connections from the thalamus to L5 as well. The thalamus seems to active two layers of the cortex in parallel.

# Literature

There are many other facets of the thalamus to talk about. For example Adaptive Resonance Theory has been used to describe the nature of the cortex-thalamus loops. However, if I write about this again, it will be about a nice model that has temporal components as well as the ability to build up sequences. I haven’t found something with enough believable detail yet, so I’ll have to be patient till more is figured out.

1. Exploring the Thalamus (2001) Sherman and Guillery a nice overview of all kind of matters around the thalamus, and - really nice! - only using a few abbreviations!
2. Thalamic Relay or Cortico-Thalamic Processing? Old Question, New Answers (2013) by Ahissar and Oram, one of the most recent studies of higher-order nuclei.
3. Sensorimotor Circuitry involved in the Higher Brain Control of Coughing (2013) Mazzone et al.
4. Thalamic Mechanisms in Language: A Reconsideration Based on Recent Findings and Concepts (2013) Crosson
5. The Thalamic Dynamic Core Theory of Conscious Experience (2011) Ward
6. Thalamocortical Mechanisms for Integrating Musical Tone and Rhythm (2013) Musacchia et al.
7. Deep Cortical Layers are Activated Directly by Thalamus (2013) Constantinople and Bruno

# Introduction

It all started with annoying messages that nobody seems to understand (/var/log/syslog):

Apr 25 12:28:03 six kernel: [    1.346712] ata3.00: supports DRM functions and may not be fully accessible
Apr 25 12:28:03 six kernel: [    1.347278] ata3.00: supports DRM functions and may not be fully accessible
Apr 25 12:28:03 six kernel: [    3.047797] [drm] Initialized drm 1.1.0 20060810
Apr 25 12:28:03 six kernel: [    3.076720] [drm] Memory usable by graphics device = 2048M
Apr 25 12:28:03 six kernel: [    3.076726] fb: switching to inteldrmfb from EFI VGA
Apr 25 12:28:03 six kernel: [    3.076841] [drm] Replacing VGA console driver
Apr 25 12:28:03 six kernel: [    3.101307] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Apr 25 12:28:03 six kernel: [    3.101309] [drm] Driver supports precise vblank timestamp query.
Apr 25 12:28:03 six kernel: [    3.143256] fbcon: inteldrmfb (fb0) is primary device
Apr 25 12:28:03 six kernel: [    3.146818] [drm] Initialized i915 1.6.0 20141121 for 0000:00:02.0 on minor 0
Apr 25 12:28:03 six kernel: [    3.424167] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A
Apr 25 12:28:03 six kernel: [    3.424202] [drm:intel_pch_fifo_underrun_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun
Apr 25 12:28:03 six kernel: [    3.848413] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device


What is this problem about FIFO (first-in-first-out) underruns? What is a transcoder? What is PCH? What is a DRM? The following quest will try to find some answers…

# The system

What do I actually have residing in my laptop? First of all, it is important to know what to search for. There are three basic components that we need to know, namely, what is:

• the processor
• the chipset
• the integrated graphics unit

## Processor

I have a N56VZ which has an i7-3610QM processor. The processor architecture is from the Ivy Bridge variety (on 22 mm). There are plenty of datasheets available for the 7-series chipset (pdf). Note also that this Ivy Bridge series is also called the 3rd Gen Intel Core family (see also this Intel Datasheet (pdf)).

## Chipset

The chipset that is used in tandem with this processor is, some people state, the HM76 chipset. In particular, the BD82HM76 PCH. This “PHC” is known under the name Panther Point.

So, what we have here is a hardware setup that is divided over the CPU and a thing called a PCH. PCH stands for Platform Controller Hub. There are some nice pictures in a presentation (pdf) on the 2nd Gen processors, which show how both the CPU and the PCH are integrated in the same chip (but on separate dies). The PCH of this previous generation is called Cougar Point.

Another picture from Intel shows the Panther Point PCH:

Anyway, we apparently have the BD82HM76 chipset, the “Panther Point” PCH (from 2012).

## Integrated graphics

The Intel integrated graphics unit in my Intel Core i7-3610QM system is a HD Graphics 4000 running on 650 MHz base speed and 1100 MHz turbo speed notebookcheck.net. The HD Graphics 4000 is an Intel HD variant compatible with the Ivy Bridge line of processors.

# Linux

So, how does Linux address this integrated graphics card? The first entity we encounter is the DRM (Direct Rendering Manager). The DRM provides a single entry point for multiple user-space applications to use the video card.

The Direct Rendering Manager has a DRM core that is independent of the specific drivers required for whatever type of hardware that is on your system. It provides the interface to user-space code. A driver consists out of two parts, GEM (Graphics Execution Manager) and KMS (Kernel Mode Setting). GEM has been developed by Keith Packard and Eric Anholt (see lwn.net) from Intel.

The reasoning behind GEM is pretty interesting. It tries to remove latency as much as possible, which is nicely visualized by the following picture.

KMS allows to set display modes, as its name betrays, from kernel-space. A user space graphics server (like X) does now also not need superuser rights anymore (in theory). The display mode concerns matters such as screen resolution, color depth, and refresh rate. Of course, the functionality of GEM and KMS could have been implemented in a single software entity, but there are technical reasons not to do so (basically to account for split hardware).

Laurens Pinchart has a nice presentation on DRM, KMS, and in particular, writing drivers at YouTube. The picture below is from his presentation.

You see that in memory there are two structures, frame buffers (old) and planes (new). Subsequently, you see something that sounds very old-fashioned, a CRTC (Cathode Ray Tube Controller). This is just nomenclature from the past. It is basically a reference to a scan-out buffer, a part of (Video) RAM that will be displayed on your screen. It also has (a reference to) a display mode, offsets into the video memory, etc. The controller links to an encoder (which can be off-chip). If it links to multiple encoders, these will receive data from the same scanout buffer, so they will display the same, cloned, data. The connectors, finally, can be only connected to one encoder, and know how to talk HDMI, TVout, CRT, LVDS, etc. It also has the information about the display, EDID data, DPMS, and connection status. Make sure to read also the man-pages about drm-kms.

Startup with drm.debug=14 (or if you’re a hexademically inclined person, drm.debug=0x0e) and you will get plenty of debug information from the DRM. For example, that we have the Panther Point PCM is indeed confirmed in the syslog:

Apr 25 13:12:35 six kernel: [    3.243360] [drm] Initialized drm 1.1.0 20060810
Apr 25 13:12:35 six kernel: [    3.280255] [drm:i915_dump_device_info] i915 device info: gen=7, pciid=0x0166 rev=0x09 flags=is_mobile,need_gfx_hws,is_ivybridge,has_fbc,has_hotplug,has_llc,
Apr 25 13:12:35 six kernel: [    3.280271] [drm:intel_detect_pch] Found PantherPoint PCH


Just pore over the information yourself, restarting your computer with this debug flag. You will see for example how many display pipes are available (3 in my case). It will enable something like ENCODER:31:LVDS-31 and CONNECTOR:30:LVDS-1. And you spot how there are several connectors to choose from:

Apr 25 13:12:35 six kernel: [    3.303850] [drm:intel_dsm_platform_mux_info] MUX info connectors: 5
Apr 25 13:12:35 six kernel: [    3.303853] [drm:intel_dsm_platform_mux_info]   port id: LVDS
Apr 25 13:12:35 six kernel: [    3.303860] [drm:intel_dsm_platform_mux_info]   port id: Analog VGA
Apr 25 13:12:35 six kernel: [    3.303867] [drm:intel_dsm_platform_mux_info]   port id: HDMI/DVI_C
Apr 25 13:12:35 six kernel: [    3.303874] [drm:intel_dsm_platform_mux_info]   port id: DisplayPort_B
Apr 25 13:12:35 six kernel: [    3.303881] [drm:intel_dsm_platform_mux_info]   port id: DisplayPort_D


Something interesting to our errors are latency settings:

Apr 25 13:12:35 six kernel: [    3.304061] [drm:intel_print_wm_latency] Primary WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304063] [drm:intel_print_wm_latency] Primary WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304065] [drm:intel_print_wm_latency] Primary WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304067] [drm:intel_print_wm_latency] Primary WM3 latency 32 (16.0 usec)
Apr 25 13:12:35 six kernel: [    3.304068] [drm:intel_print_wm_latency] Sprite WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304070] [drm:intel_print_wm_latency] Sprite WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304072] [drm:intel_print_wm_latency] Sprite WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304073] [drm:intel_print_wm_latency] Sprite WM3 latency 32 (16.0 usec)
Apr 25 13:12:35 six kernel: [    3.304075] [drm:intel_print_wm_latency] Cursor WM0 latency 12 (1.2 usec)
Apr 25 13:12:35 six kernel: [    3.304077] [drm:intel_print_wm_latency] Cursor WM1 latency 4 (2.0 usec)
Apr 25 13:12:35 six kernel: [    3.304079] [drm:intel_print_wm_latency] Cursor WM2 latency 16 (8.0 usec)
Apr 25 13:12:35 six kernel: [    3.304080] [drm:intel_print_wm_latency] Cursor WM3 latency 64 (32.0 usec)


At virtuousgeek some matters are explained around power management of displays. For example memory can enter self-refresh in which it consumes way less power. The display plane FIFO watermarks can be set conservatively, however, this leads to long periods in which self-refresh doesn’t happen. It can be set aggressively, FIFO underruns can occur. See this nice FPGA implementation of a FIFO buffer to understand watermarks in more detail.

Now, we know what kind of hardware we have, let us first try to blindly apply the newest kernel from Intel…

# Trying the newest DRM kernel

We try to run the newest kernel from the guys who are concerned with incorporating the lastest changes from Intel using the drm-intel-next packages from kernel.ubuntu.com:

#!/bin/sh

website=kernel.ubuntu.com/~kernel-ppa/mainline
kernel=2015-04-24-vivid
version=4.0.0-997
subversion=4.0.0-997.201504232205

wget ${website}/drm-intel-next/${kernel}/linux-image-${version}-generic_${subversion}_amd64.deb
wget ${website}/drm-intel-next/${kernel}/linux-headers-${version}_${subversion}_all.deb
wget ${website}/drm-intel-next/${kernel}/linux-headers-${version}-generic_${subversion}_amd64.deb

sudo dpkg -i linux-headers-${version}*.deb linux-image-${version}*.deb


Installing this kernel (contrary to just kernel 4.0.0 RC7 for example) leads to trouble. Moreover, the same error about underruns on the transcoder occurs… So, this is not something already being addressed… And that’s all we wanted to know. So, we go back to the default kernel (4.0.0-040000rc7-generic #201504142105 of April the 14th).

This gives me an idea. Let’s disable some of the power saving options of the i915 kernel module.

sudo grep '' /sys/module/i915/parameters/*


Subsequently I tried the option i915.enable_fbc=1 (to see if compression would lower the bandwidth and hence lead to fewer underruns). The rigorous option i915.powersave=0 didn’t work as well. There is an option i915.enable_rc6=3 on my system. I set it to 0, but I guess powersave overrules all that anyway. Also setting i915.fastboot=1 doesn’t get rid of the underruns. All this is not a very thought out approach anyway…

If you git clone git://kernel.ubuntu.com/virgin/linux.git v4.0-rc7, and navigate to drivers/gpu/drm/i915 you will encounter a file intel_fifo_underrun.c which supports underrun detection on the PCH transcoder. It is the last function in the file:

/**
* intel_pch_fifo_underrun_irq_handler - handle PCH fifo underrun interrupt
* @dev_priv: i915 device instance
* @pch_transcoder: the PCH transcoder (same as pipe on IVB and older)
*
* This handles a PCH fifo underrun interrupt, generating an underrun warning
* into dmesg if underrun reporting is enabled and then disables the underrun
* interrupt to avoid an irq storm.
*/
void intel_pch_fifo_underrun_irq_handler(struct drm_i915_private *dev_priv,
enum transcoder pch_transcoder)
{
if (intel_set_pch_fifo_underrun_reporting(dev_priv, pch_transcoder,
false))
DRM_ERROR("PCH transcoder %c FIFO underrun\n",
transcoder_name(pch_transcoder));
}


It is nice that the message will be only displayed once. This is initiated through the IRQ from the Intel graphics unit. And, of course, this is at the end: it’s only the guy reporting the bad news.

The intel_display.c has a suspicious statement about Gen 2 chips:

/*
* Gen2 reports pipe underruns whenever all planes are disabled.
* So don't enable underrun reporting before at least some planes
* are enabled.
* FIXME: Need to fix the logic to work when we turn off all planes
* but leave the pipe running.
*/
if (IS_GEN2(dev))
intel_set_cpu_fifo_underrun_reporting(dev_priv, pipe, true);


Of course, this is reporting about underruns for the CPU, not the PCH, and it is Gen 2, not Gen 3 of the chips. But it might be a symptom of something peculiar in the hardware.

In the North Display Registers document you see a nice overview of the sequence in which the display needs to be set.

You can follow along in the code, in particular, in the function ironlake_crtc_enable. Here you will see an order like intel_enable_pipe -> ironlake_pch_enable -> intel_crtc_enable_planes. What is remarkable is that step 7 in which the planes are configured is done after the port on the PHC is enabled in the code. In the haswell code, this is the same, but this function is preceded with haswell_mode_set_planes_workaround. It is also interesting to note that the “Notes” in this table state that the CPU FDI Transmitter should not be set to idle while the PCH transcoder is enabled because it will lead to PCH transcoder underflow.

On bugzilla there are several bug reports on the FIFO underrun, such as 79261. However, none of them seems to be about the HD Graphics 4000 in particular, and even less providing any solution. A patch for ArchLinux just makes DRM_DEBUG messages from the DRM_ERROR messages.

What I can think of are only a few things (because I don’t understand much of it, yet):

• Somehow the initialization isn’t done correctly. Clocks aren’t initialized at the right time, not synchronized. Or anything else with respect to initialization is forgotten.
• Latency is configured incorrectly. Perhaps the BIOS of my Asus (although up to date) hands over latency values that are not sufficiently high.
• The interrupt is generated incorrectly (as in Gen2, for example indeed when all planes are disabled).

My issues aren’t solved, so I’ll need to delve further into details some other time.

# More literature

If you’d like to read more:

David Herrmann for example is the guy behind render nodes (integrated since 3.17).

The Intel 7 Series PCH datasheet contains all kind of interesting information. See for example Fig. 5-13 for another view on the display architecture.

## Sampling of Dirichlet Process

Thousands of articles describe the use of the Dirichlet Process, but very few describe how to sample from it. Most often one is referred to Markov chain sampling methods for Dirichlet process mixture models (pdf) by Radford Neal (at University of Toronto), which is a nice piece of work, but still a bit dense as an introduction. I contacted him by email about certain things that were difficult to understand at first and he was kind enough to respond, thanks a lot! Definitely also check out his blog in which he regularly showcases his fast version of R.

Let us introduce the Dirichlet Process with two lines:

First of all, if you haven’t been reading an article about this material in the last decennia, the notation $\sim$ stands for “is distributed as” (see stackexchange).

On the first line we have defined that $G$ being distributed as a Dirichlet Process (DP) with parameters $H$ and $\alpha$. The parameter $H$ has kind of the role of a mean, the parameter $\alpha$ has kind of the role of precision (inverse variance). The parameter $H$ is called the base distribution, the parameter $\alpha$ is called the dispersion factor.

On the second line we generate a bunch of parameters $\theta_i$ conditioned on the generated distribution $G$. A peculiar property of this distribution is that exactly the same $\theta_i$ can be sampled multiple times. Hence, the Dirichlet Process can generate discrete objects out of a continuous base distribution $H$.

If we have parameters we can subsequently generate observations as well:

Here $F$ describes the mapping of parameters to observations.

We can integrate over $G$ to obtain a presentation such as:

And just to write this down in a totally equivalent notation for your convenience:

Here we have picked an index $m$ mid-way, and not a new item $n+1$ which complicates the notation considerably. The notation $\theta_{-m}$ with the minus sign means in this case the set of parameters $(\theta_1,\ldots,\theta_M)$ with $\theta_m$ excluded.

This trick in which we can act as if a parameter mid-way is the same as a new one, is possible because of an underlying assumed property: exchangeability of observations.

In both notations we have $\delta_{\theta}$, the distribution at the point $\theta$. Here $\delta$ has the beautiful role of a functor which maps a function to a value. Something we will see back when we will take Hilbert spaces seriously, but that’s something for later.

If we incorporate the likelihood $F(y_i,\theta_i)$, we will arrive at the following formula:

Here $C$ is a normalization factor to make it a proper probability summing to one. The entity $H_i$ is the posterior distribution of $\theta$ given $H$ as prior and given observation $y_i$. The integral on the right side has the imposing form of a (Lebesgue-)Stieltjes integral. The Lebesgue integral is a nice topic on its own as well, but skimming through wikipedia should be sufficient. A one-line summary: For a two-dimensional function you can see it as “horizontal” integration rather than “vertical” integration. The Stieltjes part of it generalizes integration by (linear) parts $dx = x_i - x_{i-1}$ to that of integration by $dg(x)$. If $g(x)$ is differentiable everywhere and the derivative is continuous, it has the more familiar presentation:

There is a solid reason why this familiar notation is insufficient here. In probability theory we often have a nice cumulative distribution function $g(x)$. However, things become nasty for the probability density function as such. The Lebesgue measure for $g'(x)$ (the “horizontal” integration) if the distribution of $x$ is discrete becomes nonsense.

Studying this multiplicative (or convolutive) form it is clear that a conjugate pair of $F$ and $H$ is extremely convenient.

The definition of $p(\theta_i\ \mid \theta_{-i},y_i)$ leads to Gibbs sampling by just drawing from it for all indices $(i=1,\ldots,n)$.

To remind you about the basics of Gibbs sampling; if we sample the conditionals $p(a_0\ \mid a_1,b)$ and $p(a_1\ \mid a_0,b)$ iteratively, we will get the distribution $p(a_0,a_1\ \mid b)$ which is joint in $a_0$ and $a_1$. So, our sampling results in $p(\theta\ \mid y)$ with $y$ all observations. Note that Gibbs sampling allows you to forget a value $\theta_i$ when it is time to sample it again.

## Using cluster indices

It is possible to calculate stuff using the parameters $\theta$ directly, but considering that we are here interested in clustering, it makes sense to introduce one level of indirection. In this case: cluster indices. Each parameter $\theta_i$ is paired with a cluster index $c_i$. The values of the cluster indices do not matter, as long as all the parameters that belong to the same cluster are having the same value $c_i$. For computational convenience a series of integers is appropriate, but it could just have as well be names from a dictionary about tropical trees.

In Neal’s paper the derivation of the following is quite brief, so let me expand here a bit. The system under consideration here is not a Dirichlet Process, but a mixture of $K$ Dirichlet distributions:

Now we have to integrate out the mixing proportions $p_i$ to get a formula that operates on cluster indices. Again, the number of cluster indices is exactly equal to the number of parameters and exactly equal to the number of observations.

Let us now consider the Dirichlet-Multinomial as it is often called (terrible name!). As you see from above it should be called the Dirichlet-Categorical (see for a derivation stackexchange again).

Here $\alpha=\sum_k \alpha_k$, $n=\sum_n n(c_l=k)$ and $l$ runs from $1$ to $n$ (over all observations). We will use the shorter notation $n_k=n(c_l=k)$ as the number of cluster indices that point to the same cluster $k$. Do not forget however that this number is a function of $c_l$, or else you might think: where have all the $c$’s gone!? And of course the guy who uses a variable $p$ in probabilistic texts should be a $p$-victim in Jackass. :-) I think you deserve a break on the moment:

In our case $\alpha_k$ is taken to be the same for all clusters, so $\alpha/K$:

What we are after is:

And:

Basically the $;\alpha$ means that we identified it as a variable. However, we haven’t decided yet if it is going to be a random variable or just a parameter. We can just leave it out in the next equations, just remember that the entire thing depends on $\alpha$ in the end. Anyway, the above two functions are required to calculate the entity we are after. Using the definition of a conditional probability, we calculate the probability of a value for a specific cluster index given all other cluster indices:

I try to follow here the notation in Neal’s paper as close as possible. Do not fry me because I use $c$ here as an instance of $c_i$. It ain’t pretty, I know!

The nominator:

And the denominator where we have one less component index to calculate with (so $n-1$ to sum over):

The notation $n_{k,-i}$ counts the cluster indices as for $n_k$ but does not take into account $i$. This notation is a bit awkward admittedly, because the left-hand side does not seem to contain $c_i$. But it does because $n_k = n(c_l = k)$ if you remember. The notation just becomes a bit less cluttered by writing it down like this, trust me. The division becomes:

And we are happy to see we can cross-out a lot of the terms in this fraction:

As you probably know $\Gamma(x)=(x-1)!$, so it is not hard to see that $\Gamma(x+1)=x\Gamma(x)$ just as with a factorial: $x! = x(x-1)!$. Also, $\Gamma(x-1)=\Gamma(x)/(x-1)$, which leads to a simplification of the first fraction:

The second fraction requires splitting of the product for $c_i=k$ or, equivalently, $k=c$. So, we bring the index that (amongst others) is pointed at by observation $c_i=k$ to the front:

If we think this through, we now know that $n_c$ does contain the cluster corresponding with the observation with index $i$ and henceforth that $n_k$ does not contain that cluster. Thus the following is absolutely equivalent to the above:

Thinking this through, we also realize that $n_c$ does indeed contain this cluster corresponding with observation $i$, and therefore must be exactly one larger than if it would not be counting $c_i$!

We can again make use of the $\Gamma(x+1)=x\Gamma(x)$ fact:

And we can move the Gamma term with $n_{c,-i}$ within the product again:

Let me reiterate the equation from above (because it has now scrolled off your screen except if you keep your laptop vertically - on its side - when reading this, or if you use a ridiculous high resolution):

Fill in our neat endevour with shifting in and out of the product, just to go through a Gamma function, and we arrive at:

The first time I encountered the sentence “just integrate out the mixing proportions $p_i$” I did not have a clue that a dozen steps like these were required. But, perhaps, I need them spelled out in more detail than a person with a normal brain…

The step subsequently fo $K \rightarrow \infty$ contains probably also many steps, but only for a mathematical rigorous mind, not my sloppy one. If we only consider cluster $c_i=c$, we can use the equation above and assume $\alpha/K$ goes to zero. The only other assumption we need in this case, is that there is some other observation with cluster index $c_i$, so $n_{c,-i} \neq 0$.

Or, equivalently:

If we consider $c_i=c$ again and now with $n_{c,-i}=0$:

I won’t say often “of course”, but here it is of course that $i \neq j$. Then if we allow $c_i$ to take all possible $K$ values, in this case the sum is not one because of the “and” operation:

So, this is straightforward the number of clusters, $K$, times the previously calculated probability for a single cluster. We arrive sweetly - without taking limits in this case - at:

We kept here explicit that we consider all possible cluster indices for $c_i$ by drawing it from the set $\Omega(\mathbf{c})$. Sorry, this is not standard notation, feel free to suggest improvements using the discussion section! The reason why I explicitly add it is because $p(c_i\neq c_j \textit{ and } i \neq j)$ might be read as the sum of the probabilities of any two cluster indices being unequal (where we do not fix $i$ to a specific cluster index). Note also that in Neal’s exposition $i$ is the last observation. The notation here does not assume that $i$ is the last observation, and hence uses $i \neq j$ rather than $% $, but this is absolutely equivalent.

## Nonconjugate priors

Neal’s article is again a source of insight. Make sure however to also take a look at the first people working on these problems. Antoniak for example extended Ferguson’s Dirichlet Process to the mixture model case and reading this old text might be enlightening as well. Let us recapitulate quickly what we have. We have an equation for the conditional probabilities of cluster indices given the other cluster indices for an infinite number of clusters. The equation has two parts. One in which we have a cluster index that has been encountered before:

And one part in which we have a cluster index that has not been encountered before:

The term $c_i=c_j \textit{ and } i \neq j$ is true for any $i \neq j$, while $c_i \neq c_j \textit{ and } i \neq j$ is true only when all of $c_j$ are unequal to $c_i$.

We now have to take a shortcut and bluntly introduce the Metropolis-Hastings algorithm. Why it works I would love to show another time. It is a typical Monte Carlo method driven by detailed balance. Ehrenfest, the inventer of the dog-flea model would have been proud.

To sample a density $\pi(c_i)$ a proposal density, $g(c_i^*\mid c_i)$, is used in calculating an acceptance probability $a(c_i^*,c_i)$:

With probability $a$ the new state $c_i^*$ is accepted. With probability $1-a$ the old state $c_i$ is maintained.

Here we made explicit that $k^*$ and $k$ might be different. Moreover, the notation $c_{-i}$ stands for $c_1,\ldots,c_n$ with $c_i$ excluded.

First, I went on a tangent here. I thought Neal was picking the following as a transition probability:

And subsequently I started calculating this ratio, where I made mistakes as well not realizing at first that $k$ and $k^*$ do not need to be the same. However, after “breaking my head”, I finally understood that Neal proposes something much simpler:

In such a case the acceptance rule becomes indeed quite simple:

Only the likelihood is used to calculate the Metropolis-Hastings ratio.

## Auxiliary parameters

I have to study this part later.

## Bayesian Approximation

In the world of Bayesian’s, a model is a triplet $p(x,z,\theta)$. The observations are random variables $x$. And then there is a seemingly artificial distinction between the random variables that are called hidden ($z$) and other random variables that are called parameters ($\theta$), and are hidden as well! So, how come that parameters got their distinguished name? In the case of for example a clustering task, we can assign each observation a corresponding hidden variable: an index of the cluster it belongs to. Hence, there are as many hidden variables as there are observations. Now, in contrary, we might define parameters in two different ways:

• The number of parameters is fixed. The parameters are defined as that part of the model that is not adjusted with respect to the number of observations. In $k$-means there are $k$ means that need to be estimated. There are not more means to be estimated if the number of observations grows.
• Parameters are more variable then the hidden variables. There are not just as many parameters as observations, no their number needs to be derived from the data. In nonparameteric Bayesian methods there will be more clusters (means) when the number of observations grows.

Consider a dataset $D$, we would like to get our hands on $p(D,z,\theta)$. If this all concerns discrete variables, this is a 3D ‘matrix’ (a tensor of rank 3) with the observations along one dimension, the hidden variables along another dimension, and the parameters along the third dimension. The joint distribution $p(D,z,\theta)$ would tell us everything. Some examples where either marginalize out a variable, or use the definition of a conditional probability:

To get to the matrix $p(D,z)$ we sum over all entries in the rank 3 matrix in the direction of the dimension $\theta$. In other words, we are not interested anymore to what kind of fine granular information $\theta$ brings to bear. We are only interested in how often $D$ occurs with $z$.

The problem is, we often do not have $p(D,z,\theta)$. Now, we suppose we have instead all of the following conditional distributions, $p(D|z,\theta)$, $p(z|\theta,D)$, and $p(\theta|D,z)$, would we have enough information?

The answer is a resounding yes. If you have all conditional distributions amongst a set of random variables you can reconstruct the joint distribution. It always helps to experiment with small discrete conditional and joint probability tables. In this question on https://stats.stackexchange.com you see how I construct a joint probability table from the corresponding conditional probability tables. Note the word corresponding: as Lucas points out in his answer there are problems with compatibility. If the conditional probabilities correspond to a joint probability, the latter can be found. However, if you come up with random conditional probability tables it is very likely that the joint probability table does not exist.

The paper that I found very clarifying with respect to the different ways to do approximations to the full Bayesian method is Bayesian K-Means as a “Maximization-Expectation” Algorithm (pdf) by Welling and Kurihara.

So, let us assume we are given a dataset $D$ (hence $p(D)$) and we want to find $p(z,\theta|D)$.

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

2. Variational Bayes

Instead, we can try to establish a distribution $q(\theta)$ and $q(z)$ and minimize the difference with the distribution we are after $p(\theta,z|D)$. The difference between distributions has a convenient fancy name, the Kullback-Leibler divergence. To minimize $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

This type of variational Bayes that uses the Kullback-Leibler divergence is also known under mean-field variational Bayes.

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don’t we instead consider a point estimate for one of these and keep the other nice and nuanced. In Expectation Maximization the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$.

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let’s not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn’t change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$.

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$-means alternative, but this lucid and concise exposition of approximation.