Jekyll2021-06-10T22:36:26+00:00https://annevanrossum.com/feed.xmlRobots, machine learning, global issuesA blog about robots, machine learning, and other random stuffAnne van RossumPlaying with DMX2020-05-16T20:28:06+00:002020-05-16T20:28:06+00:00https://annevanrossum.com/2020/05/16/playing-with-dmx<p>During these times I decided to start playing with DMX. I bought a the <a href="https://www.lumeri.nl/lumeri-wash-710.html">Lumeri Wash 7.10</a>. It has RGBW leds, 9 or 16 channels, and a moving head. It uses DMX512.
The DMX in <a href="https://www.element14.com/community/groups/open-source-hardware/blog/2017/08/24/dmx-explained-dmx512-and-rs-485-protocol-detail-for-lighting-applications">DMX512</a> stands for Digital Multiplex (protocol). Lights like this have a DMX input and output. so they can be chained. A collection of DMX devices is called a <strong>universe</strong>.</p>
<p>DMX is super simple. It is a serial interface at 250.000 bits per second. The electrical interface is RS-485. The 512 in DMX512 stands for the number of data bytes that can be sent. If a device uses one channel, it can support 512 devices. The above device already uses 9 or 16 channels, so I guess that can quickly become filled.</p>
<!--more-->
<p>To steer the UART from the PI requires it to run at this speed of 250kbaud. This is possible with some tricks. See this article on someone creating <a href="https://eastertrail.blogspot.com/2014/04/command-and-control-ii.html">OLA support</a>. These instructions can be found at different other locations as well.</p>
<p>Adjust <code class="language-plaintext highlighter-rouge">/boot/config.txt</code> and add this line at the end:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>init_uart_clock=16000000
</code></pre></div></div>
<p>Also adjust the kernel to use serial at boot (see <a href="https://elinux.org/RPi_Serial_Connection#Preventing_Linux_using_the_serial_port">here</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo raspi-config
</code></pre></div></div>
<p>Disable here the boot messages, but not the device itself.</p>
<p>Also in <code class="language-plaintext highlighter-rouge">/boot/cmdline.txt</code> remove the console parameter. I don’t know if this is actually necessary, because it already has <code class="language-plaintext highlighter-rouge">plymouth.ignore-serial-consoles</code>. I also didn’t find getty on the serial line and no <code class="language-plaintext highlighter-rouge">/etc/inittab</code> file. None of the processes was using <code class="language-plaintext highlighter-rouge">ttyAMA0</code> (quick check with <code class="language-plaintext highlighter-rouge">ps</code>). I’ve executed the following anyway:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo systemctl disable serial-getty@ttyAMA0.service
</code></pre></div></div>
<p>When reading <a href="https://www.raspberrypi.org/forums/viewtopic.php?t=244741">here</a> it states that GPIO 14/15 is used by the Bluetooth device. It can be disabled. The instructions in <code class="language-plaintext highlighter-rouge">/boot/overlays/README</code> are actually quite clear.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Name: disable-bt
Info: Disable onboard Bluetooth on Pi 3B, 3B+, 3A+, 4B and Zero W, restoring
UART0/ttyAMA0 over GPIOs 14 & 15.
N.B. To disable the systemd service that initialises the modem so it
doesn't use the UART, use 'sudo systemctl disable hciuart'.
Load: dtoverlay=disable-bt
Params: <None>
</code></pre></div></div>
<p>Indeed <code class="language-plaintext highlighter-rouge">sudo systemctl disable hciuart</code> sounds like something that should be done then as well.</p>
<p>If we need Bluetooth later on, we might want to use <code class="language-plaintext highlighter-rouge">core_freq_min=500</code> to prevent core scaling. This is namely the issue. The GPIO pins get the clock from the system bus clock. The latter changes depending on the system load.</p>
<h1 id="dmx-interface">DMX interface</h1>
<p>I got a <a href="https://bitwizard.nl/wiki/Dmx_interface_for_raspberry_pi">DMX interface</a> for the Raspberry PI. You can buy also a case, really neat.</p>
<p>First I’ve been trying QLC+.</p>
<p>When trying OLA, adjust <code class="language-plaintext highlighter-rouge">/etc/ola/ola-uartdmx.conf</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/dev/ttyAMA0-break = 100
/dev/ttyAMA0-malf = 24000
device = /dev/ttyAMA0
enabled = true
</code></pre></div></div>
<p>Following the instructions on the <a href="https://bitwizard.nl/wiki/Dmx_interface_for_raspberry_pi">site</a>, I tried to set the board to output mode.</p>
<p>The <code class="language-plaintext highlighter-rouge">gpio</code> utility doesn’t seem to be maintained anymore. THere is <code class="language-plaintext highlighter-rouge">raspi-gpio</code> however. Display the configuration:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>raspi-gpio get
</code></pre></div></div>
<p>Set it like described:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>raspi-gpio set 18 op
raspi-gpio set 18 dh
raspi-gpio set 14 a0
raspi-gpio set 15 a0
</code></pre></div></div>
<p>The result is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GPIO 14: level=1 fsel=4 alt=0 func=TXD0 pull=NONE
GPIO 15: level=1 fsel=4 alt=0 func=RXD0 pull=UP
...
GPIO 18: level=1 fsel=1 func=OUTPUT pull=DOWN
</code></pre></div></div>
<p>In <a href="https://www.raspberrypi.org/forums/viewtopic.php?t=176531">this post</a> it states that all this is too complicated.</p>
<h2 id="ola">OLA</h2>
<p>Installing OLA was very simple, just:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt install ola
</code></pre></div></div>
<p>Navigate to something like <code class="language-plaintext highlighter-rouge">192.168.86.246:9090</code> where you change the IP address to that of your PI.</p>
<p><img src="/images/blog/uart_native_dmx.png" alt="UART driver" /></p>
<p>Now I’ll browse forums to make this work….</p>Anne van RossumDuring these times I decided to start playing with DMX. I bought a the Lumeri Wash 7.10. It has RGBW leds, 9 or 16 channels, and a moving head. It uses DMX512. The DMX in DMX512 stands for Digital Multiplex (protocol). Lights like this have a DMX input and output. so they can be chained. A collection of DMX devices is called a universe.Streaming to your TV2020-03-22T11:21:30+00:002020-03-22T11:21:30+00:00https://annevanrossum.com/2020/03/22/streaming-to-your-tv<p>If you’re in quarantaine or in isolation, there’s a lot of staying inside. Perhaps you have to be in another room.
Perhaps you just want to stream some online event to a larger screen. In either case, you want to figure out how
to stream your desktop to your TV. If you happen to have a Chromecast, this is possible, but there are many ways to
accomplish this. We will go through a few.</p>
<!--more-->
<p>Streaming from Firefox is possible through a utilty that’s called <code class="language-plaintext highlighter-rouge">fx_cast</code>. It only works for a select list of (whitelisted)
pages. Netflix can be streamed like this for example.</p>
<p>If you want to have more freedom in what you stream, it is worth to look at <code class="language-plaintext highlighter-rouge">mkchromecast</code> (or home assistant) which
is a wrapper around <a href="https://github.com/balloob/pychromecast">pychromecast</a>. The latest release of <a href="https://github.com/muammar/mkchromecast">mkchromecast</a>
is from December 2017, version 0.3.8.1. You can also clone and install the newest version 0.3.9 (not released).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/muammar/mkchromecast
cd mkchromecast
pip3 install .
</code></pre></div></div>
<p>We can slowly go through all kind of variants to call it, but let’s just drop the bomb:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkchromecast --video --command 'ffmpeg \
-f pulse -ac 2 \
-i default -acodec aac \
-f x11grab -framerate 30 -video_size 3200x1800 \
-i :0.0+0,0 \
-vaapi_device /dev/dri/renderD128 -vf format=nv12,hwupload,scale_vaapi=w=1920:h=1080 -c:v h264_vaapi \
-bf 4 -threads 4 \
-preset ultrafast -tune zerolatency -maxrate 10M -bufsize 20M \
-pix_fmt yuv420p \
-g 15 \
-f mp4 \
-max_muxing_queue_size 999 \
-movflags frag_keyframe+empty_moov \
pipe:1'
</code></pre></div></div>
<p>You might need to remove the tabs and put it all on one line if you actually run this on the command line! So, what
does it all mean?</p>
<p>The <code class="language-plaintext highlighter-rouge">lavfi</code> parameter stands for a libavfilter input virtual device. This reads data from input devices that can be
anything (they do not need to be files). You can see examples online where just colors are streamed for example, or
where video is negated or other special effects are applied. Here it turns out not be necessary. :-)</p>
<p>The <code class="language-plaintext highlighter-rouge">pulse</code> parameter is for audio. It uses <code class="language-plaintext highlighter-rouge">pulseaudio</code>, has two channels <code class="language-plaintext highlighter-rouge">-ac 2</code>, uses the <code class="language-plaintext highlighter-rouge">default</code> source, and the
<code class="language-plaintext highlighter-rouge">aac</code> audio codec. The <code class="language-plaintext highlighter-rouge">-strict experimental</code> option is not necessary.</p>
<p>Note that in pulseaudio you will need to change the input from the microphone to the “monitor” of that microphone to
be able to stream the audio that normally would come out of your laptop speakers.</p>
<p>When I had both lavfi and experimental I had a big mismatch between video and audio. I’ll have to figure out where it
come from. In <code class="language-plaintext highlighter-rouge">pavucontrol</code> I selected the “Monitor of Built-in Audio Digital Stereo (HDMI)” channel. Now I selected the
“Monitor of Null Output”. It does not sound like it went okay, but there’s no mismatch now. :-)</p>
<p>Then we want to broadcast our desktop, this is done through a screen grab command <code class="language-plaintext highlighter-rouge">-f x11grab</code>. The frame rate and
video size are obvious. Note that the latter is quite high. Adjust it to your own screen’s resolution. Check that
e.g. by <code class="language-plaintext highlighter-rouge">xdpyinfo | awk '/dimensions/{print $2}'</code>. The screen we pick is the one at <code class="language-plaintext highlighter-rouge">:0.0</code>. If you don’t have a
second monitor that’s probably the same for you.</p>
<p>This is a Yoga 900 laptop. It has an integrated Intel GPU. This can be deployed by the following combination of flags
<code class="language-plaintext highlighter-rouge">-vaapi_device /dev/dri/renderD128 -vf format=nv12,hwupload,scale_vaapi=w=1920:h=1080 -c:v h264_vaapi</code>.</p>
<p>I didn’t find any improvements using <code class="language-plaintext highlighter-rouge">-re</code>, supposed for real-time streaming. The <code class="language-plaintext highlighter-rouge">-f ismv</code> for smooth streaming does
not help either. It is a fragmented format. The packets and metadata about these packets are stored together. A
fragmented file can be decodable even if the writing is interrupted. It also requires less memory. It can be considered
as setting a bunch of flags like <code class="language-plaintext highlighter-rouge">-movflags empty_moov,faststart</code>, etc.</p>
<p>The <a href="https://developers.google.com/cast/docs/reference/messages#MediaData">Google Cast documentation</a> has LIVE as a
possible <code class="language-plaintext highlighter-rouge">streamType</code>. This is used in version <code class="language-plaintext highlighter-rouge">0.3.9</code> of <code class="language-plaintext highlighter-rouge">mkchromecast</code>. The <code class="language-plaintext highlighter-rouge">currentTime</code> option should definitely
<strong>not</strong> be set. If not specified, the stream will start at the live position.</p>
<p>According to <a href="https://www.reddit.com/r/PleX/comments/b768ym/pretranscoding_question_best_ffmpeg_settings_for/">this post</a>
the Chromecast (v2) is limited to 11Mbps. A buffer should be 2x the bitrate. So, if at 8Mbps, it should be set at 12M.</p>
<p><a href="https://www.videosolo.com/tutorials/chromecast-mkv.html">Here</a> it states what formats Chromecast supports:</p>
<ul>
<li>MP4</li>
<li>WebM</li>
<li>MPEG-DASH</li>
<li>Smooth Streaming</li>
<li>HTTP Live Streaming (HLS)</li>
</ul>
<p>A Chromecast can support a range of formats (e.g. also MKV) as long as it contains a H.264 video codec and/or an
AAC audio codec.</p>
<p><a href="https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP">DASH</a> stands for Dynamic Adaptive Streaming over
HTTP. It is codec-agnostic. And again can use H.264 (or VP8).</p>
<p>The Chromecast <a href="https://developers.google.com/cast/docs/media">according to Google</a>, 1st and 2nd generation, can
support the H.264 High Profile up to level 4.1 (720p/60fps, or 1080p/30fps). Or VP8. Then there are several delivery
methods and adaptive streaming protocols through the <a href="https://developers.google.com/cast/docs/caf_receiver">Cast Application Framework (CAF)</a>,
each with DRM support as well (not relevant to us):</p>
<ul>
<li>MPEG-DASH (<code class="language-plaintext highlighter-rouge">.mpd</code>)</li>
<li>SmoothStreaming (<code class="language-plaintext highlighter-rouge">.ism</code>)</li>
<li>HTTP Live Streaming (HLS) (<code class="language-plaintext highlighter-rouge">.m3u8</code>)</li>
</ul>
<p>And some progressive download format without adaptive switching.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkchromecast --video --command 'ffmpeg \
-re \
-f pulse -ac 2 -i default -acodec aac \
-f x11grab -framerate 30 -video_size 3200x1800 -i :0.0+0,0 \
-vaapi_device /dev/dri/renderD128 -vf format=nv12,hwupload,scale_vaapi=w=1920:h=1080 -c:v h264_vaapi \
-bf 4 -threads 4 -preset ultrafast -tune zerolatency -maxrate 10M -bufsize 20M \
-pix_fmt yuv420p -g 30 \
-movflags isml+frag_keyframe \
-f ismv \
pipe:1'
</code></pre></div></div>
<p>Streaming format <code class="language-plaintext highlighter-rouge">hls</code> stands for pple HTTP Live Streaming. Unable to find a suitable output..</p>
<p><a href="https://developers.google.com/cast/v2/mpl_player#cors">Suggestion</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>To start at "live" you can specify the Infinity property as the initialTime parameter to the player.load API call
</code></pre></div></div>
<p>Spotify streams with:</p>
<p>https://community.spotify.com/t5/Other-Partners-Web-Player-etc/Chromecast-bitrate-solution-verified/td-p/4661520</p>
<p>Changed in mkchromecast/video.py mtype to application/x-mpegurl</p>
<p>Something on H.264 vs 265</p>Anne van RossumIf you’re in quarantaine or in isolation, there’s a lot of staying inside. Perhaps you have to be in another room. Perhaps you just want to stream some online event to a larger screen. In either case, you want to figure out how to stream your desktop to your TV. If you happen to have a Chromecast, this is possible, but there are many ways to accomplish this. We will go through a few.Wasserstein and Gromov-Wasserstein2019-08-01T10:12:07+00:002019-08-01T10:12:07+00:00https://annevanrossum.com/2019/08/01/wasserstein-and-gromov-wasserstein<p>Suppose we have to come up with some kind of function that defines how different two probability distributions are.
One such function is the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>.
It is an asymmetric function: it gives a different value for
probability distribution $A$ given probability distribution $B$ versus the other way around. It is henceforth not a
true <strong>distance</strong> (which is symmetric), but a so-called <strong>divergence</strong>. A divergence also does not satisfy the
“triangle inequality”: \(D(x + y) \leq D(x) + D(y)\) is not necessarily true for all $x$ and $y$.
It does satisfy however two other important conditions. A divergence is always zero or larger and the divergence
is only zero if and only if \(x = y\).</p>
<h2 id="earth-movers-distance">Earth Mover’s distance</h2>
<p>There is another function that defines how different two probability distributions are. It is called the
<a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">Earth Mover’s Distance</a>. This name has been suggested
by Jorge Stol when he was working with CAD programs that have a function to compute the optimal earth displacement
from roadcuts to roadfills. The concept is much older though (from <a href="https://en.wikipedia.org/wiki/Gaspard_Monge">Gaspard Monge</a>
in 1781, more than two centuries ago).</p>
<!--more-->
<h2 id="1st-wasserstein-distance">1st Wasserstein distance</h2>
<p>Mathematically this function is equivalent to the 1st <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a>.
The 1st Wasserstein distance between two probability measures $\mu$ and $\nu$ is defined as:</p>
\[W(\mu, \nu) = \inf_{\lambda \in \Gamma(\mu,\nu)} \int_{M \times M} d(x,y) d\lambda(x,y)\]
<p>where $\Gamma(\mu,\nu)$ is the collection of all measures $M \times M$ with marginals $\mu$ and $\nu$.
This set $\Gamma(\mu,\nu)$ is called the set of all couplings of $\mu$ and $\nu$. We can write it also as:</p>
\[W(\mu, \nu) = \inf_{\lambda \in \Gamma(\mu,\nu)} \int_\mu \int_\nu d(x,y) \lambda(x,y) dx dy\]
<p>where we assume $\lambda$ to be well-behaved.</p>
<p>Informally, given all kind of ways to send over “stuff” from position $x$ to position $y$ we find the one that
minimizes (mathematically fancy the infimum) the distance $d(x,y)$ so that everything is moved.
Note, that $\mu$ and $\nu$ are probability measures. They sum to one. There is no mass lost or gained in the process.</p>
<p>We can use the Wasserstein distance to calculate the difference between point clouds. The above distance becomes now
discrete:</p>
\[W(\mu, \nu) = \min_{\lambda \in \Gamma(\mu,\nu)} \sum_{x \in \mu} \sum_{y \in \nu} d(x,y) \lambda(x,y)\]
<!--
We can define the problem as a minimization problem with particular constraints.
Find a flow $F = [f_{i,j}]$ with $f_{i,j}$ the flow between $x_i$ and $y_i$ that minimizes the overall cost. That is,
solve:
$$\min \sum_i \sum_j f_{i,j} d(x_i,y_i)$$
subject to
$$f_{i,j} > 0, 1 \leq i \leq m, 1 \leq j \leq n$$
$$\sum_{i=1}^m f_{i,j} \leq 1, 1 \leq j \leq n$$
$$\sum_{j=1}^n f_{i,j} \leq 1, 1 \leq i \leq m$$
$$\sum_{i=1}^m \sum_{j=1}^n f_{i,j} = \min (m, n)$$.
-->
<h2 id="assignment-problem">Assignment problem</h2>
<p>This can also be formulated as an assignment problem or a matching problem on a (complete) bipartite graph.
The nodes $x \in \mu$ on the left, the nodes $y \in \nu$ on the right.
The edges between the nodes in $\mu$ and the nodes in $\nu$ have weights corresponding to their distance $d(x,y)$. Now,
define a mapping $\lambda$ that minimizes \(W(\mu,\nu)\).</p>
<p>We can write the constraints on that indicate that each vertex is adjacent to exactly one edge:</p>
\[\sum_{x \in \mu} \lambda(x,y) = 1 \quad \text{for} \quad y \in \nu\]
\[\sum_{y \in \nu} \lambda(x,y) = 1 \quad \text{for} \quad x \in \mu\]
\[0 \leq \lambda(x,y) \leq 1 \quad \text{for} \quad x \in \mu, y \in \nu\]
\[\lambda(x,y) \in \mathbb{Z} \quad \text{for} \quad x \in \mu, y \in \nu\]
<p>The last constraint can actually be removed. When we solve it without this constraint, we will end up with an optimal
solution that satisfies this condition nevertheless.</p>
<h2 id="gromov-wasserstein">Gromov-Wasserstein</h2>
<p>The Gromov-Wasserstein distance can compare point clouds of different dimension. Suppose we have
a point cloud A in 2D and a point cloud B in 3D, it can assign a distance to it. How is this possible? The distance
between two points in any dimension is just a scalar. Hence, if we do not work with the point coordinates themselves,
but only with the pairwise distances between points, we can minimize a function between the pairwise distances in A
and the pairwise distances in B.</p>
<h1 id="challenge">Challenge</h1>
<p>Now, the kind of distance we would like to formulate is one that can be used for multiple objects.</p>Anne van RossumSuppose we have to come up with some kind of function that defines how different two probability distributions are. One such function is the Kullback-Leibler divergence. It is an asymmetric function: it gives a different value for probability distribution $A$ given probability distribution $B$ versus the other way around. It is henceforth not a true distance (which is symmetric), but a so-called divergence. A divergence also does not satisfy the “triangle inequality”: \(D(x + y) \leq D(x) + D(y)\) is not necessarily true for all $x$ and $y$. It does satisfy however two other important conditions. A divergence is always zero or larger and the divergence is only zero if and only if \(x = y\).Nonnegative Autoencoders2018-09-30T14:36:27+00:002018-09-30T14:36:27+00:00https://annevanrossum.com/autoencoder/nonnegativity/2018/09/30/nonnegative-autoencoders<p>My intuition would say that a part-based decomposition should arise naturally within an autoencoder. To encorporate
the next image in an image recognition task, it must be more beneficial to have gradient descent being able to
navigate towards the optimal set of neural network weights for that image. If not, for each image gradient descent
is all the time navigating some kind of common denominator, none of the images are actually properly represented.
For each new image that is getting better classified, the other images are classified worse. With a proper
decomposition learning the next representation will not interfere with previous representations. Grossberg calls this
in Adaptive Resonance Theory (ART) catastrophic forgetting.</p>
<p>Maybe if we train a network long enough this will be the emerging decomposition strategy indeed. However, this is not
what is normally found. The different representations get coupled and there is not a decomposition that allows
the network to explore different feature dimensions independently.</p>
<p>One of the means to obtain a part-based representation is to force positive or zero weights in a network. In the
literature <a href="https://yliapis.github.io/Non-Negative-Matrix-Factorization/">nonnegative matrix factorization</a> can be
found. Due to the nonnegativity constraint the features are additive. This leads to a (sparse) basis where through
summation “parts” are added up to a “whole” object. For example, faces are built up out of features like eyes, nostrils,
mouth, ears, eyebrows, etc.</p>
<p><img src="/images/blog/nonnegative_examples.jpg" alt="Nonnegative examples. From top to bottom: 1) Sparse Autoencoder, 2) Nonnegative Sparse Autoencoder, 3) Nonnegativity Constrained Autoencoder, and 4) Nonnegative Matrix Factorization. The nonnegative examples do not use clear cut facial features like eyes and ears, but you see only parts of the image being nonnegative (white). This means an image can be composed using a sum of the displayed images. Copyright Hosseini-Asl et al." /></p>
<!--more-->
<h1 id="sparse-autoencoder-with-nonnegativity-constraint">Sparse Autoencoder with Nonnegativity Constraint</h1>
<p>At Louisville university
<a href="https://github.com/ehosseiniasl">Ehsan Hosseini-Asl (github)</a>,
<a href="http://www.jacekzurada.org/">Jacek Zurada</a> (who is running for 2019 IEEE president), and
<a href="https://twitter.com/olfanasraoui">Olfa Nasraoui (twitter)</a>
studied how nonnegative constraints can be added to an autoencoder in
<a href="https://arxiv.org/pdf/1601.02733.pdf">Deep Learning of Part-based Representation of Data Using Sparse Autoencoders with Nonnegativity Constraints (2016)</a>.</p>
<p>An autoencoder which has a latent layer that contains a part-based representation, only has a few of the nodes active
at a particular input. In other words, such a representation is sparse.</p>
<p>One of the ways a sparse representation can be enforced is to limit the activation of a hidden unit over all data
items \(r\). The average activation of a unit is:</p>
\[\hat{p}_j = \frac{1}{m} \sum_{r=1}^m h_j(x^{(r)})\]
<p>To make sure that the activation is limited, we can bound \(\hat{p}_j < p\) with \(p\) a small value close to zero.</p>
<p>The usual cost function is just the reconstruction error \(J_E\). Here, we include the activation limitation by adding an additional term:</p>
\[J_{KL}(p || \hat{p}) = \sum_{j=1}^n p \log \frac{p}{\hat{p}_j} + (1-p) \log \frac{1-p}{1-\hat{p}_j}\]
<p>We can prevent overfitting by regularization. This can be done by adding noise to the input, dropout, or by penalizing large weights. The latter corresponds to yet another term:</p>
\[J_{O}(W,b) = \frac{1}{2} \sum_{l=1}^2 \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} \left( w_{ij}^l \right)^2\]
<p>The sizes of adjacent layers are indicated by \(s_l\) and \(s_{l+1}\) (and we are limited to \(l\) layers).</p>
<p>The total cost function used by the authors for the sparse autoencoder contains all the above cost functions, each weighted, one by parameter \(\beta\), the other by \(\lambda\).</p>
\[J_{SAE}(W,b) = J_E(W,b) + \beta J_{KL}(p||\hat{p}) + \lambda J_O(W,b)\]
<p>To enforce nonnegativity we can introduce a different regularization term:</p>
\[J_{O}(W,b) = \frac{1}{2} \sum_{l=1}^2 \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} f \left( w_{ij}^l \right)\]
<p>For the nonnegative constrained autoencoder the authors suggest:</p>
\[f(w_{ij}) =
\begin{cases}
w_{ij}^2 & w_{ij} < 0 \\
0 & \text{otherwise}
\end{cases}\]
<p>This term penalizes all negative values. All positive values do not contribute to the cost function.</p>
<h2 id="results">Results</h2>
<p>Results are compared between the Sparse Autoencoder (SAE), the Nonnegative Sparse Autoencoder (NNSAE), the Negatively
Constrained Autoencoder (NCAE), and Nonnegative Matrix Factorization (NMF).</p>
<p><img src="/images/blog/nonnegative_autoencoder_representation_comparison.jpg" alt="Comparison of representations. 1) SAE, 2) NNSAE, 3) NCAE, 4) NMF" /></p>
<p>The SAE representation contains negative values (dark pixels). The NNSAE representation has neurons with zero weights (complete black nodes).</p>
<p>The receptive fields learned by NCAE are more sparse than the others.
The features from NNSAE and NMF are more local.</p>
<p><img src="/images/blog/nonnegative_constrained_mnist_comparison.png" alt="Nonnegative Constrained Autoencoder compared using the MNIST classification task with other reconstruction methods. Rows: 1) Original digits, 2) Sparse Autoencoder, 3) Nonnegative Sparse Autoencoder, 4) Negatively Constrained Autoencoder, and Nonnegative Matrix Factorization." /></p>
<h2 id="ideas">Ideas</h2>
<p>To really encourage a part-based decomposition it would be best to enforce either very large values or values that are zero. Something like sum over x divided by number of non-zero components with each x nonnegative and maximizing over this.</p>Anne van RossumMy intuition would say that a part-based decomposition should arise naturally within an autoencoder. To encorporate the next image in an image recognition task, it must be more beneficial to have gradient descent being able to navigate towards the optimal set of neural network weights for that image. If not, for each image gradient descent is all the time navigating some kind of common denominator, none of the images are actually properly represented. For each new image that is getting better classified, the other images are classified worse. With a proper decomposition learning the next representation will not interfere with previous representations. Grossberg calls this in Adaptive Resonance Theory (ART) catastrophic forgetting.Generating Point Clouds2018-09-26T08:15:22+00:002018-09-26T08:15:22+00:00https://annevanrossum.com/deep%20learning/autoencoders/point%20clouds/2018/09/26/generating-point-clouds<p>If we do want robots to learn about the world, we can use computer vision. We can employ traditional methods. Build up a full-fledged model from corner detectors, edge detectors, feature descriptors, gradient descriptors, etc. We can also use modern deep learning techniques. One large neural network hopefully captures similarly or even better abstractions compared to the conventional computer vision pipeline.</p>
<p>Computer vision is not the only choice though! In recent years there is a proliferation of a different type of data: depth data. Collections (or clouds) of points represent 3D shapes. In a game setting the Kinect was a world-shocking invention using structured light. In robotics and autonomous cars LIDARs are used. There is huge debate about which sensors are gonna “win”, but I personally doubt there will be a clearcut winner. My line of reasoning:</p>
<ul>
<li>Humans use vision and have perfected this in many ways. It would be silly to not use cameras.</li>
<li>Depth sensors can provide information when vision gets obstructed.</li>
<li>Humans use glasses, microscopes, infrared goggles, all to enhance our senses. We are basically cyborgs.</li>
<li>Robots will benefit from a rich sensory experience just like we do. They want to be cyborgs too.</li>
</ul>
<!--more-->
<h1 id="point-clouds">Point clouds</h1>
<p>Point clouds are quite a good description. Objects are represented by individual points in a 3D space. By making the points a bit bigger you can easily figure out the shapes yourself (see the figure below).</p>
<p><img src="/images/blog/dfaust_dataset.png" alt="The D-FAUST dataset and interpolation between the figure totally at the left and the one totally at the right. Copyright: Achlioptas et al. (2018)." title="Interpolation between two distinct human figures shows a gradual transition from one figure to the next" /></p>
<p>A group of researchers who started to study point clouds are
<a href="http://web.stanford.edu/~optas/">Panos Achlioptas</a>,
<a href="https://geometry.stanford.edu/person.php?id=diamanti">Olga Diamanti</a>,
<a href="http://mitliagkas.github.io/">Ioannis Mitliagkas</a>, and
<a href="https://geometry.stanford.edu/member/guibas/">Leonidas Guibas</a> in the paper <a href="https://arxiv.org/pdf/1707.02392.pdf">Learning Representations and Generative Models for 3D Point Clouds</a> from Leonidas Guibas’ Geometric Computation group in the Computer Science Department of Stanford University. Side note: prof. Guibas obtained his PhD under the supervision of Donald Knuth. You can reach Panos via <a href="http://web.stanford.edu/~optas/contact.html">his website</a>. I haven’t found his Twitter account.</p>
<p>There are two main reasons why point clouds should be treated different from pixels in a neural network:</p>
<ol>
<li>The convolution operator works on a grid. Encoding 3D data on a grid would encode a lot of empty voxels. This means that for point clouds we cannot just convolution.</li>
<li>Point clouds are permutation invariant. Only the 3D position of a point matters, their id does not. Points in a point cloud can be numbered in any way. It still describes the same object. A comparison operator between two point clouds need to take this into account. Preferably the latent variables will also be permutation invariant.</li>
</ol>
<h2 id="permutation-invariant-distance-metrics">Permutation invariant distance metrics</h2>
<p>There are permutation invariant distance metrics available. The authors describe the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">Earth Mover’s Distance (EM)</a>. This is a concept discovered by Gaspard Monge in 1781 on how to transport soil from one place to the next with minimal effort. Let us define the flow \(f_{i,j}\) between location \(i\) and \(j\) with distance \(d_{i,j}\), then the EM distance between location \(i \in P\) and \(j \in Q\) is as follows:</p>
\[d_{EM}(P,Q) = \frac{ \sum_{i=1}^m \sum_{j=1}^n f_{i,j} d_{i,j} }{ \sum_{i=1}^m \sum_{j=1}^n f_{i,j} }\]
<p>The individual flow is multiplied with the corresponding distance. The overall sum is normalized with the overall flow. In mathematics this is known as the Wasserstein metric. <a href="https://vincentherrmann.github.io/blog/wasserstein/">This blog post</a> introduces the Wasserstein metric perfectly and <a href="https://www.alexirpan.com/2017/02/22/wasserstein-gan.html">this blog post</a> explains its relevance to Generative Adversarial Networks.</p>
<p>The Wasserstein metric is differentiable almost everywhere. <a href="https://en.wikipedia.org/wiki/Almost_everywhere">Almost everywhere (a.e.)</a> is a technical term related to a set having measure zero. It states that the elements for which the property (in this case being differentiable) is not valid has measure zero. Another example of a function that is differentiable a.e. is a monotonic function on \([a,b] \rightarrow \mathbb{R}\).</p>
<p>The Chamfer pseudo-distance (C) measure is another measure which calculates the squared distance between each point in one set to the nearest neighbour in the other set:</p>
\[d_{C}(P,Q) = \sum_{i=1}^m \min_j d_{i,j}^2 + \sum_{j=1}^n \min_i d_{i,j}^2\]
<p>It’s stated that \(d_{C}\) is more efficient to compute than \(d_{EM}\).</p>
<p>Immediately we can observe from these metrics that they are not invariant with respect to rotations, translations, or scaling.</p>
<h2 id="comparison-metrics">Comparison metrics</h2>
<p>To compare shapes in a 3D space we can follow different strategies. We describe three methods that take the spatial
nature into account and that compare over sets of 3D objects (so we can compare set \(P\) with set \(Q\)):</p>
<ol>
<li>Jensen-Shannon divergence.</li>
<li>Coverage</li>
<li>Minimum Matching distance.</li>
</ol>
<h3 id="jensen-shannon">Jensen-Shannon</h3>
<p>First of all, we can align the point cloud data along the axis, introduce voxels and measure the number of points in
each corresponding voxel. Then we use a distance metric using these quantities, in this case a Jensen-Shannon divergence.</p>
\[d_{JS}(P||Q) = \frac{1}{2} D_{KL}(P||M) + \frac{1}{2} D_{KL}(Q||M)\]
<p>Here \(M = \frac{1}{2}(P + Q)\) and \(D_{KL}\) is the Kullbach-Leibler divergence metric.</p>
<p>To compare sets of point clouds we can do exactly the same. In this case the number of points in each voxel is just the
collection of points across all point clouds in that set.</p>
<h3 id="coverage">Coverage</h3>
<p>Coverage is defined by the fraction of point clouds in \(P\) that are matched with point clouds in \(Q\). The match is
defined by one the permutation invariant distance metrics. It’s not entirely clear how this is completely specified
to me. Is there a threshold used that defines if it is a match? Or is it just a sum or average?</p>
<h3 id="minimum-matching-distance">Minimum Matching distance</h3>
<p>The minimum matching distance measures also the fidelity of \(P\) versus \(Q\) (compared to the coverage metric). This
indeed uses an average over the (permutation invariant) distances between point clouds.</p>
<h2 id="generation">Generation</h2>
<p>The pipeline followed by the authors is similar to that of PointNet. The pointcloud contains 2048 3D points.
This data is fed into the encoder. The encoder exists of 1D convolutional layers (five of them) with a kernel size of 1 each of which are followed by a layer with ReLUs and one to
perform batch normalization. After this pipeline an permutation-invariant max layer is placed. To read more on a
1D convolutional layer check the following clip by Andrew Ng. On just a 2D matrix a 1D convolutional layer would be
just a multiplication by a factor. However, on 3D objects, it can be used to reduce layers or introduce a
nonlinearity.</p>
<iframe width="740" height="480" src="//www.youtube.com/embed/vcp0XvDAX68" frameborder="0" allowfullscreen=""></iframe>
<p>The decoder contains three layers, the first two followed by ReLUs.</p>
<h2 id="results">Results</h2>
<p>Some of the results from this work:</p>
<ul>
<li>The GAN operating on the raw data converges much slower than the GAN operating on the the latent variables.</li>
<li>The latent GAN model using the AE with Earth Mover’s distance outperforms the one with the Chamfer pseudo distance.</li>
<li>Both latent GAN models suffer from mode collapse. Wasserstein GAN does not…</li>
</ul>
<p>Note that the Wasserstein metric is the same as the Earth Mover’s distance. So the above basically states that using
Wasserstein distance makes sense for both the autoencoder as well as the GAN involved.</p>
<p>There are not yet many models that operate directly on point clouds. PointNet is one of the most famous ones.
In a ModelNet40 shape classification task it has the following performance:</p>
<ul>
<li>87.2% - Vanilla PointNet, without transformation networks</li>
<li>89.2% - PointNet, with transformation networks</li>
<li>90.7% - PointNet++, with multi-resolution grouping to cope with non-uniform sampling densities</li>
<li>91.9% - PointNet++, with face normals as additional point features</li>
</ul>
<p>And the paper’s performance:</p>
<ul>
<li>84.0% - Earth Mover’s distance</li>
<li>84.5% - Chamfer distance</li>
</ul>
<p>It is not clear to me why they don’t list the results of the PointNet and PointNet++ papers which they both cite.
They should have definitely told why it cannot be compared if they think that’s the case.</p>
<h2 id="promising-research-direction">Promising research direction</h2>
<p>One of the most obvious improvements seem to be the choice of the Wasserstein metric in different parts of the
architecture. Another paper that caught my interest is <a href="https://openreview.net/pdf?id=B1xsqj09Fm">Large Scale GAN Training for High Fidelity Natural Image Synthesis (pdf)</a>
under review at ICLR 2019.</p>
<p>An interesting aspect of this paper is that they sample from a different distribution while testing versus while training.
They introduce a “truncation trick”. It samples from a truncated Normal distribution rather than \(N(0,\sigma)\) for the latent variables. (The values above a particular threshold are just sampled again until they are below that threshold.) I don’t completely get this. What’s going on there? Is there a mode in the network that defines the prototypical dog and are other dogs defined by nonzero values in the latent variable space? Then this seem to show that the layer exhibits an non-intuitive decomposition of the task at hand. I would expect a zero vector to correspond to an “abstract dog” and have all nonzero parameters contribute in an attribute like fashion. This seems to be more prototype-like, similar to old-fashioned vector quantization.</p>
<p>They however also define other latent random variables (in appendix E). The censored normal \(\max[N(0,\sigma),0]\) is interesting. It reminds me of <a href="https://yliapis.github.io/Non-Negative-Matrix-Factorization/">nonnegative matrix factorization</a>. By using a nonnegative constraint the representation becomes additive (part-based) and sparse. That’s quite different from prototype-like methods.</p>
<p>In the last few months I’ve been trying nonparametric extensions to the latent layer, but these experiments do not seem to be
very promising.</p>
<p>A promising research direction might be to study autoencoders where the latent variables are such that they exhibit the
same <strong>nonnegative</strong> (part-based representation) features. When we have a latent layer that decomposes the input like this,
it might become more valuable to subsequently have a nonparametric extension.</p>Anne van RossumIf we do want robots to learn about the world, we can use computer vision. We can employ traditional methods. Build up a full-fledged model from corner detectors, edge detectors, feature descriptors, gradient descriptors, etc. We can also use modern deep learning techniques. One large neural network hopefully captures similarly or even better abstractions compared to the conventional computer vision pipeline.Attend, infer, repeat2018-09-20T08:57:08+00:002018-09-20T08:57:08+00:00https://annevanrossum.com/deep%20learning/nonparametric%20latent%20layer/attention/variational%20method/2018/09/20/attend-infer-repeat<p>A long, long time ago - namely, in terms of these fast moving times of advances in deep learning - two years (2016),
there was once a paper studying how we can teach neural networks to count.</p>
<h1 id="attend-infer-repeat">Attend, infer, repeat</h1>
<p><a href="https://papers.nips.cc/paper/6230-attend-infer-repeat-fast-scene-understanding-with-generative-models.pdf">This paper</a>
is titled “Attend, infer, repeat: Fast scene understanding with generative models” and the authors are
<a href="http://arkitus.com/">Ali Eslami</a>,
Nicolas Heess,
<a href="http://thphn.com/">Theophane Weber</a>,
Yuval Tassa (<a href="https://github.com/yuvaltassa">github</a>, nice, he does couchsurfing),
<a href="http://szepi1991.github.io/">David Szepesvari</a>,
<a href="https://koray.kavukcuoglu.org/">Koray Kavukcuoglu</a>,
and
<a href="http://www.cs.toronto.edu/~hinton/">Geoffrey Hinton</a>.
A team at Deepmind based in London.</p>
<p>This has been a personal interest of mine. I felt it very satisfying that bees for example can
<a href="https://motherboard.vice.com/en_us/article/pgkman/bees-can-count-to-four-display-emotions-and-teach-each-other-new-skills">count landmarks</a> or
at least have a capability that approximates this fairly good. It is such an abstract concept, but very rich. Just
take the fact that you can recognize yourself in the mirror (I hope). It’s grounded on something that really strongly
believes that there is only one of you, that you are pretty unique.</p>
<!--more-->
<p>From a learning perspective, counting feels like mapping in autonomous robotics. The very well-known chicken and egg
problem of simultaneous localisation and mapping (SLAM) immediately addresses that mapping and localisation is an
intertwined problem where one task immediately influences the other task. To properly map it would be very useful if
you have good odometry and can tell accurately how your location is changing. To properly locate yourself it would be
very useful to have a very good map. In the beginning the robot sucks in both, but by learning (for example through
expectation maximization) it learns to perform both better and better.</p>
<p>Counting objects likewise benefits from properly being able to recognize objects. Moreover, it also likely benefits
from localization objects. A child counts by pointing to the objects and even sometimes verbalizes the object in the
process. Of course a network might do all three things in different layers, but that would remove the chance to have
these layers to inform each other. If we introduce cross-connections manually the network would not learn to decompose
in an autonomous manner. Ideally the network learns the decomposition itself so that we do not artificially
introduce limitations in the information transfer between those tasks.</p>
<p>The paper by Eslami introduces several aspects that is important for a system like this:</p>
<ul>
<li>Learning latent spaces of variable dimensionality.</li>
<li>An iterative process that attends to one object at a time. This requires also a stopping condition to stop counting.</li>
<li>Complete end-to-end learning by amortized variational inference.</li>
</ul>
<p>It is coined the <strong>AIR model</strong> by the authors: attend, infer, repeat.</p>
<h2 id="learning-latent-spaces-of-variable-dimensions">Learning latent spaces of variable dimensions</h2>
<p>The representation of a scene is with a fixed upper limit on the number of objects. A nice extension would be to
make this a nonparametric prior like a Dirichlet Process. The number of objects is drawn from a Binomial distribution,
\(p_N(n)\), and the scene model generates a variable length feature vector \(z \sim p_\theta(\cdot|n)\).
The data itself is generated from the features through \(x \sim p_\theta(\cdot|n)\). Summarized:</p>
\[p_\theta(x) = \int p_\theta(z) p_\theta(x|z) dz\]
<p>with the prior decomposed as:</p>
\[p_\theta(z) = \sum_{n=1} p_N(n) p_\theta(z|n)\]
<p>The posterior is given by Bayes’ rule, prior times likelihood divided by the evidence:</p>
\[p_\theta(z|x) = \frac{p_\theta(z) p_\theta(x|z) }{p_\theta(x)}\]
<p>Equivalently:</p>
\[p_\theta(x|n) = \int p_\theta(z|n) p_\theta(x|z, n) dz\]
<p>And:</p>
\[p_\theta(z,n|x) = \frac{p_\theta(z|n) p_\theta(x|z, n) }{p_\theta(x|n)}\]
<p>We approximate the posterior variationally by a simpler distribution \(q_\phi(z,n|x)\) using the Kullback-Leibler
divergence:</p>
\[KL\left[q_\phi(z,n|x)|| p_\theta(z,n|x)\right]\]
<p>The divergence is minimized by searching through the parameter space \(\phi \in \Phi\).</p>
<h2 id="an-iterative-process-and-a-stopping-condition">An iterative process and a stopping condition</h2>
<p>One difficulty arises through \(n\) being generated through a random variable. This requires evaluating:</p>
\[p_N(n|x) = \int p_\theta(z,n|x) dz\]
<p>for all values of \(n = 1 \ldots N\).</p>
<p>Now it is suggested to representent \(n\) through a latent vector \(z_{present}\) that is formed out of \(n\) ones followed
by a zero (and has hence size \(n + 1\)). So we have \(q_\phi(z,z_{present}|x)\) rather than \(q_\phi(z,n|x)\).
The posterior than does have the following form:</p>
\[q_\phi(z,z_{present}|x) = q_\phi(z_{present}^{n+1} = 0 | z^{1:n}, x) \prod_{i=1}^n q_\phi(z^i, z_{present}^i = 1|z^{1:i-1},x)\]
<p>The first term describes the stopping condition. If \(z_{present} = 0\) then there are no more objects to detect.
The second term contains a conditional on previous objects. We do not want to describe the same object twice!</p>
<h2 id="a-variational-implementation">A variational implementation</h2>
<p>To optimize for \(\theta\) and \(\phi\) we use the negative free energy \(\mathcal{L}\). The negative free energy is
guaranteed to be smaller than \(\log p_\theta(x)\) so can be used to approximate the latter by increasing it as much
as possible.</p>
\[\mathcal{L}(\theta,\phi) = \mathop{\mathbb{E_{q_\phi}}} \left[ \log p_\theta(x,z,n) - \log q_\phi(z,n|x) \right]\]
<p>We now have to calculate both \(\frac{\partial}{\partial\theta} \mathcal{L}\) and
\(\frac{\partial}{\partial\phi} \mathcal{L}\)
to perform gradient ascent.</p>
<p>The estimate of the latter term is quite involved. First \(\omega_i\) denotes all parameters at time step \(i\) in
\((z_{present}^i, z^i)\). Then we map \(x\) to \(\omega^i\) through a recurrent function \((\omega^i,h^i) = R_\phi(x,h^{i-1})\).
Here the recurrent function \(R_\phi\) is a recurrent neural network. The gradient obeys the chain rule:</p>
\[\frac{\partial \mathcal{L}}{ \partial \phi} = \sum_i \frac{ \partial \mathcal{L} }{ \partial \omega^i} \times \frac{\partial \omega^i}{ \phi}\]
<p>Now, we have to calculate \(\frac{\partial \mathcal{L}}{\partial \omega^i}\). Remember \(\omega_i\) can contain either
continuous or discrete variables. With continuous variables the reparametrization trick is applied. With discrete
variables a likelihood ratio estimator is used. The latter might have high variance with is reduced using
structured neural baselines.</p>
<h2 id="results">Results</h2>
<p>The results on a multi MNIST learning task can be seen in the next figure.</p>
<p><img src="/images/blog/multi-mnist-training.png" alt="Multi MNIST task. Copyright: Eslami et al, 2018" title="From top to bottom training advances. Different numbers from the MNIST dataset are better recognized the longer the system runs. It learns to count from zero to three." /></p>
<p>The figures shows how the system properly recognizes multiple visual digits from the MNIST training set. The boxes show attention windows. From top to bottom there is a steady improvement in count accuracy over time.</p>
<h2 id="discussion">Discussion</h2>
<p>What do we learn from this?</p>
<ul>
<li>We have to come up with a <strong>particular representation</strong> of the number of objects. Using this representation we do not
only inform the network that it has to count, but also that this has to be used as a stopping condition. It very much
looks like a handcrafted architecture.</li>
<li>There is apparently <strong>no satisfying black-box approach</strong> to calculate the gradients. Not only do we have to manually
describe which strategy has to be used for which parameter. For discrete variables we have to go even further and
come up with manners to reduce the variance of the estimator.</li>
</ul>
<p>If we would use this architecture would we be surprised that the network learns to count? No, I don’t think so. We
pretty much hardcoded this in the architecture.</p>
<p>An interesting observations by the authors concerns generalization. When the model is trained on images with up to
two digits in a multi-MNIST task, it will not generalize to three digits. Likewise if it is trained on images with
zero, one, or three digits, it will not be able to handle images with two digits. Another architecture change has
been applied with the recurrent network fed by differences with the input
\((\omega^i,h^i = R_\phi(x^i - x, h^{i=1})\). The author coin this the DAIR model rather than just the AIR model.</p>
<p>The authors compare the system with the Deep Recurrent Attentive Writer (DRAW) architecture. The latter
exhibits good performance with the same counting task. Where it lacks is a task where a task of counting zero, one, or
two digits is followed by another task using two digits. That other task is a) summing the two digits, or
b) determining if the digits are in ascending order. Here the AIR model outperforms DRAW.</p>
<h2 id="research-direction">Research direction</h2>
<p>One the things that is interesting from the neuroscientific literature is the concept of subitizing. It might, or
might not be the case, that it is faster to count up to four than upwards from four. Over four there is a sequential
process like the one described in this blog post. Some scientists think there is a different pathway that allows a
more instantaneous response if there only a few objects.</p>
<p><a href="https://arxiv.org/pdf/1808.00257.pdf">The paper</a> titled “Subitizing with Variational Autoencoders” by the authors
Rijnder Wever (<a href="https://github.com/rien333">github</a>)
and
<a href="http://tomrunia.github.io/">Tom Runia</a>
from the University of Amsterdam describes subitizing as an emerging phenomenon in an ordinary autoencoder. A
supervised classifier is trained on top of this unsupervised autoencoder. It is not entirely clear to me that the
latent representation indeed somehow disentangled the object identification from the number of objects.</p>Anne van RossumA long, long time ago - namely, in terms of these fast moving times of advances in deep learning - two years (2016), there was once a paper studying how we can teach neural networks to count.Random gradients2018-05-26T13:30:44+00:002018-05-26T13:30:44+00:00https://annevanrossum.com/gradients/reparametrization%20trick/log-derivative%20trick/2018/05/26/random-gradients<p>Variational inference approximates the posterior distribution in probabilistic models.
Given observed variables \(x\) we would like to know the underlying phenomenon \(z\),
defined probabilistically as \(p(z | x)\).
Variational inference approximates \(p(z|x)\) through a simpler distribution \(q(z,v)\).
The approximation is defined through a distance/divergence, often the <a href="/gradient%20descent/gradient%20ascent/kullback-leibler%20divergence/contrastive%20divergence/2017/05/03/what-is-contrastive-divergence.html">Kullback-Leibler divergence</a>:</p>
\[v = \arg\min_v D_{KL}(q(z,v) || p(z|x))\]
<p>It is interesting to see that this <strong>deterministic</strong> strategy does not require Monte Carlo updates. It can be seen as a deterministic optimization problem. However, it is definitely possible to solve this deterministic problem <strong>stochastically</strong> as well! We can formulate it as a stochastic optimization problem!</p>
<p>There are two main strategies:</p>
<ul>
<li>the reparametrization trick</li>
<li>the log-derivate trick</li>
</ul>
<!--more-->
<p>The log-derivate trick is quite general but still suffers from high variance. Henceforth, so-called control variates
have been introduced that reduce variance. We will spend quite a bit of time to clarify what a control variate is.
The last section describes modern approaches that combine features from both strategies.</p>
<h1 id="the-reparametrization-trick">The reparametrization trick</h1>
<p>The reparametrization trick introduces auxiliary random variables that are stochastic such that the parameters to be
optimized over are only occuring in deterministic functions. This is convenient because it can reduce variance and
sometimes the derivatives of the probability density functions do not exist in closed-form (which means no autodifferentation).
See the <a href="/inference/deep%20learning/2018/01/30/inference-in-deep-learning.html">Inference in deep learning</a> post.</p>
<h1 id="the-log-derivative-trick">The log-derivative trick</h1>
<p>The log-derivative trick is also called the score function method, REINFORCE, or black-box variational inference.
The term black-box variational inference reveals that this trick is completely general.
It can be applied to any model. For instance, models that have both continuous and discrete latent variables.
The joint distribution does not need to be differentiable either.</p>
<p>It uses the following identity:</p>
\[\nabla_\phi \log p_\phi(x) = \frac{ \nabla_\phi p_\phi(x) } { p_\phi(x)}\]
<p>This identity is just obtained by differentiating using \(\nabla \log x = \frac{1}{x}\) and applying the chain rule \(\nabla \log f(x) = \frac{1}{f(x)} \nabla f(x)\).
Let’s subsequently rewrite this identity as a product:</p>
\[\nabla_\phi p_\phi(x) = p_\phi(x) \nabla_\phi \log p_\phi(x)\]
<p>The expected costs we want to minimize:</p>
\[\nabla_\phi L(\theta,\phi) = \nabla_\phi E_{x \sim p_\phi(x)} [f_\theta(x) ] = \nabla_\phi \int_x f_\theta(x) p_\phi(x) dx\]
<p>We can use <a href="https://en.wikipedia.org/wiki/Leibniz_integral_rule">Leibniz’s integral rule</a> (differentiation under the integral sign) to shift the differential operator into the integral. To recall the rule:</p>
\[\nabla_x \int f(x,t) dt = \int \nabla_x f(x,t) dt\]
<p>In our case:</p>
\[\nabla_\phi L(\theta,\phi) = \int_x f_\theta(x) \nabla_\phi p_\phi(x) dx\]
<p>Using the log identity:</p>
\[\nabla_\phi L(\theta,\phi) = \int_x f_\theta(x) p_\phi(x) \nabla_\phi \log p_\phi(x) dx\]
\[\nabla_\phi L(\theta,\phi) = E_{x \sim p_\phi(x)} [ f_\theta(x) \nabla_\phi \log p_\phi(x) ]\]
<p>Now we can use Monte Carlo to estimate:</p>
\[\nabla_\phi L(\theta,\phi) \approx \frac{1}{S} \sum_{s=1}^S [ f_\theta(x^s) \nabla_\phi \log p_\phi(x^s) ]\]
<p>Here $x_s \sim p_\phi(x)$ i.i.d. This is general estimator: $f_\theta(x)$ does not need to be differentiable or
continuous with respect to $x$.
Note that $\log p_\phi(x)$ needs to be differentiable with respect to $\phi$.</p>
<p>We should show that the variance is actually reduced… However, let us first explain something that you will find
time after time. Namely the notion of control variates…</p>
<h1 id="control-variates">Control variates</h1>
<p>Let us estimate the expectation over a function \(E_x[f(x)]\) given a function \(f(x)\).
The Monte Carlo estimator is of the form \(E_x[f(x)] \approx \frac{1}{k} \sum_i f(x^i)\) with \(x^i \sim p(x)\).
We can introduce a control variate to reduce the variance:</p>
\[E[f(x)] \approx \left( \frac{1}{k} \sum_i f(x^i) - \eta g(x^i) \right) + \eta E[g(x)]\]
<p>The parameter \(\eta\) can be chosen to minimize the variance, which turns out to be optimally:</p>
\[\eta = \frac{Cov(f,g)}{Var(g)}\]
<p>More information can be found at <a href="https://en.wikipedia.org/wiki/Control_variates">Wikipedia</a>. The final variance will be something along the lines:</p>
\[Var(f) - \frac{ Cov(f,g)^2}{ Var(g)}\]
<p>Here \(Var(f) = E[f^2] - E[f]^2\) and \(Cov(f,g) = E[(f-E[f])(g-E[g])]\).
So, how we can explain this best?</p>
<p>Assume we have to sum over <font color="blue">the function</font> \(f(x) = 1/(1+x)\) with \(0 < x < 1\), then if we sample uniformly random values between \(0\) and \(1\) we will have results between \(1/(1+0)=1\) and \(1/(1+1)=1/2\).
We would like to transform this function in such way that these results are closer to each other.
The values at \(x=0\) should be going to the mean, and the values at \(x=1\) as well.
At wikipedia they give the example of the <font color="orange">covariate</font> \(g(x) = 1 + x\) (this could have just been \(g(x) = x\)). By adding \(x\) and subtracting the average (in this case \(\int_0^1 (1+x) dx = 3/2\)) we make the function <strong>flatter</strong> with picking \(\eta=0.4773\), in other words we reduced the variance. We sample 100 values uniformly and demonstate in the following graph that the function using the covariate is indeed flatter.</p>
<div id="visualization-controlvariates"></div>
<p>Another <font color="purple">covariate</font> could be \(g(x) = \log(x + 1)\). We then have to subtract the expectation of that function, namely \(\int_0^1 \log(x+1) dx = \log(4)-1\). This function is even flatter and has an even smaller variance. You can see that in the graph above. We have picked a value for \(\eta=0.72\).
The covariate which would make the compound function completely flat would be \(g(x) = 1/(2-x)\), which is \(f(x)\) mirrored over the range from \(x=[0,1]\). However, this would of course render the Monte Carlo sampling redundant, because we would need the expectation over \(g(x)\) which is in this case just as hard as that over \(f(x)\).</p>
<!--
Why is a biased estimator at times not a problem. Suppose the expectation is consistently above or consistently below the real expectation. Assume the expectation is used within a (variational) minimization or maximization problem. Then it does not matter if we minimize including a particular offset.
-->
<!-- ## Local expectation gradients -->
<h1 id="recent-approaches-and-combinations">Recent approaches (and combinations)</h1>
<p>The log-derivative trick (or the score function estimator) still suffers from high variance. Common techniques to
reduce variance is by introducing baselines. Examples of unbiased single sample gradient estimators, are
NVIL (Mnih and Gregor, 2014) and MuProp (Gu et al., 2015).
An example of an unbiased multisample case is VIMCO (Mnih and Rezende, 2016).</p>
<p>Examples of biased single sample gradient estimators, are Gumbel-Softmax (Jang et al., 2016) and Concrete relaxiations
(Maddison et al., 2017), independent researchers coming to the same strategy.
The family of concrete distributions (Maddison et al, 2017) has closed-form densities and a simple reparametrization. The
concrete distributions can replace discrete distributions on training so all gradients can properly be calculated. During
training the concrete distributions can be replaced by discrete distributions.</p>
<p>REBAR (Tucker et al., 2017) is a new approach that uses a novel control variate to make the Concrete relaxation
approach unbiased again.</p>
<!--
Toy problem:
$$E_{p(b)} [ f(b,\theta) ]$$
The random variables $$b \sim Bernoulli(\theta)$$ are independent variables parameterized by $$\theta$$.
The function $$f(b,\theta)$$ is differentiable.
This can be estimated through gradient ascent:
$$\nabla_\theta E_{p(b)} [ f(b,\theta) ] = E_{p(b)} [ \nabla_\theta f(b,\theta) + f(b,\theta) \nabla_\theta \log p(b) ] $$
-->
<h1 id="references">References</h1>
<ul>
<li><a href="https://gabrielhuang.gitbooks.io/machine-learning/content/reparametrization-trick.html">Reparametrization Trick (Huang, 2018, blog post)</a></li>
<li><a href="http://papers.nips.cc/paper/6328-the-generalized-reparameterization-gradient.pdf">The Generalized Reparameterization Gradient (Ruiz et al., 2016)</a></li>
<li><a href="https://arxiv.org/pdf/1503.01494.pdf">Local Expectation Gradients for Doubly Stochastic Variational Inference (Titsias, 2015)</a></li>
<li><a href="https://arxiv.org/pdf/1402.0030.pdf">Neural Variational Inference and Learning in Belief Networks (Mnih, Gregor, 2014)</a></li>
<li><a href="https://arxiv.org/pdf/1511.05176.pdf">MuProp: Unbiased Backpropagation for Stochastic Neural Networks, (Gu et al, 2016)</a></li>
<li><a href="https://arxiv.org/pdf/1602.06725.pdf">Variational Inference for Monte Carlo Objectives (Mnih, Rezende, 2016)</a></li>
<li><a href="https://arxiv.org/pdf/1611.01144.pdf">Categorical Reparameterization with Gumbel-Softmax (Jang et al., 2016)</a></li>
<li><a href="https://arxiv.org/pdf/1611.00712.pdf">The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables (Maddison et al., 2017)</a></li>
<li><a href="https://openreview.net/pdf?id=ryBDyehOl">REBAR: Low-Variance, Unbiased Gradient Estimates for Discrete Latent Variable Models (Tucker et al., 2017)</a></li>
</ul>
<script type="text/javascript" src="/javascripts/controlvariates.js">
</script>Anne van RossumVariational inference approximates the posterior distribution in probabilistic models. Given observed variables \(x\) we would like to know the underlying phenomenon \(z\), defined probabilistically as \(p(z | x)\). Variational inference approximates \(p(z|x)\) through a simpler distribution \(q(z,v)\). The approximation is defined through a distance/divergence, often the Kullback-Leibler divergence:Machine learning done Bayesian2018-05-23T10:00:08+00:002018-05-23T10:00:08+00:00https://annevanrossum.com/machine%20learning/bayesian/l1%20regularization/support%20vector%20machines/herding/dropout/stochastic%20gradient%20descent/2018/05/23/machine-learning-done-bayesian<p>In the dark corners of the academic world there is a rampant fight between practitioners of deep learning and researchers of Bayesian methods. This polemic <a href="https://medium.com/intuitionmachine/cargo-cult-statistics-versus-deep-learning-alchemy-8d7700134c8e">article</a> testifies to this, although firmly establishing itself as anti-Bayesian.</p>
<p>There is not much you can have against Bayes’ rule, so the hate runs deeper than this. I think it stems from the very behavior of Bayesian researchers rewriting existing methods as approximations to Bayesian methods.</p>
<p>Ferenc Huszár, a machine learning researcher at Twitter <a href="http://www.inference.vc/everything-that-works-works-because-its-bayesian-2/">describes</a> some of these approximations.</p>
<ul>
<li>L1 regularization is just Maximum A Posteriori (MAP) estimation with sparsity inducing priors;</li>
<li>Support vector machines are just the wrong way to train Gaussian processes;</li>
<li>Herding is just Bayesian quadrature done <a href="https://arxiv.org/abs/1204.1664">slightly wrong</a>;</li>
<li>Dropout is just variational inference done <a href="https://arxiv.org/abs/1506.02142">slightly wrong</a>;</li>
<li>Stochastic gradient descent (SGD) is just variational inference (variational EM) done <a href="https://arxiv.org/pdf/1704.04289.pdf">slightly wrong</a>.</li>
</ul>
<p>Do you have other approximations you can think of?</p>Anne van RossumIn the dark corners of the academic world there is a rampant fight between practitioners of deep learning and researchers of Bayesian methods. This polemic article testifies to this, although firmly establishing itself as anti-Bayesian.Inference in deep learning2018-01-30T00:00:00+00:002018-01-30T00:00:00+00:00https://annevanrossum.com/inference/deep%20learning/2018/01/30/inference-in-deep-learning<p>There are many, many new generative methods developed in the recent years.</p>
<ul>
<li>denoising autoencoders</li>
<li>generative stochastic networks</li>
<li>variational autoencoders</li>
<li>importance weighted autoencoders</li>
<li>generative adversarial networks</li>
<li>infusion training</li>
<li>variational walkback</li>
<li>stacked generative adversarial networks</li>
<li>generative latent optimization</li>
<li>deep learning through the use of non-equilibrium thermodynamics</li>
</ul>
<h1 id="deep-models">Deep Models</h1>
<p>We can’t delve into the details of those old workhorse models, but let us summarize a few of them nevertheless.</p>
<p>A Boltzmann machine can be seen as a stochastic generalization of a Hopfield network. In their unrestricted form Hebbian learning is often used to learn representations.</p>
<!--more-->
<p>A restricted Boltzmann machine, or Harmonium, restricts a Boltzmann machine in the sense that the neurons have to form a bipartite graph. Neurons in one “group” are allowed connections to another group, and the other way around, but they are not allowed to be connected to neurons in the same group. This restriction naturally, but not necessarily leads to structures that resemble layers.</p>
<p>A deep belief network hand deep Boltzmann machines have multiple (hidden) layers that are each connected to each other in the restricted sense of above. These models are basically stacks of restricted Boltzmann machines. This is by the way only true in a handwaving manner. A deep belief network is not a true Boltzmann machine because its lower layers form a <em>directed</em> generative model. <a href="http://proceedings.mlr.press/v5/salakhutdinov09a/salakhutdinov09a.pdf">Salakhutdinov and Hinton (pdf)</a> spell out the differences in detail.</p>
<h1 id="markov-chain-monte-carlo-mcmc">Markov Chain Monte Carlo (MCMC)</h1>
<p>Restricted Boltzmann Machines, Deep Belief Networks, and Deep Boltzmann Machines were trained by MCMC methods. MCMC computes the gradient of the log-likelihood (see post on <a href="/gradient%20descent/gradient%20ascent/kullback-leibler%20divergence/contrastive%20divergence/2017/05/03/what-is-contrastive-divergence.html">contrastive divergence</a>.
MCMC has particular difficulty in mixing between modes.</p>
<h1 id="autoencoder">Autoencoder</h1>
<p>An autoencoder has an input layer, one or more hidden layers, and an output layer. If the hidden layer has fewer nodes than the input layer it is a dimension reduction technique. Given a particular input, the hidden layer represents only particular abstractions that are subsequently enriched so that the output corresponds to the original input. An other dimension reduction technique is for example principle component analysis which has some additional constraints such as linearity of the nodes. Given the shape an autoencoder can also be called a bottleneck or sandglass network.</p>
<p>If we represent the encoder \(F: X \rightarrow H\) and the decoder \(G: H \rightarrow X\). We apply the individual \(x\) to the product as \(x' = (G \circ F)x\), then we can define the autoencoder as:</p>
\[\{F, G \} = \arg \min_{F,G} \| X - X'\|^2\]
<p>Here we choose for an L2 norm for the reconstruction: \(L(x,x') = \| x-x' \|^2\).</p>
<p><img src="/images/blog/autoencoder.png" alt="The autoencoder exists of an encoder F and a decoder G. The encoder maps the input to a hidden set of variables, the decoder maps it back as good as possible to the original input. The difference between original and generated output is used to guide the process to converge to optimal F and G." title="Autoencoder" /></p>
<p>An autoencoder is typically trained using a variant of backpropagation (conjugate gradient method, steepest descent). It is possible to use so-called pre-training. Train each two subsequent layers as a restricted Boltzmann machine and use backpropagation for fine-tuning.</p>
<p>A nice blog post at <a href="https://blog.keras.io/building-autoencoders-in-keras.html">Keras</a> explains also some of the disadvantages of autoencoders, a very clarifying read!</p>
<h1 id="denoising-autoencoders">Denoising Autoencoders</h1>
<p>A denoising autoencoder (DAE) is a regular autoencoder with the input signal corrupted by noice (on purpose: \(\tilde{x} = B(x)\)). This forces the autoencoder to be resilient against missing or corrupted values in the input.</p>
<p>The reconstruction error is again measured by \(L(x,x') = \| x - x'\|^2\), but now \(x'\) is formed by a distortion of the original \(x\), denoted by \(\tilde{x}\), hence \(x' = (G \circ F) \tilde{x}\).</p>
<p><img src="/images/blog/denoising-autoencoder.png" alt="The denoising autoencoder is like the autoencoder but has first a step in which the input is distorted before it is fed into the encoder F and a decoder G." title="Denoising autoencoder" /></p>
<p>Note that a denoising autoencoder can be seen as a stochastic transition operator from <strong>input space</strong> to <strong>input space</strong>. In other words, if some input is given, it will generate something “nearby” in some abstract sense. An autoencoder is typically started from or very close to the training data. The goal is to get an equilibrium distribution that contains all the modes. It is henceforth important that the autoencoder mixes properly between the different modes, also modes that are “far” away.</p>
<!--
# Generative Stochastic Networks
Generative Stochastic Networks ([Alain et al., 2015](https://www.researchgate.net/profile/Saizheng_Zhang/publication/273788029_GSNs_Generative_Stochastic_Networks/links/55140dbf0cf2eda0df303dad/GSNs-Generative-Stochastic-Networks.pdf)) generalize denoising autoencoders. It learns the transition operator of a Markov chain such that its stationary distribution approaches the data distribution.
![Denoising Autoencoder vs a Generative Stochastic Network (copyright Alain et al.). Top: the denoising autoencoder corrupts X and subsequently tries to reconstruct X. Bottom: a generative stochastic network introduces arbitrary random variables H, rather than just a distorted version of X and reconstructs X given H.](/images/blog/generative_stochastic_networks.png)
-->
<h1 id="variational-autoencoders">Variational Autoencoders</h1>
<p>The post by <a href="http://blog.fastforwardlabs.com/2016/08/22/under-the-hood-of-the-variational-autoencoder-in.html">Miriam Shiffman</a> is a nice introduction to variational autoencoders. They have been designed by <a href="https://arxiv.org/pdf/1312.6114.pdf">(Kingma and Welling, 2014)</a> and <a href="https://arxiv.org/pdf/1401.4082.pdf">(Rezende et al., 2014)</a>. The main difference is that \(h\) is now a full-fledged random variable, often Gaussian.</p>
<p><img src="/images/blog/variational_autoencoder.png" alt="Variational Autoencoder. The hidden (latent) variables in a variational autoencoder are random variables. A variational autoencoder is a probabilistic autoencoder rather than a conventional deterministic one. This means that it becomes possible that there are closed form descriptions for p and q and that standard Bayesian inference can be applied." title="Variational Autoencoder" /></p>
<p>A variational autoencoder can be seen as a (bottom-up) recognition model and a (top-down) generative model. The recognition model maps observations to latent variables. The generative model maps latent variables to observations. In an autoencoder setup the generated observations should be similar to the real observations that go into the recognition model. Both models are trained simultanously. The latent variables are constrained in such a way that a representation is found that is approximately factorial.</p>
<h1 id="helmholtz-machine">Helmholtz Machine</h1>
<p>A Helmholtz machine is a probabilistic model similar to the variational autoencoder. It is trained by the so-called sleep-wake algorithm (similar to expectation-maximization).</p>
<!-- See http://artem.sobolev.name/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.html -->
<h1 id="importance-weighted-autoencoders">Importance weighted Autoencoders</h1>
<p>The importance weighted autoencoder (<a href="https://arxiv.org/pdf/1509.00519.pdf">Burda et al., 2015</a>) is similar to the variational autoencoder, but it uses a tighter loglikelihood lower bound through applying importance weighting. The main difference is that the recognition model uses <strong>multiple samples</strong> (to approximate the posterior distribution over latent variables given the observations). In order words, the recognition model is run a few times and the suggested latent variables are combined to get a better estimate. The model gives more weight to the recognition model than the generative model.</p>
\[\mathcal{L}(x) = \mathbb{E}_{z \sim q(z|x) } \left[ \log \frac{1}{k} \sum_k \frac{p(x,z)}{q(z|x)} \right]\]
<h1 id="generative-adversarial-networks">Generative Adversarial Networks</h1>
<p>Generative Adversarial Networks (<a href="http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf">Goodfellow et al., 2014</a>) use two networks. A generative model \(G\) and a discriminative model \(D\). The generative model maps latent variables \(z\) to data points \(x'\). The discriminator has to make a choice between true data \(x\) and fake data \(x'\). Hereby should \(D(x)\) have a large value and \(D(x')\) have a small value. The discriminator maximizes (we fix the generator):</p>
\[V(D) = \mathbb{E}_{x\sim p_{data}(x)} \left[ \log( D(x) \right] + \mathbb{E}_{x' \leftarrow G(z)} \left[ \log( 1 - D(x') \right]\]
<p>The generator in contrast maximizes:</p>
\[V(D,G) = \mathbb{E}_{x\sim p_{data}(x)} \left[ \log( D(x) \right] + \mathbb{E}_{z \sim G(z)} \left[ \log( 1 - D(G(z)) \right]\]
<p>It is clearly visualized by Mark Chang’s <a href="https://www.slideshare.net/ckmarkohchang/generative-adversarial-networks">slide</a>.</p>
<p><img src="/images/blog/gan.png" alt="Generative Adversarial Net. The discriminator is trying to score as high as possible by assigning ones to real data and zeros to fake data. The generator is trying to make this job as difficult as possible by having the fake data look similar to the real data. The log function punishes false positives and false negatives extraordinarly hard." title="Generative Adversarial Net" /></p>
<p>The distribution \(p_z(z)\) is a arbitrary noise distribution. In other words, the generator morphs totally random stuff into meaningful \(x\). It is like throwing darts randomly into a dart board and the generator folding the board into a hat. Similarly from pure random values we can draw point clouds that have elaborate structure.</p>
<p>The latent variables \(z\) are totally random, however there is something else important here. If \(z\) is a multidimensional random variable information across all dimensions can be used to construct \(x' \leftarrow G(z)\). There is no information about \(z\) if we would like to reason back from \(x'\). This means that from a representation learning perspective the unconstrained use of \(z\) leads to entangled use of it in \(G\). InfoGAN introduces an additional mutual information term between a latent code \(C\) and generated data \(X\).</p>
<h1 id="adversarial-autoencoders">Adversarial Autoencoders</h1>
<p>Adversarial Autoencoders (<a href="https://arxiv.org/pdf/1511.05644.pdf">Makhzani et al., 2016</a>) is an autoencoder that uses generative adversarial networks. The latent variables (the code) are matched with a prior distribution. This prior distribution can be anything. The autoencoder subsequently maps this to the data distribution.</p>
<p><img src="/images/blog/adversarial_autoencoder.png" alt="Adversarial Autoencoder. The latent variables (code) are denoted by h. Samples are drawn from e.g. a Normal distribution p(h). The discriminator (bottom-right) has the task to distinguish positive samples h' from negative samples h. Preferably p(h) will look like p(h') in the end. In the meantime the top row is reconstructing the image x from h as well." title="Adversarial Autoencoder" /></p>
<p>Note that the use of the adversarial network is on the level of the hidden variables. The discriminator attempts to distinguish “true” from “fake” hidden variables.</p>
<p>This immediately rises the following question: Can we also generate fake data as well? If one discriminator has the goal to distinguish true from fake hidden variables, the other can have as goal to distinguish true from fake data. We should take provisions to not have the former discriminator punished by a bad performing second discriminator.</p>
<h1 id="deep-learning-through-the-use-of-non-equilibrium-thermodynamics">Deep Learning Through The Use Of Non-Equilibrium Thermodynamics</h1>
<p>Non-equilibrium Thermodynamics (<a href="https://arxiv.org/pdf/1503.03585.pdf">Sohl-Dickstein et al., 2015</a>) slowly destroys structure in a data distribution through a diffusion process. Then a reverse diffusion process is learned that restores the structure in the ata.</p>
<p>Both processes are factorial Gaussians, the forward process, \(p(x^{t}\mid p(x^{t-1})\) and the inverse process,
\(p(x^{t-1}\mid p(x^t)\).</p>
<p>To have an exact inverse diffusion the chain requires thousands of small steps.</p>
<!-- We can also have "heat up" the diffusion operator. -->
<h1 id="infusion-training">Infusion Training</h1>
<p>Infusion training (<a href="https://arxiv.org/pdf/1703.06975.pdf">Bordes et al., 2017</a>) learns a generative model as the transition operator of a Markov chain. When applied multiple times on unstructured random noise, infusion training will denoise it into a sample that matches the target distribution.</p>
<p><img src="/images/blog/infusion-training.png" alt="Infusion training (copyright Bordes et al.) infuses in this case target x=3 into the chain. First row: random initialization of network weights. Second row: after 1 training epoch. Third row: after 2 training epochs, etc. Bottom row: the network learned how to denoise as fast as possible to x=3." title="Infusion training" /></p>
<!-- compatitive results compared to GAN -->
<h1 id="variational-walkback">Variational Walkback</h1>
<p>Variational Walkback (<a href="http://papers.nips.cc/paper/7026-variational-walkback-learning-a-transition-operator-as-a-stochastic-recurrent-net.pdf">Goyal et al., 2017</a>) learns a transition operator as a stochastic recurrent network. It learns those operators which can represent a nonequilibrium stationary distribution (also violating detailed balance) directly. The training objective is a variational one. The chain is allowed to “walk back” and revisit states that were quite “interesting” in the past.</p>
<p>Compared to MCMC we do not have detailed balance, nor an energy function. A detailed balance condition would by the way mean a network with symmetric weights.</p>
<!--
# Stacked Generative Adversarial Networks
Stacked Generative Adversarial Networks ([Huang et al., 2017](https://arxiv.org/pdf/1612.04357.pdf))
![Stacked Generative Adversarial Networks (copyright Huang et al.) ](/images/blog/stacked_gans.jpg "Stacked Generative Adversarial Networks")
-->
<h1 id="nonparametric-autoencoders">Nonparametric autoencoders</h1>
<p>The latent variables in the standard variational autoencoder are Gaussian and have a fixed quantity. The ideal hidden representation however might require a dynamic number of such latent variables. For example if the neural network has only 8 latent variables in the MNIST task it has to somehow represent 10 digits with these 8 variables.</p>
<p>To extend the hidden layer from a fixed to a variable number of nodes it is possible to use methods developed in the nonparametric Bayesian literature.</p>
<p>There have been already several developments:</p>
<ul>
<li>A stick-breaking variational autoencoder (<a href="https://arxiv.org/pdf/1605.06197.pdf">Nalisnick and Smyth, 2017</a>) where the latent variables are represented by a stick-breaking process (SB-VAE). The inference is done using stochastic gradient descent, which requires a representation where the parameters of a distribution are separated from an independent stochastic noise factor, called a <strong>differentiable, non-centered parametrization</strong> (DNCP). With a Gaussian distribution this is done through the <strong>reparameterization trick</strong> (see below). For a stick-breaking process Beta random variables need to be sampled. This can be done by drawing \(x \sim Gamma(\alpha,1)\) and \(y \sim Gamma(\beta,1)\) and have \(v = x/(x+y)\), corresponding to \(v \sim Beta(\alpha,\beta)\). This does not work as a DNCP though, because Gamma does not have one w.r.t. the shape parameter. When close to zero an inverse CDF might be used. However, the authors opt for a so-called Kumaraswamy distribution;</li>
<li>A nested Chinese Restaurant Process as a prior on the latent variables (<a href="https://arxiv.org/pdf/1703.07027.pdf">Goyal et al., 2017</a>);</li>
<li>An (ordinary) Gaussian mixture as a prior distribution on the latent variables (<a href="https://arxiv.org/pdf/1611.02648.pdf">Dilokthanakul et al., 2017</a>), but see <a href="http://ruishu.io/2016/12/25/gmvae/">this interesting blog post</a> for a critical review (GMVAE);</li>
<li>A deep latent Gaussian mixture model (<a href="http://www.ics.uci.edu/~enalisni/BDL_paper20.pdf">Nalisnick et al, 2016</a>) where a Gaussian mixture is used as the approximate posterior (DLGMM);
<!-- $$z ~ DP(\alpha)$$ and $$x ~ p_\theta(x|z_i)$$ with $$p_\theta$$ the generating network (DLGMM); --></li>
<li>Variational deep embedding uses (again) a mixture of Gaussians as a prior (<a href="https://arxiv.org/pdf/1611.05148.pdf">Jiang et al., 2017</a>) (VaDE);</li>
<li>Variational autoencoded deep Gaussian Processes (<a href="https://arxiv.org/pdf/1511.06455.pdf">Dai et al., 2016</a>) uses a “chain” of Gaussian Processes to represent multiple layers of latent variables (VAE-DGP).</li>
</ul>
<p>The problem with autoencoders is that they actually do not define how the latent variables are to be used.</p>
<p>Inherently, without additional constraints the representation problem is ill-posed. Suppose for example that the generator is just a dictionary of images and that training will make the latent variables point to a particular index in this dictionary. In this way no deep structure has been uncovered by the network at all. It’s pretty much just pointing at what it has been seen during training. Generalization can be expected to be pretty bad.</p>
<p>Especially when variational autoencoders are used in sequence modeling it becomes apparent that the latent code is generally not used. The variational lossy autoencoder introduces control over the latent code to successfully combine them with recurrent networks (<a href="https://arxiv.org/pdf/1611.02731.pdf">Chen et al., 2017</a>).</p>
<p>From an information-theoretic perspective the differences can be formulated in an extreme manner: <strong>maximization or minimization</strong> of mutual information. With InfoGAN (not explained in this blog post) mutual information between input and latent variables is maximized to make sure that the variables are all used. This is useful to avoid the “uninformative latent code problem”, where latent features are actually not used in the training. However, with for example the information bottleneck approach the mutual information between input and latent variables is minimized (under the constraint that the features still predict some labels). This makes sense from the perspective of compression. This behavior can all be seen as a so-called information-autoencoding family (<a href="http://bayesiandeeplearning.org/2017/papers/60.pdf">Zhao et al., 2017</a>).</p>
<p>It is interesting to study how nonparametric Bayesian methods fare with respect to this family and what role they fulfill in such a constrained optimization problem. Existing models namely use fixed values for the Lagrangian multipliers (the tradeoffs they make).</p>
<h1 id="mode-collapse">Mode Collapse</h1>
<p>There are several research directions where mode collapse is the main topic. Mode collapse is especially prevalent in generative adversarial networks. In distributional adversarial networks (<a href="https://arxiv.org/pdf/1706.09549.pdf">Li et al., 2017</a>) two adversaries are defined that are slightly different from the normal one, both based on a so-called <strong>deep mean encoder</strong>. The deep mean encoder has the form:</p>
\[\eta(P) = \mathop{\mathbb{E}}_{x \sim P} [ \phi(x) ]\]
<p>The GAN objective function is:</p>
\[\min_G \max_D { \mathop{\mathbb{E}}_{x \sim P_x} [ \log D(x) ] + \mathop{\mathbb{E}}_{z \sim P_z} [ \log (1 - D(G(z)) ] }\]
<p>The authors extend it with an additional term:</p>
\[\min_G \max_{D,M} { \lambda_1 \mathop{\mathbb{E}}_{x \sim P_x} [ \log D(x) ] + \mathop{\mathbb{E}}_{z \sim P_z} [ \log (1 - D(G(z)) ] + \lambda_2 M(P_x,P_G) }\]
<p>The sample classifier \(\psi\) uses the above intermediate summary statistics \(\eta(P)\) to define a costs (it outputs 1 if sample is drawn from \(P_x\) and 0 otherwise).</p>
\[M(P_x,P_G) = { \log \psi (\eta (P_G)) ] + \log (1 - \psi (\eta(P_x)) ] }\]
<h1 id="generalization">Generalization</h1>
<p>The GAN objective:</p>
\[\min_{u \in U} \max_{v \in V} { \mathop{\mathbb{E}}_{x \sim D_{real}} [ \phi ( D_v(x) ) ] + \mathop{\mathbb{E}}_{x \sim D_{G_u}} [ \phi (1 - D_v(x)) ] }\]
<p>This objective assumes we have an infinite number of samples from \(D_{real}\) to estimate
\(\mathop{\mathbb{E}}_{x \sim D_{real}} [ \phi ( D_v(x) ) ]\). If we have only a finite number of training examples \(x_1, \ldots, x_m \sim D_{real}\), we use the following to estimate this expectation: \(\frac{1}{m} \sum_{i=1}^m [ \phi(D_v(x))]\).</p>
<!--
# Nonparametric view of the GAN
Theorem (oracle inequality for GAN). Let $$F$$ be any critic function class. Denote $$\mu_n$$ as the solution with respect to the empirical estimate $$\nu_n$$ to GAN with generator $$\mu_G$$ and discriminator $$F_D$$:
$$\mu_n = \arg\min_{\mu \sim \mu_G} \max_{f \in F_D} E_{Y \sim \mu} f(Y) - E_{X \sim \nu_n} f(X)$$
The the following decompositions hold for any distribution $$\nu$$,
$$d_{F_D}(\mu_n,\nu) \leq \min_{\mu \in \mu_G} d_{F_D}(\mu,\nu) + d_{F_D}(\nu,\nu_n) + d_{F_D}(\nu_n,\nu)$$
$$d_{F}(\mu_n,\nu) \leq \min_{\mu \in \mu_G} d_{F_D}(\mu,\nu) + (1 + {||\nu_n||}_1 ) \max_{f \in F} \min_{f' \in F_D} {|| f - f'||}_\infty + d_{F_D}(\nu,\nu_n) + d_{F}(\nu_n,\nu)$$
In the first decomposition $$d_{F_D}$$ is the objective evaluation metric. The first term is a minimization term, the best approximation error within the generator class when having population access to the true measure $$\nu$$. The second term is the statistical error, also called the generalization error, due to the fact that there are only $$n$$ samples available.
In the second decomposition a different $$d_F$$ is the objective metric. The first term is the approximation error induced by the generator. The second term defines how well the discriminator serves as a surrogate for the objective metric, and the third term is the statistical error.
-->
<h1 id="regularization">Regularization</h1>
<p>Training deep networks has undergone several advances. One of the first innovations has been the layer by layer training.
Other concepts you will find are:</p>
<ul>
<li>dropout</li>
<li>stochastic gradient descent</li>
<li>batch normalization</li>
<li>residual training</li>
<li>reparameterization trick</li>
</ul>
<p>We will briefly describe them, but they each easily deserve a dedicated explanation as well. So little time!</p>
<h2 id="dropout">Dropout</h2>
<p>Another key idea has been to randomly drop units including connections during training. This prevents overfitting. During training in this way a collection of differently thinned networks is used. At testing an unthinned network is used. This is called dropout (<a href="http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf">Srivastava et al., 2014</a>).</p>
<h2 id="stochastic-gradient-descent">Stochastic gradient descent</h2>
<p>Gradient descent or steepest descent is an iterative method where we take steps that depend on the slope (or more general, that depend on the gradient) with as purpose to end up at a minimum. To get (negative) gradients we need to have differential functions.</p>
<p>Stochastic gradient descent is a stochastic approximation to gradient descent. What is approximated is the true gradient. Adjusting the parameters \(\theta\) it minimizes the following loss function:</p>
\[\theta = \arg \min_\theta \frac{1}{N} \sum_{i=1}^N L(x_i;\theta)\]
<p>Here \(x_1, \ldots x_N\) is the training set. Stochastic gradient descent now incrementally navigates to the values for $\theta$ where the sum over the function $L(x_i, \theta)$ is minimized. The parameter $\theta$ is continuously adjusted by looping over all observations $x_i$:</p>
\[\theta' = \theta - \eta \frac{\partial }{\partial \theta} L(x_i;\theta)\]
<p>After looping over all observations, stochastic gradient descent performs this loop again and again till some kind of convergence criterion is met or until the researcher likes to have a beer, read a book, or spend time on social media.</p>
<!--
Stochastic gradient descent typically uses a mini-batch $$x_1, \ldots, x_m$$ of size $$m$$. The gradient is then approximated by:
$$\frac{1}{m} {\partial l(x_i, \theta)}{\partial \theta}$$
-->
<h2 id="batch-normalization">Batch normalization</h2>
<p>The distribution of network activities change during training due to the fact that the network parameters change. This phenomenon is called <strong>internal covariate shift</strong>. It is possible to fix the distribution of the layer inputs \(x\) as the training progresses. It is for example well-known that whitening the inputs (linear transforming them to zero means, unit variances and decorrelating them) makes a network converge faster. Batch normalization does not simply whiten each layer’s input, but makes two simplifications: (1) normalize each scalar feature independently, and (2) introduce scale and shift parameters to preserve nonlinearities. Batch normalization improved significantly on the ImageNet classification task (<a href="https://arxiv.org/pdf/1502.03167.pdf">Ioffe and Szegedy, 2015</a>).</p>
<h2 id="residual-learning">Residual learning</h2>
<p>Making networks deeper and deeper counterintuitively increases the training error and thus the test error. Consider for example an identity mapping (as with autoencoders): a network needs to learn to duplicate the input to generate the output. Empirical evidence shows that learning the difference (in this case zero between input and output) is easier for a network. This is called residual learning (<a href="https://arxiv.org/pdf/1512.03385.pdf">He et al., 2015</a>. At ImageNet such residual nets achieve 3.57% error on the test set. It is hence no surprise that the fourth edition of the Inception networks use residual learning (<a href="http://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14806/14311">Szegedy et al., 2017</a>).</p>
<h2 id="reparameterization-trick">Reparameterization Trick</h2>
<p>The reparameterization trick replaces a (blackbox) stochastic node in a computational graph with a node that is non-stochastic (of which a gradient can be calculated) with the noise added separately. It’s just as if the salt is added after you have made the soup. It substitutes a random variable by a deterministic transformation of a simpler random variable. There are three popular methods (<a href="http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/">Shakir Mohammed blog</a>):</p>
<ol>
<li>Inverse sampling. The inverse cumulative distribution function can be used as the transformation.</li>
<li>Polar methods. Generating pairs (e.g. the basis of the Box-Muller transform).</li>
<li>Coordinate transformation methods (shifting and scaling).</li>
</ol>
<p>The last example uses the fact that the transformation ($x = g(\epsilon;\theta)$) is valid for particular well chosen one-liners:</p>
\[\frac{\partial}{\partial \theta} \sum_{i=1}^N p(x_i; \theta) f(x_i) =
\frac{\partial}{\partial \theta} \sum_{i=1}^N p(\epsilon_i) f(g(\epsilon_i;\theta))\]
<p>For example the (zero-centered) Normal distribution is defined as:</p>
\[p(x;\theta) = N(0,\theta)\]
<p>We can write this as a standard Normal distribution with a deterministic transformation:</p>
\[p(\epsilon) = N(0,1)\]
\[g(\epsilon; \theta) = \theta \epsilon\]
<!--
For example the standard Cauchy distribution (position at zero, $x_0 = 0$, and scale at one, $\gamma=1$) is defined as:
$$p(x;\theta) = \frac{1}{\pi (1 + x^2)}$$
The quantile function (inverse cumulative function) is:
$$Q(\epsilon) = \tan (\pi (\epsilon - 1/2) )$$
Thus is can also be written as a uniform base distribution with a subsequent deterministic transformation:
$$p(\epsilon) = U[0,1]$$
$$g(\epsilon; \theta) = \tan(\pi \epsilon)$$
-->
<!--
Suppose we want to optimize over $\sigma$:
$$
x = U(-1, 1) \\
y = N(0, \sigma) \\
\arg \min_{\sigma} \frac{1}{N} \sum_{i=1}^N L(x_i,y_i)
$$
Here $L$ is the loss function. It is often the mean squared error between $x$ and $y$, let us assume this here as well.
$$L(x,y) = (x - y)^2$$
In a probabilistic model the output is **different** each time we run, even if the input is the same. To do gradient descent we define a closed-form formula about how to change our parameter $\sigma$ to get a lower value for $L$. The gradient defines how much the output changes when we alter the input.
$$\frac{\partial}{\partial \sigma} L(x,y) = \frac{\partial}{\partial \sigma} {(x - c \sigma)}^2 $$
Here we have $c ~\sim N(0,1)$ simulated from a normal distribution without parameters.
$$
x = U(-1, 1) \\
y = c \sigma \\
\arg \min_{\sigma} \frac{1}{N} \sum_{i=1}^N L(x_i,y_i)
$$
We can solve:
$$\frac{\partial}{\partial \sigma} L(x,y) = \frac{\partial}{\partial \sigma} {(x - c \sigma)}^2 =
\frac{\partial}{\partial \sigma} x^2 - 2 c \sigma x + {(c \sigma)}^2 = - 2 x c + 2 c^2 \sigma $$
If we now perform gradient descent it will be an iterative execution of:
$$\sigma' = \sigma - \eta ( - 2 x c + 2 c^2 \sigma ) = \sigma + 2 \eta c ( x - c \sigma) $$
Now we can reparametrize in such way that $x$ and $\sigma$ are not parameters of a probability distribution. They are just parameters of a deterministic function (that is transformed through a stochastic function that has **no parameters** to optimize over).
-->
<p>The result is that through this reparameterization the variance can be substantially reduced (potentially!).
The reparameterization trick is well explained by Goker Ergodan in this <a href="http://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb">Jupyter notebook</a>.</p>
<!-- The way gradient descent is done in this type of setting is stochastic as well. Single $x,y$ pairs are used to find out in which way $\sigma$ has to be adjusted. The difference with the setting above is that -->
<!-- if the primal is infeasible (insufficient model capacity) the choice of Lagrange multipliers prioritizes different constraints -->
<!--
# Generative Latent Optimization
# Other typical concepts
## Energy function representation of probability
A probability distribution can be represented through an energy function
$$p(x) = \frac { \exp^{-E(x)} }{ \sum_{x \in X} \exp^{-E(x)} } = \frac{1}{Z} \exp^{-E(x)}$$
It relates the probability of $$x$$ with that of all possible other states. A so-called canonical assemble represents the states of a system by such an exponential. The above is actually using the so-called canonical [partition function](https://en.wikipedia.org/wiki/Partition_function_(mathematics)). Note, that in both cases the original definitions contains a thermodynamic beta $$\beta$$. This (inverse) temperature can be used to compare systems: they will be in equilibrium if their temperature is the same.
The generalization of the canonical assemble to an infinite number of states is called the [Gibbs measure](https://en.wikipedia.org/wiki/Gibbs_measure):
$$P(X=x) = \frac{1}{Z(\beta)} \exp^{-\beta E(x)}$$
What is all narrows down to is that not every state is counted equally. The Boltzmann factor is a weight. A low energy state is easier to access and weighs much more than a high energy state. If the temperature increases this difference diminishes.
-->
<!--
Kullback-Leibler
Thus, we have for example:
$$\log p(x) = -E(x) - \log Z$$
-->
<!--
## Monte Carlo simulation
If a probability density function $$p(x)$$ is known, its statistical properties such as mean, variance, etcetera can be found through integration:
$$E[h(X)] = \int h(x) p(x) dx$$
This integral can be approximated by Monte Carlo simulation by drawing many $$X_i$$ from $$p(x)$$:
$$\mu_h = \int h(x) p(x) dx \approx \frac{1}{n} \sum_{i=1}^n h(X_i)$$
That the latter converges to the expectation $$\mu_h$$ of $$h(x)$$ is known as the law of large numbers.
## Jensen's inequality
To really appreciate Jensen's inequality I'd recommend the Convex Optimization course by Stephen Boyd at the Stanford University, Electrical Engineering department ([youtube lecture series](https://www.youtube.com/watch?v=McLq1hEq3UY).
## Reparameterization Trick
The reparameterization trick is well explained by Goker Ergodan in this [Jupyter notebook](http://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb).
Suppose we have a simple distribution $$q_{\theta}(x) = N(\theta, 1)$$ and we want to solve the following toy problem:
$$\min_\theta E_q \left[ x^2 \right]$$
The gradient over $$\theta$$ we can calculate (see the Jupyter notebook):
$$\nabla_\theta E_q \left[ x^2 \right] = E_q \left[ x^2 (x - \theta) \right]$$
The gradient is zero at $$x = \theta$$ and increases quadratically with $$x^2$$ all with $$x$$ sampled from $$N(\theta,1)$$.
However, if we separate $$x = \theta + \epsilon$$ with $$\epsilon \sim N(0,1)$$, we can write:
$$\nabla_\theta E_q \left[ x^2 \right] = E_p \left[ 2 (\theta + \epsilon) \right]$$
Here $$p$$ is the distribution over $$\epsilon$$, namely $$N(0,1)$$, not the distribution over $$N(\theta,1)$$. This is the whole trick, the distribution $$p$$ does not depend on $$\theta$$. The result is that through this reparameterization the variance can be substantially reduced. In a handwaving manner this is logical. Do not try to differentiate over stochastic elements, that only amplifies deltas.
## Backpropagation
The backprop paper came out in 1986.
## Likelihood
Likelihood, call it an error function if you hate Bayesian terms. :-)
-->Anne van RossumThere are many, many new generative methods developed in the recent years. denoising autoencoders generative stochastic networks variational autoencoders importance weighted autoencoders generative adversarial networks infusion training variational walkback stacked generative adversarial networks generative latent optimization deep learning through the use of non-equilibrium thermodynamicsWhat is contrastive divergence?2017-05-03T09:30:04+00:002017-05-03T09:30:04+00:00https://annevanrossum.com/gradient%20descent/gradient%20ascent/kullback-leibler%20divergence/contrastive%20divergence/2017/05/03/what-is-contrastive-divergence<p>In contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume \(x\) to be discrete):</p>
\[D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \log \frac {P_0(x) }{P(x\mid W)}\]
<p>Here \(P_0(x)\) is the observed data distribution, \(P(x\mid W)\) is the model distribution and \(W\) are the model parameters. A <strong>divergence</strong> (<a href="https://en.wikipedia.org/wiki/Divergence_(statistics)">wikipedia</a>) is a fancy term for something that resembles a <strong>metric</strong> distance. It is not an actual metric because the divergence of \(x\) given \(y\) can be different (and often is different) from the divergence of \(y\) given \(x\). The Kullback-Leibler divergence \(D_{KL}(P \mid \mid Q)\) exists only if \(Q(\cdot) = 0\) implies \(P(\cdot) = 0\).</p>
<!--more-->
<p>The model distribution can be written in the form of a normalized energy function:</p>
\[P(x|W) = \frac {\exp \{ -E(x,W) \} } { Z(W) }\]
<p>The partition function can be written as the sum over all states:</p>
\[Z(W) = \sum_x \exp \{ -E(x,W) \}\]
<h2 id="gradients">Gradients</h2>
<p>With gradient descent we use the gradient negatively:</p>
\[W_{t+1} = W_t - \lambda \nabla f(W_t)\]
<p>With gradient ascend we use the gradient positively:</p>
\[W_{t+1} = W_t + \lambda \nabla f(W_t)\]
<p>In both cases \(\lambda\) is a predefined parameter. It can be constant, but in learning methods this can also be a function called the <strong>learning rate</strong>. The parameter \(\lambda\) might depend on time \(t\).</p>
<p>For both gradient descent and gradient ascent \(W_{t+1} - W_t = 0\) means that \(\nabla f(W_t) = 0\). Descending a slope up to a zero gradient leads to a minimum if there is one. Ascending a slope up to a zero gradients leads to a maximum if there is one. The extremum found does not necessarily need to be unique, except if the function is concave, respectively convex.</p>
<h2 id="gradient-descent-of-the-kl-divergence">Gradient descent of the KL-divergence</h2>
<p>Below you will find a step-by-step derivation of a description of gradient descent for the KL-divergence. It needs to
be <strong>minimization</strong> so we will indeed need gradient descent (not ascent). Formally, we have to calculate:</p>
\[W_{t+1} - W_t = - \lambda \left\{ \nabla D(P_0(x) \mid\mid P(x\mid W) \right\}\]
<h3 id="kl-divergence-parts-that-depend-on-w">KL-divergence parts that depend on \(W\)</h3>
<p>We are gonna rewrite this equation is a way relevant to taking a derivative: (1) reorganize the equation such that the
terms not involving \(W\) are separate terms, (2) using log identities to write it as a sum of terms, and (3) removing
the terms not involving \(W\).</p>
<p>Hence, first, let us rewrite the divergence to obtain separate terms that do and do not involve \(W\). Herefore we substitute \(P(x\mid W)\) on the fourth line:</p>
\[D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \color{green}{ \log \frac {P_0(x) }{P(x\mid W)}}\]
\[= \sum_x P_0(x) \color{green}{\left\{ \log P_0(x) - \log P(x\mid W) \right\}}\]
\[= \sum_x P_0(x) \log P_0(x) - \sum_x P_0(x) \log \color{purple} { P(x\mid W) }\]
\[= \sum_x P_0(x) \log P_0(x) - \sum_x P_0(x) \log \color{purple} { \frac {\exp \{ -E(x,W) \} } { Z(W) } }\]
<p>Second, use the following identity \(\log a + \log b = \log a b\) to reach a sum of terms:</p>
\[= \sum_x P_0(x) \log P_0(x) - \left\{ \sum_x P_0(x) \{ -E(x,W) \} + \log \frac{1}{Z(W)} \right\}\]
\[= \sum_x P_0(x) \log P_0(x) - \left\{ \sum_x P_0(x) \{ -E(x,W) \} - \log Z(W) \right\}\]
\[= \sum_x P_0(x) \log P_0(x) + \sum_x P_0(x) E(x,W) + \log Z(W)\]
<p>Third, get rid of the first term that does not depend on \(W\). Now the part relevant to our derivative is:</p>
\[\sum_x P_0(x) E(x,W) + \log Z(W)\]
<p>In “On Contrastive Divergence Learning” by Carreira-Perpinan and Hinton (<a href="http://www.gatsby.ucl.ac.uk/aistats/aistats2005_eproc.pdf">proceedings AISTATS 2015</a>) this is written as the log-likelihood objective:</p>
\[L(x,W) = -\left\langle E(x,W) \right\rangle_0 - \log Z(W)\]
<p>Note, that there is a negative sign here. The maximum log-likelihood is identical to the minimum KL divergence.</p>
<h3 id="the-gradient-of-the-kl-divergence">The gradient of the KL-divergence</h3>
<p>Taking the gradient with respect to \(W\) (we can then safely omit the term that does not depend on \(W\)):</p>
\[\nabla D(P_0(x) \mid\mid P(x\mid W)) = \frac{ \partial \sum_x P_0(x) E(x,W)}{\partial W} + \frac{\partial \log Z(W)}{ \partial W}\]
<p>Recall the derivative of a logarithm:</p>
\[\frac{ \partial \log f(x) }{\partial x} = \frac{1}{f(x)} \frac{\partial f(x)}{\partial x}\]
<p>Take derivative of logarithm:</p>
\[\nabla D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} + \frac{1}{Z(W)} \frac{\partial Z(W)}{ \partial W}\]
<p>The derivative of the partition function:</p>
\[Z(W) = \sum_x \exp \{ -E(x,W) \}\]
\[\frac{\partial Z(W)}{ \partial W} = \frac{ \partial \sum_x \exp \{ -E(x,W) \} }{ \partial W }\]
<p>Recall the derivative of an exponential function:</p>
\[\frac{ \partial \exp f(x) }{\partial x} = \exp f(x) \frac{\partial f(x)}{\partial x}\]
<p>Use this for the partition function derivative:</p>
\[\frac{\partial Z(W)}{ \partial W} = \sum_x \exp \{ -E(x,W) \} \frac{ \partial \{-E(x,W) \} }{ \partial W }\]
<p>Hence:</p>
\[\frac{1}{Z(W)} \frac{\partial Z(W)}{ \partial W} = \sum_x \frac{\exp \{ -E(x,W) \} }{Z(W)} \frac{ \partial \{ -E(x,W) \} }{ \partial W }\]
<p>Using \(P(x \mid W)\):</p>
\[= \sum_x P(x \mid W) \frac{ \partial \{ -E(x,W) \} }{ \partial W }\]
<p>Again, the gradient of the divergence was:</p>
\[\nabla D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} + \frac{1}{Z(W)} \frac{\partial Z(W)}{ \partial W}\]
<p>Hence:</p>
\[\nabla D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} + \sum_x P(x \mid W) \frac{ \partial \{ -E(x,W) \} }{ \partial W }\]
\[\nabla D(P_0(x) \mid\mid P(x\mid W)) = \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} - \sum_x P(x \mid W) \frac{ \partial E(x,W) }{ \partial W }\]
<p>Compare with Hinton:</p>
\[\frac{ \partial L(x,W) }{ \partial W} = - \left\langle \frac{\partial E(x,W)}{\partial W} \right\rangle_0 + \left\langle \frac{ \partial E(x,W) }{ \partial W } \right\rangle_{\infty}\]
<p>Gradient descent:</p>
\[W_{t+1} - W_t = - \lambda \nabla f(W_t)\]
<p>Thus,</p>
\[W_{t+1} - W_t = \lambda \left\{ - \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} + \sum_x P(x \mid W) \frac{ \partial E(x,W) }{ \partial W } \right\}\]
<p>We arrived at the formulation of minimization of KL-divergence that allows comparing it with Contrastive divergence.</p>
<h1 id="constrastive-divergence">Constrastive divergence</h1>
<p>Contrastive divergence uses a different (empirical) distribution to get rid of \(P(x \mid W)\):</p>
\[W_{t+1} - W_t = \lambda \left\{ - \sum_x P_0(x) \frac{\partial E(x,W)}{\partial W} + \sum_x \color{blue}{Q_W(x)} \frac{ \partial E(x,W) }{ \partial W } \right\}\]Anne van RossumIn contrastive divergence the Kullback-Leibler divergence (KL-divergence) between the data distribution and the model distribution is minimized (here we assume \(x\) to be discrete):