Nonnegative Autoencoders

My intuition would say that a part-based decomposition should arise naturally within an autoencoder. To encorporate the next image in an image recognition task, it must be more beneficial to have gradient descent being able to navigate towards the optimal set of neural network weights for that image. If not, for each image gradient descent is all the time navigating some kind of common denominator, none of the images are actually properly represented. For each new image that is getting better classified, the other images are classified worse. With a proper decomposition learning the next representation will not interfere with previous representations. Grossberg calls this in Adaptive Resonance Theory (ART) catastrophic forgetting.

Maybe if we train a network long enough this will be the emerging decomposition strategy indeed. However, this is not what is normally found. The different representations get coupled and there is not a decomposition that allows the network to explore different feature dimensions independently.

One of the means to obtain a part-based representation is to force positive or zero weights in a network. In the literature nonnegative matrix factorization can be found. Due to the nonnegativity constraint the features are additive. This leads to a sparse basis where through summation “parts” are added up to a “whole” object. For example, faces are built up out of features like eyes, nostrils, mouth, ears, eyebrows, etc.

Don't just read the excerpt. :-) Sit down and read for real! →

Generating Point Clouds

If we do want robots to learn about the world, we can use computer vision. We can employ traditional methods. Build up a full-fledged model from corner detectors, edge detectors, feature descriptors, gradient descriptors, etc. We can also use modern deep learning techniques. One large neural network hopefully captures similarly or even better abstractions compared to the conventional computer vision pipeline.

Computer vision is not the only choice though! In recent years there is a proliferation of a different type of data: depth data. Collections (or clouds) of points represent 3D shapes. In a game setting the Kinect was a world-shocking invention using structured light. In robotics and autonomous cars LIDARs are used. There is huge debate about which sensors are gonna “win”, but I personally doubt there will be a clearcut winner. My line of reasoning:

  • Humans use vision and have perfected this in many ways. It would be silly to not use cameras.
  • Depth sensors can provide information when vision gets obstructed.
  • Humans use glasses, microscopes, infrared goggles, all to enhance our senses. We are basically cyborgs.
  • Robots will benefit from a rich sensory experience just like we do. They want to be cyborgs too.
Don't just read the excerpt. :-) Sit down and read for real! →

Attend, Infer, Repeat

A long, long time ago - namely, in terms of these fast moving times of advances in deep learning - two years (2016), there was once a paper studying how we can teach neural networks to count.

Attend, infer, repeat

This paper is titled “Attend, infer, repeat: Fast scene understanding with generative models” and the authors are Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa (github, nice, he does couchsurfing), David Szepesvari, Koray Kavukcuoglu, and Geoffrey Hinton. A team at Deepmind based in London.

This has been a personal interest of mine. I felt it very satisfying that bees for example can count landmarks or at least have a capability that approximates this fairly good. It is such an abstract concept, but very rich. Just take the fact that you can recognize yourself in the mirror (I hope). It’s grounded on something that really strongly believes that there is only one of you, that you are pretty unique.

Don't just read the excerpt. :-) Sit down and read for real! →

Random Gradients

Random Gradients

Variational inference approximates the posterior distribution in probabilistic models. Given observed variables we would like to know the underlying phenomenon , defined probabilistically as . Variational inference approximates through a simpler distribution . The approximation is defined through a distance/divergence, often the Kullback-Leibler divergence:

It is interesting to see that this deterministic strategy does not require Monte Carlo updates. It can be seen as a deterministic optimization problem. However, it is definitely possible to solve this deterministic problem stochastically as well! We can formulate it as a stochastic optimization problem!

There are two main strategies:

  • the reparametrization trick
  • the log-derivate trick
Don't just read the excerpt. :-) Sit down and read for real! →

Machine Learning Done Bayesian

In the dark corners of the academic world there is a rampant fight between practitioners of deep learning and researchers of Bayesian methods. This polemic article testifies to this, although firmly establishing itself as anti-Bayesian.

There is not much you can have against Bayes’ rule, so the hate runs deeper than this. I think it stems from the very behavior of Bayesian researchers rewriting existing methods as approximations to Bayesian methods.

Ferenc Huszár, a machine learning researcher at Twitter describes some of these approximations.

  • L1 regularization is just Maximum A Posteriori (MAP) estimation with sparsity inducing priors;
  • Support vector machines are just the wrong way to train Gaussian processes;
  • Herding is just Bayesian quadrature done slightly wrong;
  • Dropout is just variational inference done slightly wrong;
  • Stochastic gradient descent (SGD) is just variational inference (variational EM) done slightly wrong.

Do you have other approximations you can think of?