
The Confusion of Variational Autoencoders - jaan
https://jaan.io/unreasonable-confusion/
======
murbard2
What puzzles me with the variational autoencoder is that there is no reason to
expect the covariance of p(z|x) to be diagonal. This sounds like such a crude
approximation that there ought to little benefits to even treat it as a
distribution rather than a point mass. And yet it seems to do rather well
(though not as well as GAN which do represent arbitrary distributions).

~~~
taliesinb
VAEs can be extended to make the latent variables dependent. OpenAI's inverse
autoregressive flow is one recent way that is particularly efficient:
[http://arxiv.org/pdf/1606.04934v1.pdf](http://arxiv.org/pdf/1606.04934v1.pdf).
Linear IAF is the simplest form of this, with it you can model normal z having
an arbitrary covariance matrix.

But aside from that, there is an information-theoretic view on why you might
prefer VAEs over AEs. In short, having p(z|x) not be point-mass (aka an
ordinary AE) allows you to bound the information flow through the bottleneck.
KL loss on p(z|x) forces the network to be honest about how much information
it is cramming into z for the purposes of reconstruction.

To unpack that a bit: in theory, even a single real-valued latent variable z
could store an arbitrary amount of information (if the encoder and decoder
conspired cleverly enough). But if you make z stochastic, or in other words if
your encoder's job is to calculate the parameters of a distribution from which
you sample z, you're essentially introducing a noisy channel in the middle of
your network, and you can then bound how much information is flowing across
that channel. But to do that you still need to use KL divergence loss to
encourage p(z|x) to approximate your chosen latent distribution, otherwise
your encoder and decoder might cheat, e.g. by using near-point-mass z as a way
to turn back into ordinary AEs again.

Or in deep learning speak, it's a form of regularization with a particularly
rich and interpretable statistical motivation.

~~~
murbard2
I get the regularization part, but don't you get essentially the same
regularization from using a sparse autoencoder? If the encoder realizes it
doesn't have much information, it will turn on few units.

What I don't really intuit is: is it just basically doing regularization, or
is the interpretation in terms of learning to infer the posterior meaningful?

~~~
taliesinb
> I get the regularization part, but don't you get essentially the same
> regularization from using a sparse autoencoder? If the encoder realizes it
> doesn't have much information, it will turn on few units.

Putting a sparsity loss on z in a regular AE will encourage the code to have
smaller magnitudes, and with relu those units will tend to saturate to zero,
yes.

But the original point was that even a single continuous unit can be used to
transmit an arbitrary amount of information. Not so much that this happens in
practice, because the encoder and decoder would need access to something like
modulo to do the most obvious kinds of cheating, but just that from an
information theory point of view you can't really talk about how much
information a continuous variable transmits unless you are transmitting it
over a noisy channel and can measure entropies of distributions (and indeed
you can formally derive how a given KL loss bounds the information transmitted
by z).

> What I don't really intuit is: is it just basically doing regularization, or
> is the interpretation in terms of learning to infer the posterior
> meaningful?

Both, which I think is really nice. You can look at it either way.

The Bayesian interpretation is powerful because you now have a principled way
to calculate p(x), which you didn't have before. And you can introduce
multiple latent variables in your network (as long as no layers take inputs
from both ordinary and sampling layers) and so you have some flexibility to do
limited forms of graphical modelling that supports efficient forward inference
and GPU acceleration. And the inference machinery can be trained via cheap
backpropagation instead of expensive sampling.

------
conjectures
There were some nice things about this article. However I wouldn't recommend
it as a cure for confusion. E.g.

> in mean-field variational inference, we have parameters for each datapoint
> ... In the variational autoencoder setting, we do amortized inference where
> there is a set of global parameters ...

Mean-field implies the variational posterior is modelled as factorising over
the different latent variables involved. Some latent variables can be local
(unique to a data point) and some can be global (shared across data points).

------
jayajay
Recently, someone shared a link on Hacker News to this website:
[https://pomax.github.io/nrGrammar/](https://pomax.github.io/nrGrammar/). If
you look carefully in section 1.1.4, which aims to visually compare the
differences between the Hiragana and Katakana scripts, you can see that there
is a "logic" in transitioning from a character in Hiragana to the same
character in Katakana. In the same way, it seems that an autoencoder is
capable of capturing this logic.

~~~
dietrichepp
Note that this is because a subset of the kana were derived from the same Han
characters in both scripts, but this does not apply to all kana.

------
tiiualto
Küll on keeruline, ma ähin ja puhin, aga ikka ei taipa mõhkugi!

