
Variational Autoencoders Explained - lainon
http://anotherdatum.com/vae.html
======
sdenton4
Two things I believe to be true...

1) Auto-encoders are overplayed, mostly because they're a pretty easy intro ML
project. There was a brief moment (before ResNets and batch normalization
showed up) when they were useful for bootstrapping a representation, but they
aren't serving a terribly concrete purpose now that it's so much easier to get
end-to-end deeper pipelines running and learning their own representations.
The common criticism is that a representation that's great for reconstruction
may still not be super useful for classification (or whatever your real goal
is). And, compared to 'real' engineering, autoencoders do a pretty crap job at
data compression.

2) That said, /variational/ auto-encoders are doing some interesting things.
There was a nice paper [pdf:
[https://arxiv.org/pdf/1612.00410.pdf](https://arxiv.org/pdf/1612.00410.pdf) ]
using variational methods to try to take advantage of the (possibly
chimerical!) information bottleneck of Tishby. And I think the general idea of
being able to have middle-of-the-stack loss may still have some use for
helping models generalize; loss based entirely on reconstruction error seems a
bit too constrictive, though.

~~~
nabla9
> auto-encoders are overplayed,

What you are using for unsupervised learning instead of autoencoders?

~~~
sdenton4
Triplet loss is a good alternative, which plays nice when dealing with large,
partially labelled data set. Here's a random blog post:
[https://omoindrot.github.io/triplet-
loss](https://omoindrot.github.io/triplet-loss)

~~~
nabla9
It's not unsupervised learning.

~~~
sdenton4
You can run triplet learning as unsupervised representation learning. Take let
aug(X) be an augmentation of X (noise, translation, etc), and form the triple
(X, aug(X), Y). Let f be your neural network (or whatevs). Then reduce the
distance d(f(X), f(aug(X))) while increasing d(f(X), f(Y)). No labels needed.

~~~
nabla9
Has anyone done that and what is the advantage over modern autoencoders?

I would really like to know, because I'm currently working with autoencoders.
Relying only on synthetic samples can introduce all kinds of biased denoising
and require endless tweaking.

When you tie the weights and convolution kernels in the decoder to those of
the encoder, you get relatively fast learning without excess number of
variables and encoder-decoder can be much deeper.

~~~
sdenton4
Here's the OG triplet metric learning paper:
[https://arxiv.org/abs/1412.6622](https://arxiv.org/abs/1412.6622) Their
strategy is for labeled data; just take X and X' with the same labels, and Y
from a different label.

And here's an example of unlabeled triplet learning:
[https://arxiv.org/abs/1711.02209](https://arxiv.org/abs/1711.02209) In this
case, aug(X) might be a slightly time or pitch shifted example. So whatever
the label is for X, it will be the same as the label for aug(X).

The advantage here is that as a framework it plays nicely with 'real'
problems. Representation learning is rarely the actual problem; it's certainly
not something an end-user cares about! If you are ultimately working on a
classification problem and have a huge amount of unlabeled data and a tiny bit
of labeled data, you can train the metric with both the Huff scheme (X and X'
with same label) and the unsupervised scheme (X and aug(X)) simultaneously,
and take advantage of the large dataset.

A more general argument for metric learning is that, again, reconstruction
error proooobably isn't what you actually care about. (Sure, you're sharing
encoder and decoder weights, but why do you even need to decode?! How many
parameters are being wasted on the need for the weight matrices to support
decoding?) If clustering is what you want, metric learning gets at things
being close or far from one another more directly.

~~~
nabla9
Thank's for the info. I look into it.

------
theCricketer
The insight that made it possible for me to grasp VAEs was digging into the
probabilistic setup that leads to this formulation. The neural networks are
"just" a powerful function approximators applied on top of this probabilistic
framework.

