
Neural Networks Are Impressively Good at Compression - ingve
https://probablydance.com/2016/04/30/neural-networks-are-impressively-good-at-compression/
======
kaffeinecoma
People are knocking this guy for not being an expert and maybe getting some
details wrong. Maybe it's a little bit like watching a non-programmer stumble
their way through a blog post about learning to program- experienced
programmers may cringe a bit.

But I really appreciate these kinds of write-ups: he declares his non-
expertise up-front, and then proceeds to document his understanding as he goes
along. There's something useful about this kind of blog post for non-experts.

I'm working my way through Karpathy's writeup on RNNs
([http://karpathy.github.io/2015/05/21/rnn-
effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness)). I've
mechanically translated his Python to Go, and even managed to make it work.
But I _still_ don't entirely understand the math behind it. Now obviously
Karpathy IS an expert, but despite his extremely well-written blog post, a lot
of it is still somewhat impenetrable to me ("gradient descent"? I took Linear
Algebra oh, about 25 years ago). So sometimes it's nice to see other people
who are a bit bewildered by things like tanh(), yet still press on and try to
understand the overall process.

And FWIW I had the same reaction as the author when I started toying around
with neural nets- it's shocking how small the hidden layer can be and still do
useful stuff. It seems like magic, and sometimes you have to run through it
step-by-step to understand it.

~~~
karpathy
Sorry about that! There's a lot to cover for one blog post to do satisfyingly.
I encourage you to check CS231n for a more thorough treatment where we also
discuss, for example, the tradeoffs of different activation functions like
tanh(), have a more gentle introduction on gradient descent, I devote a whole
lecture to char rnn, assignment #1 (they are available) would demystify the
backward pass, etc.

Also definitely +1 for not putting down people who write similar posts. I
encourage everyone who is trying to learn to do it through blog posts because
it lets you explain/organize thoughts. I also enjoy reading them quite a bit
because it illustrates the kinds of conceptual problems beginners face (which
is not at all obvious once you've been in the area for a few years). And it's
also interesting to see many different interpretations of the same concepts,
as everyone has different background and the way they reason through things is
usually quite unique. Granted, this one could have been named something more
appropriate!

~~~
kaffeinecoma
No need to apologize- I learned SO much from your blog, thank you. I didn't
realize the course was online
([https://www.youtube.com/watch?v=NfnWJUyUJYU](https://www.youtube.com/watch?v=NfnWJUyUJYU)).
Also, looks like there's a subreddit for it as well:
[https://www.reddit.com/r/cs231n](https://www.reddit.com/r/cs231n)

It's really wonderful that all of this is freely available, thank you.

~~~
jdminhbg
The lecture that covers gradient descent in the Youtube list you linked there
is the first time gradient descent actually clicked for me, and I made it
through the entire Andrew Ng Coursera ML course. Highly highly recommend it.

~~~
andycjw
the video became private, anyone know the title of the video or is there
another copy of it somewhere else?

------
svantana
I thought this would be about something like giving the Hutter Prize [1] a go
using character RNNs [2]. Instead, it's a somewhat confused "gentle
introduction" to neural nets (which there are plenty of already, of higher
quality) and compression is sort of handwavely discussed, not properly with
bits and entropy like us information theorists would have it :)

[1] [http://prize.hutter1.net](http://prize.hutter1.net) [2]
[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)

~~~
blazespin
It's an interesting idea though, as sophomore as the attempt was. I'd be
curious to see images lossy compressed via a neural net. Do you know of any
better attempts than this?

~~~
petra
Not excatly what you've asked, but even more interesting:There's a startup
that uses deep learning to take a compressed video and knows how add details
and improve the quality, just by understanding what we are expecting.

Can't find the startup's name though.

~~~
hellameta
[http://www.piedpiper.com/](http://www.piedpiper.com/)

~~~
fegu
The middle out Compression company

------
jg8610
It's good to see people write up their experiments, it's useful for the rest
of us to test how we understand neural nets.

I think there are a few mistake in your maths though. You can learn a 1-1
discrete mapping through a single node where you are using a one-hot vector.
You just assign a weight to each of the input nodes, and then use a delta
function on the other side. If I understood correctly, this is what you are
doing.

Also, if you use a tanh in your input layer, but keep a linear output layer
(as you start off with), you are still doing a linear approximation because
you have a rank H (where H is the hidden layer) matrix that is trying to
linearly approximate your input data. This is done optimally using PCA.

I'd second the advice to look into the coursera courses, or the nando de
freitas oxford course on youtube (that actually has a really nice derivation
of backprop).

------
brendanofallon
I think some of his statements are a little off-base, for instance, regarding
the choice of tanh() for the activation function he says: "But mostly it’s
'because it works like this and it doesn’t work when we change it.'" People
have spent a fair amount of time investigating the properties of different
activation functions, and my understanding is that tanh() is generally not in
favor, and ReLU (rectified linear) is probably a better choice for many
applications, for reasons that are well understood. Maybe the author isn't all
that familiar with the field?

~~~
tgflynn
It depends on who you read. LeCun certainly advocated tanh for a long time.
ReLU seems to be currently favored, especially for deep nets but I'm not sure
it's being used universally. In my own experiments I am seeing it train faster
than logistic, but not necessarily give better results (maybe because you
don't want the net to train too fast to reach the best optimum).

His overall point that the field is largely empirical with little solid theory
to guide one's choices is, in my opinion, not wrong. It may be that a general
theory doesn't exist and different network architectures and hyperparameters
are needed for different types and sizes of problems.

~~~
argonaut
It doesn't depend on who you read. ReLUs (and their variants) are virtually
universal now. LeCun advocated tanh but that was years and years ago before
ReLUs came out (they came out in 2011) and he advocated ReLUs as early as
three years ago ([http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-
icml2013.pdf](http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf)).
Empirically they give better results and are faster.

(I say "almost" because while I haven't seen a single state of the art result
without ReLUs, I'm sure there is probably a random paper out there)

~~~
tgflynn
As far as I can tell LSTM's are still using a combination of logistic and tanh
transfer functions, at least through 2014 and some of the most impressive deep
learning results have been obtained with these.

For example: [http://arxiv.org/abs/1308.0850](http://arxiv.org/abs/1308.0850)

------
astazangasta
Lossy compression. Reminds me of this hilarious bit:
[http://web.archive.org/web/20050402231231/http://lzip.source...](http://web.archive.org/web/20050402231231/http://lzip.sourceforge.net:80/index.html)

------
algorithm314
Neural networks are used for over a decade in Context Mixing compressors like
paq [https://en.wikipedia.org/wiki/PAQ](https://en.wikipedia.org/wiki/PAQ)

------
pigscantfly
I was expecting to read about some experiments with autoencoders here, not
this tutorial. I'm not sure how the author is learning about neural nets or
where they are now, but anyone at this level of knowledge who wants to know
more would be well served by going through Geoff Hinton's class on Coursera.

------
jcoffland
This is not compression. The author focuses on the small number of nodes and
calls it compression but it's the edges that encode the information and there
are a lot of them.

~~~
zardeh
>If you count the number of connections on that last picture you will notice
that there are fewer than there were in the first network. There were 5 _5=25
at first, now there are 5_ 2*2=20. That reduction would be larger if I had
more input and output nodes. For 10 nodes that reduction would be from 100
connections down to 40 when inserting two nodes in the middle. That’s where
the compression in the title of this blog post comes from.

------
frozenport
How can the author draw conclusions without citing any code or experiments.
For example, if we try to use the features DeepDream leaned we can see that
our image will have terrible compression artifacts (faces of cats). DNNs do
dimensional reduction, it is not clear if this correctly preserves image
features.

------
IamFermat
This is really cool. Totally can see a way where the algo would adapt based on
object getting compressed

------
john_reel
ZPAQ and pngwolf have very nice approaches to learning algorithms for
compression. Especially ZPAQ.

------
rdlecler1
I'm sure with perturbation analysis you could also remove even more edges from
the ANN.

------
tacos
This is both a fascinating read of an obviously bright and clever fellow and
also a TERRIBLE way to approach building intuition around the emergent
behaviors of neural networks. If you want to learn about them without
repeating the terrible mistakes this guy is about to make if he continues down
this path, pick up a book.

~~~
CardenB
What sort of terrible mistakes are you referring to?

~~~
tacos
A fundamental misunderstanding of the space, focusing on the wrong things,
thinking he can visualize (or should visualize) the mechanics, hyperfocus on
topologies, misunderstanding of key distance functions, sigmoids,
training/overtraining, add buzzword here.

Right now he's a chef who obsessively spent six months thinking about knives.
But it's about the meat and the heat, not the chef knife.

Just pick a problem and start playing with it! If you lean to this sort of
pedantic "I MUST UNDERSTAND ALL FUNDAMENTALS" (I do!) I still recommend Matlab
(available for $49-$99 in student editions) because the tools are so good and
the docs are so damn readable. This post is basically a matlab doc page minus
the links to the stuff that actually works.

~~~
eli_gottlieb
>Just pick a problem and start playing with it! If you lean to this sort of
pedantic "I MUST UNDERSTAND ALL FUNDAMENTALS" (I do!) I still recommend Matlab
(available for $49-$99 in student editions) because the tools are so good and
the docs are so damn readable.

 _Don 't pay to torture yourself!_ Use Python instead! Numpy and scipy are the
standard libraries for most scientific code nowadays, and you'll be able to
use Theano, Caffe, TensorFlow, PyMC, scikit-learn, pybrain, and loads of other
machine-learning and statistics libraries with Python interfaces!

~~~
tacos
+1 for python however good luck building fundamentals given the sea of chaotic
products. The whole point of the Matlab recommendation is that, like Amazon
S3, they haven't really touched the damn NN toolkit in a decade. While y'all
were tinkering with MVC apps in 2007 some of us were researching and
publishing and laying the foundation for the nine different open source
learning libraries that can't agree on terminology or workflow patterns or
which cloud provider they are clandestinely pushing you toward.

That said, I deploy using Python and several of these libraries (and more
interesting unmentioned ones). But often, Matlab for exploratory dev! I re-
recommend it for getting started.

~~~
thatcat
>more interesting unmentioned ones

care to share more info?

~~~
tacos
NN is flavor of the month at HN but these core toolkits don't really do much.
They're all running the same Fortran code and MIT FFT libraries Matlab's been
using for 20+ years. It's the base that the cool experiments happen on top of.

Once you dig in and try to actually do something, you stumble on the cool
stuff. It's often Python throwaway or Matlab spaghetti code in a .edu
directory that starts with a tilde. And it disappears when you need it most.

(If it's Python, it's guaranteed to be something wonky that uses 2.7 if you're
on 3.x or vice versa, or requires three additional packages that throw obscure
build errors on your distro. And if it's Matlab, it inevitably requires you to
shell out another $299 for some stupid toolbox just to generate white
noise...)

Good times. I won't kill your fun by being overly specific. Dig in!

------
pjbrunet
I suppose that one episode of POI (the chain of laptops on ice with Pink
Floyd) could have implied a neural network of sorts, manifesting at the
hardware level. If you remember the episode, they had to compress the AI to
fit in a briefcase.

------
robbiep
Ok so 3 layers increases the complexity but simplifies the connections. The
human cortex has 6 layers.

Now create 6 layers, classify different sets of inputs as 'different' to
represent different neurochemicals (you need several excitatory and several
inhibitory and then a couple of very small master neurochemicals that have
major excitatory and inhibitory responses to represent the dopamine network
and a whole system for the amygdala), cluster different groups to either
respond to inputs or create outputs, and set it loose on an environment. How
close would we come to something that behaves as if conscious?

~~~
zamalek
> How close would we come to something that behaves as if conscious?

The human brain is recursive in nature - the neurons and synapses form a
cyclic graph. You can't do that with a simple perceptron.

~~~
otempomores
If the human brain is recursive... whats stopping it from going in infinite
circles ... Chemistry? If you inhibit that reaction..do you crash? can you
canbecome a master of a field by having the specialized neurons keep there
cycle support ed?

~~~
__s
Cyclic graph with distributed/parallel execution means infinite loops don't
halt the entire system. Some specialized neurons being tightly recursive don't
generate new information, thus wouldn't build expertise, but obsession can be
fruitful

