
New Method for Compressing Neural Networks Better Preserves Accuracy - georgecarlyle76
https://developer.amazon.com/blogs/alexa/post/a7bb4a16-c86b-4019-b3f9-b0d663b87d30/new-method-for-compressing-neural-networks-better-preserves-accuracy#
======
twotwotwo
Interesting 'cause once you think of it as a lossy compression task, lots of
possible tricks come to mind. One is that you could "cheat" to correct the
most important errors: look at which input words have high (distance from
original point * frequency of use), and save the (in their example)
300-element error vector (original embedding - estimate) for those. Since we
use the most common words a _lot_ , you might be able to recover a noticeable
chunk of accuracy without using much additional space.

Really, you want to correct the errors that are _important_ , and importance
to the task is a different thing from error magnitude * word frequency. You
might run the network on real tasks with the uncompressed table and the
compressed table, and add correction vectors for the words to which you can
attribute the most mistakes made by the compressed net. And you don't
necessarily have to correct the whole vector or correct it perfectly; you
could round values in the correction vector down to zero when they're small
enough, saving space and computation, or save your corrections with low
precision, or try another round of SVD on some error vectors and only store 30
more weights for correcting important errors (and one 30x300 matrix) not 300,
or some combo of tricks like that.

(I think this paper is saying that after you get the SVD-compressed lookup
table, you do more NN training that tweaks the embedding and possibly the rest
of the net. The links and weights for correcting important errors could be
tweaked in the same process.)

It's tempting to treat neural networks as their own black box in a world apart
from other CS sometimes, when in fact there can be lots of interesting
connections. Now where all those connections ought to be and their ideal
weights I don't know :P

------
azhenley
The article's title is actually "New Method for Compressing Neural Networks
Better Preserves Accuracy".

Summary of a key result: "In one set of experiments, for instance, we showed
that our system could shrink a neural network by 90 percent while reducing its
accuracy by less than 1%. At the same compression rate, the best prior method
reduced the accuracy by about 3.5%."

Link to the 2019 AAAI paper:
[https://arxiv.org/pdf/1811.00641.pdf](https://arxiv.org/pdf/1811.00641.pdf)

~~~
ralusek
Do they mean compressing the amount of neurons involved in computing, or do
they mean compressing the outputted model?

~~~
irodov_rg
Its mostly the lookup table which takes up the most space. This work is about
breaking it into 2 layers and continuing to train to gain accuracy. The output
model becomes 90% smaller compared to the original model.

------
jknz
The goal of the post is to compress an already trained network.

An more ambitious related goal is to perform the training of the network with
at most 10% (or other small percentage) of the network weights being nonzero
at any time during training. If possible without a great loss in accuracy,
this could be particularly useful to save memory during training. The paper
[https://arxiv.org/abs/1711.05136](https://arxiv.org/abs/1711.05136) propose a
method to achieve this.

------
arnocaj
I would not consider word embeddings to be state of the art anymore. Word
Embeddings are like TF-IDF when word embeddings came out. Have a look at BERT
model that just recently got published and is outperforming all kind if NLP
tasks with one main Architecture. I would consider BERT language model two
levels higher than word embeddings, as it considers full context sensitive
embeddings, dependent on the text left and right of the word in parallel.

~~~
yazr
Is BERT computationally (and sample-wise) equivalent to previous SOTA?

(I do DRL but not NLP)

I sometimes read these DL papers and the requirements are not really feasible
if you have to re-implement them in a modified domain.

~~~
irodov_rg
BERT is more computationally expensive. It might end up giving better results
on the task mentioned in the paper but we don't know. At the time of writing
this all of the contextual word embedding techniques were fairly new and were
not tried.

------
anishacharya
Just to clarify a bunch of questions below: The idea is more like : start from
a full embedding layer train the network, after the training stabilizes break
the first layer into two using the SVD formulation in an online fashion and
continue training - after only a few more epochs you gain back all accuracy
almost. This is particulary interesting and efficient compared to quantization
/ pruning methods as there is no way to regain lost accuracy in those
methods... quantized distillation can be an option but too slow as pointed out
by the authors, offline embedding compression like FastText.zip compresses
embedding matrix offline and seems to be less information preserving than this
method leading to worse performance. Another important thing to note: this
method shows significant improvement over quantization/ pruning/ hashing when
the network is deep. Deeper the network more is the quantization error due to
more FLOPs in low precision - whereas this algorithm being principled from an
information theoretic perspective maintains steady performance irrespective of
network depth actually deeper nets will give better gains. In many of our
experiments on smaller tasks this method actually acts as a cheap
regularization as well leading to boosted performance. Also i guess we need to
appreciate the mathematical elegance and ease of implementation. Also look at
the Numerical Analysis that guarantees and provides bounds compared to
Quantization methods.

------
nestorD
So they took a NN taking embeddings as input, added a first layer going from
one-hot encoding to embeddings (you just use the embedding matrix as first
layer), did a SVD on it and retrained the network on top.

Using a SVD+retraining to compress a neural network is a common idea, their
gain come mostly from the idea of applying that to the embeddings which are
usually considered separately from the network (but do take space on your
device).

~~~
anishacharya
The idea ismore like : start from a full embedding layer train the network,
after the training stabilizes break the first layer into two using the SVD
formulation in an online fashion and continue training - after only a few more
epochs you gain back all accuracy almost. This is particulary interesting and
efficient compared to quantization / pruning methods as there is no way to
regain lost accuracy in those methods... quantized distillation can be an
option but too slow as pointed out by the authors, offline embedding
compression like FastText.zip compresses embedding matrix offline and seems to
be less information preserving than this method leading to worse performance.
Another important thing to note: this method shows significant improvement
over quantization/ pruning/ hashing when the network is deep. Deeper the
network more is the quantization error due to more FLOPs in low precision -
whereas this algorithm being principled from an information theoretic
perspective maintains steady performance irrespective of network depth
actually deeper nets will give better gains. In many of our experiments on
smaller tasks this method actually acts as a cheap regularization as well
leading to boosted performance. Also i guess we need to appreciate the
mathematical elegance and ease of implementation. Also look at the Numerical
Analysis that guarantees and provides bounds compared to Quantization methods.

------
mlucy
Interesting. It took me a while to figure out what the main contribution here
is, since doing dimensionality reduction on embeddings is fairly common. I
think the main contribution is an empirical measure of A) how little is lost
by reducing the dimensionality of the embeddings, and B) how much the
dimensionality-reduced embeddings benefit from being fine-tuned on the final
dataset.

It looks like fine-tuning the lower-dimensionality embeddings helps a lot (1%
degradation vs. 2% for the offline dimensionality reduction according to the
paper).

This makes me pretty happy, since I run a startup that trains and hosts
embeddings ([https://www.basilica.ai](https://www.basilica.ai)), and we made
the decision to err on the side of providing larger embeddings (up to 4096
dimensions). We made this decision based on the efficacy of offline
dimensionality reduction, so if fine-tuned dimensionality reduction works even
better, that's great news.

~~~
DoctorOetker
my background is physics, but I have been reading and following AI papers for
many years, without actually applying this knowledge.

I have an idea I wish to try out regarding word embeddings, and I am looking
for either a minimalist or a very clearly documented library for word
embeddings. With minimalist I do _not_ refer to simplest in usage, but
simplest in getting acquainted with the underlying codebase.

I have no interest in becoming a _proficient user_ of a commercially relevant
library, since machine learning is not how I earn money. But I can imagine one
of the more theoretical courses on machine learning has a code base for
students to tinker with. i.e. where the educational facet of how the word
embedding algorithms work is somewhat prioritized over the commercial
relevance and computational efficiency facets. Do you know of any such minimal
implementations?

The idea concerns _objective_ bias removal (potentially at the cost of loss of
factual knowledge), so if it works it may be interesting outside of word
embeddings as well...

EDIT: the only functional requirement I have of the codebase is that given a
corpus, it can generate an embedding such that the classic "king" \+ "woman"
\- "man" = ~ "queen" is satisfied. the only computational requirement I have
regarding computational efficiency is that it can generate this embedding on a
single core laptop within a day for a roughly billion word corpus. any
additional functionality is for my purpouses overcomplicating. So I am looking
for the simplest (well-documented, preferably in paper form) codebase which
achieves this.

~~~
speedplane
>I have an idea I wish to try out regarding word embeddings, and I am looking
for either a minimalist or a very clearly documented library for word
embeddings.

Using AI on natural language is still in its infancy. You're not going to find
a clean off-the-shelf solution if you're doing something unique.

Google Cloud has a natural language AI module that makes solving certain
problems pretty easy (little to no coding). The next step is to look at
packages like NLTK, which is more complex but can handle a larger set of
natural language processing (you need to be comfortable with Python). If
neither of the above tools does the job, you're going to need to dive much
deeper, and learn about the fundamentals of neural nets and natural language
processing and become familiar with tools like Tensorflow.

Natural language processing is a much more open field than image processing,
despite a seemingly much lower bandwidth and data size. As far as data-size,
all of Shakespeare's works can fit into a single iPhone image, but
understanding what's going on in natural language is often a far more
difficult task than what's going on in an image.

~~~
DoctorOetker
I apologize for my late edit, and thank you for your response. I just need a
minimal codebase that generates a word embedding that satisfies these quasi
linear relationships.

------
karmakaze
This leads in a very interesting direction. Previously I read that DNN is just
as content to fit random input as it is to fit meaningful categories of input.
This makes it seem that the DNN is just a memory trick and can't possibly
extract any meaning.

Now being able to compress the network after being trained, it's evident that
not all the nodes were needed/used. A good experiment would be to train a DNN
on random input samples and verify that it is far less compressible. If so,
the degree of compressibily could be used as a proxy estimate of how much
'conceptualization' is being done.

~~~
anishacharya
Overcomplete nature of DNN is a common phenomenon and often advocated to
convert saddle points into local minima --- understanding the actual landscape
is difficult due to the inherent nonlinearity.... this indeed leads to an
interesting research direction and many prominent research groups have started
investing heavily on this

------
atrudeau
Very cool idea. How does fine tuning the SVD initialization compare to
training from random initialization using the same architecture? I couldn't
find this in the paper.

------
ppod
Something I don't understand:

I thought that when using LSTM for classification, the input vector for the
sentence was a single vector of length 300, recurrently updated with each word
in the text. And for the averaging method, it was a vector of length 300, the
average of each of the vectors of length 300 for the words in the sentence. At
what stage in the standard process would a matrix of size V*300 be used to
represent a sentence or document?

~~~
twotwotwo
I think the big matrix in the post is the "dictionary" used to map words to NN
input vectors, not a representation of one input sentence or document. So its
size is (total words understood by NN * dimensions of embedding).

~~~
ppod
Yes, sorry that's actually quite clear in the text now that I read it again.

------
drewm1980
"NLP" would be welcome in the title since this seems specific to natural
language processing.

------
anishacharya
The paper is here:
[https://arxiv.org/pdf/1811.00641.pdf](https://arxiv.org/pdf/1811.00641.pdf)

------
darawk
They're just SVDing the embeddings? Is that really new?

~~~
twotwotwo
I think the new things are integrating the SVD output into the NN as a couple
of layers and then training it more, and adjusting the learning rate
differently from how it was done before. The training step will tend to tune
the compression to work well for the specific task.

Getting good results on the task is different from making the compressed
representation of the embedding better on all-purpose metrics (like average
magnitude of difference from the original embedding). For example, you might
end up with changes that improve representation of common words, or words
where errors actually mess up results, at the cost of making the decompression
worse by simple metrics.

(It wasn't clear to me if the training adjusts the _whole_ net's weights or
just the two layers they add up front. If it adjusts everything, you could
also get joint changes to the original net + decompression machinery that only
work well together.)

~~~
irodov_rg
Training adjusts all whole net's weights. And you are right, we get the joint
changes and decompression machinery to work together.

------
josteink
Can some translate the title into a non-obfuscated language please?

This is ridiculous.

~~~
rococode
NN = Neural Network

SOA = State of the Art

The rest should be clear: it's a new method that shrinks the size of neural
nets by 90% while only dropping the accuracy <1%. The previous state of the
art for doing the same task of shrinking neural nets dropped the accuracy by
3.5%.

Compressing/shrinking neural nets is an important task because most edge
devices (i.e. smart devices like thermostats, security cams, etc.) don't have
the compute power to run large neural networks and need more compact ways to
do the same tasks. Currently the solution is to use internet connectivity to
query an API but that's not ideal for a variety of reasons.

~~~
visarga
Not neural nets in general, just the word embedding layer. This layer tends to
be large, like 1,000,000 x 300, so there are problems deploying it to
production, especially in mobile environments. They can reduce the second
dimension to something like 30 using SVD, replacing the large layer with the
smaller one, and then retraining.

Note that whole networks can also be compressed to similar levels with a
variety of techniques, this one is just slightly better for a specific type of
layer.

