
Reformer, the Efficient Transformer - davidfoster
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
======
gwern
Discussion & links to various implementations:
[https://www.reddit.com/r/MachineLearning/comments/eg1wr3/ref...](https://www.reddit.com/r/MachineLearning/comments/eg1wr3/reformer_the_efficient_transformer_anonymous_et/)

------
occamrazor
The demonstration on images is underwhelming at best. It is only marginally
better than extending vertically the bottom row of pixels.

~~~
anentropic
the same with the demos for completing a phrase based on Crime & Punishment...
having all recently seen the text completion demos of GPT-2 the Reformer
examples are decidedly underwhelming

I mean, I'm sure it's a great new technique and all

------
lapink
There is no argument for why the LSH would work well, especially at the
beginning of training. As the weights are initially random, bucket assignment
would be random as well. If predicting at position A requires info from
position B, but they are not in the same bucket, there will be no gradient to
get the query embedding of A closer to the key embedding of B. The reversible
layer trick is neat though.

~~~
gwern
Why is that any worse than, say, starting with randomly initialized weights in
general?

~~~
MiroF
I haven't read this paper yet, but to answer your question:

because bucket choice is a discrete decision - and discrete decisions are hard
to pass gradients through

~~~
gwern
But you aren't 'making decision's to pass gradients through them at all. It's
just fixed random projections AFAICT
([https://openreview.net/pdf?id=rkgNKkHtvB#page=3](https://openreview.net/pdf?id=rkgNKkHtvB#page=3)).

~~~
MiroF
Okay, I've now done a brief perusal of the paper.

Perhaps I'm wrong, but it seems to me that deciding on the bucket is the
discrete decision. If you have two "words"/"contexts" in a sequence that ought
to attend to each other, but they don't get bucketed together early in
training, then there is no gradient pushing those two hidden states to be
close to each other, because there is no comparison being done between the two
contexts.

In a standard transformer, on the backprop we can see something like "oh, you
would have been quite closer to the correct answer on this sentence if you had
matched the context for 'dog' with the context for 'treat' about 20 words
back." But, here, if 'dog' doesn't get bucketed with 'treat', then there's no
such gradient pressure.

Eventually (and with enough hashing+bucketing), the embedding of the more
relevant contexts will move closer together, but I'd suspect this might occur
more slowly. Here's the authors describing the process:

> We don’t differentiate through the hash bucket assignment procedure, or the
> choice of what order to sort the items into. Rather, these operations take
> query/key vectors as input where LSH maps nearby vectors to the same bucket
> with high probability. Therefore, the sorting re-adjusts any time parameter
> updates to cause relevant vector pairs to have higher dot product, and
> “unhelpful” vector pairs to have lower dot products.

e: And here is a reviewer noting what I suspected about number of gradient
updates,

> the performance achieved by the proposed method after 140k iterations is
> achieved by the full attention after ~40k iterations [on imagenet64]

~~~
octbash
Yes, the Reformer is basically trading off noisier for faster training /
memory savings.

~~~
MiroF
Yep, and I'm not saying its a bad approach! Just trying to answer "why is that
any worse than, say, starting with randomly initialized weights in general?"
wrt gradient passing

I'm not sure I'd agree with the "noisy" characterization - which to me implies
stochasticity-, whereas this is just blocking off the flow of gradient
information to save memory.

------
sillysaurusx
One neat trick is that you can extend GPT-2 117M's context window from 1024 up
to 30k on a TPU, since TPUs can allocate up to 300GB of memory for backprop.
[https://twitter.com/gwern/status/1218001309435072513](https://twitter.com/gwern/status/1218001309435072513)

It's not quite 1M words, but a 30k context window is big enough for e.g. most
midi songs.

------
darawk
This seems like a big deal. An asymptotic reduction in the resource explosion
created by larger attention windows should allow the development of
substantially more complex models here.

------
overlords
Vowpal Wabbit has been doing this 'hashing trick' since the 200s.

It also the feature interaction, which are the same thing as a layer in
transformers (all against all matrix).

So it seems like they are still catching up to where John Langford and crew
were over a decade ago.

And, the vowpal wabbit approach is extremely fast to train because it's only
doing stochastic gradient descent on a linear function - linear regression.
Transformers are much slower to train.

EDIT: Downvoters, please see my last leaf to see why they're effectively the
same. The guy responding here seems unfamiliar with all the functionality of
vowpal wabbit.

~~~
overlords
Downvoters, please see

[http://matpalm.com/resemblance/simhash/](http://matpalm.com/resemblance/simhash/)

[https://en.wikipedia.org/wiki/SimHash](https://en.wikipedia.org/wiki/SimHash)

Simhash, a type of local sensitive hashing - using hash functions on ngrammed
data.

That is exactly what Vowpal Wabbit does.

~~~
overlords
I'm going to write this out more clearly, because I'm still getting downvotes
for my correct answer.

Why neural networks?
[https://en.wikipedia.org/wiki/Universal_approximation_theore...](https://en.wikipedia.org/wiki/Universal_approximation_theorem)

Can polynomials do this? (Yes)
[https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo...](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

What is transformer and attention? [https://pathmind.com/wiki/attention-
mechanism-memory-network](https://pathmind.com/wiki/attention-mechanism-
memory-network)

Attention = Polynomial (x2,x3 etc.)

Polynomial = interaction. VW flag -interaction

1 layer transformer = xx. (x^2)

2 layer tranformer = xxx. (x^3)

3 ... etc

What is reformer? Transformer where LSH is applied.

One type of LSH is SimHash. ngrams of strings, followed by 32 bit hash.

Vowpal Wabbit -n flag for ngrams.

vw -interact xxx -n2 -n3 and you get ngrams + 32 bit hash doing SGD over a
vector.

This vector is equivalent to a 2 layer reformer.

Non-linear activation is not needed because polynomials are already nonlinear.

So vw + interact + ngrams (almost)= reformer encoder. (if reformer uses
SimHash, then they are identical).

Transformer/Reformer have an advantage, the encoder-decoder can learn from
unlabeled data.

However, you can get similar results from unlabeled data using preprocessing
such as introducing noise to the data, and then treating it as noise/non-noise
binary classification. (it can even be thought of as reinforcement learning,
with the 0-1 labels as the reward using vw's contextual bandits functionality.
This can then do what GAN's do - climb from noise to perfection).

~~~
visarga
> This vector is equivalent to a 2 layer reformer.

There is no feed forward layer, no skip connections and no layer normalization
in VW. In the reformer, hashing is followed by dot products. In VW hashing
just collides some tokens, followed by a linear layer.

Also, 2 layers of transformer is a little shallow. In practice it's 12-14
layers or more.

In order to be equivalent, there would need to be equally good results on
translation from VW, but I've never seen it used for translation. I'm
wondering why?

~~~
overlords
\- hashing followed by dot product in transformer you said

\- you were doing dot products at each layer to introduce non-linearity in
transformer (and neural nets in general). Polynomials are already non-linear,
so you don't need that. Transformer and vw -interact are polynomials. Maybe
the feedforward layers and skip connections are not actually needed.

\- 12 layers ? vw -interact xxxxxxxxxxxxx is 12 layers. You need a lot of
memory for that, but in principle vw interact can do any number of them

These results are coming from google and their massive compute resources. If
they ran vw with -interact x^13 they might get similar results.

We're really talking about polynomial approximation here, both transformer and
vw used in this way. And that is in theory able to approximate any continuous
function (just like neural networks).

------
foota
I wonder if this could be used for the Wikipedia compression challenge?

~~~
gwern
Probably, but it seems like it would run into diminishing returns except on
the longest articles, because AFAIK the articles in wikitext are provided
alphabetically, and consecutive articles may have little or nothing to do with
each other, rendering the very wide window pointless.

~~~
foota
Eh? My understanding from the article was that long is anything beyond a
couple paragraphs, many (though maybe a minority by count) of the Wikipedia
pages are much longer than this.

~~~
gwern
Last I saw any Wikipedia statistics, the average page was ridiculously small,
like a few paragraphs. People just happen to not spend much time reading
'stubs', is all, but you can get an idea by spending some time on
Special:Random. So, most WP articles are something that would fit entirely
into many architectures' windows: for example, GPT-2 at 1024 context is
roughly 3k characters (and you can easily scale GPT-2 way beyond that, see
sillysaurus's comment above - we're training a GPT-2 with a context window of
25k right this second). Chonky paragraphs!

Reformer's advantage would come only from the subset of articles longer than
that, and only from the improvement in prediction from the subset of
characters out of window at the beginning of the article in trying to predict
toward the end of the article. And then you have the article boundaries which
largely 'reset' the memory. Reformer's advantage then would have to come from
the chance that there is a relevant article somewhere accidentally
alphabetically close enough to be in its window while predicting the current
article.

~~~
foota
Maybe you could do a topological sort over references between articles?
There's certainly cycles but those could be broken arbitrarily.

~~~
gwern
You could, and it would be an interesting challenge to come up with the
sorting which makes the data most compressible. (The question of how to sort
data for optimum compressibility is something I've messed around with a bit:
[https://www.gwern.net/Archiving-URLs#sort---key-
compression-...](https://www.gwern.net/Archiving-URLs#sort---key-compression-
trick) An interesting tool designed just for this is 'binsort':
[http://neoscientists.org/~tmueller/binsort/](http://neoscientists.org/~tmueller/binsort/)
)

But it would also be completely different from (and easier than) the existing
benchmark of wikitext, which is what everyone uses and judges NN natural
language modeling progress by, so there wouldn't be too much research interest
in it, and it's not clear how useful it would be. After all, AIs don't get to
reorganize the entire universe to make inputs come in the most convenient
order for compression.

------
The_rationalist
How does accuracy compare in Nlp tasks vs XLnet? If we can have XLnet accuracy
and fast inference on a single gpu, that would be revolutionary!

------
the8472
This looks like building blocks from cryptography inspiring ML

------
4gotunameagain
Would it be reasonable to add something to the tittle so it's clear it has
nothing to do with electronics? Maybe it's just me.

~~~
bluescrn
Nothing to do with electronics, or with robots in disguise...

------
silvestrov
so many smart people and still using fuzzy PNG instead of SVG

~~~
bowmessage
The input examples are photographs, how can one take an SVG photograph..?

