
Single Headed Attention RNN - spatters
https://arxiv.org/abs/1911.11423
======
albertzeyer
The writing style is amusing. :)

Some notes from a first glance:

* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?

* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?

~~~
dual_basis
I don't think that was his motivation, I think his motivation was stated quite
clearly in the abstract:

> The author's lone goal is to show that the entire field might have evolved a
> different direction if we had instead been obsessed with a slightly
> different acronym and slightly different result.

------
citilife
Honestly, I wish all research papers were written this way. Easy to
understand, kept me entertained, and presented meaningful results with a way
to reproduce (on a single GPU).

I grant all research papers on deep learning can't be reproducible with a
single GPU in a reasonable time, but it should happen more often IMO. It seems
lazy to just toss out a paper saying "we hit new benchmarks, by increasing the
parameters and throwing more compute". I'd like to see "we hit new benchmarks
with a new design, the old ones had this issue, etc.

Anyway, great read, recommend. Also, happy for the author haha

"The author has also moved to a one bedroom apartment in San Francisco,
removing themselves from proximity to the alley of questionable odors and
unsavory noises."

~~~
LudwigNagasena
Now imagine reading papers is your job and you try to skim through dozens of
wannabe stand up comedians each day.

~~~
6gvONxR4sf7o
Is dozens of papers per day how academics work? Holy crap.

~~~
throwlaplace
Notice the word skim. Yes when I'm trying to figure something out I often have
10 tabs with papers open and I'm flipping between them.

------
czr
for those who aren't familiar with the author, he previously worked at
metamind / salesforce research doing nlp and has published many successful nlp
papers [0]. he opted to write an informal paper for this project (similar to
yolov3 [1]), but the work itself should still be taken seriously.

[0]
[https://scholar.google.com/citations?user=AolIi4QAAAAJ](https://scholar.google.com/citations?user=AolIi4QAAAAJ)

[1]
[https://pjreddie.com/media/files/papers/YOLOv3.pdf](https://pjreddie.com/media/files/papers/YOLOv3.pdf)

------
1maginary
You just have to love Stephen Merity.

His work on QRNN's saved me quite a bit of time and money when I was doing my
undergrad dissertation on language models.

This SHA-RNN seems to have surfaced from a similar line of thinking that
spawned the QRNN.

~~~
technics256
Are qRNNs still used much?

~~~
polymorph1sm
check out MultiFiT [0] from fastai, it uses QRNN for speed.

[0]
[http://nlp.fast.ai/classification/2019/09/10/multifit.html](http://nlp.fast.ai/classification/2019/09/10/multifit.html)

------
lopuhin
The paper rises a great point on tokenization affecting perplexity, that we
can't compare perplexities of different tokenizers even re-normalizing taking
token counts into account, say BPE vs word tokenization. This example nails
it:
[https://twitter.com/Smerity/status/1192252147598909441](https://twitter.com/Smerity/status/1192252147598909441)

~~~
pcwelder
I don't see his point. Doesn't renormalizing token counts essentially
eliminate the effect of tokenization? The perplexity which then we get
essentially is representative of how well a model compresses the test
document. Isn't that the whole point? A better model compresses the document
better, how does it matter if you model each character or each word or bigrams
or even directly the bits?

The main disadvantage of word-level models is large vocabulary size, however,
the tweet completely ignores the advantage--sequence length becomes shorter,
it has to look only a few tokens back to find the reference to "Bob" and
"Alice".

The same model at word level writes more sensible sentences than at character
level. There's a tradeoff between larger vocabulary and modelling longer
dependencies. A model which can encode a text document more effectively is
better; tokenization is just a part of the modelling. You just need to take
care of the "number of words" of "per word" part of "perplexity per word" and
you can directly compare their performances.

The author is wrong that entropy collapses after "A" is given of "Alice".
Entropy will only collapse if the model has really "understood" the context
and modelled that "Bob" and "Alice" are the only options here. The entropy
won't collapse for a sentencepice based bi-gram model, for example.

In his example, it is not clear if the wordpiece model is at an advantage.
Suppose both the models "understand" that there are two options "Bob" and
"Alice". Then the word-level model only has to predict one token which can be
either of the names. Perplexity = 0.5. The sentence-piece model _also_ has to
choose between two tokens "B" and "A", the second token won't add to
perplexity since it'll be known. Perplexity = 0.5.

~~~
lopuhin
Good point, assuming some extent of collapse is crucial, and the question is
if different perplexities due to tokenization can happen in principle. You are
right that in "Alice" vs. "A|lice" example we get the same perplexity after
re-normalization, I can't come up with an example where it would be different
right now.

------
MiroF
Perhaps I am missing the point of this article. The RNN approach seems to get
similar performance, but uses more parameters and misses the parallelization
benefits that Transformers have and recurrent networks do not.

What is the benefit of the RNN here?

~~~
vsef
The parallelism in a transformer doesn't necessarily translate to less or
faster compute. Each layer has to be computed in serial after the previous
layer, and the computation of each attention head is quadratic in the size of
of the input sequence. When used this way for language modeling, the
transformer also has to be run step-by-step for inference, the parallelism
that was a boon at training is no longer available.

The author doesn't do much absolute wall time comparison but does mention that
only the adaptive transformer configuration trained in similar time on the
single gpu.

------
lucidrains
Another work in the opposite direction, introducing gating in Transformer-xl
[https://arxiv.org/abs/1910.06764](https://arxiv.org/abs/1910.06764)

------
octocop
Hilarious papers, i'm about to drop a SHA-RNN on my GPu to make it sweat

------
sbpayne
Did anyone one else read "SHA-RNN" as "SHHHAAAAARRRROOOOONNN" in Ozzy's voice?

------
reubens
Now that was some refreshing reading.

------
Dasemu
ok

------
madenine
Now I really want pop music made by Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin

------
toxik
A dissenting voice from the positive reception here on HN, I thought that this
paper was a joke. Single author, no affiliation, snarky language. Why not be
civil instead?

~~~
sqrt17
> Single author, no affiliation, snarky language.

I'd say that all of these are factors that don't add or detract from the value
of the paper itself - it's a "hey I tried this and it works ok despite not
going in the obvious direction". So, limited experiments but IMO competently
done and with usable information.

It's a pity that all papers nowadays have a gazillion authors, from well-
funded research labs, with as-dry-as-possible language that hides the real
research behind a "we knew this all along rather than figuring it out along
the way" facade. OTOH that's what you get in a large fairly mature research
field, where most competent people get hired by research labs and then do lots
of collaborative research that scales well and subsequently need to show
publication counts to secure further funding.

~~~
6gvONxR4sf7o
Its a shame that professionalism and showing personality are so at odds all
over the place, from papers to the workplace. For the most part, professional
has aligned with formal. It's clear why, but still sad :(

~~~
toxik
Why is it sad? The whole point in professionalism is disaffective
communication.

