Some notes from a first glance:
* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?
* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?
> The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result.
I grant all research papers on deep learning can't be reproducible with a single GPU in a reasonable time, but it should happen more often IMO. It seems lazy to just toss out a paper saying "we hit new benchmarks, by increasing the parameters and throwing more compute". I'd like to see "we hit new benchmarks with a new design, the old ones had this issue, etc.
Anyway, great read, recommend. Also, happy for the author haha
"The author has also moved to a one bedroom apartment in
San Francisco, removing themselves from proximity to the
alley of questionable odors and unsavory noises."
1) Reducing the density of information per paragraph (vs. packing information in)
2) Clearly outlined motivation and context (vs. just referencing some other papers and assuming they've been read)
3) Deploying comedy (vs. professionalism)
The first two improve the paper, the third is a step backwards because the reader has to spend effort separating fact & fiction. The combination in this case works and is a lot of fun but it would have been a catastrophic and cringeworthy exercise if the execution of (1) and (2) hadn't worked out so well.
The real trick here is excellent writing and the comedy is simply draws attention to it. Much like how an army marching in a silly way draws attention to its discipline. The silly march itself is not a good idea.
His work on QRNN's saved me quite a bit of time and money when I was doing my undergrad dissertation on language models.
This SHA-RNN seems to have surfaced from a similar line of thinking that spawned the QRNN.
The main disadvantage of word-level models is large vocabulary size, however, the tweet completely ignores the advantage--sequence length becomes shorter, it has to look only a few tokens back to find the reference to "Bob" and "Alice".
The same model at word level writes more sensible sentences than at character level. There's a tradeoff between larger vocabulary and modelling longer dependencies. A model which can encode a text document more effectively is better; tokenization is just a part of the modelling. You just need to take care of the "number of words" of "per word" part of "perplexity per word" and you can directly compare their performances.
The author is wrong that entropy collapses after "A" is given of "Alice". Entropy will only collapse if the model has really "understood" the context and modelled that "Bob" and "Alice" are the only options here. The entropy won't collapse for a sentencepice based bi-gram model, for example.
In his example, it is not clear if the wordpiece model is at an advantage. Suppose both the models "understand" that there are two options "Bob" and "Alice". Then the word-level model only has to predict one token which can be either of the names. Perplexity = 0.5. The sentence-piece model also has to choose between two tokens "B" and "A", the second token won't add to perplexity since it'll be known. Perplexity = 0.5.
What is the benefit of the RNN here?
The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.
I'd say that all of these are factors that don't add or detract from the value of the paper itself - it's a "hey I tried this and it works ok despite not going in the obvious direction". So, limited experiments but IMO competently done and with usable information.
It's a pity that all papers nowadays have a gazillion authors, from well-funded research labs, with as-dry-as-possible language that hides the real research behind a "we knew this all along rather than figuring it out along the way" facade. OTOH that's what you get in a large fairly mature research field, where most competent people get hired by research labs and then do lots of collaborative research that scales well and subsequently need to show publication counts to secure further funding.
I strongly prefer papers written in this style. Not only are they more enjoyable to read, but they are often easier to understand and more geniune as well. Papers written in a formal style often obscure the real motivation and instead provide a fancy-sounding retroactive justification. It makes the authors feel smarter, and I guess some readers feel smarter as well, but it belies the reality of research.
What’s wrong with single authorship and no affiliation?
At the end of the day, if the paper proposes some idea or method, and achieves the stated claims (with reproducible code), then I don’t care who wrote it, how many authors there were, and who the authors work for.
He has worked on YOLO (computer vision) and NLP related problems.