The Unreasonable Effectiveness of Recurrent Neural Networks (2015)

gwern · on Feb 3, 2019

If this were written today, Karpathy would have to call it "The Unreasonable Effectiveness of Convolutions". Since 2015, convolutions, causal or dilated convolutions, and especially convolutions with attention like the Transformer, have made remarkable inroads onto RNN territory and are now SOTA for most (all?) sequence-related tasks. Apparently RNNs just don't make very good use of that recurrency & hidden memory, and the non-locality & easy optimization of convolutions allow for much better performance through faster training & bigger models. Who knew?

swairshah · on Feb 3, 2019

I don't understand what you mean by 'non-locality of convolutions'. Isn't convolution inherently a local operation? This probably being one of the main reasons that CNNs are biased towards texture [0] and not shapes?

[0] https://openreview.net/forum?id=Bygh9j09KX

gwern · on Feb 3, 2019

Convolutions in a hierarchy of layers, especially with dilated convolutions, provide long-range connections between inputs (handwavily logarithmic). In an RNN, they are separated by however many steps in a linear way, so gradients more easily vanish. Some paper which I do not recall examined them side by side and found that RNNs quickly forget inputs, even with LSTMs, and this means their theoretically unlimited long-range connections between inputs via their hidden state don't wind up being that useful.

trott · on Feb 3, 2019

In an RNN, you could connect each hidden state at time step t, h(t) to h(t-N), instead of, or in addition to, h(t-1), making it analogous to dilated convolutions, but with hidden-to-hidden connections at the same layer.

So I don't think RNNs are fundamentally more myopic than CNNs (just that there may be practical advantages to using the latter)

Hierarchical RNNs, Clockwork RNNs and Hierarchical Multiscale RNNs and probably others are doing things of this nature.

gwern · on Feb 4, 2019

You could, but it's not equivalent, and no one seems to have been able to use clockwork RNNs or related archs to achieve similar performance, so the differences would seem to make a difference.

trott · on Feb 4, 2019

Right. I'm just saying that this myopia is not a fundamental property of the recurrence any more than of convolution.

Clockwork RNNs subsample, BTW, so they are more analogous stride=2 in CNNs than to dilation.

an_opabinia · on Feb 4, 2019

That’s an awful lot of woo to describe something “theoretical” in the sense of being imaginary but not theoretical in the sense of ever proven in a rigorous way mathematically.

Just some papers, you know?

We’re so screwed.

stochastic_monk · on Feb 3, 2019

Most benchmarks in [0] use LSTMs.

[0] https://nlpprogress.com

gwern · on Feb 3, 2019

Not sure how up to date that is. How many of those things currently listed as SOTA have been compared to, say, BERT baselines?

carbocation · on Feb 3, 2019

Under their language modeling section, they include the Transformer and have listed some papers from 2019, so it seems to be reasonably up to date.

stochastic_monk · on Feb 4, 2019

Furthermore. transformer xl extends the convolutional architecture to use some recurrence. I think that claims that RNNs have been wholesale replaced by convolutions are premature.

carbocation · on Feb 4, 2019

I can't speak for 'gwern, but I was also under the impression that CNNs had basically replaced RNNs for these tasks. I think my view is colored by Jeremy Howard's FastAI talks. Since he isn't big on publishing papers (though that is changing a bit via collaborations with his students), it may be true that the state of the art is no longer best represented by published papers. (I would argue that avoiding this ambiguity is an argument in favor of publication.)

phowon · on Feb 4, 2019

ULMFiT is LSTM-based. https://arxiv.org/pdf/1801.06146.pdf

carbocation · on Feb 6, 2019

(You are right! I'm going to be rewatching the lectures this spring, and it sounds like I should take notes this time.)

stochastic_monk · on Feb 4, 2019

It’s common to use CNNs in practice because they get you most of the way and are so much cheaper to train/run. They are, however, limited in context to the width of the receptive field.

mrcoder111 · on Feb 4, 2019

How do you handle variable length input without something like an RNN? Even transformers use RNN structures right.

I suppose convolutions could technically handle variable length inputs (just slide the window of weights over different length inputs) but I don't think tensorflow or pytorch supports this

phowon · on Feb 4, 2019

>Even transformers use RNN structures right.

Nope.

>How do you handle variable length input without something like an RNN?

Any form of pooling, really. Max, Avg, Sum. The tricky part is how to do the pooling while still taking advantage of the sequential structure of the input information. The Transformer -based models have shown that you can get away with providing very little order information and still go very far.

mrcoder111 · on Feb 4, 2019

Samcodes said it above. How do transformers build a shared representation of two input sentences with different lengths? If you convolve them with the same filter, you get two different sized convolution outputs - the embedding dimensions don't align.

phowon · on Feb 4, 2019

Like I said - pooling.

You can take the mean over 3 elements or 10 elements just the same. Pooling is lossy, but it seems that if you have the right architecture the model can still learn what it needs to.

It's worth noting that the attention mechanism (at least in RNNs) has always been invariant to inputs lengths. It's a weighted sum with weights computed per element, so there's no length constraint at all.

mrcoder111 · on Feb 5, 2019

Can you share some paper names or links to architectures that demonstrate the length invariant convolution and attention?

phowon · on Feb 5, 2019

I'm not sure if you're understanding me correctly.

Attention is generally length invariant. You take some transformation on the hidden representations (/+ inputs) at that each time step, and then you normalize over all the transformed values to get weights that sum to one. No part of this is constrained by length.

For CNNs, any network that has pooling has the potential to be length/dimension invariant. Whether it actually is is a combination of the architectural design and an implementation detail (e.g. some implementations when trying to pool will specifically define a pooling operation over, say, a 9x9 window. You could define the same pooling operation over a variable-dimension window).

The length/dimension invariance aren't a special or novel property. In the case of attention it's built in. In the case of CNNs, the convolutions are not length invariant, but depending on the architecture, the pooling operations are (or can be modified to be).

mrcoder111 · on Feb 5, 2019

In order to get a variable length context, you need to add some machinery to some forms of attention. For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors. You cannot train on images with 5 annotations/context vectors and expect anything to transfer to a setting with 10 annotations. That's why I would be interested in a specific paper to solidify what you're saying.

phowon · on Feb 8, 2019

>For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors

That's not true.

You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.

I will be happy to discuss and clarify over email.

mrcoder111 · on Feb 12, 2019

Sounds good - can you send me an email? I put mine in my about

samcodes · on Feb 4, 2019

The hard part is that after the convolutions you want a fully connected layer or two, and to get those dimensions right you need to know the input dimensions. But, pytorch is building the graph at runtime, so maybe you could do this...

phowon · on Feb 3, 2019

Can you elaborate on how Transformers are "convolutions with attention"?

epberry · on Feb 3, 2019

I found this to be a great read on what I think the parent is referring to: https://lilianweng.github.io/lil-log/2018/06/24/attention-at...

gwern · on Feb 3, 2019

Self-attention uses 1x1 convolutions, and while the original Transformer is fully-connected-only (I think?) the latest & greatest uses convolutions (https://arxiv.org/abs/1901.11117), and even if you're using the original, it's very often being used in conjunction with convolutions elsewhere. So you could argue that the real title would have to be 'The Unreasonable Effectiveness of Attention', but given all the other stuff convolutions have done without attention like WaveNet, I think it's currently fairer to go with convolutions than attention. But we'll see how it goes over the next 4 years...

phowon · on Feb 4, 2019

A 1x1 convolution is such an edge case of convolution that it's really not worth discussing its inclusion as related to the success of the Transformer. Calling the Transformer "convolutions with attention" demonstrates a new-complete misunderstanding of the architecture.

There's a reason Transformer's original paper is entitled "Attention is All You Need", because it throw out all the previous structures people assumed were necessarily to solving these problems (recurrence from RNNs, local-transformations for convolutions) and just threw multiple layers of large multi-headed attentions at the problem and got even better results.

gwern · on Feb 4, 2019

> A 1x1 convolution is such an edge case of convolution that it's really not worth discussing its inclusion

It is, nevertheless, still a convolution and calls to convolution code is how self-attention is implemented. Look inside a SAGAN or something and you'll see the conv2d calls.

> Calling the Transformer "convolutions with attention" demonstrates a new-complete misunderstanding of the architecture.

You're reading that in an overly narrow way and imputing to me something I didn't mean. And again, the original Transformer may not be used in conjunction with convolutions, but it often is, and the best current variant uses convolutions internally and so involves convolutions no matter how you want to slice it. Attention is a powerful construct, but convolutions are pretty darn powerful too, it turns out, even outside images.

phowon · on Feb 4, 2019

>Look inside a SAGAN or something and you'll see the conv2d calls.

...Yes, because SAGANs operate on images, so the foundational operation is a convolution.

>You're reading that in an overly narrow way and imputing to me something I didn't mean.

You characterized the Transformer as "convolutions with attention". You then attributed the success of Transformer-based models to "the non-locality & easy optimization of convolutions". The "SOTA for most (all?) sequence-related tasks" applies the regular Transformer variants, not the Evolved Transformer which was published about 5 days ago.

No one is denying that convolutions are useful across many domains. But no one seriously working in the domain of NLP would consider convolutions to be anywhere near the most novel or notable parts of the Transformer.

(In case you do want to look it up, OpenAI's GPT also uses character-level convolutions for its word embeddings. However, BERT does not.)

pilooch · on Feb 4, 2019

Interesting conversation. I would add that papers by Lecun and others have been using character based convolutions on pure text since 2015 with great success. VDCNN is still a very good way to go for classification, and is much faster to train than RNN due to effective parallelization.

On a side note, sad to see these conversations about SOTA deep learning to be so adversarial... You're wrong / you're right kinda thing. It's an empirical science mostly at the moment, surf the gradient, be right and wrong at the same time !

phowon · on Feb 4, 2019

And convolution-based models still find use in all sorts of cool applications in language, such as: https://arxiv.org/abs/1805.04833

With regards to adversarial discussions, it's one thing to argue about whether method A or method B gives better results in a largely empirical and experimental field. But giving a very misleading characterization of a model is actively detrimental especially when it would give casual readers the impression that the Transformer is a "convolution-based" model, which no one in the field would do.

phowon · on Feb 3, 2019

But no part of that Transformer section makes any reference to convolutions.

tanilama · on Feb 3, 2019

But RNNs are losing its edge over sequence modeling against fully attentional techniques like Transformer, at least it seems to be the consensus right now, Transformer, when done right, offers better performance.

minimaxir · on Feb 3, 2019

I made textgenrnn (https://github.com/minimaxir/textgenrnn) as a higher-level approach to creating char-rnns, solving some issues such as the cold start problem (textgenrnn doesn't need a seed) and incorporating a few newer discoveries since 2015 such as Attention and CuDNN speedups.

There are other strategies for working with text to solve generic classification problems (e.g. BERT), but for text generation, LSTMs still can't be beat even though it still has issues with longer-term dependencies.

bjourne · on Feb 3, 2019

The cool part of RNNs are how simple they are. https://gist.github.com/karpathy/d4dee566867f8291f086 These 112 lines of Python is all it takes to very realistic Shakespeare prose (or Trump tweets if that's your thing). The results are much better than when using HMMs.

CoreSet · on Feb 4, 2019

Fun to read this thread and see all the projects it spawned. It's a testament to how compelling the post and example are that they simply demand to be tinkered with.

I used the RNN to train a George RR Martin-sounding Twitter account. Really fun for a while - and drove home the "Unreasonable Effectiveness" part.

jhinra · on Feb 3, 2019

I loved this article; I used Karpathy's tutorial to imitate a local burger reviewer: http://brian.copeland.bz/2018/12/16/the-burger-hunter/

alkonaut · on Feb 4, 2019

I worked with recurrent nets for dynamic system prediction for an undergrad thesis in ‘02 and we really struggled to train them on the CPUs we had. They had probably fewer than 100 parameters.

minimaxir · on Feb 4, 2019

In the GPU space, the new CuDNN RNNs are almost as fast as running MLPs on GPUs.

thinkr42 · on Feb 3, 2019

It is remarkable how frequently “unreasonable effectiveness” is thrown around.

nerdponx · on Feb 3, 2019

It's a deliberate riff on "The Unreasonable Effectiveness of Mathematics in the Natural Sciences", in the same vein of declaring something "considered harmful."

thinkr42 · on Feb 3, 2019

That is something I hadn’t thought of before

tomrod · on Feb 3, 2019

I must be missing something -- I have never considered the "unreasonable effectiveness" and "considered harmful" to be equivalent?

jasode · on Feb 3, 2019

>I have never considered the "unreasonable effectiveness" and "considered harmful" to be equivalent?

I didn't downvote you but as an fyi...

When gp wrote "in the same vein of", he wasn't saying the phrases meant the same thing. He was pointing out that they are the same classification of rhetorical device which some call a "snowclone". (See examples.[1])

The "unreasonable effectiveness of <x>" is re-used as a text template similar to "<x> considered harmful".

For another timely example, the current #1 story on HN is ""Debugging Emacs, or How I Learned to Stop Worrying and Love DTrace"". I think many readers will recognize it as a snowclone of the Kubrick film title "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"

The snowclone is: "<X> or: How I Learned to Stop Worrying and Love <Y>".[2]

[1] https://en.wikipedia.org/wiki/Snowclone#Notable_examples

[2] More examples of previous HN submissions that used that Strangelove snowclone: https://hn.algolia.com/?query=or%20%22how%20i%20learned%20to...)

tomrod · on Feb 4, 2019

Ah, that makes sense. Thanks -- and thanks for teaching me about "snowclone".

buboard · on Feb 3, 2019

for the sake of completeness, both are good reads

https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.p...

https://www.maths.ed.ac.uk/~v1ranick/papers/wigner.pdf

adrianhel · on Feb 3, 2019

"The unreasonable effectiveness of ___" and "___ considered harmful" titles are copied a lot. They are not equivalent in any way.

swagasaurus-rex · on Feb 3, 2019

Perhaps it's the educated equivalent of "You'll be amazed at this one amazing trick!"

sherjilozair · on Feb 4, 2019

We embrace that as well: https://arxiv.org/abs/1404.5997