So I don't think RNNs are fundamentally more myopic than CNNs (just that there may be practical advantages to using the latter)
Hierarchical RNNs, Clockwork RNNs and Hierarchical Multiscale RNNs and probably others are doing things of this nature.
Clockwork RNNs subsample, BTW, so they are more analogous stride=2 in CNNs than to dilation.
Just some papers, you know?
We’re so screwed.
I suppose convolutions could technically handle variable length inputs (just slide the window of weights over different length inputs) but I don't think tensorflow or pytorch supports this
>How do you handle variable length input without something like an RNN?
Any form of pooling, really. Max, Avg, Sum. The tricky part is how to do the pooling while still taking advantage of the sequential structure of the input information. The Transformer -based models have shown that you can get away with providing very little order information and still go very far.
You can take the mean over 3 elements or 10 elements just the same. Pooling is lossy, but it seems that if you have the right architecture the model can still learn what it needs to.
It's worth noting that the attention mechanism (at least in RNNs) has always been invariant to inputs lengths. It's a weighted sum with weights computed per element, so there's no length constraint at all.
Attention is generally length invariant. You take some transformation on the hidden representations (/+ inputs) at that each time step, and then you normalize over all the transformed values to get weights that sum to one. No part of this is constrained by length.
For CNNs, any network that has pooling has the potential to be length/dimension invariant. Whether it actually is is a combination of the architectural design and an implementation detail (e.g. some implementations when trying to pool will specifically define a pooling operation over, say, a 9x9 window. You could define the same pooling operation over a variable-dimension window).
The length/dimension invariance aren't a special or novel property. In the case of attention it's built in. In the case of CNNs, the convolutions are not length invariant, but depending on the architecture, the pooling operations are (or can be modified to be).
That's not true.
You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.
I will be happy to discuss and clarify over email.
There's a reason Transformer's original paper is entitled "Attention is All You Need", because it throw out all the previous structures people assumed were necessarily to solving these problems (recurrence from RNNs, local-transformations for convolutions) and just threw multiple layers of large multi-headed attentions at the problem and got even better results.
It is, nevertheless, still a convolution and calls to convolution code is how self-attention is implemented. Look inside a SAGAN or something and you'll see the conv2d calls.
> Calling the Transformer "convolutions with attention" demonstrates a new-complete misunderstanding of the architecture.
You're reading that in an overly narrow way and imputing to me something I didn't mean. And again, the original Transformer may not be used in conjunction with convolutions, but it often is, and the best current variant uses convolutions internally and so involves convolutions no matter how you want to slice it. Attention is a powerful construct, but convolutions are pretty darn powerful too, it turns out, even outside images.
...Yes, because SAGANs operate on images, so the foundational operation is a convolution.
>You're reading that in an overly narrow way and imputing to me something I didn't mean.
You characterized the Transformer as "convolutions with attention". You then attributed the success of Transformer-based models to "the non-locality & easy optimization of convolutions". The "SOTA for most (all?) sequence-related tasks" applies the regular Transformer variants, not the Evolved Transformer which was published about 5 days ago.
No one is denying that convolutions are useful across many domains. But no one seriously working in the domain of NLP would consider convolutions to be anywhere near the most novel or notable parts of the Transformer.
(In case you do want to look it up, OpenAI's GPT also uses character-level convolutions for its word embeddings. However, BERT does not.)
On a side note, sad to see these conversations about SOTA deep learning to be so adversarial... You're wrong / you're right kinda thing. It's an empirical science mostly at the moment, surf the gradient, be right and wrong at the same time !
With regards to adversarial discussions, it's one thing to argue about whether method A or method B gives better results in a largely empirical and experimental field. But giving a very misleading characterization of a model is actively detrimental especially when it would give casual readers the impression that the Transformer is a "convolution-based" model, which no one in the field would do.
There are other strategies for working with text to solve generic classification problems (e.g. BERT), but for text generation, LSTMs still can't be beat even though it still has issues with longer-term dependencies.
I used the RNN to train a George RR Martin-sounding Twitter account. Really fun for a while - and drove home the "Unreasonable Effectiveness" part.
I didn't downvote you but as an fyi...
When gp wrote "in the same vein of", he wasn't saying the phrases meant the same thing. He was pointing out that they are the same classification of rhetorical device which some call a "snowclone". (See examples.)
The "unreasonable effectiveness of <x>" is re-used as a text template similar to "<x> considered harmful".
For another timely example, the current #1 story on HN is ""Debugging Emacs, or How I Learned to Stop Worrying and Love DTrace"". I think many readers will recognize it as a snowclone of the Kubrick film title "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
The snowclone is: "<X> or: How I Learned to Stop Worrying and Love <Y>".
 More examples of previous HN submissions that used that Strangelove snowclone: https://hn.algolia.com/?query=or%20%22how%20i%20learned%20to...)