
Google ‘BigBird’ Achieves SOTA Performance on Long-Context NLP Tasks - Yuqing7
https://syncedreview.com/2020/08/03/google-bigbird-achieves-sota-performance-on-long-context-nlp-tasks/
======
visarga
For an in-depth review you can see:

Yannic Kilcher - Big Bird: Transformers for Longer Sequences (Paper Explained)

[https://www.youtube.com/watch?v=WVPE62Gk3EM](https://www.youtube.com/watch?v=WVPE62Gk3EM)

~~~
mdrabla
Thanks for the link - pretty good review. Totally agree with what he says
about calling it SOTA. It's definitely a very interesting approach but I
wouldn't put that much weight on whether it's achieved SOTA, because of the
amount of compute they threw into it.

------
faitswulff
For those who weren't familiar with the acronym as I was, SOTA = "state of the
art."

~~~
JshWright
Thanks, that's definitely not what my brain assumed...

[https://www.sota.org.uk/](https://www.sota.org.uk/)

------
gwern
Bibliography of the various approaches to long-context Transformers:
[https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_b...](https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_breaking_the_quadratic_attention_bottleneck_in/)

------
phillypham
I'm one of the authors. I may be able to answer some questions or concerns.

~~~
lacker
If you compare what BigBird is good at to what GPT-3 is good at, what are the
relative strengths of each system?

~~~
phillypham
Right now, BigBird is encoder only so it doesn't generate text. Causal
attention with the global memory is a bit weird, but we could probably do it.

GPT-3 is only using a sequence length of 2048. In most of our paper, we use
4096, but we can go much larger 16k+. Of course, we don't nearly have as many
parameters as GPT-3, so our generalization may not be as good.

BigBird is just an attention mechanism and could actually be complementary to
GPT-3.

------
qeternity
Can someone with more domain experience grace us with the ELI5 of how these
sparse transformers differ from vanilla transformers a la GPT-3

~~~
ianhorn
First off, GPT-3 is absolutely not a vanilla transformer.

The moving pieces here in BigBird are a vector associated with each token in
the sequence and a few more global vectors that you can think of as latent
variables. Those pieces are present at each layer. The vector in layer i+1 at
position t in the sequence is a function of a bunch of the vectors in layer i.

If position t depends on all of the other positions, then you end up computing
all combinations. A sequence of length N has N^2 combinations. Papers like
this are using different patterns.

In this one, position t only depends on local positions in some small window
around t, those global/latent-ish variables, and a small number of random
positions. The number of combinations is (the size of the local window + the
number of random pieces to attend to + the number of global vectors to attend
to) times the number of positions, so it's some k * N rather than N^2. That
lets you scale to longer sequences.

~~~
YetAnotherNick
> First off, GPT-3 is absolutely not a vanilla transformer.

I thought it had fixed length window. Can you explain how it differs from
vanilla transformer other than the size.

~~~
ianhorn
Unfortunately, they didn't give many details in the paper. It's frustrating,
to say the least. Yay reproducibility. They say

> We use the same model and architecture as GPT-2 [RWC+19], including the
> modified initialization, pre-normalization, and reversible tokenization
> described therein, with the exception that we use alternating dense and
> locally banded sparse attention patterns in the layers of the transformer,
> similar to the Sparse Transformer.

In the referenced paper (sparse transformer) they showed a bunch of different
sparsity patterns, and I believe they're referring to either their banded
block diagonal sparsity or a true banded diagonal pattern (local windows like
bigbird and some other papers). Unfortunately, that paper _also_ was light on
details and the repo they open sourced alongside it is inscrutible.

~~~
gwern
[https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_b...](https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_breaking_the_quadratic_attention_bottleneck_in/fzh7bpd/)

------
codekansas
This seems pretty similar to the Longformer, and the performance is not _that_
much better. Is the only difference the random attention?

------
karavelov
Was it named BigBird because it is never going to fly?

~~~
bobosha
BERT, ElMO et al were Sesame street characters, hence they kept the theme
going.

~~~
ignoramous
Fun fact: Amazon DynamoDB was codenamed _Big Bird_ , too.

[https://news.ycombinator.com/item?id=21695147](https://news.ycombinator.com/item?id=21695147)

~~~
karavelov
Yes, as the rest of the AWS database services it used Sesame street name. E.g.
Elmo was Elastic Cache.

Not flying was a joke from the early design period, I heard. At the end it
took off pretty well.

------
echan00
Is there a GitHub repo?

