Thanks for the link - pretty good review.
Totally agree with what he says about calling it SOTA. It's definitely a very interesting approach but I wouldn't put that much weight on whether it's achieved SOTA, because of the amount of compute they threw into it.
Right now, BigBird is encoder only so it doesn't generate text. Causal attention with the global memory is a bit weird, but we could probably do it.
GPT-3 is only using a sequence length of 2048. In most of our paper, we use 4096, but we can go much larger 16k+. Of course, we don't nearly have as many parameters as GPT-3, so our generalization may not be as good.
BigBird is just an attention mechanism and could actually be complementary to GPT-3.
I think the insight is not incredibly original and follows naturally from OpenAI's Sparse Transformer. The idea is similar to Longformer. Two teams at Google had a similar insight, hence the high number of authors.
The original implementation only took a couple of months and was primarily motivated by internal Google applications. Natural Questions was the first external benchmark we tried to validate on, which took a few months to find the right setup. All the other datasets, took a few weeks but the effort was done in parallel given the large team.
There was quite a bit of frustration dealing with Tensorflow, TPUs, and the XLA compiler that maybe set us back a few months, too.
Edit: This is not meant to criticize the Tensorflow, TPU, or XLA team. They responded quickly to our bugs and made the work possible. I just meant there was some extra organizational overhead.
The main advantage of Big Bird is its linear complexity in sequence length. If it were to be trained on the same corpus as GPT-3 what would be the advantages/disadvantages?
I am thinking maybe longer context window, faster training and less memory use, but what about performance, will it measure up?
This is something we want to explore. BigBird just replaces the attention mechanism in BERT.
We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context.
Not really. The proofs are more of a curiosity really. I think the strong performance on QA takes that require multihop reasoning give some evidence that the model is capable of complex reasoning.
First off, GPT-3 is absolutely not a vanilla transformer.
The moving pieces here in BigBird are a vector associated with each token in the sequence and a few more global vectors that you can think of as latent variables. Those pieces are present at each layer. The vector in layer i+1 at position t in the sequence is a function of a bunch of the vectors in layer i.
If position t depends on all of the other positions, then you end up computing all combinations. A sequence of length N has N^2 combinations. Papers like this are using different patterns.
In this one, position t only depends on local positions in some small window around t, those global/latent-ish variables, and a small number of random positions. The number of combinations is (the size of the local window + the number of random pieces to attend to + the number of global vectors to attend to) times the number of positions, so it's some k * N rather than N^2. That lets you scale to longer sequences.
Unfortunately, they didn't give many details in the paper. It's frustrating, to say the least. Yay reproducibility. They say
> We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.
In the referenced paper (sparse transformer) they showed a bunch of different sparsity patterns, and I believe they're referring to either their banded block diagonal sparsity or a true banded diagonal pattern (local windows like bigbird and some other papers). Unfortunately, that paper also was light on details and the repo they open sourced alongside it is inscrutible.
Yannic Kilcher - Big Bird: Transformers for Longer Sequences (Paper Explained)
https://www.youtube.com/watch?v=WVPE62Gk3EM