Hacker News new | past | comments | ask | show | jobs | submit login
Google ‘BigBird’ Achieves SOTA Performance on Long-Context NLP Tasks (syncedreview.com)
96 points by Yuqing7 on Aug 3, 2020 | hide | past | favorite | 27 comments



For an in-depth review you can see:

Yannic Kilcher - Big Bird: Transformers for Longer Sequences (Paper Explained)

https://www.youtube.com/watch?v=WVPE62Gk3EM


Thanks for the link - pretty good review. Totally agree with what he says about calling it SOTA. It's definitely a very interesting approach but I wouldn't put that much weight on whether it's achieved SOTA, because of the amount of compute they threw into it.


For those who weren't familiar with the acronym as I was, SOTA = "state of the art."


Thanks, that's definitely not what my brain assumed...

https://www.sota.org.uk/


Bibliography of the various approaches to long-context Transformers: https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_b...


I'm one of the authors. I may be able to answer some questions or concerns.


If you compare what BigBird is good at to what GPT-3 is good at, what are the relative strengths of each system?


Right now, BigBird is encoder only so it doesn't generate text. Causal attention with the global memory is a bit weird, but we could probably do it.

GPT-3 is only using a sequence length of 2048. In most of our paper, we use 4096, but we can go much larger 16k+. Of course, we don't nearly have as many parameters as GPT-3, so our generalization may not be as good.

BigBird is just an attention mechanism and could actually be complementary to GPT-3.


I'm curious how long this work took. Did you encountered many negative results before you got good results? What inspired you for the main insight?


I think the insight is not incredibly original and follows naturally from OpenAI's Sparse Transformer. The idea is similar to Longformer. Two teams at Google had a similar insight, hence the high number of authors.

The original implementation only took a couple of months and was primarily motivated by internal Google applications. Natural Questions was the first external benchmark we tried to validate on, which took a few months to find the right setup. All the other datasets, took a few weeks but the effort was done in parallel given the large team.

There was quite a bit of frustration dealing with Tensorflow, TPUs, and the XLA compiler that maybe set us back a few months, too.


Edit: This is not meant to criticize the Tensorflow, TPU, or XLA team. They responded quickly to our bugs and made the work possible. I just meant there was some extra organizational overhead.


The main advantage of Big Bird is its linear complexity in sequence length. If it were to be trained on the same corpus as GPT-3 what would be the advantages/disadvantages?

I am thinking maybe longer context window, faster training and less memory use, but what about performance, will it measure up?


This is something we want to explore. BigBird just replaces the attention mechanism in BERT.

We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context.


Any examples of turning complete interaction?


Not really. The proofs are more of a curiosity really. I think the strong performance on QA takes that require multihop reasoning give some evidence that the model is capable of complex reasoning.


Can someone with more domain experience grace us with the ELI5 of how these sparse transformers differ from vanilla transformers a la GPT-3


First off, GPT-3 is absolutely not a vanilla transformer.

The moving pieces here in BigBird are a vector associated with each token in the sequence and a few more global vectors that you can think of as latent variables. Those pieces are present at each layer. The vector in layer i+1 at position t in the sequence is a function of a bunch of the vectors in layer i.

If position t depends on all of the other positions, then you end up computing all combinations. A sequence of length N has N^2 combinations. Papers like this are using different patterns.

In this one, position t only depends on local positions in some small window around t, those global/latent-ish variables, and a small number of random positions. The number of combinations is (the size of the local window + the number of random pieces to attend to + the number of global vectors to attend to) times the number of positions, so it's some k * N rather than N^2. That lets you scale to longer sequences.


> First off, GPT-3 is absolutely not a vanilla transformer.

I thought it had fixed length window. Can you explain how it differs from vanilla transformer other than the size.


Unfortunately, they didn't give many details in the paper. It's frustrating, to say the least. Yay reproducibility. They say

> We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

In the referenced paper (sparse transformer) they showed a bunch of different sparsity patterns, and I believe they're referring to either their banded block diagonal sparsity or a true banded diagonal pattern (local windows like bigbird and some other papers). Unfortunately, that paper also was light on details and the repo they open sourced alongside it is inscrutible.



This seems pretty similar to the Longformer, and the performance is not that much better. Is the only difference the random attention?


Was it named BigBird because it is never going to fly?


I'd guess it has more to do with predecessors ELMo and BERT


BERT, ElMO et al were Sesame street characters, hence they kept the theme going.


Fun fact: Amazon DynamoDB was codenamed Big Bird, too.

https://news.ycombinator.com/item?id=21695147


Yes, as the rest of the AWS database services it used Sesame street name. E.g. Elmo was Elastic Cache.

Not flying was a joke from the early design period, I heard. At the end it took off pretty well.


Is there a GitHub repo?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: