
T5: The Text-to-Text Transfer Transformer - theafh
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
======
thatcherc
The 'fill in the N blanks' results at the end are fascinating! N<64 are all
pretty normal, but then for N=64 and N=512, it starts going on about the old
1930s cookbook it has and its grad school experiences! Wild. I think I would
not be able to distinguish this from a selection of real Amazon reviews or
similar informal text.

~~~
pragmatick
It reads like a typical blog recipe introduction.

~~~
Ajedi32
Which makes sense, given the corpus they trained T5 on.

------
akie
Question: How did they obtain the Colossal Clean Crawled Corpus (C4) they
mention in the article?

Options:

1\. "Mechanical Turk" style, a massive undertaking to manually clean up Common
Crawl, perhaps using underpaid labor in third world countries (such as
samasource.com does)

2\. By means of somehow getting the internet to do it for them with something
like reCAPTCHA

3\. With the help of machine learning / traditional text processing

4\. Some other way

Anyone has any ideas? I'm intrigued. The paper
[[https://arxiv.org/pdf/1910.10683.pdf](https://arxiv.org/pdf/1910.10683.pdf)]
and the website
[[https://www.tensorflow.org/datasets/catalog/c4](https://www.tensorflow.org/datasets/catalog/c4)]
mention almost nothing, except for an option to switch off the cleaning &
deduplication, which hints at option number 3.

~~~
bibobap
In section 2.2 of the paper they describe the process they use: applying a
series of heuristic rules to the text. (Also the dataset is 750Gb... )

~~~
akie
Ah wow, thanks! Not sure how I missed that. For other interested parties,
here's the key section:

> Unfortunately, the majority of [the text in Common Crawl] is not natural
> language. Instead, it largely comprises gibberish or boiler-plate text like
> menus, error messages, or duplicate text. Furthermore, a good deal of the
> scraped text contains content that is unlikely to be helpful for any of the
> tasks we consider (offensive language, placeholder text, source code, etc.).
> To address these issues, we used the following heuristics for cleaning up
> Common Crawl’s web extracted text:

•We only retained lines that ended in a terminal punctuation mark (i.e. a
period, exclamation mark, question mark, or end quotation mark).

•We removed any page that contained any word on the “List of Dirty, Naughty,
Obscene or Otherwise Bad Words”. [[https://github.com/LDNOOBW/List-of-Dirty-
Naughty-Obscene-and...](https://github.com/LDNOOBW/List-of-Dirty-Naughty-
Obscene-and-Otherwise-Bad-Words)]

•Many of the scraped pages contained warnings stating that Javascript should
be enabled so we removed any line with the word Javascript.

•Some pages had placeholder “lorem ipsum” text; we removed any page where the
phrase “lorem ipsum” appeared.

•Some pages inadvertently contained code. Since the curly bracket “{” appears
in many programming languages (such as Javascript, widely used on the web) but
not in natural text,we removed any pages that contained a curly bracket.

•To deduplicate the dataset, we discarded all but one of any three-sentence
span occurring more than once in the dataset.

Additionally, since most of our downstream tasks are focused on English-
language text, we used langdetect
[[https://pypi.org/project/langdetect/](https://pypi.org/project/langdetect/)]
to filter out any pages that were not classified as English with a probability
of at least 0.99.

~~~
MattConfluence
> We removed any page that contained any word on the “List of Dirty, Naughty,
> Obscene or Otherwise Bad Words”. [[https://github.com/LDNOOBW/List-of-Dirty-
> Naughty-Obscene-and...](https://github.com/LDNOOBW/List-of-Dirty-Naughty-
> Obscene-and...)]

Looking at that list, I wonder what the unintended consequences of a decision
like this is. If you want to create something related to sentiment analysis,
that swear words you discarded is a useful signal, not noise right? If you
wanted to use the dataset somehow for your tour guide business in Austria, how
does it handle the the village called Fucking? Does T5 understand the British
colloquialism for cigarettes? Can ornithologists talk to it about penguins and
eagles, but not about yellow-bellied tits and blue-footed boobies?

~~~
vimax
That made me think of something along the lines of "backdooring a dataset" by
introducing some hard to find but easy to trigger failure modes or
fingerprinting for any application built on top of it.

~~~
lowdose
Sounds like an awesome idea, put some easter eggs in the common crawl to
compromise the future of NLP.

------
anigbrowl
_To put these results in perspective, the T5 team went head-to-head with the
model in a pub trivia challenge and lost!_

Trivia fact recall and NLP seem like two quite different tasks even though
both are required to do well in a quiz.

~~~
craffel
Agreed! The interesting thing is that basic unsupervised pre-training seems to
produce a model which functions not only as a knowledge base but also an NLU
system which can effectively query the knowledge base using natural text
questions. This is exactly what our follow-up paper is on.

------
modeless
The trivia game ([https://t5-trivia.glitch.me/](https://t5-trivia.glitch.me/))
needs a little work.

> Q: How did Gus Grissom, Ed White and Roger B. Chaffee die in 1967?

> You: "Apollo 1" WRONG

> T5: "They were killed when their Apollo 1 spacecraft exploded" WRONG

> Correct answer: burned to death

> Q: Which Alpine peak is known in Italy as Monte Cervino?

> You: "Monte Cervino" CORRECT

I wonder how many of the problems with this game could be fixed by applying T5
itself to the answer grading.

~~~
craffel
Yes, unfortunately we have to rely on the very brittle "exact match" method of
evaluating whether an answer is correct. FWIW and perhaps surprisingly, this
is the primary way question-answering systems are evaluated in common
benchmarks. I totally agree that fine-tuning T5 for answer grading would be
super interesting!

~~~
dmit
I'm sorry for being blunt, but is it possible that the `very brittle "exact
match" method of evaluating whether an answer is correct` means value
equality? Is `==` the secret sauce?

~~~
craffel
It's slightly more than that -- it also involves lowercasing and removing
articles before testing for string equality.

------
lanekelly
Anyone know what's new in the blogpost? T5 has been out for a few months now.

~~~
craffel
The blogpost has a summary of our paper from October (a bit late, sorry!) but
also has some (fun?) new results on closed-book question answering and fill-
in-the-blank text generation.

------
kitsune_
No mention of MASS by Microsoft? It was afaik one of the first pretraining
schemes for a full transformer outside of XLM.

Imho a bit unfortunate, as is calling the decoder or the encoder of a
transformer "a transformer", as it has happened with GPT and BERT, which now
forces people to use "full transformer" or using phrases like the title of the
blog post.

~~~
craffel
We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper,
[https://arxiv.org/pdf/1910.10683.pdf](https://arxiv.org/pdf/1910.10683.pdf)).
FWIW, people were pre-training Transformers before MASS, e.g. "Improving
Language Understanding by Generative Pre-Training" by Radford et al. from
2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al.
describe pre-training an RNN encoder-decoder model for subsequent transfer.

~~~
kitsune_
But Radford is just pretraining the decoder and qualitatively different from a
seq2seq approach such as MASS. If we just look at the original paper from
Vaswani, than "pretraining a transformer" imho should always only have meant
pretraing the encoder and decoder. Obviously that ship has sailed.

------
atomoton
Wonder how this would compare with Watson at playing Jeopardy...

~~~
halflings
I would assume most QA (question answering) models blow Watson out of the
water. A lot has been done since then. See:
[https://aclweb.org/aclwiki/Question_Answering_(State_of_the_...](https://aclweb.org/aclwiki/Question_Answering_\(State_of_the_art\))

~~~
nl
(I've done work in QA and have played at building Jeopardy style QA models)

Watson (Jeopardy Watson, not the IBM branding exercise Watson is now) has much
weaker text understanding models, but has much much better optimisations for
the incremental style of data release that you see in Jeopardy (ie, you get
more and more data the longer you listen). IBM did a lot of work optimising
_when_ to answer as well as trying to get the correct answer.

The closest analogy that is regularly studied in modern QA research is
"Quizbowl"-style datasets, but these tend to be much smaller than the SQUAD
datasets that most modern neural network QA systems are built against.

------
kuprel
How would this perform on the reading comprehension part of the SAT?

------
vbarrielle
The example figure has a weird entry for the summary task. The input is:

"summarize: state authorities dispatched emergency crews tuesday to survey the
damage after an onslaught of severe weather in mississippi..."

And the output is:

"six people hospitalized after a storm in attala county."

That's quite a bad summary, no mention of "six" people in the original text,
no mention of hospitalization. And "attala" county is too specific, a
precision not present in the original text.

If that's the result of their model, that's not good. If it's coming from the
training set, it's an even bigger problem. I guess it's the result of the
model, because some issues can be explained by correlations ("emergency"
correlates with "hospital", "mississipi" correlates with "attala").

I'm wondering why they chose this example for the flagship figure of their
paper.

~~~
mritun
The “...” at the end of the phrase is the give away that they couldn’t fit the
whole article in the picture. Checkout the complete paper.

~~~
vbarrielle
I missed that ellipsis cue, thanks for pointing it out. The complete excerpt
is not in the complete paper though, but it's probably in their released data.

------
gwern
If you want to train or run T5 for pure text generation (GPT-2-style), Nax
developed a Colab notebook for that:
[https://twitter.com/NaxAlpha/status/1224912629967310848](https://twitter.com/NaxAlpha/status/1224912629967310848)
[https://colab.research.google.com/drive/1-ROO7L09EupLFLQM-
TW...](https://colab.research.google.com/drive/1-ROO7L09EupLFLQM-
TWgDHa5-FIOdLLh)

------
j0e1
Link to the paper:
[https://arxiv.org/abs/1910.10683](https://arxiv.org/abs/1910.10683)

------
vackosar
Listen to the paper here you can
[https://youtu.be/gyBdnNY1WPI](https://youtu.be/gyBdnNY1WPI)

------
derefr
Now I'm curious what it'd give as output for the missing-tokens task if you
specialized it on understanding SVG vector image data...

~~~
bowmessage
Interesting thought! I think you'd have to provide more than the image XML
though. The locality of XML elements in text form doesn't necessarily
correspond with their locality in rendered image form, that would be tricky as
there wouldn't be a lot of "context" to go off of.

------
Jack000
would pre-training on non-English text (using a joint dictionary) improve
translation performance?

I'm not sure if the information on
[http://nlpprogress.com/english/machine_translation.html](http://nlpprogress.com/english/machine_translation.html)
is accurate, but it appears that the top translation results rely on
backtranslation, boosting and other data augmentation techniques with a
_vanilla_ transformer model. It would be interesting to see the Bleu scores
for T5 that's more optimized for translation specifically.

~~~
Ajedi32
It'd be really interesting to see if T5 could be made to perform a reading
comprehension or summarization task where the source article is in a different
language from the question and answer. Seems like a potentially interesting
application for a model as flexible as this one.

------
foota
In case anyone from the team is watching, the colab link at the bottom is
broken.

~~~
craffel
Thanks, fixed!

------
ComputerGuru
Isn’t that the name of Microsoft’s code generation templating language?

~~~
zamalek
That's T4. The naming is pretty unfortunate.

~~~
perl4ever
Not to be confused with the m4 language.

------
slyrus
Of course every equestrian (or parent of one) would know that a sentence like
"the course is jumping well" is totally legit.

------
baq
> Text-To-Text Transfer Transformer (T5)

> Colossal Clean Crawled Corpus (C4)

pun detector at 3.6 punits. not great, not terrible.

------
flying_sheep
This is epic...

> Q: What is the opposite of an acid?

> You: alkaline

> T5: Alkali

> Correct answer: a Base

:-)

------
eutectic
Did you consider incorporating convolution into the model ala the evolved
transformer?

------
riazrizvi
> the full 11 billion parameter model achieves the exact text of the answer
> 50%[ to 30% of the time]

If you need to tweak 11-billion parameters to get a particular result, I don’t
see how you can call whatever is being called a model, more like a component
of a model.

------
edsonmedina
Social-media political bots about to get harder to detect.

------
eutectic
Did you consider incorporating convolution a la 'the evolved transformer'?

------
mc3
Ironically this can be used to create search engine spam!

------
kizer
Will it be able to learn the universal Turing machine? We could "train" a
compiler & runtime into existence. Probably too much of a correctness
constraint (since the output has to be the exact result of computing the
encoded program with the encoded machine).

------
gitgud
> _" With T5, we propose reframing all NLP tasks into a unified text-to-text-
> format where the input and output are always text strings..."_

So exactly like the "unix pipe" philosophy invented 47 years ago?

I guess ideas are cyclical...

