
The Illustrated BERT: How NLP Cracked Transfer Learning - ghosthamlet
https://jalammar.github.io/illustrated-bert/
======
danieldk
I think that the work that is done on ELMo, BERT and others is great and
useful. Unfortunately, there are many grandiose claims circling around these
papers, such as the title of this blog post.

For example:

 _If we’re using this GloVe representation, then the word “stick” would be
represented by this vector no-matter what the context was. “Wait a minute”
said a number of NLP researchers (Peters et. al., 2017, McCann et. al., 2017,
and yet again Peters et. al., 2018 in the ELMo paper ), “stick”” has multiple
meanings depending on where it’s used. Why not give it an embedding based on
the context it’s used in – to both capture the word meaning in that context as
well as other contextual information?”. And so, contextualized word-embeddings
were born._

This is blatantly false. Contextualized word representations have been around
for a very long time. For example, the neural probabilistic language model
proposed by Bengio et al., 2003 produces contextual word representations.
There have been many papers about neural language models thereafter. However,
the idea is even older, Schütze's 1993 paper (Word Spaces) produces context-
dependent word representations with subword units (n-grams).

Researchers have been well-aware for decades that ideally one would need
context-sensitive representations and that representations such as those
produced by word2vec or GloVe have this shortcoming. However, one of the
reason that word2vec became so popular is that it is damn cheap to train [1]
and that the possibility to pretrain on much larger corpora gave these simpler
models an edge.

ELMO, BERT, and others (even though they differ quite a bit) spiritual
successors of earlier neural language models that rely on newer techniques
(BiDi LSTMs, convolutions over characters, transformers, etc.), larger amounts
of data, and the availability of _much_ faster hardware than we had one or two
decades ago (e.g. BERT was trained on 64 TPU chips, or as Ed Grefenstette
called it _blowing through a forest 's worth of GPU-time_).

Disclaimer: I have nothing against this work. I very much enjoyed the ELMo
paper. I am just objecting to all the hype/marketing out there.

[1] The skip-gram model with negative sampling is very similar to logistic
regression, where one optimizes parameters of two vectors rather than just one
weight vector.

~~~
andreyk
This blog post certainly is pretty flawed on attribution of ideas - it
attributes word2vec as the first to introduce word vectors (which is.... very
wrong).

"Word2Vec showed that we can use a vector (a list of numbers) to properly
represent words in a way that captures semantic or meaning-related
relationships (e.g. the ability to tell if words are similar, or opposites, or
that a pair of words like “Stockholm” and “Sweden” have the same relationship
between them as “Cairo” and “Egypt” have between them) as well as syntactic,
or grammar-based, relationships (e.g. the relationship between “had” and “has”
is the same as that between “was” and “is”)."

~~~
KasianFranks
True, we built NLP/NLU vector representations from the ground up to tackle
hypothesis generation and hidden relationship detection connected to genes,
genomic pathways and therapeutics related to extending human lifespan, DNA
repair and LET radiation chromosomal damage repair at Lawrence Berkeley
National Lab from 2002 to 2008
[https://www.google.com/patents/US7987191](https://www.google.com/patents/US7987191)
followed by Tomas Mikolov's/Google's work and preceded by countless others of
course.

------
PaulHoule
Has anyone developed commercial applications based on word embeddings?

It's clear that people are putting up better and better numbers on certain
tasks that are widely shared, but for all I know these will always be a
bridesmaid and never a bride when it comes to being useful for something.

Back in the 1970s it was clear that it wasn't going to be easy to make rule-
based parsers that were "good enough" but it seems that now the task has been
defined down so that if you can do better than chance that's a miracle. Thus
people can kid themselves into thinking they are practicing what Thomas Kuhn
called "normal science" since they are in the same shared reality even if it
is a delusion.

~~~
m_ke
Most modern NLP methods are built on top of word embeddings, things like
Neural Machine Translation, text classification and etc all convert input
words into word embeddings then stack a neural networks on top of it.

~~~
PaulHoule
People write papers about these things.

"Do they make commercial applications?" is the question.

The paradigm here is "Does method X get better answers than method Y?" as
opposed to "Will we embarass ourselves if we put method Z into production?"

~~~
m_ke
All of those things are used daily by billions of people around the world.
Google sells their translation system through Google cloud, so does AWS and a
bunch of other NLP startups.

~~~
PaulHoule
Can you point me to a text analysis API that is not embarrassingly bad?

I have tried products from Amazon, IBM, Google and other companies and the one
thing they have in common is they've never passed an acceptance test for me.

Language translation is a particular bad example of something where a "Clever
Hans" effect can happen, where the reader's desire and capability for closure
will fill in for mistakes the system makes and make it look like it performs
better than it really does.

~~~
nl
I don’t use off the shelf APIs, but I’d be surprised if Google’s named entity
recogniser isn’t pretty good in the news or financial domain.

OpenCalais is pretty good in the financial domain.
[http://www.opencalais.com/opencalais-
demo/](http://www.opencalais.com/opencalais-demo/)

IBM isn’t great. Don’t know about Amazon.

------
andreyk
See also "NLP's ImageNet moment has arrived" ([https://thegradient.pub/nlp-
imagenet/](https://thegradient.pub/nlp-imagenet/)) by one of the researchers
involved in the papers surveyed in this post.

~~~
danielbigham
Thanks for sharing this link -- found it quite a good read.

------
deytempo
It’s ironic that the study of language leads to the creation of a new one

------
julienfr112
How is this related to fastText ?

