
Language-Agnostic Bert Sentence Embedding - theafh
https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html
======
richdougherty
Paper: [https://arxiv.org/abs/2007.01852](https://arxiv.org/abs/2007.01852)

"We adapt multilingual BERT to produce language-agnostic sentence embeddings
for 109 languages. %The state-of-the-art for numerous monolingual and
multilingual NLP tasks is masked language model (MLM) pretraining followed by
task specific fine-tuning. While English sentence embeddings have been
obtained by fine-tuning a pretrained BERT model, such models have not been
applied to multilingual sentence embeddings. Our model combines masked
language model (MLM) and translation language model (TLM) pretraining with a
translation ranking task using bi-directional dual encoders. The resulting
multilingual sentence embeddings improve average bi-text retrieval accuracy
over 112 languages to 83.7%, well above the 65.5% achieved by the prior state-
of-the-art on Tatoeba. Our sentence embeddings also establish new state-of-
the-art results on BUCC and UN bi-text retrieval."

(Found via
[https://tfhub.dev/google/LaBSE/1](https://tfhub.dev/google/LaBSE/1))

------
lacker
After all the attention OpenAI got for making the GPT-3 API semi-publicly-
available, I wonder if Google has considered making some of their research
models available via API. It would be pretty neat to be able to try these
things out, rather than just reading papers about them.

~~~
MiroF
You've actually got it backwards.

Releasing model weights has been common for a long time, it was OpenAI that
regressed and refused to release its GPT models until it released GPT-3 hidden
behind a semi-public API. The google BERT models are released pre-trained in
full (as mentioned in the article), you can easily play around with them on
your own.

BERT isn't a forward LM like GPT, so it's a little less easy to play with for
the uninitiated - although there are papers showing text generation with BERT.

~~~
tmabraham
To be fair, IIRC OpenAI did release GPT-2-large after some community pressure,
and it was somewhat doable for some people to actually train from scratch.
GPT-3 is too large, so even if they released it, nobody apart from large
companies like Google could do anything with it. If anything, they've made
GPT-3 more open than it would have been if they just released the weights.

At least that's my understanding. Feel free to correct me if I'm wrong.

~~~
liuliu
I don't know. It has 175 billion parameters, thus, about 500GiB for floating
point parameters. If these are bfloat16, it is mere 300GiB. We know CPU is
about 50 times slower than GPU for transformer models, hence, we are looking
at 4 to 8 minutes per inference (parameter loading on demand from SSD takes
200 seconds or so, and probably the bottleneck here). If the parameters can be
loaded into memory (seems you have to be on a >= 8-channel machine with
unbuffered RAM or with buffered RAM), it will be probably 1 to 2 minutes per
inference.

Still in the realm of doing inference in homelab territory (barely).

~~~
hansvm
Aside from having to slice and dice things a bit to fit in the accelerator's
memory and incurring a higher memory bandwidth cost in doing so, is there any
reason a home lab's GPU couldn't be used for GPT-3?

~~~
liuliu
It absolutely can. PCIe 3x16 is at around 16GB/s speed, hence, uploading
parameters at full speed would be around 50 seconds. That probably is the
bottleneck for this case though. A minor issue is that the NVIDIA consumer
cards (or any cards besides the new A100) doesn't support bfloat16, I am
uncertain fp16 is sufficient for this model or not.

Edit: It probably is, considering you only need weights to be fp16, while
intermediate layers can be fp32 and reuse these memory.

------
tasogare
Typical ML-based paper where everything is about machine learning and nothing
is about languages... The biggest issue is nothing is said about how the
languages were selected. That’s the basis when dealing with multilingual
languages as the features they share or not is important and will impact the
results. In this respect the last figure is the more interesting one: it
mentions ido and interlingua, two constructed languages. Why does they get
included since NLP is supposed to deal with natural languages? It not a bad
choice per se, but the selection criteria needs to be explain more than just
picking languages to get a large number on a research paper.

There is also no reflection on why some languages got better results than
others. Again looking at the last figure in the top three best performing
languages with no data two are Sinitic languages (Cantonese and Wu) which
likely happens since their closeness to standard Chinese for which there is
huge ressources available for training. On the reverse Breton for which there
is probably very few apparented language data in the initial models and
training set besides Irish get very poor results, which tends to show the
model actually don’t transfer very well if at all...

~~~
mkasu
NLP is not my main field but still relevant to my work because I often use
models and resources from NLP as tools. I'm also personally interested in
Linguistics and Languages so I follow related news, sometimes attend NLP
conferences and follow people in those fields on Social Media.

It is very concerning how few thought is usually put into linguistic or
language characteristics when dealing with these topics. I also rarely see
cultural considerations etc. Basically everything is considered as "machine
learning will hopefully get this right if having enough data" which is
unfortunate (ML is a great tool but the conferences are about language
processing).

Another big issue I noticed is that a majority of research only targets or
evaluates English texts. In many cases the language is not even specified
(although it is clear they use English from figures or examples). I even heard
people complaining that work on non-English data is treated as too minor by
many reviewers so stuff like that often just gets rejected.

I think this is a really weird development for a field which centers around
natural languages.

~~~
IfOnlyYouKnew
While I sort-of recognize the emotion you describe in myself, it cannot be
ignored that these ignoramuses are simply blowing "traditional" research out
of the water in terms of results. That's true across the board, from NLP to
image data to computational biology.

It's also a bit simplified to consider it a bifurcation between "traditional"
linguists and AI experts entirely ignorant of the discipline. Long before the
current wave of AI started, Google liked to hire linguists and computational
scientists. These teams probably do have plenty of subject matter experts, but
for now they are reaping the low-hanging fruits of the suddenly-improved
generic methods. As the marginal improvements are inevitably diminished,
subject matter will become more salient again.

I'm a computational biologist by training, and have great appreciation for the
often beautiful algorithms, many created in the 70s or 80s and allowing then-
spectacular feats of tackling large datasets. Unfortunately, it isn't always
obvious how to transfer that knowledge to the new way of doing things.

~~~
mkasu
Yes, the seeming performance of (especially) neural models compared to
traditional models is probably the main factor. Although, some voices[1] argue
that traditional or much simpler approaches still often do a similar job
compared to super over-engineered models, especially when going even slightly
beyond an existing target-dataset or task.

I'd argue, that improving the ML models is really the job of ML researchers
and should be mainly targeting ML conferences like AAAI (Adv. of AI). In other
conferences (directly targeting NLP, CV, Comp. Biology, etc.) it should be the
main job to combine those models with the domain-specific characteristics
(e.g., language information for NLP) or "traditional" methods to make it an
interesting discussion.

I was recently doing reviewing for a multimedia conference and quite a lot of
the papers I reviewed were basically pure ML papers. A colleague had the same
experience.

1: [https://arxiv.org/abs/1907.06902](https://arxiv.org/abs/1907.06902)

~~~
tasogare
The ML papers wouldn't bother me if they included specialists of the targeted
domain to address the obvious pitfall. I've analyzed the figures in the blog
post and skimmed the paper and both one novelty claim ( _(2) A single
massively multilingual model spanning 109 languages and showing cross-lingual
transfer even to zeroshot cases._ ) and an "explanation" ( _Such positive
language transfer across languages is only possible due to the massively
multilingual nature of LaBSE_ ) can be debunked just by looking carefully at
the figures like I did in the past hour. The languages on which they test the
things are also poorly selected (6 constructed languages, one duplicate and
one macro-lang) which shows clear lack of attention to details and poor
understanding of some basic linguistics notions. But hey it's an ML paper,
it's from Google and it has BERT in the title so get attention and will be
cited even if it's half-crap.

