
Facebook releases 300-dimensional pretrained Fasttext vectors for 90 languages - sandGorgon
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
======
ninjin
This has the potential to be very very useful and it is great that FB has
released them. Some potential caveats. I don't know how well Fasttext vectors
perform as features for downstream machine learning systems (if anyone know of
work along these lines, I would be very happy to know about it), unlike
word2vec [1] or GloVe [2] vectors that have been used for a few years at this
point. Also, only having trained on Wikipedia gives the vectors less exposure
to "real world" text, unlike say word2vec that was trained on the whole of
Google News back in the day or GloVe that used Common Crawl. Still, if you
need word vectors for a ton of languages this is looking like a great resource
and will save you the pre-processing and computational troubles of having to
produce them on your own.

[1]:
[https://code.google.com/archive/p/word2vec/](https://code.google.com/archive/p/word2vec/)

[2]:
[http://nlp.stanford.edu/projects/glove/](http://nlp.stanford.edu/projects/glove/)

~~~
versteegen
This isn't a real downstream task, but one of the researchers at RaRe compared
FastText to word2vec/gensim/skipgram word embeddings on the original testsets
for the 'semantic' and 'syntactic' analogy tasks from the word2vec papers
here:

[https://rare-technologies.com/fasttext-and-gensim-word-
embed...](https://rare-technologies.com/fasttext-and-gensim-word-embeddings/)

The conclusion:

    
    
       These preliminary results seem to indicate fastText embeddings are
       significantly better than word2vec at encoding syntactic information. This is
       expected, since most syntactic analogies are morphology based, and the char
       n-gram approach of fastText takes such information into account. The original
       word2vec model seems to perform better on semantic tasks, since words in
       semantic analogies are unrelated to their char n-grams, and the added
       information from irrelevant char n-grams worsens the embeddings.
    

Personally I think those analogy testsets are not very good, because they just
test all pairs of relations between a very small number of words from very
limited domains (like capital and country names).

One advantage of FastText should be better learning on small amounts of data
like Wikipedia.

~~~
dzdt
It takes a certain kind of perspective for wikipedia to be called a "small
amount of data." English wikipedia alone would run to about 2500 print
volumes. Imagine telling an AI researcher from 1995 that that was "small".

~~~
versteegen
I admit, I was being funny by being unclear. For word-embeddings English
Wikipedia is a moderate-large dataset at 58GB uncompressed (13GB compressed).
But most of those other language wikis really are tiny. Welsh is just 67MB
compressed, and there are plenty of languages more obscure than that on the
list. The point of word2vec was to make use of as much data as possible by
being as fast as possible (processing billions of words an hour) rather than
clever, so it would be impressive if fastText vectors for those wikis were at
all useful.

------
atrudeau
Evaluation of [https://s3-us-west-1.amazonaws.com/fasttext-
vectors/wiki.en....](https://s3-us-west-1.amazonaws.com/fasttext-
vectors/wiki.en.zip) (unzips wiki.en.vec) on word similarity tasks (all
numbers are Spearman rank correlation):

WS-353 Similarity: 0.781

WS-353 Relatedness: 0.682

MEN: 0.765

MTurk: 0.679

RW: 0.487

SimLex: 0.380

MC: 0.812

RG: 0.800

SCWS: 0.667

Impressive for a model trained on Wikipedia alone!

I will post analogy scores for this model as soon as they are done computing.

~~~
atrudeau
Google Analogy Task results ( detailed results at
[http://pastebin.com/PF96nMfX](http://pastebin.com/PF96nMfX) ):

Semantic accuracy: 63.84 % Syntactic accuracy: 67.00 %

Here performance is not great (great would be >80% on semantic and >70% on
syntactic).

As this task requires nearest neighbor lookups, performance is impacted by
vocabulary size. Since models trained using Wikipedia alone usually limit
vocabulary to something ~300k words, we can try that to get scores which are
comparable to those posted by the GloVe [1] and LexVec [2] papers by only
using the first 300k words in the pre-trained vectors, giving the following
results:

Semantic accuracy: 77.75 % Syntactic accuracy: 72.55 %

Impressive stuff!

[1]
[http://nlp.stanford.edu/pubs/glove.pdf](http://nlp.stanford.edu/pubs/glove.pdf)
\-
[https://github.com/stanfordnlp/GloVe](https://github.com/stanfordnlp/GloVe)

[2] [https://arxiv.org/pdf/1606.01283v1](https://arxiv.org/pdf/1606.01283v1)
\-
[https://github.com/alexandres/lexvec](https://github.com/alexandres/lexvec)

------
sandGorgon
One of the biggest things that I see with this release is trained vectors for
Asian languages - Hindi, Kannada, Telugu, Urdu, etc.

This is huge - because most other releases have traditionally been in European
languages. It is fairly rare to see Asian languages release.

One challenge is that typical linguistic use in Asia mixes the native language
with English. For example, people in north india use "Hinglish". It is
typically fairly hard to make sense out of this.

~~~
kuschku
It's already a challenge to find stuff in German, it's annoying just how much
research only focuses on English.

~~~
lkozma
According to Ethnologue (2005), by number of speakers:

Hindi -- native: 370m | total: 490m

Bengali -- native: 196m | total: 215m

German -- native: 101m | total: 229m

~~~
kuschku
I’m sure you know yourself that number of speakers is rarely the metric used
for choosing a target market, and more commonly products are launched by the
potential revenue to be made (which scales with GDP per capita).

Yet, most projects only target the Anglosphere, not even Europe is usually
included.

~~~
danmaz74
Problem is that "Europe" means so many different languages... which is also
our biggest remaining obstacle in trying to launch even just web products here
(as compared to launching "in the Anglosphere").

------
Radim
FYI: you can now use fastText directly from gensim (Python) [1]. This allows
you to easily test / compare fastText to other popular embeddings, such as
word2vec, doc2vec or gloVe.

[1] [https://github.com/RaRe-
Technologies/gensim/blob/develop/gen...](https://github.com/RaRe-
Technologies/gensim/blob/develop/gensim/models/wrappers/fasttext.py)

------
minimaxir
Played a bit with fastText months ago. The issue I had with it is that unlike
CNNs/RNNs, the _relative position_ of a word doesn't matter as much (only as a
part of a context window during training the embeds), and so results can be
worse depending on the case. However, for CPU-workloads, fastText is certainly
faster, especially since subword information is also incorporated.

I have Python code to process text files into a fastText/friendly format so I
may clean that up and see how these pre-trained embeds work.

(although, the English embeds are 10.36 GB; that might be a tough pill to
swallow for training on machines with only 8 GB of RAM)

~~~
exgrv
Regarding the size of the word vectors files: the text files are sorted by
frequency, so it is possible to easily load the top k words only.

We might also release smaller models in the future, for training on machines
without large memory.

~~~
morenoh149
fwiw I have 32gb on my workstation and my personal laptop is maxed out at
16gb. Keeping within these thresholds may be useful to others.

------
GrantS
Does anyone know if the languages all live in the _same_ 300-dimensional
space, or are they each trained independently? (i.e. do words and their
translations have similar vectors?)

~~~
exgrv
Models are trained independently for each language. So unfortunately, you
cannot directly compare words from different languages using these vectors.

If you have a bilingual dictionary, you might try to learn a linear mapping
from one language to the other (e.g. see
[https://arxiv.org/abs/1309.4168](https://arxiv.org/abs/1309.4168) for this
approach).

------
rihegher
Can anyone point me to any articles on what can be achieved with this and how?

[EDIT] I reply to myself here:
[https://news.ycombinator.com/item?id=12226988](https://news.ycombinator.com/item?id=12226988)

~~~
rspeer
Words with similar vectors have similar meanings. You use this in search,
sentiment analysis, topic detection, finding similar text, and classification.

Of course there are tests for this "words with similar vectors have similar
meanings" property, and I'm finding that the fastText vectors aren't doing
that well on them, especially outside of English.

I'm glad they released them, particularly so anyone can run a fair comparison
between different word vectors, but these things should come with evaluation
results the same way that code should come with tests. These vectors are
performing worse than ones that my company Luminoso released last year [1] (a
better, later post is [2]), and if you don't believe me plugging my own
vectors, I know that Sapienza University of Rome also has better vectors
called NASARI [3].

fastText covers more languages, but most of these languages have no
evaluations. How do you know the Basque vectors aren't just random numbers?

I think that performance hits a plateau when the vectors only come from text,
with no relational knowledge, and especially when that text is only from
Wikipedia. Text exists that isn't written like an encyclopedia. Meanings exist
that aren't obvious from context. My research at Luminoso involves adding
information from the ConceptNet knowledge graph, producing a word vector set
called "ConceptNet Numberbatch" that just won against other systems in SemEval
[4], a simultaneous, blind evaluation. The NASARI vectors are also based on a
knowledge graph.

[1] [https://blog.conceptnet.io/2016/05/19/an-introduction-to-
the...](https://blog.conceptnet.io/2016/05/19/an-introduction-to-the-
conceptnet-vector-ensemble/) \-- linking this to establish the date

[2] [https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-
con...](https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-conceptnet-
io/)

[3] [http://lcl.uniroma1.it/nasari/](http://lcl.uniroma1.it/nasari/)

[4]
[http://alt.qcri.org/semeval2017/task2/](http://alt.qcri.org/semeval2017/task2/)

~~~
elyase
We evaluated using ConceptNet Numberbatch but in the end went with fasttext
because of the treatment of OOV words using sub word information. This is
important for us because we work with Social Media where misspellings are very
frequent and we have found this helps a lot. Are you also looking into these
sort of enhancements? How do you usually deal with OOV words?

~~~
rspeer
Very good question!

Our OOV strategy was pretty important in SemEval. The first line of defense --
so fundamental to Numberbatch that I don't even think of it as OOV -- is to
see if the term exists in ConceptNet but with too low a degree to make it into
the matrix. In that case, we average the vectors from its neighbors in the
graph that are in the matrix.

For handling words that are truly OOV for ConceptNet, we ended up using a
simple strategy of matching prefixes of the word against known words (and also
checking whether a word that's supposed to be in a different language was
known in English).

fastText's sub-word strategy, which is learned along with the vocabulary
instead of after the fact, is indeed a benefit they have. But am I right that
the sub-word information isn't present in these vectors they released?

There's a paper on the SemEval results that just needs to be reviewed by the
other participants, and I'm also working on a blog update about it.

------
z3t4
Would also be nice with a guide on how to use them

------
Y_Y
Disappointing to see Facebook, a company with a huge Irish presence, neglect
Irish in the list. There are plenty of minority languages in there, like
Breton and Scots. And languages not spoken natively anywhere like Latin,
Esperanto, Volapük.

Google Translate (with a similar number of languages)supports it fine.

~~~
exgrv
Hi, because we trained these vectors on Wikipedia, we released models
corresponding to the 90 largest Wikipedia first (in term of training data
size). More models are on the way, including Irish.

~~~
Y_Y
I suspected it was something like this. Unfortunately the Vicipéid is not of
very high quality. I just just hope Facebook doesn't forget which side its
bread is buttered on.

------
denzil_correa
Can someone explain WHY do word vectors with similar contexts club together
and are good? One of the papers suggest that "we don't really know" (Section
4)

[0] [https://arxiv.org/abs/1402.3722](https://arxiv.org/abs/1402.3722)

~~~
Houshalter
Consider the classic example of king and queen. "Queen" will tend to occur
near "female" words, like "she", "her", "woman", etc. And vice versa for
"king". But both words will tend to occur near words talking about royalty,
e.g. "castle", "crown", "ruler", etc.

So you can learn a decent amount of information about a word, just by looking
at the words around it. This is the same thing we teach kids learning to read
with "context clues". If I talk about bolgorovs and how delicious they are,
and how they are ripe and sweet, etc. You can probably guess "bolgorovs" are a
fruit, just from the context.

~~~
saurik
Sometimes, context can be ripe for the picking while simultaneously leaving
your subject rotting on the vine.

Regardless: I think you answered a question more about "what" than "why"; I
know what these models are, and how they are used, but it is still somewhat
surprising to me that you can get as good results as people claim without a
nearly infinitely dimensional space (and honestly I kind of wonder if this
might be more a model of "the kinds of questions humans most often like to ask
about words when confronted with a word for the first time or wanting to test
a data set" instead of "a true understanding of what is being said encoded as
vectors", allowing things like "have opposite gender" to be extremely
functional vectors but probably leaving _much more important_ concepts like
"are classic opponent in war" on the floor, which is really important semantic
information that isn't necessarily transtive and might not even be
commutative, a thought process that seems to align with complaints about
word2vec from ConceptNet.)

Put differently: I bet the set of 300 axes is actually a more useful result
(though one that is more opaque and I don't hear much about attempts to
analyze; but I am currently not in this field and haven't been paying
attention to the literature) than the actual vector mapping (which is what
people always seem excited about). I would love to see more talk of "what
questions are these models weirdly good at answering versus questions where
they seem so limited as to almost be useless".

~~~
webmaven
_> it is still somewhat surprising to me that you can get as good results as
people claim without a nearly infinitely dimensional space_

As I understand it, the _maximum_ number of dimensions required is equal to
the number of words. That is, if you did no dimensional reduction, you have a
vector that expresses exactly how close occurrences of the word in question
are on average to occurrences of each and every other word.

That's a very large number of dimensions, but hardly infinite.

Reducing the number of dimensions turns "distance from every other word" into
"distance from abstract concepts", except that "abstract concept" is
overstating the case, as the "concepts" aren't features of human cognition
per-se except to the extent that those features are reflected statistically in
the corpus that was used. Besides, the choice of the _number_ of dimensions to
reduce to is somewhat arbitrary, and no one knows right now what the "correct"
number is, or even if there is a "correct" number. I'm not even sure whether
much work has been done on the sensitivity of the models to the number of
dimensions.

There is probably a _lot_ of productive work to be done on dimensionality
reduction techniques that make the reduced dimensions map better to
abstractions that a human would recognize, at least faintly, as well as work
to create corpora that better sample the full range of human expression in as
compact a size as possible.

------
sorenvrist
I tried using the danish .bin with fastText predict and a single danish
sentence but I keep on getting an assert error from vector.cpp around A._m not
equal to _m. Am I doing something wrong?

./fastText predict wiki.da.bin fileWithASingleLine 1

~~~
exgrv
These models were trained in an unsupervised way, and thus cannot be used with
the "predict" mode of fastText.

The .bin models can be used to generate word vectors for out-of-vocabulary
words:

    
    
      > echo 'list of words' | ./fasttext print-vectors model.bin
    

or

    
    
      > ./fasttext print-vectors model.bin < queries.txt
    

where queries.txt is a list of words you want a vector representation for.

------
saip
If you want to try out fastText without having to do any local setup, see
[https://github.com/floydhub/fastText](https://github.com/floydhub/fastText).

FloydHub[1] is a deep learning PaaS for training and deploying DL models in
the cloud with zero setup.

[1][https://www.floydhub.com](https://www.floydhub.com) Disclaimer: I am one
of Floyd's co-creators

------
ma2rten
It doesn't say what data these were training on. That is kind of important
information. I previously applied word vectors that were trained on news to
social media posts and it didn't work well at all. Also I don't think there is
a language called "Western".

Also interesting that this is hosted on S3.

~~~
exgrv
These models were trained on Wikipedia.

It should be "Western Frisian" instead of "Western"
([https://en.wikipedia.org/wiki/West_Frisian_language](https://en.wikipedia.org/wiki/West_Frisian_language)).
Thanks for the catch!

~~~
ma2rten
Thanks.

------
turtles
Can someone please ELI5 why this is good, and what they can be used for? I'm
assuming machine learning...

~~~
cschmidt
I haven't anything with fastText, but I have with word2vec. It embeds each
word in a 300 dimensional vector, such that similar words have a large cosine
similarity. (If you normalize each vector to have a unit norm, then cosine
similarity is just a dot product.) So in short, it gives you a measure of how
similar each word is to other words.

This has many uses in machine learning. You can extend it to documents and
find similar documents, find misspellings, use them as features in a ML model,
etc.

There haven't been good vectors in that many languages (that I know of), so
that's a plus for these fastText vectors.

~~~
turtles
ah. Thanks!

------
godmodus
This any useful for sentiment analysis and plagiarism detection? I might give
it a go after I'm done with my current projects

~~~
ovi256
This will enable writing a plagiarism detector which will not be fooled by the
simple strategy of replacing words with their synonyms. Given that synonyms
have very similar embeddings, you can compute a distance between two phrases
by computing the distance between their word embeddings. And that's just what
comes to mind right now.

For sentiment detection, I could see a similar experiment to [1] working, but
instead of discriminating between newsgroups, you classify sentiment.

[1] [https://blog.keras.io/using-pre-trained-word-embeddings-
in-a...](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-
model.html)

~~~
webmaven
_> This will enable writing a plagiarism detector which will not be fooled by
the simple strategy of replacing words with their synonyms._

I'm not sure how helpful that will be, as you may end up with a system that
detects whenever a student expresses similar thoughts (and lets face it, the
educational system is all about getting students to conform to conventional
patterns of thinking) in their own words.

And if the system _doesn 't_ detect re-expression of the same ideas, then a
system that automatically rewrites essays in a slightly different style
(essentially, an English-to-English neural machine translation) will defeat
it.

The endgame would be grading student essays on how well they express _an
entirely original idea_ , which is an unreasonable standard.

------
Houshalter
How does this compare to Conceptnet Numberbatch?
[https://blog.conceptnet.io/2016/05/25/conceptnet-
numberbatch...](https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch-
a-new-name-for-the-best-word-embeddings-you-can-download/)

------
web64
On their Facebook page [1] they said they are planning to release models for
294 languages very soon.

[1]
[https://www.facebook.com/groups/1174547215919768/](https://www.facebook.com/groups/1174547215919768/)

------
VMG
What is this?

------
torrent-of-ions
What is the point of using Github for something like this?

~~~
imron
It's part of fastText project, which is hosted on Github.

The files themselves appear to be hosted on s3.

