Hacker News new | past | comments | ask | show | jobs | submit login
Facebook releases 300-dimensional pretrained Fasttext vectors for 90 languages (github.com)
364 points by sandGorgon on Mar 2, 2017 | hide | past | web | favorite | 70 comments

This has the potential to be very very useful and it is great that FB has released them. Some potential caveats. I don't know how well Fasttext vectors perform as features for downstream machine learning systems (if anyone know of work along these lines, I would be very happy to know about it), unlike word2vec [1] or GloVe [2] vectors that have been used for a few years at this point. Also, only having trained on Wikipedia gives the vectors less exposure to "real world" text, unlike say word2vec that was trained on the whole of Google News back in the day or GloVe that used Common Crawl. Still, if you need word vectors for a ton of languages this is looking like a great resource and will save you the pre-processing and computational troubles of having to produce them on your own.

[1]: https://code.google.com/archive/p/word2vec/

[2]: http://nlp.stanford.edu/projects/glove/

This isn't a real downstream task, but one of the researchers at RaRe compared FastText to word2vec/gensim/skipgram word embeddings on the original testsets for the 'semantic' and 'syntactic' analogy tasks from the word2vec papers here:


The conclusion:

   These preliminary results seem to indicate fastText embeddings are
   significantly better than word2vec at encoding syntactic information. This is
   expected, since most syntactic analogies are morphology based, and the char
   n-gram approach of fastText takes such information into account. The original
   word2vec model seems to perform better on semantic tasks, since words in
   semantic analogies are unrelated to their char n-grams, and the added
   information from irrelevant char n-grams worsens the embeddings.
Personally I think those analogy testsets are not very good, because they just test all pairs of relations between a very small number of words from very limited domains (like capital and country names).

One advantage of FastText should be better learning on small amounts of data like Wikipedia.

Thank you for the link and I agree with your scepticism towards the analogy tasks. I have had a beef with intrinsic evaluation for years, but never found the time to pursue this line of research. However, there is a consensus at this point among many NLP practitioners that they are flaky at best. There is also research indicating that this is very much the case [1].

[1]: http://www.aclweb.org/anthology/W/W16/W16-2501.pdf

It takes a certain kind of perspective for wikipedia to be called a "small amount of data." English wikipedia alone would run to about 2500 print volumes. Imagine telling an AI researcher from 1995 that that was "small".

I admit, I was being funny by being unclear. For word-embeddings English Wikipedia is a moderate-large dataset at 58GB uncompressed (13GB compressed). But most of those other language wikis really are tiny. Welsh is just 67MB compressed, and there are plenty of languages more obscure than that on the list. The point of word2vec was to make use of as much data as possible by being as fast as possible (processing billions of words an hour) rather than clever, so it would be impressive if fastText vectors for those wikis were at all useful.

Not an expert but I would guess that training on Wikipedia gives you broader coverage than training on just news. Wikipedia's whole point is to cover all kinds of topics while only some topics are usually considered news-worthy. I would also think that news language tends to be more formalized than Wikipedia's language simply because Wikipedia contributers are not trained and come from a more varied population.

Edit: Unfortunately, it's becoming more and more common that people here express disagreement using downvotes. I guess that's a sign that HN is finally going down the drain like all similar platforms.

I agree, you don't deserve downvotes for what you said. Your concerns are legitimate, it is just that quantifying the impact of changes in data domain is difficult. My experience is that my own vectors trained only on Wikipedia has performed worse for a number of tasks, but your mileage may of course wary. Also agreeing that it is sad to see this forum head this way, we can be better than this.

Evaluation of https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.... (unzips wiki.en.vec) on word similarity tasks (all numbers are Spearman rank correlation):

WS-353 Similarity: 0.781

WS-353 Relatedness: 0.682

MEN: 0.765

MTurk: 0.679

RW: 0.487

SimLex: 0.380

MC: 0.812

RG: 0.800

SCWS: 0.667

Impressive for a model trained on Wikipedia alone!

I will post analogy scores for this model as soon as they are done computing.

Google Analogy Task results ( detailed results at http://pastebin.com/PF96nMfX ):

Semantic accuracy: 63.84 % Syntactic accuracy: 67.00 %

Here performance is not great (great would be >80% on semantic and >70% on syntactic).

As this task requires nearest neighbor lookups, performance is impacted by vocabulary size. Since models trained using Wikipedia alone usually limit vocabulary to something ~300k words, we can try that to get scores which are comparable to those posted by the GloVe [1] and LexVec [2] papers by only using the first 300k words in the pre-trained vectors, giving the following results:

Semantic accuracy: 77.75 % Syntactic accuracy: 72.55 %

Impressive stuff!

[1] http://nlp.stanford.edu/pubs/glove.pdf - https://github.com/stanfordnlp/GloVe

[2] https://arxiv.org/pdf/1606.01283v1 - https://github.com/alexandres/lexvec

One of the biggest things that I see with this release is trained vectors for Asian languages - Hindi, Kannada, Telugu, Urdu, etc.

This is huge - because most other releases have traditionally been in European languages. It is fairly rare to see Asian languages release.

One challenge is that typical linguistic use in Asia mixes the native language with English. For example, people in north india use "Hinglish". It is typically fairly hard to make sense out of this.

It's already a challenge to find stuff in German, it's annoying just how much research only focuses on English.

According to Ethnologue (2005), by number of speakers:

Hindi -- native: 370m | total: 490m

Bengali -- native: 196m | total: 215m

German -- native: 101m | total: 229m

I’m sure you know yourself that number of speakers is rarely the metric used for choosing a target market, and more commonly products are launched by the potential revenue to be made (which scales with GDP per capita).

Yet, most projects only target the Anglosphere, not even Europe is usually included.

Problem is that "Europe" means so many different languages... which is also our biggest remaining obstacle in trying to launch even just web products here (as compared to launching "in the Anglosphere").

To be fair, the Wikipedia sites for these languages barely have any content. Of those that do exist most articles are essentially 2-line blurbs. These languages are essentially dead for all practical purposes.

Calling a language with hundreds of millions of speakers "dead" because their wikipedia articles aren't fleshed out seems a bit hyperbolic

I'm not going to waterdown my analysis for some two bit politically correct occidentals and "hyper-nationalist" orientals.

A large fraction of Indian children will be illiterate in these "thriving" languages in the coming decades, with "zero" monetary loss. Not living is death, and they haven't been alive for a long time - I can neither get any Govt. services in my mother tongues, nor can get the laws done by the colonial state in Delhi.

A zombie is not alive IMO. If Indians wish to parlay their tongues for money, it's upto them. I for one will not deceive myself. We are going to be part of the borg that is the Anglosphere in less than 3-4 generations.

Was the two bit politically correct occidental supposed to be me? Not sure I've ever had the pleasure of being called that before.

Regardless, dead has a specific meaning for languages and if people speak it(especially hundreds of millions of people) then it is not dead

Someone who says "essentially dead for all practical purposes" is likely not using the specific technical meaning.

Do you think gnipgnip believes something that isn't true, or do you think they're just using words differently to how you'd use them?

gnipgnip is first of all projecting, according to his analysis, Indians will stop talking local languages and adopt English and in 80 years its all over, India is an English speaking country, the flaw in that logic. The main problem is Indians do entertain themselves in their native languages, yes, many languages spoken by smaller groups of people are certainly under threat, but languages with currently massive footprints like Hindi, Telugu, Bengali etc. ain't going any where 80 years from now.

A language that is spoken by 80 million people is not dead, and is not dying. The colonial two-step when coming to languages existed for a long time and its just matter of time, people with help of machine learning will provide support in multiple languages etc.

For much large part of Mughal period the court language was Farsi, but Hindi/Urdu survived, you are underestimating the ability of people to straddle multiple languages. For many Indians its just a necessity.

Language shift occurs when mothers stop talking to their infants in the language. Not being able to access government services in the language may indicate a decline in the language's vitality, in its prestige and influence, but it is a long, long way from language "death".

FYI: you can now use fastText directly from gensim (Python) [1]. This allows you to easily test / compare fastText to other popular embeddings, such as word2vec, doc2vec or gloVe.

[1] https://github.com/RaRe-Technologies/gensim/blob/develop/gen...

Played a bit with fastText months ago. The issue I had with it is that unlike CNNs/RNNs, the relative position of a word doesn't matter as much (only as a part of a context window during training the embeds), and so results can be worse depending on the case. However, for CPU-workloads, fastText is certainly faster, especially since subword information is also incorporated.

I have Python code to process text files into a fastText/friendly format so I may clean that up and see how these pre-trained embeds work.

(although, the English embeds are 10.36 GB; that might be a tough pill to swallow for training on machines with only 8 GB of RAM)

Regarding the size of the word vectors files: the text files are sorted by frequency, so it is possible to easily load the top k words only.

We might also release smaller models in the future, for training on machines without large memory.

fwiw I have 32gb on my workstation and my personal laptop is maxed out at 16gb. Keeping within these thresholds may be useful to others.

Does anyone know if the languages all live in the same 300-dimensional space, or are they each trained independently? (i.e. do words and their translations have similar vectors?)

Models are trained independently for each language. So unfortunately, you cannot directly compare words from different languages using these vectors.

If you have a bilingual dictionary, you might try to learn a linear mapping from one language to the other (e.g. see https://arxiv.org/abs/1309.4168 for this approach).

Since it doesn't mention that, i would assume they are in different spaces, trained independently.

Can anyone point me to any articles on what can be achieved with this and how?

[EDIT] I reply to myself here: https://news.ycombinator.com/item?id=12226988

Words with similar vectors have similar meanings. You use this in search, sentiment analysis, topic detection, finding similar text, and classification.

Of course there are tests for this "words with similar vectors have similar meanings" property, and I'm finding that the fastText vectors aren't doing that well on them, especially outside of English.

I'm glad they released them, particularly so anyone can run a fair comparison between different word vectors, but these things should come with evaluation results the same way that code should come with tests. These vectors are performing worse than ones that my company Luminoso released last year [1] (a better, later post is [2]), and if you don't believe me plugging my own vectors, I know that Sapienza University of Rome also has better vectors called NASARI [3].

fastText covers more languages, but most of these languages have no evaluations. How do you know the Basque vectors aren't just random numbers?

I think that performance hits a plateau when the vectors only come from text, with no relational knowledge, and especially when that text is only from Wikipedia. Text exists that isn't written like an encyclopedia. Meanings exist that aren't obvious from context. My research at Luminoso involves adding information from the ConceptNet knowledge graph, producing a word vector set called "ConceptNet Numberbatch" that just won against other systems in SemEval [4], a simultaneous, blind evaluation. The NASARI vectors are also based on a knowledge graph.

[1] https://blog.conceptnet.io/2016/05/19/an-introduction-to-the... -- linking this to establish the date

[2] https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-con...

[3] http://lcl.uniroma1.it/nasari/

[4] http://alt.qcri.org/semeval2017/task2/

We evaluated using ConceptNet Numberbatch but in the end went with fasttext because of the treatment of OOV words using sub word information. This is important for us because we work with Social Media where misspellings are very frequent and we have found this helps a lot. Are you also looking into these sort of enhancements? How do you usually deal with OOV words?

Very good question!

Our OOV strategy was pretty important in SemEval. The first line of defense -- so fundamental to Numberbatch that I don't even think of it as OOV -- is to see if the term exists in ConceptNet but with too low a degree to make it into the matrix. In that case, we average the vectors from its neighbors in the graph that are in the matrix.

For handling words that are truly OOV for ConceptNet, we ended up using a simple strategy of matching prefixes of the word against known words (and also checking whether a word that's supposed to be in a different language was known in English).

fastText's sub-word strategy, which is learned along with the vocabulary instead of after the fact, is indeed a benefit they have. But am I right that the sub-word information isn't present in these vectors they released?

There's a paper on the SemEval results that just needs to be reviewed by the other participants, and I'm also working on a blog update about it.

I googled that and got

  Showing results for *conceptnet cumberbatch*
Oh, the many names of Bumblebee Banglesnatch.

> Words with similar vectors have similar meanings.

Can you explain WHY this is the case with word2vec? I have come across a paper which says "we don't know really". Is this true?


I think there have been some attempts to explain, none of them fully satisfying.

But there have also been systems that create vectors with that property more directly instead of as a side-effect of a neural net. Examples include Stanford's GloVe, or Omer Levy's Hyperwords, which is made entirely from old-school ideas such as mutual information and dimensionality reduction, and was for a while the best system if you limited it to fixed training data (still Wikipedia).

A gripe: reviewers don't even seem to like explanations. If you can explain your system with well-understood operations, it's boring and "not novel". But when Google publishes magical mystery vectors, they lap it up.

Edit: I see that I didn't answer your question at all. However I'll leave it here for people who are not that familiar with ML

If I'm not completely wrong, these are so called latent factors of words. That pretty much means computer representations of the meaning of a word. Words with similar meanings would have similar factors, for example the word "Rome" and the word "Italy" will probably in one or more of these dimensions be quite similar.

These vectors usually take a lot of time to train if done properly, and they come out quite similar anyway, so having them precomputed makes it easier for other people without the resources of fb to do NLP.

Another cool thing is that they are available for so many languages, this is the first time I've seen precomputed vectors for my native language.

TLDR: computer representations of words, which makes it easier for people to make machine learning models.

Whats really cool is the "word math" that these enable.. equations like QUEEN - KING = DUCHESS - DUKE


Would also be nice with a guide on how to use them

Disappointing to see Facebook, a company with a huge Irish presence, neglect Irish in the list. There are plenty of minority languages in there, like Breton and Scots. And languages not spoken natively anywhere like Latin, Esperanto, Volapük.

Google Translate (with a similar number of languages)supports it fine.

Hi, because we trained these vectors on Wikipedia, we released models corresponding to the 90 largest Wikipedia first (in term of training data size). More models are on the way, including Irish.

I suspected it was something like this. Unfortunately the Vicipéid is not of very high quality. I just just hope Facebook doesn't forget which side its bread is buttered on.

It would initially seem like an odd oversight as it has Welsh, with half the number of speakers. Another poster said there are more on the way. I wonder if it's a technical or political reason for it not being in the release? They presumably have Irish speakers on staff...

While both Wales and Ireland do have compulsory education in their respective languages, and the number of people who speak Irish is higher, the number of people who speak Welsh every day is considerably higher.

Esperanto is also really easy to add to this kind of service due to a simplified grammar and the community's online presence.

Can someone explain WHY do word vectors with similar contexts club together and are good? One of the papers suggest that "we don't really know" (Section 4)

[0] https://arxiv.org/abs/1402.3722

Consider the classic example of king and queen. "Queen" will tend to occur near "female" words, like "she", "her", "woman", etc. And vice versa for "king". But both words will tend to occur near words talking about royalty, e.g. "castle", "crown", "ruler", etc.

So you can learn a decent amount of information about a word, just by looking at the words around it. This is the same thing we teach kids learning to read with "context clues". If I talk about bolgorovs and how delicious they are, and how they are ripe and sweet, etc. You can probably guess "bolgorovs" are a fruit, just from the context.

Sometimes, context can be ripe for the picking while simultaneously leaving your subject rotting on the vine.

Regardless: I think you answered a question more about "what" than "why"; I know what these models are, and how they are used, but it is still somewhat surprising to me that you can get as good results as people claim without a nearly infinitely dimensional space (and honestly I kind of wonder if this might be more a model of "the kinds of questions humans most often like to ask about words when confronted with a word for the first time or wanting to test a data set" instead of "a true understanding of what is being said encoded as vectors", allowing things like "have opposite gender" to be extremely functional vectors but probably leaving much more important concepts like "are classic opponent in war" on the floor, which is really important semantic information that isn't necessarily transtive and might not even be commutative, a thought process that seems to align with complaints about word2vec from ConceptNet.)

Put differently: I bet the set of 300 axes is actually a more useful result (though one that is more opaque and I don't hear much about attempts to analyze; but I am currently not in this field and haven't been paying attention to the literature) than the actual vector mapping (which is what people always seem excited about). I would love to see more talk of "what questions are these models weirdly good at answering versus questions where they seem so limited as to almost be useless".

> it is still somewhat surprising to me that you can get as good results as people claim without a nearly infinitely dimensional space

As I understand it, the maximum number of dimensions required is equal to the number of words. That is, if you did no dimensional reduction, you have a vector that expresses exactly how close occurrences of the word in question are on average to occurrences of each and every other word.

That's a very large number of dimensions, but hardly infinite.

Reducing the number of dimensions turns "distance from every other word" into "distance from abstract concepts", except that "abstract concept" is overstating the case, as the "concepts" aren't features of human cognition per-se except to the extent that those features are reflected statistically in the corpus that was used. Besides, the choice of the number of dimensions to reduce to is somewhat arbitrary, and no one knows right now what the "correct" number is, or even if there is a "correct" number. I'm not even sure whether much work has been done on the sensitivity of the models to the number of dimensions.

There is probably a lot of productive work to be done on dimensionality reduction techniques that make the reduced dimensions map better to abstractions that a human would recognize, at least faintly, as well as work to create corpora that better sample the full range of human expression in as compact a size as possible.

The paper you link is about word2vec, which has taken quite a while to understand. This paper shows word2vec is a particular factorization of the word-context matrix.

Neural Word Embedding as Implicit Matrix Factorization https://papers.nips.cc/paper/5477-neural-word-embedding-as-i...

Which is some of the story. I found this paper also interesting.

Towards a Better Understanding of Predict and Count Models https://arxiv.org/abs/1511.02024

The general idea (using word frequencies in context) is Firth's distributional hypothesis [0]. Related to this is the idea of a distributional representation which is covered by Gardenfors in his book on conceptual spaces.

For word2vec and such, several papers in this year's TACL explain why these methods should work [1,2].

[0] https://www.aclweb.org/aclwiki/index.php?title=Distributiona... [1] https://transacl.org/ojs/index.php/tacl/article/viewFile/742... [2]https://transacl.org/ojs/index.php/tacl/article/viewFile/809...

I tried using the danish .bin with fastText predict and a single danish sentence but I keep on getting an assert error from vector.cpp around A._m not equal to _m. Am I doing something wrong?

./fastText predict wiki.da.bin fileWithASingleLine 1

These models were trained in an unsupervised way, and thus cannot be used with the "predict" mode of fastText.

The .bin models can be used to generate word vectors for out-of-vocabulary words:

  > echo 'list of words' | ./fasttext print-vectors model.bin

  > ./fasttext print-vectors model.bin < queries.txt
where queries.txt is a list of words you want a vector representation for.

If you want to try out fastText without having to do any local setup, see https://github.com/floydhub/fastText.

FloydHub[1] is a deep learning PaaS for training and deploying DL models in the cloud with zero setup.

[1]https://www.floydhub.com Disclaimer: I am one of Floyd's co-creators

It doesn't say what data these were training on. That is kind of important information. I previously applied word vectors that were trained on news to social media posts and it didn't work well at all. Also I don't think there is a language called "Western".

Also interesting that this is hosted on S3.

These models were trained on Wikipedia.

It should be "Western Frisian" instead of "Western" (https://en.wikipedia.org/wiki/West_Frisian_language). Thanks for the catch!


> Also I don't think there is a language called "Western".

Going by the ISO 639 code I think that's supposed to be West Frisian

Can someone please ELI5 why this is good, and what they can be used for? I'm assuming machine learning...

I haven't anything with fastText, but I have with word2vec. It embeds each word in a 300 dimensional vector, such that similar words have a large cosine similarity. (If you normalize each vector to have a unit norm, then cosine similarity is just a dot product.) So in short, it gives you a measure of how similar each word is to other words.

This has many uses in machine learning. You can extend it to documents and find similar documents, find misspellings, use them as features in a ML model, etc.

There haven't been good vectors in that many languages (that I know of), so that's a plus for these fastText vectors.

ah. Thanks!

This any useful for sentiment analysis and plagiarism detection? I might give it a go after I'm done with my current projects

This will enable writing a plagiarism detector which will not be fooled by the simple strategy of replacing words with their synonyms. Given that synonyms have very similar embeddings, you can compute a distance between two phrases by computing the distance between their word embeddings. And that's just what comes to mind right now.

For sentiment detection, I could see a similar experiment to [1] working, but instead of discriminating between newsgroups, you classify sentiment.

[1] https://blog.keras.io/using-pre-trained-word-embeddings-in-a...

> This will enable writing a plagiarism detector which will not be fooled by the simple strategy of replacing words with their synonyms.

I'm not sure how helpful that will be, as you may end up with a system that detects whenever a student expresses similar thoughts (and lets face it, the educational system is all about getting students to conform to conventional patterns of thinking) in their own words.

And if the system doesn't detect re-expression of the same ideas, then a system that automatically rewrites essays in a slightly different style (essentially, an English-to-English neural machine translation) will defeat it.

The endgame would be grading student essays on how well they express an entirely original idea, which is an unreasonable standard.

I think this would be difficult because a lot of school reports already come down to re-formatting textbook ideas in your own words. Could be a tough decision of how different it needs to be to be considered original work.

might be a lot easier to make a plagiarism generator instead - keep the overall meaning of sentences but use synonyms or deliberately off-meaning words.

How does this compare to Conceptnet Numberbatch? https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch...

On their Facebook page [1] they said they are planning to release models for 294 languages very soon.

[1] https://www.facebook.com/groups/1174547215919768/

What is this?

What is the point of using Github for something like this?

It's part of fastText project, which is hosted on Github.

The files themselves appear to be hosted on s3.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact