These preliminary results seem to indicate fastText embeddings are
significantly better than word2vec at encoding syntactic information. This is
expected, since most syntactic analogies are morphology based, and the char
n-gram approach of fastText takes such information into account. The original
word2vec model seems to perform better on semantic tasks, since words in
semantic analogies are unrelated to their char n-grams, and the added
information from irrelevant char n-grams worsens the embeddings.
One advantage of FastText should be better learning on small amounts of data like Wikipedia.
Edit: Unfortunately, it's becoming more and more common that people here express disagreement using downvotes. I guess that's a sign that HN is finally going down the drain like all similar platforms.
WS-353 Similarity: 0.781
WS-353 Relatedness: 0.682
Impressive for a model trained on Wikipedia alone!
I will post analogy scores for this model as soon as they are done computing.
Semantic accuracy: 63.84 % Syntactic accuracy: 67.00 %
Here performance is not great (great would be >80% on semantic and >70% on syntactic).
As this task requires nearest neighbor lookups, performance is impacted by vocabulary size. Since models trained using Wikipedia alone usually limit vocabulary to something ~300k words, we can try that to get scores which are comparable to those posted by the GloVe  and LexVec  papers by only using the first 300k words in the pre-trained vectors, giving the following results:
Semantic accuracy: 77.75 % Syntactic accuracy: 72.55 %
 http://nlp.stanford.edu/pubs/glove.pdf - https://github.com/stanfordnlp/GloVe
 https://arxiv.org/pdf/1606.01283v1 - https://github.com/alexandres/lexvec
This is huge - because most other releases have traditionally been in European languages. It is fairly rare to see Asian languages release.
One challenge is that typical linguistic use in Asia mixes the native language with English. For example, people in north india use "Hinglish". It is typically fairly hard to make sense out of this.
Hindi -- native: 370m | total: 490m
Bengali -- native: 196m | total: 215m
German -- native: 101m | total: 229m
Yet, most projects only target the Anglosphere, not even Europe is usually included.
A large fraction of Indian children will be illiterate in these "thriving" languages in the coming decades, with "zero" monetary loss. Not living is death, and they haven't been alive for a long time - I can neither get any Govt. services in my mother tongues, nor can get the laws done by the colonial state in Delhi.
A zombie is not alive IMO. If Indians wish to parlay their tongues for money, it's upto them. I for one will not deceive myself. We are going to be part of the borg that is the Anglosphere in less than 3-4 generations.
Regardless, dead has a specific meaning for languages and if people speak it(especially hundreds of millions of people) then it is not dead
Do you think gnipgnip believes something that isn't true, or do you think they're just using words differently to how you'd use them?
For much large part of Mughal period the court language was Farsi, but Hindi/Urdu survived, you are underestimating the ability of people to straddle multiple languages. For many Indians its just a necessity.
I have Python code to process text files into a fastText/friendly format so I may clean that up and see how these pre-trained embeds work.
(although, the English embeds are 10.36 GB; that might be a tough pill to swallow for training on machines with only 8 GB of RAM)
We might also release smaller models in the future, for training on machines without large memory.
If you have a bilingual dictionary, you might try to learn a linear mapping from one language to the other (e.g. see https://arxiv.org/abs/1309.4168 for this approach).
[EDIT] I reply to myself here:
Of course there are tests for this "words with similar vectors have similar meanings" property, and I'm finding that the fastText vectors aren't doing that well on them, especially outside of English.
I'm glad they released them, particularly so anyone can run a fair comparison between different word vectors, but these things should come with evaluation results the same way that code should come with tests. These vectors are performing worse than ones that my company Luminoso released last year  (a better, later post is ), and if you don't believe me plugging my own vectors, I know that Sapienza University of Rome also has better vectors called NASARI .
fastText covers more languages, but most of these languages have no evaluations. How do you know the Basque vectors aren't just random numbers?
I think that performance hits a plateau when the vectors only come from text, with no relational knowledge, and especially when that text is only from Wikipedia. Text exists that isn't written like an encyclopedia. Meanings exist that aren't obvious from context. My research at Luminoso involves adding information from the ConceptNet knowledge graph, producing a word vector set called "ConceptNet Numberbatch" that just won against other systems in SemEval , a simultaneous, blind evaluation. The NASARI vectors are also based on a knowledge graph.
 https://blog.conceptnet.io/2016/05/19/an-introduction-to-the... -- linking this to establish the date
Our OOV strategy was pretty important in SemEval. The first line of defense -- so fundamental to Numberbatch that I don't even think of it as OOV -- is to see if the term exists in ConceptNet but with too low a degree to make it into the matrix. In that case, we average the vectors from its neighbors in the graph that are in the matrix.
For handling words that are truly OOV for ConceptNet, we ended up using a simple strategy of matching prefixes of the word against known words (and also checking whether a word that's supposed to be in a different language was known in English).
fastText's sub-word strategy, which is learned along with the vocabulary instead of after the fact, is indeed a benefit they have. But am I right that the sub-word information isn't present in these vectors they released?
There's a paper on the SemEval results that just needs to be reviewed by the other participants, and I'm also working on a blog update about it.
Showing results for *conceptnet cumberbatch*
Can you explain WHY this is the case with word2vec? I have come across a paper which says "we don't know really". Is this true?
But there have also been systems that create vectors with that property more directly instead of as a side-effect of a neural net. Examples include Stanford's GloVe, or Omer Levy's Hyperwords, which is made entirely from old-school ideas such as mutual information and dimensionality reduction, and was for a while the best system if you limited it to fixed training data (still Wikipedia).
A gripe: reviewers don't even seem to like explanations. If you can explain your system with well-understood operations, it's boring and "not novel". But when Google publishes magical mystery vectors, they lap it up.
If I'm not completely wrong, these are so called latent factors of words. That pretty much means computer representations of the meaning of a word.
Words with similar meanings would have similar factors, for example the word "Rome" and the word "Italy" will probably in one or more of these dimensions be quite similar.
These vectors usually take a lot of time to train if done properly, and they come out quite similar anyway, so having them precomputed makes it easier for other people without the resources of fb to do NLP.
Another cool thing is that they are available for so many languages, this is the first time I've seen precomputed vectors for my native language.
TLDR: computer representations of words, which makes it easier for people to make machine learning models.
Google Translate (with a similar number of languages)supports it fine.
So you can learn a decent amount of information about a word, just by looking at the words around it. This is the same thing we teach kids learning to read with "context clues". If I talk about bolgorovs and how delicious they are, and how they are ripe and sweet, etc. You can probably guess "bolgorovs" are a fruit, just from the context.
Regardless: I think you answered a question more about "what" than "why"; I know what these models are, and how they are used, but it is still somewhat surprising to me that you can get as good results as people claim without a nearly infinitely dimensional space (and honestly I kind of wonder if this might be more a model of "the kinds of questions humans most often like to ask about words when confronted with a word for the first time or wanting to test a data set" instead of "a true understanding of what is being said encoded as vectors", allowing things like "have opposite gender" to be extremely functional vectors but probably leaving much more important concepts like "are classic opponent in war" on the floor, which is really important semantic information that isn't necessarily transtive and might not even be commutative, a thought process that seems to align with complaints about word2vec from ConceptNet.)
Put differently: I bet the set of 300 axes is actually a more useful result (though one that is more opaque and I don't hear much about attempts to analyze; but I am currently not in this field and haven't been paying attention to the literature) than the actual vector mapping (which is what people always seem excited about). I would love to see more talk of "what questions are these models weirdly good at answering versus questions where they seem so limited as to almost be useless".
As I understand it, the maximum number of dimensions required is equal to the number of words. That is, if you did no dimensional reduction, you have a vector that expresses exactly how close occurrences of the word in question are on average to occurrences of each and every other word.
That's a very large number of dimensions, but hardly infinite.
Reducing the number of dimensions turns "distance from every other word" into "distance from abstract concepts", except that "abstract concept" is overstating the case, as the "concepts" aren't features of human cognition per-se except to the extent that those features are reflected statistically in the corpus that was used. Besides, the choice of the number of dimensions to reduce to is somewhat arbitrary, and no one knows right now what the "correct" number is, or even if there is a "correct" number. I'm not even sure whether much work has been done on the sensitivity of the models to the number of dimensions.
There is probably a lot of productive work to be done on dimensionality reduction techniques that make the reduced dimensions map better to abstractions that a human would recognize, at least faintly, as well as work to create corpora that better sample the full range of human expression in as compact a size as possible.
Neural Word Embedding as Implicit Matrix Factorization
Which is some of the story. I found this paper also interesting.
Towards a Better Understanding of Predict and Count Models
For word2vec and such, several papers in this year's TACL explain why these methods should work [1,2].
./fastText predict wiki.da.bin fileWithASingleLine 1
The .bin models can be used to generate word vectors for out-of-vocabulary words:
> echo 'list of words' | ./fasttext print-vectors model.bin
> ./fasttext print-vectors model.bin < queries.txt
FloydHub is a deep learning PaaS for training and deploying DL models in the cloud with zero setup.
Disclaimer: I am one of Floyd's co-creators
Also interesting that this is hosted on S3.
It should be "Western Frisian" instead of "Western" (https://en.wikipedia.org/wiki/West_Frisian_language). Thanks for the catch!
Going by the ISO 639 code I think that's supposed to be West Frisian
This has many uses in machine learning. You can extend it to documents and find similar documents, find misspellings, use them as features in a ML model, etc.
There haven't been good vectors in that many languages (that I know of), so that's a plus for these fastText vectors.
For sentiment detection, I could see a similar experiment to  working, but instead of discriminating between newsgroups, you classify sentiment.
I'm not sure how helpful that will be, as you may end up with a system that detects whenever a student expresses similar thoughts (and lets face it, the educational system is all about getting students to conform to conventional patterns of thinking) in their own words.
And if the system doesn't detect re-expression of the same ideas, then a system that automatically rewrites essays in a slightly different style (essentially, an English-to-English neural machine translation) will defeat it.
The endgame would be grading student essays on how well they express an entirely original idea, which is an unreasonable standard.
might be a lot easier to make a plagiarism generator instead - keep the overall meaning of sentences but use synonyms or deliberately off-meaning words.
The files themselves appear to be hosted on s3.