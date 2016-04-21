Hacker News new | comments | show | ask | jobs | submit login
Facebook releases 300-dimensional pretrained Fasttext vectors for 90 languages (github.com)
41 points by sandGorgon 5 hours ago | hide | past | web | 5 comments | favorite





This has the potential to be very very useful and it is great that FB has released them. Some potential caveats. I don't know how well Fasttext vectors perform as features for downstream machine learning systems (if anyone know of work along these lines, I would be very happy to know about it), unlike word2vec [1] or GloVe [2] vectors that have been used for a few years at this point. Also, only having trained on Wikipedia gives the vectors less exposure to "real world" text, unlike say word2vec that was trained on the whole of Google News back in the day or GloVe that used Common Crawl. Still, if you need word vectors for a ton of languages this is looking like a great resource and will save you the pre-processing and computational troubles of having to produce them on your own.

[1]: https://code.google.com/archive/p/word2vec/

[2]: http://nlp.stanford.edu/projects/glove/

Can anyone point me to any articles on what can be achieved with this and how?

[EDIT] I reply to myself here: https://news.ycombinator.com/item?id=12226988

Edit: I see that I didn't answer your question at all. However I'll leave it here for people who are not that familiar with ML

If I'm not completely wrong, these are so called latent factors of words. That pretty much means computer representations of the meaning of a word. Words with similar meanings would have similar factors, for example the word "Rome" and the word "Italy" will probably in one or more of these dimensions be quite similar.

These vectors usually take a lot of time to train if done properly, and they come out quite similar anyway, so having them precomputed makes it easier for other people without the resources of fb to do NLP.

Another cool thing is that they are available for so many languages, this is the first time I've seen precomputed vectors for my native language.

TLDR: computer representations of words, which makes it easier for people to make machine learning models.

Whats really cool is the "word math" that these enable.. equations like QUEEN - KING = DUCHESS - DUKE

https://blog.acolyer.org/2016/04/21/the-amazing-power-of-wor...

This any useful for sentiment analysis and plagiarism detection? I might give it a go after I'm done with my current projects

