
Learning Word Vectors for 157 Languages - ingve
https://arxiv.org/abs/1802.06893
======
visarga
This is great for quick jobs but if you have a good dataset that is in your
domain you should retrain using fasttext, doc2vecC, word2vec or starspace to
name but a few tools that are excellent. The window size is important as well,
depending on your downstream task - one size doesn't fit all. It only takes a
few hours on a beefy CPU with enough RAM for a corpus with a size of billions
of words.

Word vectors are fascinating representations. There is a huge amount of
information and nuance captured in them. You can use them directly for topic
retrieval (using annoy or another optimised vector index), or feed them into a
classifier such as those in the Sklearn library. All types of neural nets:
fully connected, recurrent and convolutional can be applied on word vectors.

~~~
rspeer
I argue that you need _both_ domain-general and domain-specific data. Most
domains you could apply NLP to don't have billions of words to train on,
especially if the language of the data isn't English.

Pre-trained data is valuable, and you really aren't going to re-learn it all
from your data, so why throw it out?

~~~
laughingman2
Yes, recent work on transfer learning
[https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146) " Fine-
tuned Language Models for Text Classification" suggests that training a lstm
on a large corpora for language modeling, then using it as sentence encoder
after tuning it in specific dataset gives a considerable perfomance.

This is covered in fast.ai 's new course 1.

------
sp332
I guess I can't argue with the results, but the data doesn't look very clean.
I downloaded the Esperanto file because I expected it to be small, but it was
a 1.1 GB download that expanded to 4.4 GB. A lot of the most popular "words"
are punctuation marks, and the first two real words are "la" and "La". Perhaps
naively, I expected those to be the same word.

~~~
exgrv
We decided to keep the casing, as it is useful for some applications such as
named entity recognition.

Regarding the punctuation, as pointed out in another comment, these tokens
might also be useful for some applications (and they are easy to filter out if
you don't need them).

~~~
sp332
In the Tagalog file, } is near the top but { is over 8,000 lines down. Is
there a reason they have such different frequencies? ( and ) are right next to
each other.

And yes I realize this is a really odd question :)

~~~
exgrv
This is probably due to our preprocessing of Wikipedia that did not get rid of
all the '}' from the markup.

~~~
sp332
Oh true. I tried to clean up Wiki markup for ML years ago and it was a huge
pain. Next time I think I'll parse the HTML version and pull out the text from
the tags explicitly.

~~~
mkl
This is a much better way to do it. It's easier, cleaner, and gets the text
which is generated by templates, which there is a surprising amount of (you
get weird artifacts from that otherwise).

------
closed
I'm always amazed that, although Cantonese is about as commonly spoken as
Italian, so little written corpus data exists for it (since most Cantonese
write in a form of Mandarin Chinese).

For example, here there are all kinds of useful things we can do with these
157 sets of word vectors, but Cantonese escaped the list because most of its
transactions happen off the page.

[https://en.wikipedia.org/wiki/Written_Cantonese](https://en.wikipedia.org/wiki/Written_Cantonese)

------
Choco31415
There seems to be some interest in learning what word vectors are, so I'll
give a basic overview here, then tie it into the research.

"One of the challenges for Natural Language Processing (NLP) systems is the
question of how to represent input such that the network runs quickly, but
also learns well. It's possible to represent each word as a one hot vector,
but that's computationally slow. It's also possible to represent each word as
some number, but then lots of words look very similar.

Instead, why not use a mix? Introducing word embeddings. We'll represent each
word as a n-dimensional vector, with each dimension representing a trait about
the word. For example, "fruit" might be represented as {food: 0.99, gender:
-0.05, size: 0.2}, and "king" might be represented as {food: -0.9, gender:
0.92, size: 0.56}." [Quoted from MuffinTech.org] [See v1n337 for caveats. [0]]

Two similar words should have similar word vectors, like "apple" and "peach".
If we learn some fact about apples, like "Humans eat apples.", then we can
easily generalize that to peaches, pineapples, etc...

Let's tie this back to the research. Since we have word vectors for many
languages now, that makes it easier for us to build NLP systems in other
languages. For example, if we wanted to build an English->French translator.

[0]
[https://news.ycombinator.com/item?id=16448960](https://news.ycombinator.com/item?id=16448960)

~~~
gigogkggi
If Apple and peach had very similar word vectors, an English apple and a
French peach would have the same too. And there is a risk of mistranslatuon.
How is that situation handled?

~~~
yorwba
It is handled by supervised training with paired translations, so that English
apple will be associated with French pomme instead of other fruits. If you
don't have a parallel corpus, translation gets significantly harder. I'm
actually more amazed that it's possible at all.

~~~
imhoguy
And what about meanings in various contexts like financial content with words
Apple, Orange?

~~~
Choco31415
That is a slight problem. Disambiguates start to dive into higher contextual
meaning where we need to look at nearby words. This means there are likely
some word vectors whose meanings are "muddled", per se.

Although, I suppose if we treat "apple" and "Apple" as different words, that
would help.

Fun fact: One of the current NLP problems is detecting which words are names.
Apparently it's really tough, especially with Twitter data!

~~~
radarsat1
I suppose that if you're doing multilanguage, this problem partially sorts
itself out. E.g. Spanish there will be Apple and manzana, in two different
places due to their different semantics. Now for English, say you were trying
to place "apple" in that space, you would want to put it next to both of them.

Unfortunately I see a problem in having to specify an exact position per word.
If you think of the position of english "Apple" in the Spanish word space as a
distribution instead of a specific location, then it ideally should be a two-
mode distribution, with one peak next to Apple and one peak next to manzana.
If you must use a normal distribution, the variance must be wide enough to
cover both words -- a huge problem, since (a) that assigns a lot of probable
values to one word and (b) the mean value (expected value) lies between them,
not at the semantic location of "apple" at all.

------
slx26
If you are interested in multi-language text analysis, you might be interested
in Freeling [1], a full-fledged, open source library for language analysis
written in C++ (which also happens to include a simple interface for working
with word vectors).

[1]
[http://nlp.lsi.upc.edu/freeling/node/1](http://nlp.lsi.upc.edu/freeling/node/1)

------
alexott
And corresponding word vectors: [https://fasttext.cc/docs/en/crawl-
vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html)

~~~
rspeer
Are these the right vectors? The filenames correspond to the fastText vectors
I've already tried, which are only in English.

EDIT: Indeed, this is old data from a previous publication. It appears they
have not actually made the new data public yet.

~~~
rspeer
Update: that link points to the right page now.

------
gojomo
Online-reader-friendly (no 2-column PDF!) paper link:

[https://www.arxiv-vanity.com/papers/1802.06893/](https://www.arxiv-
vanity.com/papers/1802.06893/)

------
neves
Newbie here. Can someone explain what would I use these vectors for?

~~~
godelmachine
At one level, it’s simply a vector of weights. In a simple 1-of-N (or ‘one-
hot’) encoding every element in the vector is associated with a word in the
vocabulary. The encoding of a given word is simply the vector in which the
corresponding element is set to one, and all other elements are zero. It's all
about Natural Language Processing.

If you are interested in more, check out these excellent reviews by Adrian
Colyer posted in The Morning Paper.

[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-
wor...](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-
vectors/)

~~~
bllguo
that's a nice link but the excerpt you quote is kind of misleading; the word
vectors in this case are not one-hot encodings. They are learned, continuous
representations. But one-hot representations are also a kind of word vector.

word vectors are vector representations of each word in the vocabulary. Here
they are learned by a neural net. the length of the vector is the # of
features. Just for intuition, one feature of a word the NN could learn is the
gender of a word, and so on.

~~~
Choco31415
To help with intuition, here are a few example word vectors that we might
encounter:

"fruit": {food: 0.99, gender: -0.05, size: 0.2}

"king": {food: -0.9, gender: 0.92, size: 0.56}

Building off of what v1n337 stated, though, axis can easily be skewed and
rotated such that they're still interpretable, just not obviously so.

