
Understanding word vectors - jxub
https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469
======
wenc
The cognoscenti already know this, but word vectors are a game changer in the
multi-class text classification game. Finding the correct representation of
text makes the classification task so much easier.

For my problem (1000+ total classes, 1 class per input), I experimented with
Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN
model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99%
accuracy, 10 minutes training).

FastText [0] in particular is quite impressive. It is essentially a variant of
Word2Vec that also supports n-grams, and there is a default implementation in
C++ with a built-in classifier that runs on the command-line (no need to setup
Tensorflow or anything like that).

Despite it only running on plain CPUs and only supporting a linear classifier,
it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in
my use cases. I later discovered this paper from the authors comparing CNNs
(and other algorithms) to FastText, and their results track my experiences
[1].

This goes to show that while GPU-accelerated models are cool, sometimes using
a simpler, more suitable model can have a significantly better pay-off.

[0] [https://fasttext.cc/docs/en/supervised-
tutorial.html](https://fasttext.cc/docs/en/supervised-tutorial.html)

[1] [https://arxiv.org/abs/1607.01759](https://arxiv.org/abs/1607.01759)

~~~
ppod
If the word embedding is, say, 400 dimensions, are you averaging the vectors
of all the words in the document to get a document vector?

~~~
isoprophlex
Just piping in to say that I'm very curious as well! How do you go from word
vectors to a description of the document?

~~~
useracd
Often the w2v embedding layer will be the first layer of a network, through
which a document of word representations will be passed. The output of the
embedding layer will be a 2 dimensional tensor with the embedding dimension in
one direction and the number of input words in the other. The next step is
often to apply zero or more convolutional units over the word direction before
some kind of pooling to get a 1 dimensional output. The output of the pooling
layer will be a document representation based upon the word vectors.

Other approaches like doc2vec are also used.

~~~
ppod
Thank you for this very helpful information, and I have just one more
question:

"Often the w2v embedding layer will be the first layer of a network, through
which a document of word representations will be passed. The output of the
embedding layer will be a 2 dimensional tensor with the embedding dimension in
one direction and the number of input words in the other. The next step is
often to apply zero or more convolutional units over the word direction before
some kind of pooling to get a 1 dimensional output. The output of the pooling
layer will be a document representation based upon the word vectors. Other
approaches like doc2vec are also used."

I thought the approach you described _was_ doc2vec. If not, then does it have
a name/citation?

~~~
aglionby
Doc2vec is the name of the gensim implementation of this paper [0]. Briefly,
it creates document embeddings (document being anything on the level of
phrases, to sentences, to paragraphs and beyond) on the basis of either
predicting a word given the context words (distributed memory -- DM), or
predicting the context words given a word (distributed bag of words -- DBOW).

[0] [https://arxiv.org/abs/1405.4053](https://arxiv.org/abs/1405.4053)

------
ovi256
A few remarks:

In the "Doing bad digital humanities with color vectors", if you consider
colors as 3D vectors, which they do, you'll see that summing enough uniformly
sampled vectors always gives medium browns, because that's the color in the
middle of the colorspace. Instead, you should model colors in a polar space
and sum vectors in that space. This will prevent going inside the sphere and
losing color saturation.

It's explained quite well here in the Interpolation section:
[http://www.inference.vc/high-dimensional-gaussian-
distributi...](http://www.inference.vc/high-dimensional-gaussian-
distributions-are-soap-bubble/)

If you want to understand contemporary use of words embeddings in ML, a nice
simple model is explained, with full code, here: [https://blog.keras.io/using-
pre-trained-word-embeddings-in-a...](https://blog.keras.io/using-pre-trained-
word-embeddings-in-a-keras-model.html)

The original model comes from Kim 2014,
[https://arxiv.org/abs/1408.5882](https://arxiv.org/abs/1408.5882) It's a very
neat use of CNNs for language processing, instead of the more popular
RNNs/LSTMs. CNNs have the advantage of training much faster.

~~~
Ethcad
Sorry if this is a dumb question, but wouldn’t it be a medium gray in RGB? Or
are you talking about HSV?

~~~
Maybestring
HSV is a polar model. (Holding constant Value)

------
visarga
Word vectors are at the same time amazing because they contain a huge amount
of latent information, and not good enough because they collapse a space of
very high dimensionality in ~300 dimensions, so they have a limit to how much
they can discriminate between close topics. I have done a lot of experiments
on classifying text in thousands of topics, and sometimes they work amazingly
well, other times they are really hard to use, depending on how close together
are the topics I want to discriminate between.

Another problem of word vectors is that any word might actually have multiple
senses, while vectors are just point estimates. If we wanted to be correct, we
needed first to find the right sense for each word in a phrase and only then
assign the vector. There is research in "on-the-fly" word vectors that adapt
to context, but it's much harder to use.

A third problem of word vectors is out of vocabulary words and words with low
frequency. For OOV, the usual solution is to create character or character-
ngram embeddings that can be used to compute embeddings for new words. For low
frequency words we usually ignore them (apply a cutoff).

Then there is the problem of phrases and collocations - some words go
together, such as "New York" and "give up". The meaning of the phrase is
different from the sum of the meanings of the component words. In these cases
we need to have lists of phrases and replace them in the original text before
training the vectors, so we have proper vectors for phrases.

By the way, one amazing tool that goes with word vectors is the library
'annoy' which can do similarity search in log time. So you can do approx 1000
lookups per second per CPU even if the database contains millions of vectors,
pretty good. Annoy can be used to find similar articles, or music
recommendations. Another remark - my preferred word vectors are computed with
Doc2VecC (a variant of doc2vec with corruption). Doc2VecC seems more apt to
discriminate between topics, but the secret is to feed it gigabytes of text.

Playing with word vectors has taught me intuitively how it is to navigate a
space of high dimensionality. It feels different than 3d-space because each
point has a shortcut to other points, each point leads to hundreds of other
places which might be far apart. It's like a kaleidoscope where a small change
can create a very different perspective.

doc2vecC:
[https://github.com/mchen24/iclr2017](https://github.com/mchen24/iclr2017)

annoy: [https://github.com/spotify/annoy](https://github.com/spotify/annoy)

~~~
aglionby
> Another problem of word vectors is that any word might actually have
> multiple senses, while vectors are just point estimates. If we wanted to be
> correct, we needed first to find the right sense for each word in a phrase
> and only then assign the vector. There is research in "on-the-fly" word
> vectors that adapt to context, but it's much harder to use.

This is interesting, and there seems to be a bit of debate about it (at least
with compositional distributional semantic models). [0][1] seem to show that
sense disambiguation helps in some contexts, and [2] show that they don't in
others. It doesn't seem immediately clear who is right here. I agree with you
though that it seems pretty likely that disambiguating would be helpful.

[0]
[https://www.aclweb.org/anthology/W13-3513](https://www.aclweb.org/anthology/W13-3513)

[1]
[https://www.aclweb.org/anthology/P16-1018](https://www.aclweb.org/anthology/P16-1018)

[2]
[https://www.aclweb.org/anthology/D10-1115](https://www.aclweb.org/anthology/D10-1115)

~~~
kevinwang
It seems obvious to me that disambiguating _correctly_ would necessarily
improve performance. As an extreme example, homonyms like bark (the verb) and
bark (on a tree) have nothing to do with each other, and ideally would be
considered two different words.

~~~
taeric
I imagine contranyms would be among the most difficult here. Would be fun to
see how these vectors stack up with those.

~~~
kevinwang
Well, antonyms are usually right next to each other in word embedding spaces
anyways. (since the context they're used in is often hard to tell apart)

~~~
taeric
Contranyms would be identical in the embedding space, no?

I'm actually intrigued that nobody has made a text processor named after
[https://en.wikipedia.org/wiki/Amelia_Bedelia](https://en.wikipedia.org/wiki/Amelia_Bedelia)
yet. :)

------
stared
For an technical intro for word2vec (what's that exactly, how to train it):
[https://lilianweng.github.io/lil-log/2017/10/15/learning-
wor...](https://lilianweng.github.io/lil-log/2017/10/15/learning-word-
embedding.html) and its readable implementation in PyTorch:
[https://adoni.github.io/2017/11/08/word2vec-
pytorch/](https://adoni.github.io/2017/11/08/word2vec-pytorch/).

And just for getting idea why does it work, and play with examples in your
browser: [http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)

------
sumitgt
My mind was blown when I found out how easy it was to get started using pre-
trained Glove embedding in Keras. Took my Kaggle game up a few notches over-
night.

[https://blog.keras.io/using-pre-trained-word-embeddings-
in-a...](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-
model.html)

------
debt
"Poetry is, at its core, the art of identifying and manipulating linguistic
similarity."

ah shit this makes me want to stand on a desk.

------
ZeroCool2u
This is a highly localized problem, but I've wanted to read the past couple
articles by Allison and cannot, because my org has to block gists.

Allison we know this is a lot of work for a very small group but, if you see
this, a couple of us here would be super stoked if you could mirror your
articles somewhere else as well!

~~~
wyldfire
Boy, that's a shame for them to do that. But I've been in similar places
before. Do they also block reg'lar github.com?

~~~
ZeroCool2u
Yeah, but we're part of a very small subset of places that genuinely need this
level of security. I used to do counter-threat infosec here before I moved to
data science, so I know how much work goes into our security.

Regular Github is free and clear, so we're good there.

------
radarsat1
Gem in here is that I learned about Annoy.
[https://pypi.python.org/pypi/annoy](https://pypi.python.org/pypi/annoy)

~~~
patelajay285
We've recently created an open source library that can help you get started
with pre-trained word vectors and Annoy quickly:

[https://github.com/plasticityai/magnitude](https://github.com/plasticityai/magnitude)

~~~
Radim
Interesting effort :-) Unfortunately, your comparison table there is somewhere
between misleading and insulting.

Almost all of the "unique" features listed there are in fact a standard part
of Gensim. Fast approximative queries (using Annoy), memory mapping with lazy
loading, ngrams features, format convertors, Python interface,
parallelization, pre-trained models for download…

There is a way to promote cool new libraries, but this ain't it.

~~~
patelajay285
Hey Radim,

Shoot me an e-mail (link in HN profile), we _just_ created this a few days
ago, so it hasn't been up long, and happy to fix any disagreements in the
benchmarks. You're absolutely right about Annoy Indexing, but to be fair, I
don't think it was part of Gensim when I started using it :). Gensim's a great
library and Magnitude's not meant to be an attack on it (in fact we use it for
our own converter) and we provide zero-training which Gensim does handle.

~~~
patelajay285
I'll remove the comparison to Gensim :). It really wasn't meant to be an
attack on Gensim. I think it's a good and versatile library that handles a
lot, but the aim of Magnitude was to be what Keras is to TensorFlow, a simpler
interface.

For the record, the claim was "Pythonic interface" not "Python" interface
because we support some Pythonic syntactic sugar like "cat in vectors" with
the "__contains__" method and "for key, vector in vectors" with the "__iter__"
method. It wasn't meant to be in bad faith, but I could see how that claim
could be misinterpreted, so I will remove it.

The interface is very similar to Gensim's, but Gensim is after all open
source, and we made it very similar on purpose so it could be easily swapped
out in our own internal codebase :).

Like I said, I think Gensim's a great library! Thanks making us aware of your
concerns, I also sent you an e-mail. I'll update the repository later today to
remove the comparison.

------
melzarei
For anyone interested in more explanation, Stanford CS224N course has 2 great
lectures on them.

------
hoerzu
Where is T-SNE? ;)

~~~
b_tterc_p
Dimension reduction technique. For word vectors, the usual use case is to take
the 400 dimension vector and turn it into a 2 dimension vector that you can
use to plot like a scatterplot. Similar things will be close together.
Dissimilar things will be far apart. It's kind of like principal components
analysis but quirkier.

For those who like Tsne. Check out the relatively new UMAP, which seems to be
faster and better.

------
juanmirocks
Love the practicality of this simple practical tutorial. The only pity is that
it's in Python 2

~~~
cup-of-tea
Yeah, I really don't get why they would use Python 2 for this kind of thing.

