
Word vectors are awesome but you don’t need a neural network to find them - blopeur
http://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/
======
tensor
After reading this I'm left wondering why anyone should stop using word2vec.
The article makes the point that you can produce word vectors using other
techniques, in this case by computing probabilities of unigrams and skip-grams
and running SVD.

This is all well and good, but from an industry practitioner standpoint this
doesn't explain why one would avoid using, or actually stop using word2vec.

1\. Several known good word2vec implementations exist, the complexity of the
technique doesn't really matter as you can just pick one of these and use it.

2\. Pretrained word vectors produced from word2vec and newer algorithms exist
for many languages.

Why should someone stop using these and instead spend time implementing a
simpler method that produces _maybe good enough_ vectors? Being a simpler
method isn't a reason in of itself.

~~~
efavdb
He's arguing that if you want to develop some custom word embeddings (trained
on whatever data set you have specific to task at hand), the SVD approach is
often much faster and almost as effective. If you're willing to make that
trade, go with SVD is the argument. If you're willing to use a pre-built
library, that's certainly fastest of all!

~~~
danieldk
_He 's arguing that if you want to develop some custom word embeddings
(trained on whatever data set you have specific to task at hand), the SVD
approach is often much faster_

That depends on your corpus and vocabulary size. word2vec is _O(n)_ where _n_
is the length of the training corpus. Vanilla SVD is _O(mn^2)_ for a co-
occurence matrix size of _m x n_. I have been training word2vec and GloVe
embeddings on 'web-scale' corpora and training time is usually not a problem.

 _the SVD approach is often much faster and almost as effective_

SVD-derived embeddings does bad on some tasks, such as analogy tasks (see Levy
& Goldberg, 2014).

I agree with your parent poster that the author does not really provide good
arguments against the use of word2vec et al.

Moreover,

 _But because of advances in our understanding of word2vec, computing word
vectors now takes fifteen minutes on a single run-of-the-mill computer_

Which advances in our understanding? SVD on word-word co-occurrence matrices
was proposed by Schütze in 1992 ;). There have been many works since then
exploring various co-occurrence measures (including PMI) in combination with
SVD.

~~~
Radim
Chris Moody usually delivers great blog posts; this big-worded rehash of an
ancient 2014 technique (see [https://rare-technologies.com/making-sense-of-
word2vec/](https://rare-technologies.com/making-sense-of-word2vec/) for a more
comprehensive eval of SVD vs word2vec vs gloVe) is weird.

But on twitter, Chris promised a sequel post, with extra tricks and tips. So I
see this as a warm up :) Looking forward to part 2.

~~~
mino
I thought exactly the same. That was a weird post from Chris, and he has
iterated already many times in various talks about SQL-based SVD as a w2v
replacement.

And your blog post is much better :)

------
jdonaldson
One other benefit of using word2vec-stle training is that you can also control
the learning rate, and gracefully handle new training data.

SVD must be done at-once, and you need to use sparse matrix abstractions for
raw word vectors. The implementation and abstractions you use actually make it
_more_ complex than word2vec imho.

Word2vec can train off pretty much any type of sequence. You can adjust the
learning rate on the fly (to emphasize earlier/later events), stop or start
incremental training, and with Doc2Vec you can train embeddings for more
abstract tokens in a much more straightforward manner (doc ids, user ids,
etc.)

While word2vec embeddings are not always reproducible, it is much more stable
with the addition of new training data. This is key if you want some stability
in a production system over time.

Also, somebody edited the title of the article, thanks! The original title of
"Stop using word2vec" is click-bait FUD rubbish. I think in this case we're
trying too hard to wring a good discussion out of a bad article.

------
stared
Well, the thing that word2vec can (and should) be understood in terms of word
coincidences (and pointwise mutual information) is important, but hardly new.
I tried to explain it here: [http://p.migdal.pl/2017/01/06/king-man-woman-
queen-why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).

There is a temptation to use just the word pair counts, skipping SVD, but it
won't yield in the best results. Creating vectors not only compresses data,
but also finds general patterns. This compression is super important for less
frequent words (otherwise we get a lot of overfitting). See "Why do low
dimensional embeddings work better than high-dimensional ones?" from
[http://www.offconvex.org/2016/02/14/word-
embeddings-2/](http://www.offconvex.org/2016/02/14/word-embeddings-2/).

~~~
taliesinb
Thanks for your blog post!

Are you familiar with
[http://www.offconvex.org/2016/07/10/embeddingspolysemy/](http://www.offconvex.org/2016/07/10/embeddingspolysemy/)?

I read a blog post maybe a year ago that explained word embeddings under an
'atom interpretation', in which word embeddings are really sparse combinations
of 'atoms' (I don't really remember more than that). It was very interesting,
but then I forgot about it. Trying to find the post again I only came up with
the above paper, which is probably the same idea, but it wasn't what I
originally read. Wish I could find it.

------
arrmn
Word embeddings are not just useful for text, they can be applied whenever you
have relation between "tokens". You can use them to identifying nodes in
graphs that belong to the same group[0]. Another, in my opinion, really
interesting idea is to apply them to relational databases[1], you can simply
ask for similar rows.

It's a interesting article but the author didn't really provide good arguments
why I should stop using w2v.

[0] [http://www.kdd.org/kdd2017/papers/view/struc2vec-learning-
no...](http://www.kdd.org/kdd2017/papers/view/struc2vec-learning-node-
representations-from-structural-identity) [1]
[https://arxiv.org/abs/1603.07185](https://arxiv.org/abs/1603.07185)

~~~
Matumio
You may be interested in Facebook's recent StarSpace[1] paper, which also
shows how this simple "entity embeddings" approach can be used for different
tasks.

[1]
[https://github.com/facebookresearch/StarSpace](https://github.com/facebookresearch/StarSpace)

------
oh-kumudo
> Word vectors are awesome but you don’t need a neural network – and
> definitely don’t need deep learning – to find them

Word2vec is not deep learning (the skip-gram algorithm is basically a one
matrix multiplication followed by softmax, there isn't even place for
activation function, why is this deep learning?), and it is simple and
efficient. And most of all, there is no overhead using word2vec, just a
difference between pre-trained vectors or trained ones.

I don't understand what this article tries to say.

~~~
kaspermarstal
Exactly. I was so confused about the bashing of deep learning.

------
make3
[https://github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)
this is Facebook's super efficient word2vec like implemention. i thought ppl
might find it interesting

------
kuschku
So, where do I get premade versions of this included all words of the 28
largest languages? This is one of the most valuable properties of word2vec and
co: Prebuilt versions for many languages, with every word of the dictionary in
them.

Once you have that, we can talk about actually replacing word2vec and similar
solutions.

~~~
rspeer
Look, let's talk about replacing word2vec because it's old and not that good
on its own. Everyone making word vectors, including this article, compares
them to word2vec as a baseline because beating word2vec on evaluations is so
easy. It's from 2013 and machine learning moves fast these days.

You can replace pre-trained word2vec in 12 languages (with aligned vectors!)
with ConceptNet Numberbatch [1]. You can be sure it's better because of the
SemEval 2017 results where it came out on top in 4 out of 5 languages and 15
out of 15 aligned language pairs [2]. (You will not find word2vec in this
evaluation because it would have done poorly.)

If you want to bring your own corpus, at least update your training method to
something like fastText [3], though I recommend looking at how to improve it
using ConceptNet anyway, because distributional semantics alone will not get
you to the state of the art.

Also: what pre-built word2vec are you using that actually contains valid word
associations in many languages? Something trained on just Wikipedia? Al-Rfou's
Polyglot? Have you ever actually tested it?

[1] [https://github.com/commonsense/conceptnet-
numberbatch](https://github.com/commonsense/conceptnet-numberbatch)

[2]
[http://nlp.arizona.edu/SemEval-2017/pdf/SemEval002.pdf](http://nlp.arizona.edu/SemEval-2017/pdf/SemEval002.pdf)

[3] [https://fasttext.cc/](https://fasttext.cc/)

~~~
rpedela
For pre-trained word2vec and fastText vectors, I think kuschku is talking
about
[https://github.com/Kyubyong/wordvectors](https://github.com/Kyubyong/wordvectors)

~~~
rspeer
One thing that pre-trained word2vec and GloVe have going for them (and which
is incorporated in ConceptNet Numberbatch) is that they're trained on broader
corpora, Google News and the Common Crawl respectively. Of course that's what
limits them to English, if you're not willing to use a knowledge graph to
align things across languages.

Training on just Wikipedia is not a representative model of any language,
because not all text is written like an encyclopedia.

~~~
rpedela
Of course the broader the corpora the better in general (sometimes more narrow
is better), but Wikipedia is certainly better than nothing.

------
serveboy
word2vec yields better representations than PMI-SVD. If you want a better
explicit PMI matrix factorization, have a look at
[https://github.com/alexandres/lexvec](https://github.com/alexandres/lexvec)
and the original paper. Explains why SVD performs poorly.

If you are looking for word embedding for production use, checkout fasttext,
lexvec, glove, or word2vec. Don't use the approach described in this article.

~~~
serveboy
Pass the "-matrix pmi" flag to factor the PMI matrix as by default it
factorizes PPMI.

------
fnl
SVD scales with the number of items cubed, w2v scales linearly. Typical real
world vocabularies are 1-10M, not 10-100k. This article is FUD and best, and
IMO, just plain BS.

~~~
ctchocula
My understanding was that you can get some savings from keeping the sparse
matrix and running sparse SVD via scipy.sparse.linalg.svds(PMI, k=256). I am
not certain about the exact time complexity however.

~~~
fnl
Some minor space savings, maybe. But SVD _runtime_ still scales with the cube
of your vocabulary size. Good luck with SVD on a vocabulary from Wikipedia or
Common Crawl. If anything, using traditional count-based approaches is good
when you only can use a small corpus with tiny vocabularies (<100k) to develop
your word embeddings. But that's not what this article is proclaiming. Oh, and
yeah, do use fasttext, not the good old word2vec.

~~~
ctchocula
I don't think you are correct here. The advantage of using sparse storage and
sparse matrix multiplication is that you can get savings in both storage and
runtime. There would be no point in using sparse storage if runtime still
scaled with the cube of vocab size. It would be that way if the best way of
getting sparse SVD is by materializing a dense matrix product, but people have
discovered smarter ways using sparse matvec. The time complexity for obtaining
k eigenvalues seems to be O(dkn) where d is the average number of nonzeroes
per row, and n is the vocab size [1]. Therefore, one can assert that sparse
SVD too is linear in vocab size just like word2vec.

This is corroborated by the link elsewhere in this thread that shows SVD
enjoying lower wall clock time on the 1.9B word Wikipedia dataset [2].

[1]
[https://en.wikipedia.org/wiki/Lanczos_algorithm#Application_...](https://en.wikipedia.org/wiki/Lanczos_algorithm#Application_to_the_eigenproblem)

[2] [https://rare-technologies.com/making-sense-of-word2vec/](https://rare-
technologies.com/making-sense-of-word2vec/)

~~~
fnl
As to [1]: Yes, I was not honest in the sense that non-standard SVD
implementations for generating your PMIs will scale with the square of |V|,
not the cube. But as I will go on to show, that is not good enough to make
count-based approaches competitive to predictive ones.

Re. [2], these measurements by Radim have several issues; First, word2vec is a
poor implementation, CPU-usage wise, as can be seen by profiling word2vec
(fastText is much better at using your CPUs). Second, even Radim states there
that his SVD-based results are significantly poorer than the w2v embeddings
("the quality of both SPPMI and SPPMI-SVD models is atrocious"). Third,
Radim's conclusion there is: "TL;DR: the word2vec implementation is still fine
and state-of-the-art, you can continue using it :-)".

So I don't really get your points. Instead of referencing websites and blogs,
lets take a deeper look at a "proponent" for count-based methods, in a peer-
reviewed setting. In Goldberg et al.'s SPPMI model [1,2] they use truncated
SVD. (FYI, that proposed model, SPPMI, is what got used in Radim's blog
above.) So even if you wanted use SPPMI instead of the sub-optimal SVD
(alone), you would first have to find a really good implementation of that,
i.e., something that is competitive to fastTest. Also note that Goldberg only
used 2 word-windows for SGNS in most comparisons, which makes the results for
neural embeddings a bit dubious. You would typically use 5-10, and as shown in
Table 5 of [2], SGNS is pretty much the winner on all cases as it "approaches"
10-word window. Next, I would only trust Hill's SimLex as proper evaluation
targets for word similarity - simply look at the raw data of the various
evaluation datasets yourself and read Hill's explanations why he created
SimLex, and I am sure you will agree. "Coincidentally", it also is - by a huge
margin - the most difficult dataset to get right (i.e., all approaches perform
worst on SimLex). However, SGNS nearly always outperforms SVD/SSPMI on
precisely that set. Finally, _even Omar et al._ had to conclude: "Applying the
traditional count-based methods to this setting [=large-scale corpora] proved
technically challenging, as they consumed too much memory to be efficiently
manipulated." So even if they "wanted" to conclude that SVD is just as good as
neural embeddings, their own results (Table 5) and this statment lead us to a
clearly different conclusion: If you use enough window size, you are better
off with neural embeddings, particularly for large corpora. And this work only
compares W2V & GloVe to SVD & SPPMI, while fastText in turn works _a lot_
better than "vanilla" SGNS and GloVe. What I do agree with is that properly
tuning neural embeddings is a bit of a black art, much like anything with the
"neural" tag on it...

QED; This article is horseradish. Neural embeddings work significantly better
than SVD, and SVD is significantly harder to scale to large corpora. Even if
you use SPPMI or other tricks.

[1] [https://papers.nips.cc/paper/5477-neural-word-embedding-
as-i...](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-
matrix-factorization) [2]
[https://www.transacl.org/ojs/index.php/tacl/article/view/570](https://www.transacl.org/ojs/index.php/tacl/article/view/570)

~~~
greeneggs
I don't understand what you mean by "non-standard SVD implementations." No SVD
implementation is going to compute more singular vectors than you ask it to.
It is neither cubic time, as you first said, nor quadratic time, as you now
say. The dimension-dependence is linear.

~~~
fnl
Nope, it's not linear [1].

ADDENDUM: To which I should add, to avoid more discussions, that _parallel_
methods on _dense_ matrices exist that essentially use a prefix sum approach
and double the work, but thereby decrease the absolute running time [2].
However, as that exploits parallelism and requires _dense_ matrices, that does
not apply to this discussion.

[1]
[https://link.springer.com/chapter/10.1007%2F978-1-4615-1733-...](https://link.springer.com/chapter/10.1007%2F978-1-4615-1733-7_22)

[2]
[http://www.netlib.org/lapack/lawnspdf/lawn283.pdf](http://www.netlib.org/lapack/lawnspdf/lawn283.pdf)

~~~
fnl
And here is a reference to what I mean by non-standard methods.

[http://sysrun.haifa.il.ibm.com/hrl/bigml/files/Holmes.pdf](http://sysrun.haifa.il.ibm.com/hrl/bigml/files/Holmes.pdf)

~~~
fnl
Finally, to tie this discussion off, two truly official references that
explicitly address the issue of runtime complexity.

In the best case, as determined by Halko et al., you low-rank k approximation
of a n times m term-document matrix is O(nmk), and randomized approximations
get that down to O(nm log(k)) [1]. And, according to Rehurek's own
investigations [2], those approximated eigenvectors are typically good enough.
I.e., in both cases, the decomposition scales with the _product_ of documents
and words, not their _sum_. Therefore, this is _clearly_ not a linear problem.

On top of that, when these inverted indices grow too large to be computed on a
single machine, earlier methods required k passes over the data. These newer
approaches [1,2] can make do with a single pass, meaning that the thing that
indeed scales linearly here is the performance gains of scaling your SVD among
a cluster with these newer approaches. Maybe this is the source of confusion
for some commenters here.

[1]
[https://authors.library.caltech.edu/27187/](https://authors.library.caltech.edu/27187/)

[2]
[https://link.springer.com/chapter/10.1007%2F978-3-642-20161-...](https://link.springer.com/chapter/10.1007%2F978-3-642-20161-5_29)

------
rpedela
This is a great, simple explanation of word vectors. However I think the
argument would have been stronger if there were numbers showing that this
simplified method and word2vec are similarly accurate like the author claims.

------
kirillkh
Asking as someone who barely has any clue in this field: is there a way to use
this for full-text search, e.g. Lucene? I know from experience that for some
languages (e.g. Herew) there are no good stemmers available out of the box, so
can you easily build a stemmer/lemmatizer (or even something more powerful?
[1]) on top of word2vec or fastText?

[1] E.g., for each word in a document or a search string, it would generate
not just its base form, but also a list of top 3 base forms that are
different, but similar in meaning to this word's base form (where the meaning
is inferred based on context).

~~~
visarga
You can do all that and more: for example, to find lexical variations of a
word, just compute word vectors for the corpus and then search the most
similar vectors to a root word, that also contain the first letters (first 3
or 4 letters) of the root. It's almost perfect at finding not only legal
variations, but also misspellings.

In general, if you want to search over millions of documents, use Annoy from
Spotify. It can index millions of vectors (document vectors for this
application) and find similar documents in logarithmic time, so you can search
in large tables by fuzzy meaning.

[https://github.com/spotify/annoy](https://github.com/spotify/annoy)

------
hellrich
One argument for SVD is the low reliability (as in results fluctuate with
repeated experiments) of word2vec embeddings, which hampers (qualitative)
interpretation of the resulting embedding spaces, see:
[http://www.aclweb.org/anthology/C/C16/C16-1262.pdf](http://www.aclweb.org/anthology/C/C16/C16-1262.pdf)

------
Piezoid
Random projections methods are cheaper alternatives to SVD. For example you
can bin contexts with a hash function and count collocations between word and
binned contexts the same way this article does. Then apply weighting and SVD
if you really want the top n principal components.

What's nice with counting methods is that you can simply add matrices from
different collections of documents.

------
Radim
Article explaining this relationship between matrix factorizations (SVD) and
word2vec [2014]:

[https://rare-technologies.com/making-sense-of-word2vec/](https://rare-
technologies.com/making-sense-of-word2vec/)

(also contains benchmark experiments with concrete numbers and Github code --
author here)

------
kevinalbert
I must be missing something here - in step 3, PMI for x, y is calculated as:

log( P(x|y) / ( P(x)P(y) ) )

Because the skipgram probabilities are sparse, P(x|y) is often going to be
zero, so taking the log yields negative infinity. The result is a dense PMI
matrix filled (mostly) with -Inf.

Should we be adding 1 before taking the log?

------
wodenokoto
I thought w2v didn't have any hidden layers or non-linear activation function,
making it essentially a linear regression.

Do I need to reread some papers?

------
make3
also, word2vec are super fast and work great. the text has no convincing
argument on why not to use them, unless you don't want to learn basic neutral
nets. even then, just use Facebook fast text :
[https://github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)

------
justwantaccount
I thought this article was going to talk about GloVe, which actually performs
better than Google's word2vec without a neural network according to its paper,
but I guess not.

------
anentropic
one of the tables in the article mentions something called "word2tensor" but
google doesn't throw up anything

except this tweet which seems to be from a conference
[https://twitter.com/ic/status/756918600846356480?lang=en](https://twitter.com/ic/status/756918600846356480?lang=en)

does anyone have any info about it?

~~~
hellrich
Baroni published on using a word-word-link (e.g., object-of) tensor instead of
the more common word-word matrices:
[http://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00016](http://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00016)

------
KasianFranks
Here's one more reason:

Word2Vec is based on an approach from Lawrence Berkeley National Lab posted in
Bag of Words Meets Bags of Popcorn 3 years ago 2 "Google silently did
something revolutionary on Thursday. It open sourced a tool called word2vec,
prepackaged deep-learning software designed to understand the relationships
between words with no human guidance. Just input a textual data set and let
underlying predictive models get to work learning."

“This is a really, really, really big deal,” said Jeremy Howard, president and
chief scientist of data-science competition platform Kaggle. “… It’s going to
enable whole new classes of products that have never existed before.”
[https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-
learn...](https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-
the-masses-you-can-thank-google-later/)

Spotify seems to be using it now:
[http://www.slideshare.net/AndySloane/machine-learning-
spotif...](http://www.slideshare.net/AndySloane/machine-learning-spotify-
madison-big-data-meetup) pg 34

But here's the interesting part:

Lawrence Berkeley National Lab was working on an approach more detailed than
word2vec (in terms of how the vectors are structured) since 2005 after reading
the bottom of their patent:
[http://www.google.com/patents/US7987191](http://www.google.com/patents/US7987191)
The Berkeley Lab method also seems much more exhaustive by using a fibonacci
based distance decay for proximity between words such that vectors contain up
to thousands of scored and ranked feature attributes beyond the bag-of-words
approach. They also use filters to control context of the output. It was also
made part of search/knowledge discovery tech that won the 2008 R&D100 award
[http://newscenter.lbl.gov/news-
releases/2008/07/09/berkeley-...](http://newscenter.lbl.gov/news-
releases/2008/07/09/berkeley-lab-wins-four-2008-rd-100-awards/) &
[http://www2.lbl.gov/Science-
Articles/Archive/sabl/2005/March...](http://www2.lbl.gov/Science-
Articles/Archive/sabl/2005/March/06-genopharm.html)

A search company that competed with Google called "seeqpod" was spun out of
Berkeley Lab using the tech but was then sued for billions by Steve Jobs
[https://medium.com/startup-study-group/steve-jobs-made-
warne...](https://medium.com/startup-study-group/steve-jobs-made-warner-music-
sue-my-startup-9a81c5a21d68#.jw76fu1vo) and a few media companies
[http://goo.gl/dzwpFq](http://goo.gl/dzwpFq)

We might combine these approaches as there seems to be something fairly
important happening here in this area. Recommendations and sentiment analysis
seem to be driving the bottom lines of companies today including Amazon,
Google, Nefflix, Apple et al.

[https://www.kaggle.com/c/word2vec-nlp-
tutorial/discussion/12...](https://www.kaggle.com/c/word2vec-nlp-
tutorial/discussion/12349)

~~~
visarga
We don't need w2v precursors from 2005, we got more embeddings that we care to
use and we can use random embeddings and train them on project for even better
results.

------
make3
professionals don't use pretrained word2vec vectors in the really complex
(like neural machine transnation) deep learning models anymore, they let the
models train their own word embeddings directly, or let the models learn
character level embeddings.

~~~
arrmn
What exactly do you mean with "they let the models train their own word
embeddings", can you elaborate more on this or are there any current papers
about this topic?

~~~
make3
the embedding layer is the layer that converts the one hot word feature in to
a continuous multi dimensional vector that the deep net can learn with. they
used to pretrain that layer separately with word2vec. now as it's just a
neural net layer, they let the translation model train it with backprop on the
main (translation /dialog / qa, etc) task as a regular layer

------
phy6
I find these baiting titles tiresome, and I generally assume (even if it makes
an ass out of me) that the author is splitting hairs or wants to grandstand on
some inefficiency that most of us knew was there already. (I'm assuming if
they had a real argument then it would have been in a descriptive title) With
titles like these I'll go straight to the comments section before giving you
any ad-revenue. This is HN, not Buzzfeed, and we deserve better than this.

~~~
2bitencryption
"stop using [something useful]" considered harmful

~~~
mdellabitta
'"stop using [something useful]" considered harmful' is dying

------
JKirchartz
Wake me up when you have a readily available libraries to implement this in
multiple programming languages.

~~~
s17n
There are libraries for SVD in any language you'd want to use.

------
yters
This is excellent. This is the kind of machine learning we need, that provides
understanding instead of "throw this NN at this lump of data and tweak
parameters until the error is small enough."

