
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures - JoshTriplett
https://arxiv.org/abs/1611.06172
======
minimaxir
It's worth noting that fasttext, which was made in part by the original
word2vec authors, can handle as many cores as you throw at it.

[https://github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)

~~~
boomzilla
The original word2vec, written in C, could utilize all the cores available.
It's actually refreshing to read that code, a true one person shop, hard
engineering code :)

~~~
ncdr
That's not true - they don't lock global weights when updating, so if you have
lots of cores and threads, asynchronous updates will result in very poor
accuracy, making training useless.

~~~
utopcell
The opposite is actually true.

If you train on the big dataset that is produced by demo-train-big-model-v1.sh
(which includes news corpora from 2012 & 2013, the 1BN word dataset from
statmt, the UMBC web corpus and the whole wikipedia) using only one thread,
accuracy on the google analogy dataset drops to 68% (down from ~71.5% when
using 20 threads.)

This is due to the learning rate algorithm used: Learning rate is linearly
reduced with the number of processed words. When K threads are used, the input
dataset is split into K parts, processed in parallel, which means that more
parts of the dataset have a chance to influence the resulting vectors in the
beginning of the computation (where learning rate is relatively high.)

~~~
ncdr
No it is not - as number of cores approaches infinity, the validation accuracy
will approach zero, due to the lack of locking of shared memory. There is
definitely a sweet spot in the number of cores for the original code, but it
is not scalable to infinity. Therefore, it cannot utilize any number of cores.

~~~
utopcell
Aligned float updates are atomic in all architectures that matter. Also,
unsynchronized parameter updates for SGD have actually been studied in [1],
where it was shown that they don't affect performance.

In the limit, performance would indeed suffer as all updates would happen in
parallel.

[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing
stochastic gradient descent." Advances in Neural Information Processing
Systems. 2011.

~~~
nkurz
There's another paper describing the "Hogbatch" approach that shows more
exactly the effect of adding cores on accuracy:
[http://www.ece.ubc.ca/~matei/papers/ipdps16.pdf](http://www.ece.ubc.ca/~matei/papers/ipdps16.pdf).

The summary would be that accuracy per pass suffers slightly, but since the
speedup is close to linear for the first dozen or so cores, each pass is much
faster to run. The result is that the wall time to achieve a given level of
accuracy is much shorter despite the slightly lower accuracy per pass.

------
novalis78
"...and process hundreds of millions of words per second, which is the fastest
word2vec implementation to the best of our knowledge." Sweet! Just today I saw
someone on the gensim mailing list trying to process 250Gb worth of data
feeding it into word2vec
[https://groups.google.com/forum/#!topic/gensim/QvSJd4Ma6oE](https://groups.google.com/forum/#!topic/gensim/QvSJd4Ma6oE)

------
utopcell
100M words/sec is quite impressive, although this approach does not seem to be
able to handle very large dictionaries.

If the authors are reading this, it'd be nice to see the actual accuracy on
the google analogy dataset (the paper states it is within 1% of the reference
implementation) and the performance on the large dataset produced by demo-
train-big-model-v1.sh.

Incidentally, at Yahoo we can learn from a dataset of 1066 Billion Words, on a
dictionary of 1.42 Billion unique terms, in 7344" (~145M words/sec.)

~~~
programnature
How?

~~~
utopcell
We formulate SGNS word2vec as a distributed graph problem, where nodes are all
unique tokens (the dictionary) in the corpus and edges are defined by
skipgrams. For skipgram (w_in, w_center), there will be an edge from w_in to
w_center.

Tokens are randomly distributed over a set of workers. Each worker iterates
over its edges in parallel with all other workers and performs the appropriate
computation.

Drawing negative samples is done in two steps. We first draw a worker W from a
suitable distribution over the workers and then draw a word from W. The
overall word sampling is the same as for the reference implementation (ie,
unigram distribution raised to 3/4.)

This work will soon be made public [1].

[1] Stergios Stergiou, Zygimantas Straznickas, Rolina Wu and Kostas
Tsioutsiouliklis, ``Distributed Negative Sampling for Word Embeddings''. AAAI
2017.

------
pavanky
BTW the NVIDIA results they are comparing to are from the previous generation.
The current generation has 50% more FLOPS and 33% more bandwidth.

~~~
yvdriess
Titan-X is the current generation, DeepBench also shows results with Titan-X.
The P100 has not been released.

~~~
jimfleming
Titan X is the product line which has multiple generations, with the Pascal
architecture being the latest.

