
Differences between the word2vec paper and its implementation - bollu
https://github.com/bollu/bollu.github.io#everything-you-know-about-word2vec-is-wrong
======
DannyBee
Speaking as someone who has read about 40 years of papers in compiler
optimization, it's very interesting.

In the early days, there were fairly exact algorithms that worked as
described, and were implemented as described, but were pseudocoded in papers.
Where the pseudocode differed from implementation, differences were described
in great detail (IE they may say an array can be shared but isn't to make the
pseudocode easier to describe).

Then, over time, things start to get further away from that. You start to see
papers published with algorithms that either don't work as described, or are
so inefficient as to be unusable. Like literally cannot work. Where you can
get source code, later looking at the source code shows that's not what they
did at all.

One infamous example of this is SSAPRE - to this day, people have a lot of
trouble understanding the paper (and it has significant errors that make the
algorithm incorrect as written). The concept sure, but the exact algorithm -
less so. Reading the source code to it in Open64 - it is just wildly different
than the paper (and often requires a lot of thought for people to convince
themselves it is correct).

It's not just better engineering/datastructures vs research algorithms.

The one shining counterexample is the Rice folks who wrote their massively
scalar compiler in nuweb (one of many literate programming environments), so
the descriptions/papers and code were in the same place - these are very very
readable and useful papers in my experience.

Nowadays it's coming back to the earlier daysdue to github/et al. People seem
to try to make the code more like the paper algorithm since they now release
the code.

Word2vec appears to be a counterexample (maybe because they released the code
they didn't feel a need to get the paper as right)

~~~
bayareanative
Editors gotta be more rigorous and only accept papers with completely
reproducible portable examples, i.e., docker images, literate code and source
code repos. Pseudocode is helpful to be platform neutral, but if it's not
precise enough to implemented as code, then it's still a proprietary figment
of someone else's imagination akin to the squishy social sciences where almost
anything goes, not rigorously reproducible science. Keep the standards high or
the quality will taper off.

PS: Sometimes I think some researchers think they're helping themselves keep
their research proprietary so they will able to monetize their special
knowledge or implementation, especially if no one else can make it work
("knowledge" (job) security/silo). Why do the hard work of figuring out how to
make a novel AI/ML algorithm if it can be readily commercially monetized
without recompense? (Modern Western civ doesn't have a good patronage system
to uniformly support arts, trades and sciences.)

~~~
opportune
In some fields, like in ML/AI or in other data-sciencey fields, keeping your
code / training data closed prevents other researchers from building or
improving on your work. It's more than just monetization, in that case it's
just tragedy-of-the-commons career growth

~~~
duckmysick
> keeping your code / training data closed prevents other researchers from
> building or improving on your work.

I might be naive in this regard, but isn't that the main point of doing
research?

~~~
opportune
In theory, yes. The actual main point of doing research, from the researcher's
point of view, is to advance their career. Typically this means posting
cutting-edge results in high impact journals using novel methods.
Relinquishing control over crucial details allowing a researcher to continue
publishing high-impact papers would increase the competition for the
researcher in that academic space, making it harder for that individual to
advance their career.

~~~
duckmysick
How do the sources of funding (R&D divisions, universities, governments) fit
in this picture? Do they have access to the "secret sauce" or do they have to
pay extra consultation fees on top of research grants.

------
RyEgswuCsn
I find the title of the article rather exaggerating...

As of the first difference pointed out in the article, one of the CS224D
lectures on word2vec did addressed it:

[https://youtu.be/aRqn8t1hLxs?t=2650](https://youtu.be/aRqn8t1hLxs?t=2650)

It was also mentioned later in the lecture that having two vectors
representing each word is meant to make the optimisation easier (so it's kind
of a trick); at the end, the two vectors learnt will have to be averaged over
in order to reach a single vector for each word.

To be fair, the fact that each word is represented by two vectors was also
mentioned in the original paper describing word2vec:

[https://arxiv.org/pdf/1310.4546.pdf](https://arxiv.org/pdf/1310.4546.pdf)

On page 3, just beneath equation (2).

Why so surprised?

------
billconan
For the past week I have been frustrated by an opensource code of a deep
learning paper. This type of things are so common in academia. The particular
code I looked at has missing documentation, hardcoded local paths, broken
dataset download links and broken pretrained model download links. I have to
fix bugs before the code can run. I'm very curious how did the author run that
code with the bugs.

I call them insincerely opensourced projects.

~~~
throwaway287391
This sort of thing is aggravating to read. Frankly it comes off as really
entitled. As researchers, the expectation is now that we not only have to do
the research and write a paper like the good old days, but we have to release
the code too. Okay, fine. But now that's not enough either -- the code has to
be well-documented and clean. Ugh, alright, fine -- it's going to take me a
few extra weeks of not doing research, but I'll clean up all the code, rerun
experiments to make sure it all still works like before, and add a bunch of
documentation. But no, still not good enough -- it has to run at the press of
a button in _your_ particular programming environment. If we don't know how to
write a script (or couldn't be bothered to spend the time writing one) to
check that the data is on disk and, if not, crawl a website to download some
huge dataset in one click, test our code on your OS, your CPU/GPU/TPU/...,
etc., we were being "insincere" with our open-sourcing efforts.

~~~
KirinDave
Pardon me, maybe I just misunderstood the whole idea of research but _what
good is it if it 's not reproducible?_

I can understand it may be part of a meaningful personal journey for you, and
I appreciate that. But if no one else can validate your research they're
correct to discredit it and you.

So what is the optimal outcome here? Should we hold you to a standard of
reproducibility even if it is as minimal as, "actually describe your
algorithms correctly and don't misrepresent a piece of code and a paper?" Or
should everyone just decide you can find your own research funding if it's not
going to help anyone?

~~~
archgoon
The original idea behind "reproducible" is that the ideas conveyed in the
paper should be enough to reproduce the results. Physicists and biologists are
not expected to drive over to your lab to figure out what's wrong with your
setup.

Now, that said, reproducibility is terrible in many fields. CS has an
opportunity to act as a trailblazer here, but it should be noted that this
would be holding themselves to a higher standard than their peers in other
fields. As a result, there's going to be a learning process for everyone as
they figure out how to make this all work. :)

~~~
throwawayjava
Some pretty good computer science got done before devops was gifted to the
world.

And some pretty good science got done before computer scientists were gifted
to the world.

I'm genuinely skeptical that modern software engineering practices are a good
way of thinking about reproduction in science. Even in computer science.
There's a lot that scientists can learn from software engineering (and in fact
I've helped run workshops in the past on exactly this topic), but science is
not engineering.

~~~
KirinDave
> Some pretty good computer science got done before devops was gifted to the
> world.

I'm happy to talk about this if you want. One of the most important aspects of
this work was that people like Dijkstra started using notions that approached
what real computers could read while remaining human-readable. This is some
measure of classical "reproducibility". And work like McCarthy's was
revolutionary in part because it was a definition of reproducibility as a
result!

I can give examples of shockingly good papers that are struggling to see the
light of day in their industry because they're written in ways that make them
hard not only to understand, but to reproduce.

So don't presume to lecture me about this. Part of the reason the word2vec
paper stands out is precisely because this is such a deviation from the norm
to have a paper misrepresent its most fundamental component: the algorithm.

------
utopcell
I think it very unfair to the original set of word2vec papers to be talking
about 'academic dishonesty'. This is a case of a user that has little to no
experience with neural networks. There are a ton of articles describing the
need for random initialization [1][2]. In fact, if one spends a few seconds
thinking about it, the need is evident. Without it, the NN cannot perform
symmetry breaking: If inputs are set to zero, all neurons will perform the
same calculations, rendering the network useless.

[1] google: "neural networks vector initialization" [2]
[http://deeplearning.ai/ai-notes/initialization/](http://deeplearning.ai/ai-
notes/initialization/)

~~~
bollu
That's hardly the point of the article --- the actual paper does not describe
the use of two separate vectors for each word. The initialization was an
interesting tidbit.

~~~
exgrv
Except it does? After Equation 2: "v_w and v'_w are the input and output
vector representations of w."

------
b_tterc_p
On a similar note, a long time ago I read the Doc2Vec paper, then looked at
popular Doc2Vec implementations. They didn’t seem to do the same thing. The
paper said you basically make vectors for words, then append on an additional
space that represents the additional information of documents as opposed to
single words.

All popular implementations I found seemed to put the document vectors into
the same space as the word vectors. They also didn’t seem to do any better
than a tf-idf weighted average of word vectors... curious if anyone has ever
bumped against this.

~~~
gojomo
The only code released by the 'Paragraph Vector' paper authors was a small
patch, from Mikolov, that added paragraph-vectors to the original `word2vec.c`
implementation in a very simple way: treating the 1st token of each line as a
special paragraph-vector, still string-named (and allocated in the same lookup
dictionary). Only by convention (a special prefix on those paragraph-vector
tokens) could collisions with similarly-named word-vectors avoided.

That's a nice minimal way to demo/test the idea, but limited and fragile in
other ways. The initial gensim implementation did something similar, then I
changed it to use a separate doc-vectors space, to better support a lot of
options (including the PV-DM mode with a concatenative input layer – which has
never been confirmed to perform as well as the original paper implied).

~~~
b_tterc_p
Insightful. Thanks

------
slx26
Just as a curiosity, complementing what others have already written... I read
part of Mikolov's thesis and code in the past (when I was still studying at
the university, so I might have got everything wrong (I still don't get half
of it :D)). First I found it quite shocking that the code was so bad. The
training code was pretty confusing to me, and I found the lack of useful
comments discouraging. The test code (which loaded stored embeddings from a
file and allowed some basic operations) was even much, much worse. Like,
declaring three variables (a, b, c) and reusing them for different things in
the main functions without explaining anything, and doing linear searches
through the whole embeddings to find a word vector... very ugly and scary
things.

So, I had a very bad impression of the code. But then, I checked the thesis,
and I found it awesome. The amount of tests and implementations the guy made,
and how he showed in practice how better results could be achieved in a good
number of different setups... I found it really impressive. But such great
work paired with such bad code! I was just a CS student, so I found it
shocking. Nowadays I realize he was simply focused on a different thing, and
the results he obtained were indeed outstanding and talk for themselves.

It's easy to look back and criticise the code, but when you look at the work
he did in perspective... it's completely unfair to ask more from him
(admittedly, they had time to address some of the issues later, but they
probably had better things to do too).

------
newen
This kind of things happens all the time in academia. The authors are either
constrained by space due to paper limitations or they are too lazy to explain
all the little details that go into the algorithm.

I used to do research in computer vision a few years ago and it used to be
that people won't publish their code _and_ they purposely won't put in all of
the details of the algorithm in the paper. Many of those algorithms were
patent pending and I assume the authors were hoping to make some money from
the patents. Compared to that, it's a lot better nowadays where most of the
popular papers come with published code.

~~~
bollu
Is this really that common? That's disheartening, I want to spend time in
academia but experiences like this are sucking the fun out for me...

~~~
toast0
I tried to make use of some public audio research and it was pretty bad. There
was an audio comprehensibility competition a few years ago. Some of the papers
submitted are still around, as well as the summary paper describing the
results. But many papers are hard to find, and those that claimed to have
source code available are hard to find --- i was able to get matlab sources
for a few algorithms, but they somehow work on the example files, but mostly
crash on my files.

It's a shame because I understand the idea of the paper, and have an excellent
place to apply it, but I lack the DSP background, so I can't really rebuild
the code from scratch -- so the work is not able to be used.

~~~
DoctorOetker
this sounds interesting, would you care to reference the paper in question?

~~~
toast0
I'm not sure if I can find the exact paper anymore. This was in response to
the Hurricane Challenge, a summary of results is available [1]. I tried to use
code for uwSSDRCt available from the legacy page of the conference [2], under
the link "Live and recorded speech modifier", direct download here [3].

The basic context is verification code delivery -- I'm playing pre-recorded
samples of numbers to users, and can't control or sample the noise (either
transmission or environmental), but would like to enhance intelligibility to
reduce user effort, improve experience, and reduce costs.

[1]
[https://www.research.ed.ac.uk/portal/files/17887878/Cooke_et...](https://www.research.ed.ac.uk/portal/files/17887878/Cooke_et_al_2013_Intelligibility_enhancing_speech_modifications.pdf)

[2]
[https://web.archive.org/web/20131012005150/http://listening-...](https://web.archive.org/web/20131012005150/http://listening-
talker.org/legacy.html)

[3]
[http://www.laslab.org/resources/LISTA/code/D4.3.zip](http://www.laslab.org/resources/LISTA/code/D4.3.zip)

------
gojomo
They're not really that different.

There's only a second vector for a word in the (common, default) negative-
sampling case, where each predictable word has a distinct "output node" of the
neural network, and the second vector is the in-weights to that one node.
Still, most implementations don't emphasize this vector – the classic "word-
vector" is a word's representation when it's a neural-network input. And in
the hierarchical-softmax training mode, there's no clear second vector.

I suspect the original word2vec authors left out a clearer description of the
initialization as they were following some oft-assumed practices implied by
their other descriptions.

Another minor difference between the literal descriptions, and original C
implementation, was a slightly different looping order in skip-gram training:
holding a target-word, and then looping over all context-words, rather than
holding a context-word, then looping over all neighboring target-words. One of
the authors once mentioned that the shipped approach was slightly more
efficient – maybe it was due to CPU cache issues? In any case all the same
context->target pairs get trained either way, just in a slightly different
order.

------
eggie5
instead of thinking about what it is in practice: skip-gram negative sampling,
I think it's much more intuitive to think about what it is in theory: extreme
multi-class classification.

word2vec is a multi-class classification problem with a softmax output layer
and cross-entropy loss. The novel part of word2vec, in my opinion, is two:

1\. dataset (proximal input word & output word) generation from documents eg:
skiagram, CBOW, etc 2\. engineering speedup for softmax: Approximate Softmax
eg Negative Sampling using NCE, hierarchal softmax, etc

If you just build word2vec w/o step 2, it's a easier to understand. Then when
you get that working, add in the negative sampling speedup trick which isn't
core the theoretical algorithm.

~~~
utopcell
Can't really call it a speedup trick, since it actually improves the
performance of the embeddings but in terms of qualitative understanding, I see
where you're coming from.

------
xxxpupugo
The title reads to me like hyperbole.

The implementation can differ, they got time to refactor/optimize it after the
publication. But they can't probably revise the paper itself. As long as the
code is there and can produce said/better result, then it is probably your
responsibility to keep the differences in check.

It is actually quite common for deep learning papers overall, the github repo
gets updated after the paper is out, and you will find the divergence lying
there.

------
MichaelStaniek
My intuition for that, and you can tell me if its wrong.

The normal explanation for Word2Vec is 2 weight matrices, so the formula looks
like this: (One_hot_input x W1) x W2, which is then softmaxed.

W1 then is the matrix that contain our focus embedding from, but if we only
evaluate specific words on the target side, then W2 are actually our context
embeddings, and the normal multiplication then is focus_w x context_w.

Am I wrong?

~~~
bollu
Now it's `one_hot_focus x W1 x (one_hot_context x W2)^T`. So we still pick one
row of the matrix from the focus and context embeddings, but they're separate
embeddings.

~~~
MichaelStaniek
Yes, but thats also what happens in the normal formulation, no? So the second
weight matrix actually are our context embeddings?

------
lelf
It’s patented BTW —
[https://patents.google.com/patent/US9037464B1/en](https://patents.google.com/patent/US9037464B1/en)

~~~
rurban
Turns the patent only describes the paper, but not the implementation. Great,
and somewhat ironic

------
skythomas
Grossly unfair title

