> Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec  and also in subsequent work . However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.
The upshot is that they get similar results with CBOW while training three times faster than skipgram.
Given the popularity of Transformers, and that Fasttext exists, I'm curious as to what inspired them to even try this, but it's certainly an interesting result. There's so much word vector research that relies on quirks of the word2vec implementation.
It's been a few years since I looked at it, but IIRC fastText is basically just w2v with subwords, so it's also possible this negative sampling fix applies to w2v and fastText equally.
If there's an actual benefit to be had here, Gensim could add it as an option - but would likely always default to the same CBOW behavior as in `word2vec.c` (& similarly, FastText) - rather than this 'koan' variant.
The `koan` CBOW change has mixed effects on benchmarks, and makes their implementation no longer match the choices of the original, canonical `word2vec.c` release from the original Google authors of the word2vec paper. (Or, by my understanding, the CBOW mode of the FastText code.)
So all the reasoning in that issue for why Gensim didn't want to make any change stands. Of course, if there's an alternate mode that offers proven benefits, it'd be a welcome suggestion/addition. (At this point, it's possible that simply using the `cbow_mean=0` sum-rather-than-average mode, or a different starting `alpha`, matches any claimed benefits of koan_CBOW.)
What if that's the real reason for their sometimes slightly-better, sometimes slightly-worse performance on some benchmarks? Perhaps there are other changes, too.
This is why I continue to think Gensim's policy of matching the reference implementations from the original authors, at least by default, is usually the best policy – rather than using an alternate interpretation of the often-underspecified papers.
This is another paper that's basically just about some details of word2vec and GloVe and their effects on the results:
Improving Distributional Similarity with Lessons Learned from Word Embeddings - ACL Anthology
For instance, they store the vocabulary. I can query for similar words, or do vector math and convert it back to words. That is much harder to do with transformers.
Also, not surprised at all that this kind of bug made it through inspire of how popular word2vec is. NLP is chalk full of tiny bugs like this and there is all sorts of low hanging fruit for interested enough researchers...
(I recall seeing some hints in early word2vec code of an HS-based vocabulary that wasn't based on mere word-frequency, but some earlier or perhaps iterated semantic-clustering steps, that I think managed to give similar words shared codes. But I've not seen more on that recently.)
1. The original word2vec is considered a reference implementation, used for benchmarks, so some people might not want it to change much.
2. The original hasn't been updated since 2017, and development was never very interactive.
3. word2vec has been widely reimplemented, and some reimplementations may be more widely used than the original (particularly Gensim).
4. The original had a paper from the very start (rather than being an open source project where a paper came later), so other papers reference it. For future papers that use the koan form, having a paper makes it easy to use in a similar way.