Author here. I wrote this blog post attempting to visually explain the mechanics of word2vec's skipgram with negative sampling algorithm (SGNS). It's motivated by:
1- The need to develop more visual language around embedding algorithms.
2- The need for a gentle on-ramp to SGNS for people who are using it for recommender systems. A use-case I find very interesting (there are links in the post to such applications)
I'm hoping it could also be useful if you wanted to explain to someone new to the field the value of vector representations of things. Hope you enjoy it. All feedback is appreciated!
Here's some more layman reading "from back when", for people interested in how word2vec compares to other methods and works technically:
- https://rare-technologies.com/making-sense-of-word2vec/ (my experiments with word2vec vs GloVe vs sparse SVD / PMI)
- https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s (my PyData talk on optimizing word2vec)
The BERT article  has 'em too!
I believe I understand the concepts of CBOW and skip-gram. But I'm a little bit stuck. I kind of don't understand this . In fact I understand it so poorly that I can't even formulate a question around it.
Now what do we do?
Edit: An attempt at formulating a question: is it the process of feeding the model with the [context][context][output] vector that you are depicting?
I'll be honest, I personally found this figure puzzling. Still not 100% clear on it, but I don't believe it refers to the negative sampling approach. My best guess is that it's referring to earlier word2vec variants where the input in skipgram (or sum of inputs in CBOW) are multiplied by a weights matrix that projects the input to an output vector.
Is there a reason why the training is started off with two separate matrices - the embedding and the context matrix? If the context matrix is anyway discarded at the end, why not start and work with only the embedding matrix?
Also thumbs up for the Dune references :)