
Primer on Neural Network Models for Natural Language Processing[pdf] - fitzwatermellow
http://u.cs.biu.ac.il/~yogo/nnlp.pdf
======
vonnik
Yoav is one of the smartest guys in this space. He and Omer Levy also wrote a
great explanation of word2vec:

[http://arxiv.org/pdf/1402.3722v1.pdf](http://arxiv.org/pdf/1402.3722v1.pdf)
[https://levyomer.wordpress.com/2014/04/25/word2vec-
explained...](https://levyomer.wordpress.com/2014/04/25/word2vec-explained-
deriving-mikolov-et-al-s-negative-sampling-word-embedding-method/)

~~~
Karlozkiller
I found that explanation to focus almost entirely on the negative sampling,
not explaining too much of the actual Skip-Gram and CBOW-models.

However as I've understood it, the negative-sampling is a big part in why
those models are so calculation-efficient, combined with Hierarchical Softmax
to reduce the complexity further.

~~~
gojomo
This current article seems to cover the various choices for constructing
'contexts' (which include skip-gram and CBOW) pretty well.

Note that negative-sampling and hierarchical-softmax are actually
_alternative_ choices to interpret the hidden-layer and to arrive at error-
values to back-propagate. Each can be used completely independently.

If you enable them both, you're training two independent hidden layers, which
then in an interleaved fashion update the same shared input-vectors.
(Essentially, it's joint training of each example via the hierarchical-softmax
codepath to nudge the vectors, then via the separate negative-sampling
codepath to nudge the vectors.) So the actual combination doesn't reduce the
complexity – it's additive to model state size and training time – and I think
most projects with large amounts of data just use one or the other (usually
just negative-sampling).

~~~
Karlozkiller
Ah, thank you for pointing that out. I guess I got confused in all the papers
I've read on the topic recently. It's hard to get into.

However, I would still not agree that the comment-linked article explaining
negative sampling really explains how word2vec works, well enough, or maybe I
just didn't understand.

Either way I recommend looking at this article as well if anyone wants to
understand word2vec. [http://www-
personal.umich.edu/~ronxin/pdf/w2vexp.pdf](http://www-
personal.umich.edu/~ronxin/pdf/w2vexp.pdf)

------
mark_l_watson
Nice paper. I especially like how he has equations, pseudo-code, and Python
code snippets. He could turn this paper into a book, adding full Python
examples, and I would buy a copy.

------
Karlozkiller
Huh, I wrote a few pages on neural networks for Natural Language Processing
just a few days ago. Too bad I didn't have access to this. It seems to mention
all the different kinds of networks I figured to be relevant to mention, and
it has a comprehensive explanation on Recursive Neural Networks, which I
didn't really find.

Nice one.

~~~
stevetjoa
I glanced through the entire PDF. While it looks like an outstanding
comprehensive overview to neural networks, it doesn't appear to really address
NLP all that much, despite the title.

I would gladly welcome if you or someone could write a guide that has the
comprehensiveness of the PDF above but with more NLP domain-specific
discussion and concrete examples.

~~~
herewego
I think you should give it another look because, by my observation of the PDF,
it's most definitely all about NN-based NLP.

