
Embed, encode, attend, predict: the new deep learning formula for NLP models - jast
https://explosion.ai/blog/deep-learning-formula-nlp
======
YeGoblynQueenne
That's a nice article in that it manages to get excitement across without
forgetting to balance it somewhat. Kind of a rare thing these days.

Btw, I'm interested to hear how well training with large one-hot encoded
vectors scales. A paper someone pointed me to recently on HN suggested that it
doesn't scale very well:

One-shot Learning with Memory-Augmented Neural Networks
[[https://arxiv.org/abs/1605.06065](https://arxiv.org/abs/1605.06065)]

------
syllogism
Here's the implementation of the first example model, the Parikh et al. (2016)
textual entailment system:

[https://github.com/explosion/spaCy/tree/master/examples/kera...](https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment)

This got dropped during editing...Updating the post to make this more
prominent.

~~~
1024core
I read the full post, thanks for writing it. It is very clear, but I do have a
couple of questions:

1\. In Step (2), Bidirectional RNN: what are you making the forward/backward
passes over? How do the tokens get turned into a "matrix" ? What is the
dimensionality of this matrix?

2\. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices
from?

It would be nice to bring in some concreteness: talk about sentences,
documents, etc. and how they map into this scheme.

Thanks!

~~~
syllogism
The implementation and papers are probably much clearer about the details.
This post might also help: [https://explosion.ai/blog/spacy-deep-learning-
keras](https://explosion.ai/blog/spacy-deep-learning-keras)

I'll answer briefly about the Parikh et al model.

1) Input: (ids1, ids2). These are integer-typed arrays of length len1 and len2

2) sent1 = embed(ids1); sent2 = embed(ids2). Data is now real-value arrays of
shape (len1, vector_dim) and (len2, vector_dim) respectively. 300 is a common
value for vector_dim, e.g. from the GloVe common crawl model.

3) sent1 = encode(sent1); sent2 = encode(sent2). Data is now real-valued
arrays of shape (len1, fwd_dim+bwd_dim), (len2, fwd_dim+bwd_dim).

4a) attention = create_attention_matrix(sent1, sent2). This is a real-valued
array of shape (len1, len2)

4b) align1 = soft_align(sent1, attention); align2 = soft_align(sent2,
transpose(attention)). These are a real-valued array of shape (len1,
compare_dim), (len2, compare_dim)

4c) feats1 = sum(map(compare(sent1, align2))); feats2 = sum(map(compare(sent2,
align1))). These are real-valued arrays of shape (predict_dim,),
(predict_dim,)

5\. class_id = predict(feats1, feats2)

The post describes steps 4a, 4b and 4c as a single operation that takes the
two 2-dimensional sentence representations as input and outputs a single
vector (obtained by concatenating the representations feats1 and feats2 in
this description).

------
sixhobbits
I really like this post. So much NLP research is 'locked away' in academic
papers, and making the knowledge more accessible through posts like this is
very important for large-scale adoption by non-academics.

Also, really well done on the site design. Love the graphics, font, layout and
'progress bar' animation at the top. Very nice UX overall.

------
mtrimpe
I really loved reading this article but it's always so hard to figure out
exactly how these things work out in detail.

I understand matrix multiplication but it seems that (some of) these matrix to
vector calculations are actually trained by/as part of the neural net... but
how exactly that works I can't figure out coming at it from articles like
this.

~~~
syllogism
(Author here)

Thanks! I'm planning to make two follow up posts, on each of the systems, that
go through those details. I blurred them out in this post because I wanted to
get across this more abstract story about the data types and transformations.

There are lots of good posts about attention mechanisms. The WildML post is
good, as is Chris Olah's post. Bidirectional RNNs are a little bit less well
covered, but the idea is not too difficult to understand given a single RNN
(or LSTM, GRU etc).

You should also read the papers :). This is how most people who are doing ML
--- including the people building practical things, not researchers --- are
staying up to date. Academia is so competitive and writing is cheap relative
to experimentation. The deep learning literature is really pretty easy to
follow.

~~~
dharma1
Really like what you're doing with SpaCy and explosionAI, good stuff :)

What do you think about dilated convolutional encoder/decoder networks [1]?
Useful for NLP beyond machine translation?

[1] [https://arxiv.org/abs/1610.10099](https://arxiv.org/abs/1610.10099),
[https://github.com/paarthneekhara/byteNet-
tensorflow](https://github.com/paarthneekhara/byteNet-tensorflow)

~~~
syllogism
Thanks!

I don't understand those models very well yet. I haven't implemented one, or
really sat down with the paper and really worked through it.

------
imh
One other cool part of attention is that you can attend to m-dimensional parts
of a n-by-m matrix just as well as a k-by-m matrix. Objects (sentences) of
varying size can be treated the same in a really nicely principled way.

------
bertomartin
Could you use such a model to do sequence labeling? For example, I have a text
document such as a financial document and I want to detect where in that
document it states that a "stock split" will occur, or "share repurchase" and
how much. This seems like a good approach given that it learns context. I know
there are NER methods, but this is slightly different. I want to train a model
to recognize specific events. The best I can do right now is a regex.

~~~
syllogism
If you want to tag sequential spans of text, then you've basically got the
same "shape" of problem as named entity recognition, just with different
labels and data. BiLSTMs work well for this.

------
Mathnerd314
typo: contradition

I just wish I understood the rest of the article...

