
An Introduction to Recurrent Neural Networks - bibyte
https://victorzhou.com/blog/intro-to-rnns/
======
stereolambda
It's worth noting that apparently (as I learned lately) RNNs are going
slightly out of fashion because they are hard to parallelize and have trouble
remembering important stuff at larger distances. Transformers are proposed as
a possible solution - very roughly speaking, they use attention mechanisms
instead of recurrent memory and can run in parallel.

I have to say that while I understand the problems with recurrent nets (which
I've used many times), I haven't yet grokked the alternatives. Here are some
decently looking search results for you as starting points. Warning, these can
be longer and heavier reads probably not for beginners.

[https://towardsdatascience.com/the-fall-of-rnn-
lstm-2d1594c7...](https://towardsdatascience.com/the-fall-of-rnn-
lstm-2d1594c74ce0) (there's some sensationalism here to be fair)

[https://mchromiak.github.io/articles/2017/Sep/12/Transformer...](https://mchromiak.github.io/articles/2017/Sep/12/Transformer-
Attention-is-all-you-need/)

[https://www.analyticsvidhya.com/blog/2019/06/understanding-t...](https://www.analyticsvidhya.com/blog/2019/06/understanding-
transformers-nlp-state-of-the-art-models/)

[https://www.tensorflow.org/beta/tutorials/text/transformer](https://www.tensorflow.org/beta/tutorials/text/transformer)

That being said, I think that understanding RNNs is very beneficial
conceptually and nowadays there are relatively easy to use implementations
that should be pretty good for many use cases.

~~~
ma2rten
Mainly RNNs are much slower to train than transformers.

~~~
MiroF
As well as stability issues, long range dependencies, etc..

------
vzhou842
Hey, author here. Happy to answer any questions or take any suggestions.

Runnable code from the article: [https://repl.it/@vzhou842/A-RNN-from-
scratch](https://repl.it/@vzhou842/A-RNN-from-scratch)

~~~
RyEgswuCsn
Very nicely written post. I particularly like how you attached a link to your
codebase on repl.it so anyone who is interested can tinker with the code.

One thing I have been wondering for some time is whether the vanilla RNN can
learn negations (i.e. 'not good' == 'bad') and valence shifts (e.g. modifier
words like 'very' \--- they do not carry sentiment connotations themselves,
but may amplify/dampen the sentiment of the words they modify; negations like
'not' can be considered as a special-case valence shifter where it inverts the
sentiment of the following word).

My suspicion is that vanilla RNNs are not capable of modelling negations and
valence shifters since they make inference on the sentiment of a sentence by
'adding up' the sentiment connotations of its constituent words --- negations
and valence shifts, however, works more like multiplications than additions.

I see you already have such examples in your dataset so I thought I'd do some
experiments. I simplified your original dataset to the following:

    
    
      train_data = {
        'good': True,
        'bad': False,
        'not good': False,
        'not bad': True,
        'very good': True,
        'very bad': False,
        'not very good': False,
        'not very bad': True
      }
      
      test_data = {
        'very not bad': True,
        'very not good': False
      }
    

While the test cases do not reflect how people actually speak, the hope is
that the model should be able to apply its learning to infer their sentiment.
For me, however, it would seem the training failed to converge with the
default parameter settings (hidden_size=64).

It would be interesting to see how other RNN architectures (e.g. LSTM,
Transformers) fare with negations and valence shifters.

P.S.: When calculating softmax, it is better to use the built-in functions or
at least do the log-sum-exp trick to prevent under-flowing.

~~~
vzhou842
Thanks for the comments! Interesting experiment - I wouldn't be surprised if
better RNN architectures were more effective for this example.

Appreciate the softmax tip, I'll update soon.

------
wish5031
Nice! I like that the author wrote the code by hand rather than leaning on
some framework. It makes it a lot easier to connect the math to the code. :)

As a meta-comment on these "Introduction to _____ neural network" articles
(not just this one), I wish people would spend more time talking about when
their neural net isn't the right tool for the job. SVMs, kNN, even basic
regression techniques aren't any less effective than they were 20 years ago.
They're easier to interpret and debug, require many fewer parameters, and
potentially (you may need to apply some tricks here or there) faster at both
training and evaluation time.

------
cheez
This kind of article is absolutely the thing everyone new to deep
learning/neural networks should read. I wish there was one for each type of
algorithm.

~~~
vzhou842
I do have similar articles for Neural Networks (MLP) and CNNs:

\---

NN: [https://victorzhou.com/blog/intro-to-neural-
networks/](https://victorzhou.com/blog/intro-to-neural-networks/)

NN HN discussion:
[https://news.ycombinator.com/item?id=19320217](https://news.ycombinator.com/item?id=19320217)

\---

CNN: [https://victorzhou.com/blog/intro-to-cnns-
part-1/](https://victorzhou.com/blog/intro-to-cnns-part-1/)
[https://victorzhou.com/blog/intro-to-cnns-
part-2/](https://victorzhou.com/blog/intro-to-cnns-part-2/)

CNN HN discussions:
[https://news.ycombinator.com/item?id=19981736](https://news.ycombinator.com/item?id=19981736)
[https://news.ycombinator.com/item?id=20064900](https://news.ycombinator.com/item?id=20064900)

~~~
cheez
Awesome!

------
rrggrr
Would be great if you showed the final output (eg. semantic analysis) result.

------
mlevental
why do people insist on mentioning the bias terms in expository essays? it's a
detail that clutters the equations. why not keep the transformations linear
and then at the end make a note that you also need to shift using a bias term.

------
ape4
I doubt Google Translate uses RNN. They use Statistical Machine Translation.
Oops, I see they switched to NN in 2016.
[https://en.wikipedia.org/wiki/Google_Translate](https://en.wikipedia.org/wiki/Google_Translate)

~~~
probably_wrong
I feel I should point out that those two things are not mutually exclusive.
RNNs are, after all, a mechanism for learning conditional probabilities.

I think the confusion comes from Google itself, who used the term "Statistical
Machine Translation" (SMT) to refer to "Rule-based SMT". Both methods are
statistical.

~~~
MiroF
SMT as a term of art in the translation field means rule based SMT... it's not
a google particularity, I see the same usage in both industry and academia

