
The Unreasonable Effectiveness of Recurrent Neural Networks - benfrederickson
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
======
Smerity
Karpathy is one of my favourite authors - not only is he deeply involved in
technical work (audit the CS231n course for more[1]!), he spends much of his
time demystifying the field itself, which is a brilliant way to encourage
others to explore it :)

If you enjoyed his blog posts, I highly recommend watching his talk on
"Automated Image Captioning with ConvNets and Recurrent Nets"[2]. In it he
raises many interesting points that he hasn't had a chance to get around to
fully in his articles.

He humbly says that his captioning work is just stacking image recognition
(CNN) on to sentence generation (RNN), with the gradients effectively
influencing the two to work together. Given that we've powerful enough
machines now, I think we'll be seeing a lot of stacking of previously separate
models, either to improve performance or to perform multi-task learning[3]. A
very simple concept but one that can still be applied to many other fields of
interest.

[1]: [http://cs231n.stanford.edu/](http://cs231n.stanford.edu/)

[2]:
[https://www.youtube.com/watch?v=xKt21ucdBY0](https://www.youtube.com/watch?v=xKt21ucdBY0)

[3]: One of the earliest - "Parsing Natural Scenes and Natural Language with
Recursive Neural Networks"
[http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf](http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf)

~~~
cOgnaut
Read over [1] and am currently watching [2], and I really can't get over a not
insignificant bit of dissonance:

(a) He seems to be _very_ intelligent. Kudos. But…

(b) How good of an idea is it __really __to create software with these
abilities? We 're already making machines that can do most things that had
once been exclusive to humans. Pretty soon we'll be completely obsolete. Is
that REALLY a good idea? To create "face detectors" (his words!)?

~~~
firethief
Our generation is going to get old and feeble and eventually die. If we have
children, they'll completely supplant us.

Our relevance is ephemeral, but our influence will be lasting. Do we want to
have a legacy of clinging to our personal feelings of importance, or of
embracing the transience of our existence and nurturing our (intellectual)
progeny?

~~~
cousin_it
We let our children inherit the world because we want them to be happy. Not so
with machines designed to carry out industrial tasks.

------
cs702
Nice. Andrej Karpathy deserves some kind of award for _demystifying_ deep
learning and making the subject so accessible to a wider audience. If you're a
developer who knows little about the subject and want to learn more, a great
starting point is the home page for his ConvNetJS project.[1]

\--

[1]
[http://cs.stanford.edu/people/karpathy/convnetjs/](http://cs.stanford.edu/people/karpathy/convnetjs/)

~~~
choppaface
And if you're more comfortable with Python, I strongly recommend the CS231n
assignments / labs: [http://cs231n.github.io/](http://cs231n.github.io/)

Assignments 1 and 2 alone give a solid intro to implementing these algorithms,
and the lab-oriented iPython-based format gives you a very high probability of
writing a correct implementation even if you're clueless at the start.

------
swalsh
As a father, the output feels really familiar. It's like a child learning to
talk. At first, though the words they say are actual words (and mean something
to you), they themselves have no idea what the meaning is. Eventually though
they start understanding the meaning, which combined with the syntax creates a
person who can communicate.

I wonder if all that's missing is just a few more layers, and another source
of input. Maybe a list of requirements/output/input matched with the code so
it understands why what was written was written. I wonder what would happen if
you ran the program, took the output, and fed it back in as input.

Really cool stuff here.

~~~
digikata
Human children have the great benefit of interactively learning from their
parents and other humans raising them. Could we expect a child to learn to
speak if they only heard recordings of existing speech with otherwise no human
interaction/feedback - correcting them or offering customized and contextual
new bits of information? It would be interesting to add feedback path for
human corrective input. i.e. because it's direct interaction, feed it back but
somehow weight it a little more than just another corpus input.

~~~
tzs
> Could we expect a child to learn to speak if they only heard recordings of
> existing speech with otherwise no human interaction/feedback - correcting
> them or offering customized and contextual new bits of information?

I once asked a similar question on some online forum [1] where many linguists
hung out. My question was if an English-only speaking household left a general
interest Spanish language TV station on most of the time when they weren't
actively using the TV to watch something, so that their child received a very
large exposure to Spanish language programming (news, sports, soap operas,
sitcoms, movies, etc) from birth onward, would the child naturally learn
Spanish?

I don't recall for sure what the linguists who responded said, but I _think_
they all said the child would not learn Spanish from this.

[1] I have no recollection of where this was.

~~~
spdionis
When I was little (5-7 years old) I had quite a few anime videos and magazines
in italian sent to me by my parents who were abroad. Where I lived no one knew
a single word of italian. I often watched and rewatched those videos and read
those magazines without any other external input. I can tell you that doing
that I easily learned the language. When I was 8 years old I also left for
Italy and in two weeks of time I already started speaking fluently, albeit
with a few mistakes.

If the child will actually watch the Spanish TV he will learn the language.

EDIT: Even now I often learn new japanese words (and remember them) just by
watching animes. The difference is that now I have english subtitles but back
then I had no subtitles, only the images to help me understand the meaning.

~~~
digikata
But this is a little different, in the AI we want the ability to form
syntactically correct sentences, but also some intelligence behind the
sentences too. You as a human had another foundation of intelligence to lean
on, your native language, and an understanding of the world outside of
learning the Italian language. If you had no other human interaction, would
you have learned any language? That's the more the situation of these AI
algorithms.

~~~
spdionis
I was only arguing that the child could actually learn Spanish, nothing more.

Not knowing basically anything about AI state of the art what stops us from
feeding a RNN image data and text data and make it correlate them
automatically by context? Just like a child learns words by hearing them many
times in similar contexts so could a RNN.

I imagine the biggest problem is gathering and structuring the data. We humans
receive _lots_ of data and have _lots_ of time to process it in our lives
compared. And by _lots_ I mean difference of a few orders of magnitude. It's
amazing what this thing learns in just a few hours of processing.

~~~
digikata
I agree the RNN performance is really amazing!

------
fdej
"This sample from a relatively decent model illustrates a few common mistakes.
For example, the model opens a \begin{proof} environment but then ends it with
a \end{lemma} ... By the time the model is done with the proof it has
forgotten whether it was doing a proof or a lemma. Similarly, it opens an
\begin{enumerate} but then forgets to close it."

Ah, so strong AI is finally here. A computer program that makes just the same
mistakes as humans when writing in TeX.

------
tshadwell
I'm not sure how "unreasonable" the effectiveness of RNNs are if the corpus
output at 2000 iterations isn't significantly better than a simple prefix
based markov chain implementation [1] (and for the regular languages, with
some extra bracket-checking), but I found the evolution visualizations really
interesting.

[1]
[http://thinkzone.wlonk.com/Gibber/GibGen.htm](http://thinkzone.wlonk.com/Gibber/GibGen.htm)

~~~
pohl
Welcome to the unbearable forced-ness of titles. Everyone's making a nod to
Milan Kundera these days.

~~~
coldtea
First, it's not a nod to Kundera, but to a classic math related work that
predates Kundera's book.

Second, even if it was, really? As if we see plays on Kundera titles regularly
on the web?

~~~
tim333
I doubt it's a reference to Kundera.

I was thinking that both Eugene Wigner's 1960 article 'The Unreasonable
Effectiveness of Mathematics in the Natural Sciences'[0] and Karpathy's 'The
Unreasonable Effectiveness of Recurrent Neural Networks' probably touch deep
aspects of the nature of existence. The first on why the universe exists and
is mathematical - because at the fundamental level it is mathematical[1], and
in Karpathy's case the RNNs are probably effective because they are close to
the mechanisms of human consciousness.

[0] Wigner's article:
[http://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html](http://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html)

[1] 'physical world is completely mathematical' theory:
[http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_...](http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences#Max_Tegmark)

------
Patryk
This same thing (i.e., using recurrent neural networks to predict characters
(and even words)) was done by Elman in 1990 in a paper called "Finding
Structure in Time"[1]. In that paper, Elman goes several steps further and
carries out some analysis to show what kind of information the recurrent
neural network maintains about the on-going inputs.

It's an excellent read for anyone interested in learning about recurrent
neural networks.

[1]
[http://crl.ucsd.edu/~elman/Papers/fsit.pdf](http://crl.ucsd.edu/~elman/Papers/fsit.pdf)

~~~
sushirain
It's amazing how much was already known decades ago. Elman and others did much
more, and hopefully, now the field will take the next step (which was long
delayed), with the help of today's computer power.

------
pcmonk
The code generator is awesome. There's hardly a syntax error. The file headers
are the best.

Nitpick: although tty == tty is, as you say, vacuously true in this case,
that's just because tty is a pointer. If tty were a float, this wouldn't be
the case, since it could be NaN. I wouldn't be surprised if it learned to test
a variable for equality against itself from some floating point code.

~~~
zxyzzxxx
The code is nonsense. Their method is good for fuzzy logic like recognition,
but this approach with code will never work for anything other than an art
project.

~~~
nomel
Feed it all of github, and I'm sure you could come up with some interesting
auto complete code generation tools. Of course, coming from github , it'll be
poorly documented and filled with buffer overflows :D

~~~
saidajigumi
I'll agree that this is interesting, but it seems like a lot of people in this
thread miss the point: we're working with multi-layer tools now. This enables
modeling of multi-layer processes. The code generation as it stands is a
obviously a toy, but what happens if we actually think about the real
processing layers?

Take this example of code processing, and instead front it with a parser that
generates an AST. For now, an actual parser for a single language. Maybe
later, a network trained to be a parser. The AST is then fed to our network.
What could we get out of the AST network? Could we get interesting static
analysis out of it? Tell us the time and/or space complexity? Perhaps we
discover that we need other layers to perform certain tasks.

This, of course, has parallels in language processing. Humans don't just go in
a single (neural) step from excitation of inner ear cells ("sound") directly
to "meaning". Cog sci and linguistics work has broken out a number of discrete
functions of language processing. Some have been derived via experiment, some
observed via individuals with brain lesions, others worked out by studies of
children and adult language learners. These "layers" provide their own
information and inspiration for building deep learning systems.

------
rsp1984
I wonder what would happen if you train an RNN like described with, say, the
scores of all of Mozart's Chamber Music and then let it generate new music
from the learned pieces. How would it sound? Would it figure out beat? Chords?
Harmonies? May it even sound a bit like Mozart?

~~~
kastnerkyle
The work of Nicolas Boulanger-Lewandowski was extensively focused on this
topic, see his work [1]. He wrote a Theano deep learning tutorial on this
topic [2], and several people (Kratarth Goel) [3][4] have advanced the work to
use LSTM and deep belief networks.

For a brief while RNN-NADE made an appearance as well, though I do not know of
an open source implementation

There are also a few of us who are working on more advanced versions of this
model for speech synthesis, versus operating on the MIDI sequence. Stay tuned
in the near future!

I can say from experience that some of the samples from the LSTM-DBN are
shockingly cool, and drove me to spend about a week using K-means coded
speech. It made robo-voices at least but our research moved past that pretty
fast.

[1] [http://www-etud.iro.umontreal.ca/~boulanni/](http://www-
etud.iro.umontreal.ca/~boulanni/) [2]
[http://deeplearning.net/tutorial/rnnrbm.html](http://deeplearning.net/tutorial/rnnrbm.html)
[3] [http://arxiv.org/pdf/1412.6093.pdf](http://arxiv.org/pdf/1412.6093.pdf)
[4]
[https://github.com/kratarth1203/NeuralNet/blob/master/rnndbn...](https://github.com/kratarth1203/NeuralNet/blob/master/rnndbn.py)

~~~
JonnieCache
Is the robot-voice code published anywhere?

You can make money out of that kind of thing btw!

[https://soniccharge.com/bitspeek](https://soniccharge.com/bitspeek)

(Obviously not the same thing but the point is that silly robo-voice code is
marketable :)

------
narrator
The thing about neural nets is that they are pretty opaque from an analyst
point of view. It's hard to figure out why they do what they do, except that
they have been trained to optimize a particular cost function. I think Strong
AI will never happen because the people in charge will not give control over
to a system that makes important decisions without explaining why. They will
certainly not give control over the cost function to a strong AI because
control of determination of the cost function is the axis upon which all power
will rest.

~~~
dimatura
Our life is dominated by systems we don't understand. I have some
understanding of how my cell phone works at the software level, but when it
comes to details at the hardware level I just trust the electrical engineers
knew what they're doing. I have virtually no understanding of how the engine
in the bus operates beyond what I learned in thermodynamics 101. Sure, you
might say - _someone_ understands these things. But for some systems, it's
hard to pinpoint these people. And for some other complex systems, like the
stock market, nobody really understands them or (completely) controls them.
But we still use them every day. I think once AI becomes useful enough, people
will gladly hand control over.

~~~
imaginenore
But some engineer out there understands how your phone works.

With neural nets NOBODY really understands how they work.

~~~
Nadya
Maybe my understanding of neural networks is wrong... but I'm under the
impression they work from weighted criteria. With enough weight an answer is
selected as being the most likely. A well-trained neural network has enough
data to weight options and pick with high accuracy.

Then again, this is essentially black magic to me:

[http://hplusmagazine.com/2015/02/26/ai-masters-classic-
video...](http://hplusmagazine.com/2015/02/26/ai-masters-classic-video-games-
without-being-told-the-rules/)

------
myth_buster
This is quite incredible. The stylistic similarities of generated
Shakespearean saga, Linux code etc was quite startling. Perhaps we can train a
Haiku/Fortune cookie generator which could occasionally be quite profound.

~~~
seiji
> Linux code etc

People are always worried about "computers taking factory jobs" resulting in
mass unemployment, but the truth is, a rudimentary AI with acceptance tests on
output will obsolete every programmer alive.

Hell, half the programming people do these days is just gluing APIs together
then seeing if it actually works. It doesn't take 16 years of rich inner human
life experience to accomplish that, just exhaustive combinational parameter
searching on the subset of API interactions you're interested in evaluating.

~~~
myth_buster
Douglas Crockford touches on this aspect in this entertaining and insightful
talk [0]. I'm guilty of what you state and I think a large part of
"programming" is rudimentary boiler plate coding/configuration and staring
into the Abyss. I think our role will be to design algorithms and come up with
creative solutions/hacks (which would be difficult for a program) and
designing a workflow/flow chart and feeding it into a program which spits out
binaries and flag for edge cases. A whole swat industries and economies (read
outsourcing) will become redundant and only outsourcing done would be to the
generator.

[0]:
[https://www.youtube.com/watch?v=taaEzHI9xyY](https://www.youtube.com/watch?v=taaEzHI9xyY)

------
fpgaminer
I'm in the middle of reading this article (very much appreciate Karpathy's
writings), but I also wanted to brain dump some of my musings on modern
machine learning; RNNs in particular. Sorry if this is redundant to anything
the article talks about.

Deep learning has made great strides in recent years, but I don't think
architectures which aren't recurrent will ever give rise to mammalian
"thought". In my opinion, thought is equivalent to state, and feed forward
networks do not have immediate state. Not in any relevant sense. So therefore
they can never have thought.

RNNs, on the other hand, do have state, and therefore are a real step towards
building machines that posses the capacity to think. That said, modern deep
learning architectures based around feed forward networks are still very
important. They aren't thinking machines, but they are helping us to build all
those important pre-processing filters mammalian brains have (e.g. the visual
cortex). This means we won't have to copy the mammalian versions, which would
be rather tedious. We can just "learn" a V1, V2, etc from scratch. Wonderful.
And they'll be helpful for building machine with senses different than biology
has yet evolved. But, again, these feed forward networks won't lead to
thought.

My second musing is where I think the next leap in machine learning will
occur. To-date efforts have been focused on how to build algorithms that
optimize the NN architecture (i.e. optimize weights, biases, etc). But
mammalian brains seem to posses the ability to problem solve on the fly, far
faster than I imagine tweaks to architecture could account for. We solve
problem in-thought, rather than in-architecture; we think through a problem.
Machine Learning doesn't posses this ability. It can only learn by torturing
its architecture.

So, I believe there is this distinction to the learning that mammalian brains
are able to do on the fly, using just their thoughts, and the learning they do
long term by adjusting synaptic connections/response. It seems as if they
solve a problem in the short term, and then store the way they solved it in
the underlying architecture over the long term. Tweaking the architecture then
makes solving similar problems in the future easier. The synaptic weights lead
to what we call intuition, understanding, and wisdom. They make it so we don't
have to think about a class of problems; we just know the solutions without
thought. (Note how I say class of problems; this isn't just long term memory).

Along those lines, I come to my final musing. That mammalian brains are
motivated by optimization of energy expenditure. Like anything in biologically
evolved systems, energy efficiency is key, since food is often scarce. So why
wouldn't brains also be motivated to be energy efficient? To that end, I
believe tweaking synaptic weights, that kind of learning that machine learning
does so well, is a result of the brain trying to reduce energy expenditure.
Thoughts are expensive. Any time you have a thought running through your
brain, it has some associated neuronal activity associated with it. That
activity costs energy. So minimizing the amount we have to think on a day-to-
day basis is important. And that, again, is where architecture changes come
in. They are not the basis for learning; they are the basis for making future
problem solving more efficient. Like I said, once a class of problems has been
carved into your synaptic weights, you no longer have to think about that
class of problems. The solutions come immediately. You don't think about
walking; you just do it. But when you were a baby, I'll bet the bank that your
young mind thought about walking a lot. Eventually all the mechanics of it
were carved into your brain's architecture and now it requires many orders of
magnitude less energy expenditure by your brain to walk.

So, the obvious question is ... how do mammalian brains problem solve using
just thoughts. The answer to that, as I mentioned, is likely to lead to the
next leap in machine learning. And it will, more likely than not, come from
research on RNNs. What we need to do is find a way to train RNNs that are able
to adapt to new problems immediately without tweaking their weights (which
should be a slower, longer term process).

P.S. Yes, I know this was probably a bit off-topic and quite a bit wandering.
I've had these musing percolating for awhile and don't really have an outlet
for them at the moment. I hope it's on topic enough, and at least stimulates
some interesting discussion. Machine learning is fascinating.

~~~
kylebrown
> _That mammalian brains are motivated by optimization of energy expenditure.
> Like anything in biologically evolved systems, energy efficiency is key,
> since food is often scarce._

That doesn't square with empirical reality. Evolved biological systems appear
to be optimized for robustness to perturbations, not efficiency (John Doyle
argues that there is in fact a fundamental tradeoff between robustness and
efficiency, for all types of complex systems not just biological).

> _how do mammalian brains problem solve using just thoughts._

They don't. Sensory input is required for brains to learn new classes of
problems.

> _find a way to train RNNs that are able to adapt to new problems_

Is this something different than multi-task learning?

~~~
Lambdanaut
> They don't. Sensory input is required for brains to learn new classes of
> problems.

Sensory input is required to gain the knowledge, but then you can just as
easily muse over your gained knowledge for further insights in a sensory
deprivation chamber as you can in a classroom.

------
snikeris
In the spirit of:

[https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.htm...](https://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html)

[http://www.researchgate.net/profile/Derek_Abbott/publication...](http://www.researchgate.net/profile/Derek_Abbott/publication/256838918_The_Reasonable_Ineffectiveness_of_Mathematics/links/00b7d523d5bd289428000000.pdf)

~~~
iyn
Very short video about the topic:
[https://www.youtube.com/watch?v=ZBkzqLJPkmM](https://www.youtube.com/watch?v=ZBkzqLJPkmM)

------
dools
Web spam 2.0:

1) Take the entire works of several popular content creators in a given field,
complete with links out to articles etc.

2) Concatenate them into a single file

3) Train this thing to generate new articles

4) Create a map of popular articles that other people have written, to
articles you have written on similar topics

5) Replace the originals with your articles

6) Publish millions of articles that can't be detected as spam automatically
by Google

It's like bot wars: Spammers can train their robots to try and defeat Google's
robots.

~~~
stefs
well, i don't see how they - the spammers - would fake google's valuation
system of valuing incoming links from valuable sources. it's not like many
valuable sites outside this relatively insular system would link to those
generated nonsense pages. that'd practically create an insular babblenet that
could be relatively easily identified.

i mean, it's not like that's exactly what's happening right now.

~~~
dools
Okay so in the system I'm hypothesising, I pick a topic -- say content
marketing. I go to Neil Patel's and KissMetrics blog and get all their
articles on content marketing, and train this thingy with them.

I then buy, say, 1,000 domains. Doesn't matter what they are -- Or I buy 100
domains and setup 300 tumblr blogs, and 300 blogger blogs and 300
wordpress.com blogs.

Now I drip feed content to each of those blogs, but instead of linking to the
articles on content marketing that kissmetrics and neil patel originally
reference, I link to articles I have created instead.

How can Google tell the difference between a tonne of nobody bloggers link to
Neil Patel's articles, and my bots linking to my articles? The fact is that if
you blog on niche topics, with good article titles reflecting low competition
long tail keywords, you'll get some traffic from Google pretty easily -- how
can Google possible tell that links are coming from shitty bot generated pages
versus from a tonne of obscure bloggers with virtually no audiences (of which
there are thousands)?

The way they can tell the difference is Panda (or Penguin? I think it's Panda
... ) so as long as your pet robot can learn from Neil Patel and Kissmetrics
well enough to produce content that cannot be penalised by Panda, and so long
as you don't do it stupidly by like, having the same anchor text for all the
articles and doing 1,000 articles overnight and actually phase it in so that
it looks as though you're getting some reasonable organic spread, you'll be
able to game Google's rankings pretty reliably for your _real_ articles that
you're trying to promote, and get higher volumes of traffic to those articles
than you would be able to by just focusing on niche, long tail articles (for
example because you'd be able to get on page #1 or in the top 5 for much
higher volume keywords).

You would then get shares etc. for your actual content -- just because those
"spam farms" don't have social shares or backlinks from PR6 blogs doesn't mean
Google completely disregards them, just means that you need a lot more of them
to make the same impact as lots of shares/backlinks from PR6 blogs.

This strategy is old, and was killed by Panda, but if you could beat Panda
using a RNN then this would work again.

------
0xdeadbeefbabe
I'm getting the funny impression that what distinguishes an algorithm from an
AI algorithm isn't about the algorithm, but how people treat the algorithm.
It's an AI algorithm if they describe it behaving intelligently i.e. painting
numbers on a house, learning english first, being born, being tricked into
painting a fence, etc. Otherwise its just an algorithm.

~~~
Jtsummers
This is an old problem in AI. Chess was an AI problem, until a computer beat a
grandmaster. Vision was an AI problem, now we have OpenCV. Many AI problems
get shifted out of "AI" once they're solved.

~~~
TheLoneWolfling
It stems from our definition of an AI.

An AI is a computer doing those things a computer cannot do. As such, anything
that a computer cannot do isn't AI, and anything a computer can do isn't AI
either.

~~~
dtparr
Hmm, the 'No true AI' fallacy, then, eh?

~~~
TheLoneWolfling
Pretty much, assuming you're making an analogy to the "no true Scotsman"
fallacy.

------
waterlesscloud
Side note: The title is in reference to this famous paper from 1960-
[http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_...](http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences)

The form of the title has become a common trope.

~~~
nmyk
"Unreasonable Effectiveness Considered Harmful"

~~~
pizza
_' Considered Harmful Essays' Considered Harmful_:
[http://meyerweb.com/eric/comment/chech.html](http://meyerweb.com/eric/comment/chech.html)

------
danans
I'm curious to know if, since these networks can learn syntax, whether they
can also be re-purposed as syntax checkers, not just syntax generators. That
is, can the syntactical knowledge learned by these models be run in a static
classification mode on some input text to recognize the anomalies within and
suggest fixes.

------
j2kun
What's unreasonable about neural networks (in general, not just recurrent
ones) is that we don't _really_ have any theoretical understanding of why they
work. In fact, we don't even really understand what sorts of functions neural
networks compute.

------
ux-app
I'm an absolute layman with regard to AI, so I'd be keen to hear some
explanations with regard to the possibility of creating strong AI in silicon.

Might there be properties of our biological brain that silicon can't capture?
Is this related to the concept of computability? I'm not suggesting that there
is a spiritual or metaphysical component to thinking. I'm not, I'm a
materialist through and through. I just wonder if maybe there is some
component of non-deterministic behavior occurring inside a brain that our
current silicon-based computing does not capture.

Another way to ask this is will we need to incorporate some form of wetware to
achieve strong AI?

~~~
aamar
These are not fully settled questions, though the answer is probably no.

Most researchers believe that brains are Turing machine equivalent, therefore
can be simulated by any other equivalents. Even Gödel believed this, though he
believed the mind had more capabilities than the brain.[1] As a materialist,
you would share the commonly-accepted view and reject his latter claim.

There is a small minority of philosophers and physicists who believe that
there are meaningful quantum reactions happening in the brain, distinguishing
them from classical computers.[2] Some recent computer simulations have shown
this to be plausible, but the general impression is that it seems unlikely,
and we don't have specific evidence of effects of this sort.

Quantum effects of certain sorts are computationally infeasible to perform
with classical computers. And it's theoretically plausible that such effects
can not be conducted at scale with in-development quantum computer technology,
and is only practical with organic chemistry, but again, this is quite a
minority view.

It's also possible that classical brain features, such as its massive
concurrence or various clever algorithms, prove difficult to replicate or
simulate. If these are easy problems to solve, then strong AI may arrive in
decades; if very difficult, centuries. In the latter case, it seems plausible
that incorporating wetware would be a useful shortcut. But there's good reason
to believe that the practical disadvantages of wetware (e.g. keeping it alive,
coordinating with its slow "clock speed") overwhelm the computational
conveniences.

\--

[1]
[http://www.hss.cmu.edu/philosophy/sieg/onmindTuringsMachines...](http://www.hss.cmu.edu/philosophy/sieg/onmindTuringsMachines.pdf)

[2]
[http://en.wikipedia.org/wiki/Quantum_mind](http://en.wikipedia.org/wiki/Quantum_mind)

~~~
ux-app
Thank you for the detailed response. I'm looking forward to digging into the
links you posted.

> There is a small minority of philosophers and physicists who believe that
> there are meaningful quantum reactions happening

I wonder why this is a minority view. Bear in mind that I am an armchair
scientist, but I recall reading that meaningful quantum effects are
responsible for the efficiency of photosynthesis. It seems quite plausible
(due to the electro-chemical nature of brain functioning) that there might be
similar effects present in the brain.

Fascinating stuff.

------
maaaats
Isn't the author's definition of RNNs wrong?

I thought the difference is that a RNN allows connection back to previous
layers, compared to a feed-forward net. Not this talk about "fixed sizes" and
"accepting vectors". Or am I wrong?

~~~
fpgaminer
Karpathy usually talks about machine learning topics from multiple viewpoints,
and usually (in my experience with his writings) prefers more loose, non-
traditional interpretations (that ultimately lead to better understanding of
the underlying mechanics of the approach).

In this case, his point was that one way RNNs differ from FFNNs is their
ability to accept arbitrarily sized inputs and generated arbitrarily sized
outputs. That's pretty important, which is likely why he emphasizes it.

But the rest of the article shows the salient point; RNNs are NNs that hold a
state vector.

Saying that RNNs are NNs that allow connections back to previous layers is
true, but that's only one way of looking at it. Holding state is another,
since it implies backwards connections. Feedback is another term. And because
they have backwards connections, state, feedback, etc, they also posses the
capacity to handle non-fixed sized inputs and outputs.

In summary; it's different viewpoints of the same mathematical object.
Karpathy focuses on the ability of RNNs to handle arbitrarily long inputs and
outputs, because that's something FFNNs cannot do.

------
clickok
I love stuff like this, and I think "unreasonable" is almost an
understatement.

It's "unreasonable" mainly because it occasionally captures subtle aspects of
the data source for "free". If you've worked with procedurally generated
content, Markov chains, and so on, you probably have had to perform a few
tweaks in order to get plausible results[1]. From the article, an excerpt of
the output from an RNN trained on Shakespeare:

    
    
      Second Lord:
      They would be ruled after this chamber, and
      my fair nues begun out of the fact, to be conveyed,
      Whose noble souls I'll have the heart of the wars.
    
      Clown:
      Come, sir, I will make did behold your worship.
    
      VIOLA:
      I'll drink it.
    

Sure, the individual blocks are similar to what you'd get from a Markov text
generator-- _but it gets that after a full stop, there comes a newline, a new
character name, and a new text block_. To my eyes, this is a qualitative leap
in performance. It suggests that the model has figured out some things about
the data stream that you'd normally have to add in by hand[2].

It's also unreasonable that the same framework works well for so many
different data sources. My experience with other generative methods has been
that they were fragile and prone to pathological behaviour, and that getting
them to work required for a specific use case required a bunch of unprincipled
hacks[3]. It used to be that when a talk started to veer towards generative
models, I'd start looking around the room, wondering whether I could survive
the drop from any outside-facing windows. But with RNNs using LSTM (or neural
Turing machines!) you can consider incorporating a generative model in the
solution you're putting together without having to spend a huge chunk of time
massaging it into usefulness and purchasing time on a supercomputer[4]

1\. I once wrote quick a Reddit bot with the aim of learning to repost
frequent highly upvoted comments and trained it using a simple k-Markov
model... it was not good at first, and in order to get it to work I had to do
a lot of non-fun stuff like sanitizing input, adding heuristics for when/where
to post, and at the end it was mediocre.

2\. Alex Graves (from DeepMind) has a demo about using RNNs to "hallucinate"
the evolution of Atari games, using the pixels from the screen as inputs. It's
interesting because it shows that same sort of tendency to capture the subtle
stuff:
[https://youtu.be/-yX1SYeDHbg?t=2968](https://youtu.be/-yX1SYeDHbg?t=2968)

3\. As in occult knowledge and rules-of-thumb, but you might also read this as
a double entendre about myself and my colleagues.

4\. Well, you still might need an AWS GPU instance if you don't have a fancy
graphics card.

~~~
jameshart
The shakespeare generator isn't just reproducing the syntactic structures, it
occasionally seems to capture meter. The samples you've reproduced here aren't
iambic, but they are around ten or eleven syllables per line, which is
impressive enough in itself. In the longer passages, it manages some proper
iambic pentameter:

    
    
       My power to give thee but so much as hell:
       Some service in the noble bondman here
    

It doesn't seem to have managed to pick up on rhyming couplets, though.

A quick search of Shakespeare's corpus also shows that Shakespeare never
called a bondman 'noble'; there must be some conception of parts of speech
being captured by the RNN, to enable it to decide that 'bondman' is a
reasonable word to follow 'noble'.

So yes, "unreasonable" seems about right.

~~~
ryukafalz
I'd imagine the lack of rhyme is likely due to the fact that English
pronunciation is ambiguous. Given only the text, it would have no way of
picking up the fact that, say, "here" and "beer" rhyme, while "there" does
not.

(Put another way, English text is a lossy representation of English speech.)

Perhaps if you were to feed the IPA representation of each word in alongside
the text, the RNN would do a bit better, though admittedly I'm not sure how
you would do so.

If this is the case, I'd imagine training it against Lojban text would see
similar results.

~~~
Houshalter
Very relevant recent paper:
[http://arxiv.org/pdf/1505.04771v1.pdf](http://arxiv.org/pdf/1505.04771v1.pdf)

DopeLearning: A Computational Approach to Rap Lyrics Generation

------
mikecmpbll
This is my deep learning enlightenment moment. 22/05/15

~~~
phyalow
me to, mesmerised.

------
TheLoneWolfling
My question, and something this doesn't get into, is this: how do you train a
RNN?

~~~
deepnet
You need an error signal - a target value is compared with the networks
prediction. That error is carefully assigned proportionally to the network
weights that contributed to it and the weights adjusted a small amount in that
direction. This is repeated many times.

Backpropagation suffers from vanishing gradients on very deep neural nets.

Recurrent Neural Nets can be very deep in time.

Or the weights could be evolved using Genetic Programming.

~~~
skorgu
It would be interesting to occasionally train the generated C against a
compiler.

~~~
deepnet
In "Learning to Execute" by Zaremba & Sutskever
[http://arxiv.org/abs/1410.4615](http://arxiv.org/abs/1410.4615) An RNN learns
snippets of python

Their next paper is "Reinforcement Learning Neural Turing Machines"
[http://arxiv.org/abs/1505.00521](http://arxiv.org/abs/1505.00521) based on
Graves "Neural Turing Machines"
[http://arxiv.org/abs/1410.5401](http://arxiv.org/abs/1410.5401), which
attempts to infer algorithms from the result.

In a lost BBC interview from 1951 Turing reputedly spoke of evolving cpu
bitmasks for computation.

------
mangeletti
Imagine a conversion-optimizing genetic algorithm for spam (web and/or email)
generation, using a tool like this (e.g., when users perform the intended
actions, DNA is passed on to the next iteration).

That would be one positive feedback loop to rule them all.

------
stcredzero
So, if Neural Networks can be thought of as just an optimized way of
implementing unreasonably large dictionaries, Recurrent Neural Networks could
be thought of as an optimized way of implementing unreasonably large Markov
chains.

------
efnx
I've only read the first section but it seems RNNs are very close in concept
to Mealy machines.

[http://hackage.haskell.org/package/machines-0.4.1/docs/Data-...](http://hackage.haskell.org/package/machines-0.4.1/docs/Data-
Machine-Mealy.html)

> They accept an input vector x and give you an output vector y. However,
> crucially this output vector's contents are influenced not only by the input
> you just fed in, but also on the entire history of inputs you've fed in in
> the past.

~~~
teraflop
If it helps, you can think of a RNN as being analogous to a finite state
machine. But instead of a single discrete state, it's a continuous, high-
dimensional vector. That has the extremely important effect that the output is
a continuous function of the input, which is necessary for training using
gradient descent.

------
wonderingwhere
this is quite possibly the most interesting item I've read on HN

------
noahmbarr
Would the returned samples from PG/Shakespeare/Wikipedia examples be of higher
quality if you used a word-level language model instead of character model
with similar parameters?

I was curious if the overhead of learning how to spell words (vs a pure task
of sentence construction with word objects) out weigh the reduction in sample
set size?

(Awesome article for a RNN newbie)

~~~
fpgaminer
Karpathy states in the blog post that word-level models currently tend to beat
character models, across the broad field of NLP related RNNs. But he argues
that character models will eventually overtake (much in the same way that
ConvNets have "replaced" manual feature extraction).

That said, I think the RNNs here are limited by the corpus. They need to be
exposed to more writing. Even if all you want is a Shakespeare generator, you
still need to expose it to other literature. That will give it greater
context, and more freedom of expression and, dare I say, creativity. I mean,
imagine if all you were exposed to your whole life was Shakespeare. Nothing
else (no other senses). Even with your superior mind, I doubt you'd generate
anything better than what this RNN spits out.

So yeah, it needs a large corpus to build a broader model. Then we need a way
to instruct the broadly trained RNN to generate only Shakespeare-like text.
Perhaps by adding an "author" or "style" input.

~~~
kylebgorman
I fail to see how word-based models are character-based models with manual
feature extraction. Word boundaries are read directly from deterministically
tokenized inputs.

And, as I mentioned upthread, it has been known for about ten years, long
before the current neural net revival, that high-order character-based models
are competitive with word-based models (at least in terms of perplexity).

------
viraptor
I found the learning progress great. I was thinking some time ago how to
generate english-sounding words which don't exist. Well, here they are: (from
iteration 700)

Aftair, unsuch, hearly, arwage, misfort, overelical, ...

(although I admit, some of them may be just old words I haven't heard of
before)

------
oggy
In all the examples on the page, the RNN is first trained and then used to
generate the text. Is there a way to use RNNs for something interactive? For
instance, can one train an RNN to mimic Paul Graham in a discussion, and not
only in writing an essay?

------
hgibbs
I did have a bit of a chuckle when they got to Algebraic Geometry. That's
incredible.

------
lqdc13
Does anyone know if these are/can be good for named entity recognition? I am
stuck implementing second order CRFs right now for the lack of a good
implementation, and this seems a lot easier.

~~~
syllogism
I'm not aware of any strong RNN results for NER, no.

You'd probably find the paper here:
[http://aclweb.org/anthology/](http://aclweb.org/anthology/) (everything in CL
is open access). You want the proceedings of CL, TACL, ACL, EMNLP, EACL, and
NAACL. Don't bother with the workshops.

------
higherpurpose
If neural networks are the way to build strong AI and neural nets are all
about optimization, wouldn't a quantum computer be ideal to power an AI?
(assuming we can get one to work)

~~~
Houshalter
I don't think so. NNs have millions of parameters, and making a quantum
computer that large, and with that many complex interactions, would be very
difficult.

Optimization of NNs isn't really that bad. Stochastic gradient descent is
extremely powerful and roughly linear with the number of parameters, possibly
better.

------
tormeh
I've thought a bit about RNNs, and I can see an obvious problem: Fixed amount
of memory.

Is there any chance someone's come up with an RNN that has dynamic amounts of
memory?

~~~
varelse
There's a huge degree of data re-use in the weights. This should be exploited.

Second, one could envision paging the hidden units back to system memory on a
coprocessor-based implementation (GPUs/FPGAs/not Xeon Phi, gag me). 256 GB
servers are effectively peanuts these days relative to developer salaries and
university grants (datapoint: my grad school work system was ~$100K in 1990
dollars) so unless you're trying to create the first strong AI _, I don 't
think this is a serious constraint.

_Good luck with that no matter what Stephen Hawking, Elon Musk, and Nick
Bostrom harp on about: we have _no_ _idea_ what the error function for strong
AI ought to be and even if we did, it's over a MW using current technology to
achieve the estimated FLOPS of a human cerebrum.

~~~
tormeh
I meant that the state vector has constant size and just setting it at the
maximum available might give you problems with training.

~~~
varelse
Nothing you can't work around if you're willing to roll your own code. That
said, I agree 100% if you're dependent on someone else's framework...

------
evc123
Someone should train an RNN on neural network source code to see if it's
possible to get neural networks to generate neural networks.

------
divs1210
This felt like watching Ex Machina. Thanks a lot, this was extremely
informative and super fun.

------
thewarrior
I have a dumb question. How is a recurrent neural network different from a
Markov Model ?

------
jgmmo
Very neat, and funny article. I love the PG generator.

