Hacker News new | past | comments | ask | show | jobs | submit login
The Unreasonable Effectiveness of Recurrent Neural Networks (karpathy.github.io)
913 points by benfrederickson on May 21, 2015 | hide | past | favorite | 207 comments

Karpathy is one of my favourite authors - not only is he deeply involved in technical work (audit the CS231n course for more[1]!), he spends much of his time demystifying the field itself, which is a brilliant way to encourage others to explore it :)

If you enjoyed his blog posts, I highly recommend watching his talk on "Automated Image Captioning with ConvNets and Recurrent Nets"[2]. In it he raises many interesting points that he hasn't had a chance to get around to fully in his articles.

He humbly says that his captioning work is just stacking image recognition (CNN) on to sentence generation (RNN), with the gradients effectively influencing the two to work together. Given that we've powerful enough machines now, I think we'll be seeing a lot of stacking of previously separate models, either to improve performance or to perform multi-task learning[3]. A very simple concept but one that can still be applied to many other fields of interest.

[1]: http://cs231n.stanford.edu/

[2]: https://www.youtube.com/watch?v=xKt21ucdBY0

[3]: One of the earliest - "Parsing Natural Scenes and Natural Language with Recursive Neural Networks" http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf

> he spends much of his time demystifying the field itself, which is a brilliant way to encourage others to explore it :)

yup. this is the first time I understood someone from this field. Honestly, this dude just broken down the wall.

What's more important, passion flows through his writing. And it can be felt. I got so excited while reading it.

Andrej is also a great lecturer; his CS231n class in the winter was both the most enjoyable and educational I've taken all year. All of the materials are available at cs231n.stanford.edu, although I can't seem to find the lecture videos online. It may not have been recorded.

As a bonus, there's an ongoing class on deep learning architectures for NLP which covers Recurrent (and Recursive) Neural nets in depth (as well as LSTM's and GRU's). Check out cs224d.stanford.edu for lecture notes and materials. The lectures are definitely being recorded, but I don't think they're publicly available yet.

Yay, looks like the videos for cs224d are available on http://cs224d.stanford.edu/syllabus.html

He's good at demystifying a lot of things. He's taught thousands (at least) of people how to get started with solving the Rubik's cube competitively (shortest-time) via his YouTube channel.

His username is badmephisto if you're interested.

The tree that just gives and gives! Thank you!

It's cool having the network running live in your browser on the cs231n.stanford.edu page

Read over [1] and am currently watching [2], and I really can't get over a not insignificant bit of dissonance:

(a) He seems to be very intelligent. Kudos. But…

(b) How good of an idea is it really to create software with these abilities? We're already making machines that can do most things that had once been exclusive to humans. Pretty soon we'll be completely obsolete. Is that REALLY a good idea? To create "face detectors" (his words!)?

Our generation is going to get old and feeble and eventually die. If we have children, they'll completely supplant us.

Our relevance is ephemeral, but our influence will be lasting. Do we want to have a legacy of clinging to our personal feelings of importance, or of embracing the transience of our existence and nurturing our (intellectual) progeny?

We let our children inherit the world because we want them to be happy. Not so with machines designed to carry out industrial tasks.

Good for the entrepreneurs that invent them.

Nice. Andrej Karpathy deserves some kind of award for demystifying deep learning and making the subject so accessible to a wider audience. If you're a developer who knows little about the subject and want to learn more, a great starting point is the home page for his ConvNetJS project.[1]


[1] http://cs.stanford.edu/people/karpathy/convnetjs/

And if you're more comfortable with Python, I strongly recommend the CS231n assignments / labs: http://cs231n.github.io/

Assignments 1 and 2 alone give a solid intro to implementing these algorithms, and the lab-oriented iPython-based format gives you a very high probability of writing a correct implementation even if you're clueless at the start.

As a father, the output feels really familiar. It's like a child learning to talk. At first, though the words they say are actual words (and mean something to you), they themselves have no idea what the meaning is. Eventually though they start understanding the meaning, which combined with the syntax creates a person who can communicate.

I wonder if all that's missing is just a few more layers, and another source of input. Maybe a list of requirements/output/input matched with the code so it understands why what was written was written. I wonder what would happen if you ran the program, took the output, and fed it back in as input.

Really cool stuff here.

I think you are right in that other inputs are needed to decipher meaning. Humans for example tend to have quite a lot of different sources of input -- as when we are children and learning new words we have the spelling (visual), how it sounds (auditory), and possibly another image that shows what the thing means ("cat"). Or maybe we have the auditory ("mommy") and the visual (the child's mother). If you were trained strictly on text, then the meaning of concepts is harder to decipher. It might be why abstract concepts like higher level math are hard for a lot of people to grasp -- their only exposure to the concepts is usually just in the form of text.

As an exercise, when I think of the word "circle", images of circles and spheres show up in my head. Also the equation of a circle. My quick definition of it would be "a perfectly round object" which leads to questions of what "round" and "perfect" mean. The more I think about it, all my knowledge seems quite circular in that there are no axiomatic concepts, everything is relative and it just builds on itself. I wonder if that's the key to decipher meaning, increase the connections of the web -- with strong enough references you can pinpoint which of the nodes in the web something refers to.

What about programming, for example? It's entirely abstract and not necessarily have any visual representation. Programming is best learned through examples. E.g. "here's a line of code, here is what it outputs. Now try to figure out what the rules of the language are."

In the case of this article, the NN isn't being asked to do any abstract task like "decipher meaning", but the very concrete task of "predict the next word". As the article shows NNs can do this fairly well.

There is also a evidence that they can learn very high level knowledge about words and objects. See the success of word vectors: http://technology.stitchfix.com/blog/2015/03/11/word-is-wort...

> when I think of the word "circle", images of circles and spheres show up in my head

There seems some evidence that this stuff is fairly central to human intelligence and the ability to visualize in 3d is kind of hard wired. Deciphering meaning is approximately "seeing what it means" which can correspond to visualizing it in your head. For example "the cat sat on the mat" is a bunch of symbols but someone or some machine can convert that to an image of a cat sitting on a mat then I guess they've understood it.

This part of your comment, "...there are no axiomatic concepts, everything is relative and it just builds on itself", reminded me of this Marvin Minsky paper. If you haven't read him before - enjoy the ride!


As a father, and as someone interested in this discussion (about "child learning to talk"), I think you will love Prof Deb Roy's insights into how his infant son learned language.


As one of the other commenters pointed out - it is like a tree (words/concepts) branching out from one another. I would be fascinated by seeing if this research can be continued into adulthood, where the individual "concepts" aren't as important as the interplay between them.

Human children have the great benefit of interactively learning from their parents and other humans raising them. Could we expect a child to learn to speak if they only heard recordings of existing speech with otherwise no human interaction/feedback - correcting them or offering customized and contextual new bits of information? It would be interesting to add feedback path for human corrective input. i.e. because it's direct interaction, feed it back but somehow weight it a little more than just another corpus input.

> Could we expect a child to learn to speak if they only heard recordings of existing speech with otherwise no human interaction/feedback - correcting them or offering customized and contextual new bits of information?

I once asked a similar question on some online forum [1] where many linguists hung out. My question was if an English-only speaking household left a general interest Spanish language TV station on most of the time when they weren't actively using the TV to watch something, so that their child received a very large exposure to Spanish language programming (news, sports, soap operas, sitcoms, movies, etc) from birth onward, would the child naturally learn Spanish?

I don't recall for sure what the linguists who responded said, but I think they all said the child would not learn Spanish from this.

[1] I have no recollection of where this was.

When I was little (5-7 years old) I had quite a few anime videos and magazines in italian sent to me by my parents who were abroad. Where I lived no one knew a single word of italian. I often watched and rewatched those videos and read those magazines without any other external input. I can tell you that doing that I easily learned the language. When I was 8 years old I also left for Italy and in two weeks of time I already started speaking fluently, albeit with a few mistakes.

If the child will actually watch the Spanish TV he will learn the language.

EDIT: Even now I often learn new japanese words (and remember them) just by watching animes. The difference is that now I have english subtitles but back then I had no subtitles, only the images to help me understand the meaning.

But this is a little different, in the AI we want the ability to form syntactically correct sentences, but also some intelligence behind the sentences too. You as a human had another foundation of intelligence to lean on, your native language, and an understanding of the world outside of learning the Italian language. If you had no other human interaction, would you have learned any language? That's the more the situation of these AI algorithms.

I was only arguing that the child could actually learn Spanish, nothing more.

Not knowing basically anything about AI state of the art what stops us from feeding a RNN image data and text data and make it correlate them automatically by context? Just like a child learns words by hearing them many times in similar contexts so could a RNN.

I imagine the biggest problem is gathering and structuring the data. We humans receive lots of data and have lots of time to process it in our lives compared. And by lots I mean difference of a few orders of magnitude. It's amazing what this thing learns in just a few hours of processing.

I agree the RNN performance is really amazing!

I've seen something like this in action with young kids who are given a tablet and stumble upon cartoons they like on YouTube, but in a different language. After they watch a few cartoons, YouTube's recommender system keeps offering them more cartoons in that language. And it isn't long before they start spouting words and phrases in that language.

In Baltic states, majority of TV content either comes in native German or is US shows translated to Russian.

I've picked up quite a bit of Russian by watching Discovery channel this way.

There are some recent examples where people have trained a collection of large nets which are then used to teach a smaller net. The smaller net can learn more quickly and finally achieves better performance than the large collection.

The methods involve providing more detailed feedback at each example. With most training data used now, we give a 0 or 1, does this example belong to this class. In the teacher networks, they were able to teach with more subtly: this is definitely not a car, it is very lizard like and a little snake like.

Do you have a reference?

Geoffrey Hinton gave results to this effect in a talk about "Dark Knowledge" [1]. Haven't seen any of these results published, though. I think he mentions something in the talk about NIPS rejecting the paper.

[1] - https://www.youtube.com/watch?v=EK61htlw8hY

The TL;DR appears to be "no", interaction is necessary:


Although for obvious reasons this is very hard to study experimentally:


Proper language usage probably involves causal modelling, in which case intervention experiments are one of the only known ways to learn correctly.

I'm convinced that voice pitch, syllable meter, hand gestures, facial gestures, and other forms of non-word emphasis are also crucial to figuring out basic phrase chunking and word types, with minimal interaction, early on.

Seems it would be far harder to infer the basic initial structure from just plain text.

Umm. When you train you train against a cost function...

Well not exactly, if we had a cost function as intelligent as a human, the cost function to train the AI, would be an AI. Or maybe I'm completely off base here...

A machine will never get the meaning of a word, unlike a very small child. I am simply amazed by the fact that a child can learn a language, catch what a question is, offer an answer, say no (and how they like to say no), and all. As much as I wish it was possible, that much I believe it's not. The best we can do is put our knowledge of our ability to infer meaning of words into machine code.

So far scientists haven't found anything special about the human brain that can't be mimicked by a machine. Given enough neural connections, and a large enough data set, and a long enough training period there is no reason to think that a machine can't do everything a human brain can do.

Put another way there is nothing magical about a child learning about the world. A child's brain is just a large neural network being fed patterned data over the course of many years by a variety of extremely high resolution analog sensors. Eventually the child begins to respond to the patterns.

Not really, there are clearly epigenetic changes to neuron DNA w/r to memory formation and I don't think anyone has estimated what kind of computational firepower that represents.

Second, the 3D topology of a neuron is IMO more complex than reducing it to an FP32 activation threshold (all IMO of course).

Finally, I have to admit as a former biologist, I'm intrigued by microtubule activity and it seems like Dileep George and even Geoffrey Hinton are heading towards smarter but fewer neurons as opposed to just increasing the neuron count. Not surprisingly, the deep learning digerati are resisting this notion mightily just like the SVM peeps harped on neural networks until they kicked them in the keester.

TLDR: It's still early, and I'm biased that there are some interesting twists and turns yet to unfold here.

Also, brain chemistry. You can't give psychoactive drugs to a circuit board...

I'll bet in a matter of time we will find that indeed you can give something quite analogous to "psychoactive drugs" to a strong AI!

I've done some thinking on this.

If you can computationally define how different common neurotransmitters affect the function of neurons at a broad, high level, then you can create your "psychoactive drug" by just writing a routine that excessively applies the function that those neurotransmitters represent.

An artificial serotonin reuptake inhibitor would just allow the serotonin-like activity to more active in the model.

Turbo button suddenly has a real meaning.

The parameter Karpathy call 'temperature' seems not dissimilar in effect to a psychoactive drug, low temperature corresponding roughly to sober and high to being a bit, well, high.

> A child's brain is just a large neural network being fed patterned data over the course of many years by a variety of extremely high resolution analog sensors. Eventually the child begins to respond to the patterns

Seems a bit early to jump to the conclusion that we understand cognition. We don't. I agree that there is nothing exotic or metaphysical about brain meat, but really we're still feeling around in the dark with respect to how thinking occurs.

I'm confident that we'll get there eventually though.

My guess is that it's probably a bit like evolution in that fairly simple pressures and rules carried out by an astronomical number of times across a huge number of individuals interacting yields surprisingly complicated outcomes.

> So far scientists haven't found anything special about the human brain that can't be mimicked by a machine.

mimicking the brain's power-consumption-to-compute-power ratio is difficult, if not impossible, with today's technology.

an aside: since reading an article about the potential role of quantum mechanics in photosynthesis, i've wondered, as a lay person, whether quantum mechanics play a role in human cognition.

Theoretical physicist Roger Penrose is a proponent of this view, but theoretical computer scientist Scott Aaronson presents a rebuttal of his points [1]. Another article claims that the distance between synapses is two orders of magnitude too big for quantum mechanical effects to be effective, which seems like a plausible rebuttal to me [2].

[1] http://www.scottaaronson.com/democritus/lec10.5.html [2] http://www.csicop.org/sb/show/is_the_brain_a_quantum_device

There's regular quantum mechanics which underlies all chemistry and that you can use to calculate molecular properties and then the woo woo kind which Penrose seems to propose as behind consciousness on the basis that both are a bit mysterious so maybe one causes the other.

neal stephensons anathem is a novel addressing this (among other things): practically, is the brain at least partially a quantum computer.

So AI has been solved, eh?

> A machine will never get the meaning of a word, unlike a very small child.

Why not? Your brain isn't magic, just highly associative. We can do the same thing with computers real soon now.

We can do the same thing with computers real soon now.

Haven't people been saying this for decades? AI has a long history of impressive results, but somehow none of them have actually produced "thought".

Nobody even understands how the brain "thinks" at a neural level, let alone how to model that. All we can do at this point is try different models (which way or may not actually match reality) and hope we find one that works. But there's no evidence that we'll find a working model "real soon now". Impressive results that we can kinda-sorta imagine being the product of an intelligent system haven't historically been enough.

> saying this for decades?

A handful of years ago I put together a computer fully loaded that gave me 1 teraflop of commuting power.

Today I can put together a computer the same size that will give me 32 to 50 teraflops of programmable computing power.

Many of the "AI" advances since 2007 are just running old 1970s-1990s AI algorithms on faster and faster and more parallel hardware. If you have to train a model for a few hundred trillion instructions, but your CPU only does 20 operations per second (and you have to share it with 1,000 other people), you can't iterate your science fast enough to make progress. Now we can iterate our science almost too quickly.

> how the brain "thinks" at a neural level,

Planes don't fly like birds. Birds don't fly like bees. True AI doesn't have to replicate mammalian (or avian or reptilian) neural topology.

Upvote for "Birds don't fly like bees", I like the freshness of not merely pointing out that "humans can do better", but that in effect there are several paths that avoid different constraints to get to the same point!

I do not think that we have the capacity to create a brain capable of being conscious with our current technology. Storing zeroes and ones deterministically on pieces of silicon with crammed together transistors and doing computations by what are basically logical gates is kind of limiting and inefficient. And lets say that the teraflops we're talking about are meaningful. How many teraflops do we need anyway? Shouldn't Google's data-centers suffice already for reaching the potential of a piece of gray matter that fits in under 60 cm of circumference?

I also agree that AI will never be "human" (i.e. it will be different), however without understanding how the human brain works, what chances do we have to create AI?

And we have yet to crack that nut. We have yet to understand even high-level stuff in detail, like how information is flowing from short-term memory to long-term and how we forget and why we do that (i.e. forgetfulness is surely an evolutionary trait). A brain is also fascinating in how it recovers from serious strokes by re-purposing brain structures. We have yet to produce software that is that sophisticated. And we don't even understand the brain from a biological perspective yet.

Surely huge progresses have been made, but on the other hand we may still be hundreds of years away and there's a very real possibility that we lack the intellectual capability, or maybe the resources to do it (we have a history of settling for lesser solutions if we stop seeing financial benefits, like with space exploration).

> Storing zeroes and ones deterministically

Turing-complete platforms are universal simulators. There's nothing they can't represent.

> like how information is flowing from short-term memory to long-term

Sure, we know that. The little seahorse helps out.

> re-purposing brain structures

rudimentary artificial neural nets do the same thing. they also self-specialize automatically with no innate programming (line detectors, edge detectors, eye detectors, cat detectors, all the way up—automatically).

> we may still be hundreds of years away

lol. nope. gotta think exponentially.

> lesser solutions if we stop seeing financial benefits, like with space exploration

can't do space exploration without the approval of a nation-state. can do AI tomfoolery in your own basement with nobody else finding out until it's too late.

> Turing-complete platforms are universal simulators. There's nothing they can't represent.

No existing computer is a universal turing machine. The infinite ram requirement is pretty hard to implement in practice.

> Storing zeroes and ones deterministically on pieces of silicon with crammed together transistors and doing computations by what are basically logical gates is kind of limiting and inefficient.

And yet it is less limiting and more efficient than pretty much all analog computing devices we have built. I don't think the hardware is the issue anymore, I suspect that with the right models and training we can have thinking machines.

Jeff Hawkins' team of researchers and the people behind NuPIC and Numenta.org, at least, given how it was explained to me, believe that the human brain does compute digitally (ie the analog values don't matter, the presence or absence of the signals do). Geoff Hinton also appears to believe that the biological neural signals are interpreted in a binary way.

I could have misinterpreted their work, though, as I'm far from an expert, but that's what it sounded like to me.

"Real soon now" is an ironic term [1][2]

I think what the parent is trying to say is not that it's easy (it's not) but that there is nothing, in principle, to stop us from writing a program that acts like a brain.

[1]http://www.catb.org/jargon/html/R/Real-Soon-Now.html [2]http://c2.com/cgi/wiki?RealSoonNow

>Your brain isn't magic, just highly associative

It's also not pure algorithm, it's a physical entity, tangible and with real world properties and interactions.

Who said (or proved) it's just an information processing device?

> It's also not pure algorithm, it's a physical entity, tangible and with real world properties and interactions.

So are computers.

Yes, but computers are not whats important in calculation. Algorithms are. You could do exactly what a computer does with pen and paper (it would just take a much longer time). The physical properties of the computer don't matter in this regard.

Whether that's the case in human cognition remains to be shown (else we're taking for granted what we're trying to prove).

> Yes, but computers are not whats important in calculation. Algorithms are.

That's not correct.

> The physical properties of the computer don't matter in this regard.

That's not correct.

> You could do exactly what a computer does with pen and paper (it would just take a much longer time).

Yes, and that time matters greatly as it's the difference between practical and hypothetical. Beyond that, programs that can evolve their hardware have been shown to come up with optimizations no human could have created and thus the physical properties of the computer do matter.

A simulated being in a simulated world is just as real in its world as we are in ours.

If we can bridge the simulated world to our world then we can interact with it.

Being in different worlds does not imply that it can never reach conciousness (among other properties). To imply that is invoking magic.

Anything from our world can be simulated.

>A simulated being in a simulated world is just as real in its world as we are in ours.

To be literally "as real in its world (as we in ours)" several things need to happen:

1) its world should be an 1-1 simulated mapping of our world. Perhaps not to its whole extend (e.g. not the whole universe), but to ANY extend that affects the final result.

2) its world should have randomness equivalent to the quality of randomness (not sure if it's perfect) that our world has.

As for "Anything from our world can be simulated" -- that's a bold claim, provided that we haven't simulated ANYTHING at all yet, to the degree of interactions and complexity that exist in our world.

When we simulate the behavior of water in a fluids physics simulation, or the behavior of planets etc, it's amazing how much stuff we leave out. Our simulations are to a full-blown simulation what South Park cut-outs are to a photograph.

Besides, this notion reminds me of the naive 19th century ideas, that they could predict the course of the universe if only they had the details (motion, momentum, weight, etc) of all objects and the capacity to calculate their interactions. QM put a hole in that.

Regarding 1) There is no requirement for the simulated world to be a 1:1 mapping of our world. It can be completely different, a simplified subset or whatever it likes; this does not change the premise that to the inhabitants of that world, it is real. Its not our world, but that has no relevance to anything, there is no rule that says it has to be a 1:1 mapping.

As for 2) likewise, randomness isn't a requirement, you're arbitrarily picking one quality and saying that quality has to be identical for it to be real. why? I don't believe that for a second.

I'm fully aware of the simplifications of simulation... being simplified compared to an external universe does not change the premise of it being real to its inhabitants. Quantum Mechanics does not say that the universe is not mechanistic, just that there is a random element (that in itself may ultimately be modelled).

>There is no requirement for the simulated world to be a 1:1 mapping of our world

It has, if it has to be "moisture" and also to be "just as real".

Else, you can define as "moisture" any parameter in the simulation (since it can be "whatever it likes").

E.g. the property of being "alive" in Conways Game of Life.

In what sense will that be a simulation of "moisture" and "just as real" inside the simulation as moisture is to us?

you're mixing your frames of reference.... The simulated 'moisture' would exist within the simulated universe using simulated water and simulated fabric (for example). Within the simulation, that fabric would be just as moist as a piece of cloth left out in the rain in our universe. you can't mix your frames of reference.

I often wonder if the gap is that we're so caught up on training our networks on vision and text that we're ignoring the fact that living beings have a sense of time and reward as part of their input.

A child knows that if it says "Mama food," it is likely to get attention, and if it gets attention, it is likely to minimize its hunger. Right now, a neural network can be trained to know that "Mama" occurs often in human dialogue, what words occur around it, even its dictionary definition and images of mothers. But it's not making the deeper connection to a strategy that minimizes hunger.

When I think about this, I wonder if insights from the world of gaming "AI" would be useful in developing the training datasets for real AI. Because you can't be a mother to a billion virtual babies, but you might be able to program a set of heuristics to be a mother to a billion virtual babies. Then you have some system that trains on their life experiences...? All speculation, but very interesting stuff.

There is a huge amount of research which is combining the power of deep learning for automated feature extraction with reinforcement learning for learning "natural reward signals" without label information.

See any of the recent papers from Google DeepMind, such as [1] or their most recent work which is startlingly good [2]

[1] http://www.nature.com/nature/journal/v518/n7540/full/nature1... [2] http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-i...

I'd forgotten about that! The "game" needs to get much more complex to simulate life, of course. Now I wonder if they could throw that infrastructure at Minecraft survival mode...

What makes the human brain not a machine?

Physical properties? What if those kind of properties of physical materials are needed in cognition?

The problem is simulations of the brain are not "machines", they are algorithms, e.g. they assume everything is happening at the information processing level.

To use your own example, we can design an algorithm to simulate making coffee. But the algorithm can never make coffee -- unless it's fitted and connected to a coffee making apparatus.

Or take something being "wet" for example. We can emulate the motions and powers in play in liquids, but not "wetness" in the sense of the physical property (moisture etc). If something depends on it, e.g. the emulation actually watering some actual flowers, then it will fail. An emulation can only water emulated flowers.

> The problem is simulations of the brain are not "machines", they are algorithms,

Simulations are executed on concrete machines that exist in the real world. Algorithms are abstract concepts.

> e.g. they assume everything is happening at the information processing level.

Everything does happen at the information processing level. Any kind of physical process can be seen as a type of information processing. Information processing is not an abstract concept like an algorithm, for it to occur requires the time-evolution of concrete physical processes.

> We can emulate the motions and powers in play in liquids, but not "wetness" in the sense of the physical property (moisture etc).

The physical property is experienced as sensory input. Machines can have sensory input.

> An emulation can only water emulated flowers.

You are asserting that virtual reality is different from reality, which is true. That's not the GP's question. The question is whether there is a fundamental difference between machines in the real world (with sensors and arms and so on) and the human body and brain.

> The question is whether there is a fundamental difference between machines in the real world (with sensors and arms and so on) and the human body and brain.

This is pure philosophy, as no one yet knows the answers, but what if brain-like intelligence is an emergent property of non-deterministic processes? Wouldn't it then follow that a classical computer could not be able to compute the "think function" before the heat death of the universe?

personally my intuition says that strong AI cannot be encoded in silicon, or that it is a victim of the halting problem. I think we need a different substrate on which to model cognition. Or maybe not. Who knows?

It was pure philosophy but it becoming less so as we make things like cochlear implants that replace some neural circuitry with electronics.

> A machine will never get the meaning of a word

That's an irrational and indefensible position.

"This sample from a relatively decent model illustrates a few common mistakes. For example, the model opens a \begin{proof} environment but then ends it with a \end{lemma} ... By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma. Similarly, it opens an \begin{enumerate} but then forgets to close it."

Ah, so strong AI is finally here. A computer program that makes just the same mistakes as humans when writing in TeX.

I'm not sure how "unreasonable" the effectiveness of RNNs are if the corpus output at 2000 iterations isn't significantly better than a simple prefix based markov chain implementation [1] (and for the regular languages, with some extra bracket-checking), but I found the evolution visualizations really interesting.

[1] http://thinkzone.wlonk.com/Gibber/GibGen.htm

It's quite unreasonable. He could have optimized it more for fooling humans in Gibberish generation, but that would not show the general effectiveness of the approach. The power shows (quantifiably) in compression: 1.57 bits per character wikipedia is quite hard to beat. Of course, Markov Chains are essentially universal models, so the training algorithm is the crucial distinction.

I believe Markov Chains as a model quickly become inefficient (specially memory-wise) as you increase the complexity (long range correlations) of your prediction. It's an unnecessarily restrictive model for high complexity behavior that state of the art RNNs skip entirely.

The state of the art in compressing wikipedia is 1.278bits (on a certain subset) [1]. So that does seem pretty good.

[1] http://prize.hutter1.net/

Except this NN isn't really a compression of Wikipedia since it can only generate Wikipedia-like nonsense.

There's very little difference between a contextual predictive model like this and the guts of a compressor.

If your prediction is good enough that you can always come up with two possible predictions for each character, each of which has a 50% chance of being correct, then obviously you can compress your input down to one bit per character by storing just enough information to tell you which choice to pick. More generally, you can use arithmetic coding to do the same thing with an arbitrary set of letter probabilities, which is exactly what you get as the output of a neural network.

When the blog post says the model achieved a performance of "1.57 bits per character", that's just another way of saying "if we used the neural network as a compressor, this is how well it would perform."

I'd be interested in seeing this NN perform a lossless compression of Wikipedia at 1.57 bits per character.

It's a compression of Wikipedia in the sense that the NN generates probability estimates of the next character given the previous; the gibberish is simply greedily asking the NN repeatedly what the most-likely next character is. However, plug it into an arithmetic coder and start feeding in an actual Wikipedia corpus, and hey presto! a pretty high performance Wikipedia compressor, which works well on Wikipedia text but not so well on other texts (like this one, with its lack of brackets).

That Markov Chain model operates on 4-grams by default. The RNN featured in the article generates output character-by-character, which is significantly more impressive. Here's a sample from the Markov Chain model operating on 4-grams:

  Ther deat is more; for in thers that undiscorns the unwortune, 
  the pangs against a life, the law's we know no trave, the hear, 
  thers thus pause. 
The only reason why it seems like the model can occasionally spell, and create anglo-sounding neologisms, is because it operates on 4-grams.

Here's some character-by-character output from the same Markov Chain model.

  T,omotsuo ait   pw,, l f,s teo efoat t hoy tha fm nwo   
     bs rs a h enwcbr lwntikh  wqmaohaaer ah es aer 
  mkazeoltl.etnhhifcmfeifnmeeoddssmusoat irca   
  do'ltyuntos sih i etsoatbrbdl

"do'lty untos sih i"

maybe the computer was drunk?

it's a completely legit invocation for awakening cthulhu.

It balances parentheses and keeps track of other long range dependencies, something markov chain implementations cannot do.

Welcome to the unbearable forced-ness of titles. Everyone's making a nod to Milan Kundera these days.

First, it's not a nod to Kundera, but to a classic math related work that predates Kundera's book.

Second, even if it was, really? As if we see plays on Kundera titles regularly on the web?

I doubt it's a reference to Kundera.

I was thinking that both Eugene Wigner's 1960 article 'The Unreasonable Effectiveness of Mathematics in the Natural Sciences'[0] and Karpathy's 'The Unreasonable Effectiveness of Recurrent Neural Networks' probably touch deep aspects of the nature of existence. The first on why the universe exists and is mathematical - because at the fundamental level it is mathematical[1], and in Karpathy's case the RNNs are probably effective because they are close to the mechanisms of human consciousness.

[0] Wigner's article: http://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html

[1] 'physical world is completely mathematical' theory: http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_...

This same thing (i.e., using recurrent neural networks to predict characters (and even words)) was done by Elman in 1990 in a paper called "Finding Structure in Time"[1]. In that paper, Elman goes several steps further and carries out some analysis to show what kind of information the recurrent neural network maintains about the on-going inputs.

It's an excellent read for anyone interested in learning about recurrent neural networks.

[1] http://crl.ucsd.edu/~elman/Papers/fsit.pdf

It's amazing how much was already known decades ago. Elman and others did much more, and hopefully, now the field will take the next step (which was long delayed), with the help of today's computer power.

The code generator is awesome. There's hardly a syntax error. The file headers are the best.

Nitpick: although tty == tty is, as you say, vacuously true in this case, that's just because tty is a pointer. If tty were a float, this wouldn't be the case, since it could be NaN. I wouldn't be surprised if it learned to test a variable for equality against itself from some floating point code.

If nothing else, the RNN would be great for generating bogus source code for use in television programs and movies.

It would drive those who attempt to understand & reference it absolutely crazy. :D

The code is nonsense. Their method is good for fuzzy logic like recognition, but this approach with code will never work for anything other than an art project.

Currently it doesn't work, but saying it'll never work is pretty strong.

This kind of demo shows that deep neural networks can capture the structure of language, if not the semantics, in a very general way. And we have separate evidence that they can (in principle) capture semantic meaning and algorithmic reasoning as well, for example: http://arxiv.org/pdf/1410.5401v2.pdf (the "neural Turing machines" paper from DeepMind)

This is better, but you get pretty far with just markov chains with probabilities for letters actually.

Show me markov chain implementation that can write code letter by letter and I'll give you a car.

(And I mean plain markov chain, not something with additional logic that understands code structure)

comment by samizdatum shows pretty well how well markov chains work without some tweaking.

Feed it all of github, and I'm sure you could come up with some interesting auto complete code generation tools. Of course, coming from github , it'll be poorly documented and filled with buffer overflows :D

I'll agree that this is interesting, but it seems like a lot of people in this thread miss the point: we're working with multi-layer tools now. This enables modeling of multi-layer processes. The code generation as it stands is a obviously a toy, but what happens if we actually think about the real processing layers?

Take this example of code processing, and instead front it with a parser that generates an AST. For now, an actual parser for a single language. Maybe later, a network trained to be a parser. The AST is then fed to our network. What could we get out of the AST network? Could we get interesting static analysis out of it? Tell us the time and/or space complexity? Perhaps we discover that we need other layers to perform certain tasks.

This, of course, has parallels in language processing. Humans don't just go in a single (neural) step from excitation of inner ear cells ("sound") directly to "meaning". Cog sci and linguistics work has broken out a number of discrete functions of language processing. Some have been derived via experiment, some observed via individuals with brain lesions, others worked out by studies of children and adult language learners. These "layers" provide their own information and inspiration for building deep learning systems.

But how will you find that needle in the haystack that works. This is effectively producing random code samples that look syntactically correct.

There is no need for to produce readable code, it makes it easy for humans, but computers have no problems with generating and subsequently understanding unreadable assembly.

Could be interesting to plug this kind of generator into American Fuzzy Lop.

There's not a lot of floating point in the kernel though.

Yes, and feed it into hackertyper.net and you can entertain an 8 year old for hours :-)

I wonder what would happen if you train an RNN like described with, say, the scores of all of Mozart's Chamber Music and then let it generate new music from the learned pieces. How would it sound? Would it figure out beat? Chords? Harmonies? May it even sound a bit like Mozart?

The work of Nicolas Boulanger-Lewandowski was extensively focused on this topic, see his work [1]. He wrote a Theano deep learning tutorial on this topic [2], and several people (Kratarth Goel) [3][4] have advanced the work to use LSTM and deep belief networks.

For a brief while RNN-NADE made an appearance as well, though I do not know of an open source implementation

There are also a few of us who are working on more advanced versions of this model for speech synthesis, versus operating on the MIDI sequence. Stay tuned in the near future!

I can say from experience that some of the samples from the LSTM-DBN are shockingly cool, and drove me to spend about a week using K-means coded speech. It made robo-voices at least but our research moved past that pretty fast.

[1] http://www-etud.iro.umontreal.ca/~boulanni/ [2] http://deeplearning.net/tutorial/rnnrbm.html [3] http://arxiv.org/pdf/1412.6093.pdf [4] https://github.com/kratarth1203/NeuralNet/blob/master/rnndbn...

Is the robot-voice code published anywhere?

You can make money out of that kind of thing btw!


(Obviously not the same thing but the point is that silly robo-voice code is marketable :)

There's a few such projects in existence. Perhaps not RNN-Mozart inspired, but I'm sure that exists too.

Emily Howell


Here's a Bach-inspired computer-generated song:


The thing about neural nets is that they are pretty opaque from an analyst point of view. It's hard to figure out why they do what they do, except that they have been trained to optimize a particular cost function. I think Strong AI will never happen because the people in charge will not give control over to a system that makes important decisions without explaining why. They will certainly not give control over the cost function to a strong AI because control of determination of the cost function is the axis upon which all power will rest.

Our life is dominated by systems we don't understand. I have some understanding of how my cell phone works at the software level, but when it comes to details at the hardware level I just trust the electrical engineers knew what they're doing. I have virtually no understanding of how the engine in the bus operates beyond what I learned in thermodynamics 101. Sure, you might say - someone understands these things. But for some systems, it's hard to pinpoint these people. And for some other complex systems, like the stock market, nobody really understands them or (completely) controls them. But we still use them every day. I think once AI becomes useful enough, people will gladly hand control over.

But some engineer out there understands how your phone works.

With neural nets NOBODY really understands how they work.

Maybe my understanding of neural networks is wrong... but I'm under the impression they work from weighted criteria. With enough weight an answer is selected as being the most likely. A well-trained neural network has enough data to weight options and pick with high accuracy.

Then again, this is essentially black magic to me:


A trained neural network is like a horrible huge spaghetti code ball you've inherited after a programmer ran over by a bus that for some miraculous reason happens to be working mostly correctly.

However, you won't be able to understand why or how it works. That also means you won't be able to modify/improve/fix it using systematic methods. Only trial and error and it will be 'error' most of the time.

and, well, that's just a faulty assumption. of course we do know how they work.

This is a common criticism. However, almost all ML methods have some built in heuristic choices, that are the result of finding something that both works and is mathematically nice. Each of these choices restricts us to some family of functions where it's hard to justify why it's really relevant to the problem at hand, e.g. convex loss functions (l1, l2, ..), convex regularizers (l1,l2,..), gaussian priors, linear classifiers, some mathematically nice kernel functions, e.t.c. In the end, people usually statistically estimate the performance of the methods and use what works.

the people in charge will not give control over to a system that makes important decisions without explaining why

They will if it gives the answers they want to hear. History is full of critical decisions based on ridiculous pretexts or unclear processes.

It may be the case, though, that companies that relinquish control to neural nets will have better results than companies that don't. In fact, there's a winner-take-all effect in many markets, so in those even a slight improvement over humans would lead to massive benefits, rapidly pushing human analysts out of the market.

That's the (morally neutral) wonder of the market--it'll beat ideological or emotional objections into the ground, for better or for worse.

And sooner or later, someone might start a company where all decision making is performed by a neural net...

I kind of drifted into the camp of transhumanism as future where human is enhanced by all the smart sub AI problem solver but generally the humans take the decision at the end of the day. Also I think other problem is for strong AI to exist we are not sure what the "objective function" for the AI to work for.

> the people in charge will not give control

Eliezer Yudkowsky would likely disagree with you: http://www.yudkowsky.net/singularity/aibox

EDIT: Also - http://www.explainxkcd.com/wiki/index.php/1450:_AI-Box_Exper...

I remember wanting to train a neural net for my MSc thesis more than 20 years ago, but my tutor recommended against doing so for precisely this reason, i.e. he said it is very difficult to prove your results. While not being able to prove your results might be a bad idea if you're trying to get your MSc, I don't see it holding back other advances.

What if an AI saves money, though? ( i.e. is cheaper than hiring a real person for a simple task )

" Never! Companies would never sacrifice principle and safety to save money! "

We'll see...

This is quite incredible. The stylistic similarities of generated Shakespearean saga, Linux code etc was quite startling. Perhaps we can train a Haiku/Fortune cookie generator which could occasionally be quite profound.

> Linux code etc

People are always worried about "computers taking factory jobs" resulting in mass unemployment, but the truth is, a rudimentary AI with acceptance tests on output will obsolete every programmer alive.

Hell, half the programming people do these days is just gluing APIs together then seeing if it actually works. It doesn't take 16 years of rich inner human life experience to accomplish that, just exhaustive combinational parameter searching on the subset of API interactions you're interested in evaluating.

Douglas Crockford touches on this aspect in this entertaining and insightful talk [0]. I'm guilty of what you state and I think a large part of "programming" is rudimentary boiler plate coding/configuration and staring into the Abyss. I think our role will be to design algorithms and come up with creative solutions/hacks (which would be difficult for a program) and designing a workflow/flow chart and feeding it into a program which spits out binaries and flag for edge cases. A whole swat industries and economies (read outsourcing) will become redundant and only outsourcing done would be to the generator.

[0]: https://www.youtube.com/watch?v=taaEzHI9xyY

Who do you think will write the acceptance tests? (to be honest they're sometimes more complex than the code itself. E.g. write the acceptance tests for x=a/b for a and b as inputs )

I'm all for it, it's going to be a productivity gain. It's like going from a manual screwdriver to a motorized one.

That particular stuff is actually pretty typical. I have a textbook that shows similar results on Shakespeare using N-grams from years ago.

Capturing writing style with ngram-based input and individual-character input are very, very different tasks. That's several ballparks higher in difficulty.

With ngrams, Markov models are perfectly sufficient. With individual characters, complex concepts need to be remembered across many, many characters of input.

I'm in the middle of reading this article (very much appreciate Karpathy's writings), but I also wanted to brain dump some of my musings on modern machine learning; RNNs in particular. Sorry if this is redundant to anything the article talks about.

Deep learning has made great strides in recent years, but I don't think architectures which aren't recurrent will ever give rise to mammalian "thought". In my opinion, thought is equivalent to state, and feed forward networks do not have immediate state. Not in any relevant sense. So therefore they can never have thought.

RNNs, on the other hand, do have state, and therefore are a real step towards building machines that posses the capacity to think. That said, modern deep learning architectures based around feed forward networks are still very important. They aren't thinking machines, but they are helping us to build all those important pre-processing filters mammalian brains have (e.g. the visual cortex). This means we won't have to copy the mammalian versions, which would be rather tedious. We can just "learn" a V1, V2, etc from scratch. Wonderful. And they'll be helpful for building machine with senses different than biology has yet evolved. But, again, these feed forward networks won't lead to thought.

My second musing is where I think the next leap in machine learning will occur. To-date efforts have been focused on how to build algorithms that optimize the NN architecture (i.e. optimize weights, biases, etc). But mammalian brains seem to posses the ability to problem solve on the fly, far faster than I imagine tweaks to architecture could account for. We solve problem in-thought, rather than in-architecture; we think through a problem. Machine Learning doesn't posses this ability. It can only learn by torturing its architecture.

So, I believe there is this distinction to the learning that mammalian brains are able to do on the fly, using just their thoughts, and the learning they do long term by adjusting synaptic connections/response. It seems as if they solve a problem in the short term, and then store the way they solved it in the underlying architecture over the long term. Tweaking the architecture then makes solving similar problems in the future easier. The synaptic weights lead to what we call intuition, understanding, and wisdom. They make it so we don't have to think about a class of problems; we just know the solutions without thought. (Note how I say class of problems; this isn't just long term memory).

Along those lines, I come to my final musing. That mammalian brains are motivated by optimization of energy expenditure. Like anything in biologically evolved systems, energy efficiency is key, since food is often scarce. So why wouldn't brains also be motivated to be energy efficient? To that end, I believe tweaking synaptic weights, that kind of learning that machine learning does so well, is a result of the brain trying to reduce energy expenditure. Thoughts are expensive. Any time you have a thought running through your brain, it has some associated neuronal activity associated with it. That activity costs energy. So minimizing the amount we have to think on a day-to-day basis is important. And that, again, is where architecture changes come in. They are not the basis for learning; they are the basis for making future problem solving more efficient. Like I said, once a class of problems has been carved into your synaptic weights, you no longer have to think about that class of problems. The solutions come immediately. You don't think about walking; you just do it. But when you were a baby, I'll bet the bank that your young mind thought about walking a lot. Eventually all the mechanics of it were carved into your brain's architecture and now it requires many orders of magnitude less energy expenditure by your brain to walk.

So, the obvious question is ... how do mammalian brains problem solve using just thoughts. The answer to that, as I mentioned, is likely to lead to the next leap in machine learning. And it will, more likely than not, come from research on RNNs. What we need to do is find a way to train RNNs that are able to adapt to new problems immediately without tweaking their weights (which should be a slower, longer term process).

P.S. Yes, I know this was probably a bit off-topic and quite a bit wandering. I've had these musing percolating for awhile and don't really have an outlet for them at the moment. I hope it's on topic enough, and at least stimulates some interesting discussion. Machine learning is fascinating.

> That mammalian brains are motivated by optimization of energy expenditure. Like anything in biologically evolved systems, energy efficiency is key, since food is often scarce.

That doesn't square with empirical reality. Evolved biological systems appear to be optimized for robustness to perturbations, not efficiency (John Doyle argues that there is in fact a fundamental tradeoff between robustness and efficiency, for all types of complex systems not just biological).

> how do mammalian brains problem solve using just thoughts.

They don't. Sensory input is required for brains to learn new classes of problems.

> find a way to train RNNs that are able to adapt to new problems

Is this something different than multi-task learning?

> They don't. Sensory input is required for brains to learn new classes of problems.

Sensory input is required to gain the knowledge, but then you can just as easily muse over your gained knowledge for further insights in a sensory deprivation chamber as you can in a classroom.

> feed forward networks do not have immediate state. Not in any relevant sense.

Feed-forward networks do have state, but all the useful parts all obtained through explicit training (ye olde backprop, ye older hebbian). The typical scenario is "train model (write mode), deploy model (read-only mode)," which as you point out, has no "thought" since at runtime, no changes or introspections are happening.

> So therefore they can never have thought.

The key idea here would be: generative models. Most current AI fads are driven by discriminative models (image recognition, speech recognition, etc) which provide very narrow "faster than human" output, but, as you point out, have no thought or will or motives of their own.

But, once you have a sufficiently connected network, you can start to ask it open-ended questions ("draw a cat for me") in the form of sampling from the network (gibbs sampling, MCMC, ...) and it fills in the blanks.

The extra oomph of providing actual agency and intent and desire to the model is an exercise left to the reader.

> (which should be a slower, longer term process).

Sleep is a requirement of all things with neural network based brains as far as we know.

A RNN can, in particular, implement a GOFAI algorithm. I think that's what we basically learn for the first 5-7 years of our lives by analyzing other people's behavior, communicating, etc.

The "DQN" (Deep Q Network) stuff from Google DeepMind has states. (And actions that transition from state to state.) This comes from Reinforcement Learning theory. (The Q-Learner from Reinforcement Learning is the "Q" in Deep Q Networks.) [doi:10.1038/nature14236]

Suri and Shultz argue that dopamine in the mammalian brain follows the "reward prediction error" from Reinforcement Learning [doi:10.1016/S0306-4522(98)00697-6] (Indeed the DQN paper mentions dopamine in the very first paragraph.)

Because of this, I am very excited about DQN. (I do think that it's only a building block towards building a self-aware brain, though.)

This comment is really well written and expresses a lot of my thoughts about recent advances in computer learning as well -- though in a more clear and expressive way than I could, I think. Thanks fpgaminer.

From what I can tell the RNN in question already has mammalian intelligence, but also a weakness for PG, a phase that it will emerge from in 100 years.

nice dump 👍

Very short video about the topic: https://www.youtube.com/watch?v=ZBkzqLJPkmM

Web spam 2.0:

1) Take the entire works of several popular content creators in a given field, complete with links out to articles etc.

2) Concatenate them into a single file

3) Train this thing to generate new articles

4) Create a map of popular articles that other people have written, to articles you have written on similar topics

5) Replace the originals with your articles

6) Publish millions of articles that can't be detected as spam automatically by Google

It's like bot wars: Spammers can train their robots to try and defeat Google's robots.

well, i don't see how they - the spammers - would fake google's valuation system of valuing incoming links from valuable sources. it's not like many valuable sites outside this relatively insular system would link to those generated nonsense pages. that'd practically create an insular babblenet that could be relatively easily identified.

i mean, it's not like that's exactly what's happening right now.

Okay so in the system I'm hypothesising, I pick a topic -- say content marketing. I go to Neil Patel's and KissMetrics blog and get all their articles on content marketing, and train this thingy with them.

I then buy, say, 1,000 domains. Doesn't matter what they are -- Or I buy 100 domains and setup 300 tumblr blogs, and 300 blogger blogs and 300 wordpress.com blogs.

Now I drip feed content to each of those blogs, but instead of linking to the articles on content marketing that kissmetrics and neil patel originally reference, I link to articles I have created instead.

How can Google tell the difference between a tonne of nobody bloggers link to Neil Patel's articles, and my bots linking to my articles? The fact is that if you blog on niche topics, with good article titles reflecting low competition long tail keywords, you'll get some traffic from Google pretty easily -- how can Google possible tell that links are coming from shitty bot generated pages versus from a tonne of obscure bloggers with virtually no audiences (of which there are thousands)?

The way they can tell the difference is Panda (or Penguin? I think it's Panda ... ) so as long as your pet robot can learn from Neil Patel and Kissmetrics well enough to produce content that cannot be penalised by Panda, and so long as you don't do it stupidly by like, having the same anchor text for all the articles and doing 1,000 articles overnight and actually phase it in so that it looks as though you're getting some reasonable organic spread, you'll be able to game Google's rankings pretty reliably for your real articles that you're trying to promote, and get higher volumes of traffic to those articles than you would be able to by just focusing on niche, long tail articles (for example because you'd be able to get on page #1 or in the top 5 for much higher volume keywords).

You would then get shares etc. for your actual content -- just because those "spam farms" don't have social shares or backlinks from PR6 blogs doesn't mean Google completely disregards them, just means that you need a lot more of them to make the same impact as lots of shares/backlinks from PR6 blogs.

This strategy is old, and was killed by Panda, but if you could beat Panda using a RNN then this would work again.

They have been doing for years and years with Markov Chains and it works if you have content farms (100.000 urls different ipv4 ranges). Usually Google weeds them out after a while but as all is automated it works. It's tricky business as Google gets better and better but it still works and people are making a lot of money with it.

I'm getting the funny impression that what distinguishes an algorithm from an AI algorithm isn't about the algorithm, but how people treat the algorithm. It's an AI algorithm if they describe it behaving intelligently i.e. painting numbers on a house, learning english first, being born, being tricked into painting a fence, etc. Otherwise its just an algorithm.

This is an old problem in AI. Chess was an AI problem, until a computer beat a grandmaster. Vision was an AI problem, now we have OpenCV. Many AI problems get shifted out of "AI" once they're solved.

It stems from our definition of an AI.

An AI is a computer doing those things a computer cannot do. As such, anything that a computer cannot do isn't AI, and anything a computer can do isn't AI either.

Hmm, the 'No true AI' fallacy, then, eh?

Pretty much, assuming you're making an analogy to the "no true Scotsman" fallacy.

One explanation for this could be that we think that some problem is so hard that any solution to it is necessarily so complicated that it could be adapted to solve pretty much anything. When we realize that that isn't the case, we stop calling it AI.

I don't think OpenCV really solved computer vision to be fair. There's definitely no model out there that can do image-based question & answering as well as a human can, or interpret the contents of an image (parse it, if you will) in an accurate way, with the exception of very few special cases.

Learning to do something is an AI problem.

Writing a program to play Chess is not AI but doing so has helped figure learning out.

Can a submarine swim?

This one can: https://www.youtube.com/watch?v=GGrWHlAm7zM (cartoon submarine character TV show) because it has large human like eyes.

Side note: The title is in reference to this famous paper from 1960- http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_...

The form of the title has become a common trope.

"Unreasonable Effectiveness Considered Harmful"

'Considered Harmful Essays' Considered Harmful: http://meyerweb.com/eric/comment/chech.html

I'm curious to know if, since these networks can learn syntax, whether they can also be re-purposed as syntax checkers, not just syntax generators. That is, can the syntactical knowledge learned by these models be run in a static classification mode on some input text to recognize the anomalies within and suggest fixes.

What's unreasonable about neural networks (in general, not just recurrent ones) is that we don't really have any theoretical understanding of why they work. In fact, we don't even really understand what sorts of functions neural networks compute.

I'm an absolute layman with regard to AI, so I'd be keen to hear some explanations with regard to the possibility of creating strong AI in silicon.

Might there be properties of our biological brain that silicon can't capture? Is this related to the concept of computability? I'm not suggesting that there is a spiritual or metaphysical component to thinking. I'm not, I'm a materialist through and through. I just wonder if maybe there is some component of non-deterministic behavior occurring inside a brain that our current silicon-based computing does not capture.

Another way to ask this is will we need to incorporate some form of wetware to achieve strong AI?

These are not fully settled questions, though the answer is probably no.

Most researchers believe that brains are Turing machine equivalent, therefore can be simulated by any other equivalents. Even Gödel believed this, though he believed the mind had more capabilities than the brain.[1] As a materialist, you would share the commonly-accepted view and reject his latter claim.

There is a small minority of philosophers and physicists who believe that there are meaningful quantum reactions happening in the brain, distinguishing them from classical computers.[2] Some recent computer simulations have shown this to be plausible, but the general impression is that it seems unlikely, and we don't have specific evidence of effects of this sort.

Quantum effects of certain sorts are computationally infeasible to perform with classical computers. And it's theoretically plausible that such effects can not be conducted at scale with in-development quantum computer technology, and is only practical with organic chemistry, but again, this is quite a minority view.

It's also possible that classical brain features, such as its massive concurrence or various clever algorithms, prove difficult to replicate or simulate. If these are easy problems to solve, then strong AI may arrive in decades; if very difficult, centuries. In the latter case, it seems plausible that incorporating wetware would be a useful shortcut. But there's good reason to believe that the practical disadvantages of wetware (e.g. keeping it alive, coordinating with its slow "clock speed") overwhelm the computational conveniences.


[1] http://www.hss.cmu.edu/philosophy/sieg/onmindTuringsMachines...

[2] http://en.wikipedia.org/wiki/Quantum_mind

Thank you for the detailed response. I'm looking forward to digging into the links you posted.

> There is a small minority of philosophers and physicists who believe that there are meaningful quantum reactions happening

I wonder why this is a minority view. Bear in mind that I am an armchair scientist, but I recall reading that meaningful quantum effects are responsible for the efficiency of photosynthesis. It seems quite plausible (due to the electro-chemical nature of brain functioning) that there might be similar effects present in the brain.

Fascinating stuff.

Isn't the author's definition of RNNs wrong?

I thought the difference is that a RNN allows connection back to previous layers, compared to a feed-forward net. Not this talk about "fixed sizes" and "accepting vectors". Or am I wrong?

Karpathy usually talks about machine learning topics from multiple viewpoints, and usually (in my experience with his writings) prefers more loose, non-traditional interpretations (that ultimately lead to better understanding of the underlying mechanics of the approach).

In this case, his point was that one way RNNs differ from FFNNs is their ability to accept arbitrarily sized inputs and generated arbitrarily sized outputs. That's pretty important, which is likely why he emphasizes it.

But the rest of the article shows the salient point; RNNs are NNs that hold a state vector.

Saying that RNNs are NNs that allow connections back to previous layers is true, but that's only one way of looking at it. Holding state is another, since it implies backwards connections. Feedback is another term. And because they have backwards connections, state, feedback, etc, they also posses the capacity to handle non-fixed sized inputs and outputs.

In summary; it's different viewpoints of the same mathematical object. Karpathy focuses on the ability of RNNs to handle arbitrarily long inputs and outputs, because that's something FFNNs cannot do.

I love stuff like this, and I think "unreasonable" is almost an understatement.

It's "unreasonable" mainly because it occasionally captures subtle aspects of the data source for "free". If you've worked with procedurally generated content, Markov chains, and so on, you probably have had to perform a few tweaks in order to get plausible results[1]. From the article, an excerpt of the output from an RNN trained on Shakespeare:

  Second Lord:
  They would be ruled after this chamber, and
  my fair nues begun out of the fact, to be conveyed,
  Whose noble souls I'll have the heart of the wars.

  Come, sir, I will make did behold your worship.

  I'll drink it.
Sure, the individual blocks are similar to what you'd get from a Markov text generator-- but it gets that after a full stop, there comes a newline, a new character name, and a new text block. To my eyes, this is a qualitative leap in performance. It suggests that the model has figured out some things about the data stream that you'd normally have to add in by hand[2].

It's also unreasonable that the same framework works well for so many different data sources. My experience with other generative methods has been that they were fragile and prone to pathological behaviour, and that getting them to work required for a specific use case required a bunch of unprincipled hacks[3]. It used to be that when a talk started to veer towards generative models, I'd start looking around the room, wondering whether I could survive the drop from any outside-facing windows. But with RNNs using LSTM (or neural Turing machines!) you can consider incorporating a generative model in the solution you're putting together without having to spend a huge chunk of time massaging it into usefulness and purchasing time on a supercomputer[4]

1. I once wrote quick a Reddit bot with the aim of learning to repost frequent highly upvoted comments and trained it using a simple k-Markov model... it was not good at first, and in order to get it to work I had to do a lot of non-fun stuff like sanitizing input, adding heuristics for when/where to post, and at the end it was mediocre.

2. Alex Graves (from DeepMind) has a demo about using RNNs to "hallucinate" the evolution of Atari games, using the pixels from the screen as inputs. It's interesting because it shows that same sort of tendency to capture the subtle stuff: https://youtu.be/-yX1SYeDHbg?t=2968

3. As in occult knowledge and rules-of-thumb, but you might also read this as a double entendre about myself and my colleagues.

4. Well, you still might need an AWS GPU instance if you don't have a fancy graphics card.

The shakespeare generator isn't just reproducing the syntactic structures, it occasionally seems to capture meter. The samples you've reproduced here aren't iambic, but they are around ten or eleven syllables per line, which is impressive enough in itself. In the longer passages, it manages some proper iambic pentameter:

   My power to give thee but so much as hell:
   Some service in the noble bondman here
It doesn't seem to have managed to pick up on rhyming couplets, though.

A quick search of Shakespeare's corpus also shows that Shakespeare never called a bondman 'noble'; there must be some conception of parts of speech being captured by the RNN, to enable it to decide that 'bondman' is a reasonable word to follow 'noble'.

So yes, "unreasonable" seems about right.

I'd imagine the lack of rhyme is likely due to the fact that English pronunciation is ambiguous. Given only the text, it would have no way of picking up the fact that, say, "here" and "beer" rhyme, while "there" does not.

(Put another way, English text is a lossy representation of English speech.)

Perhaps if you were to feed the IPA representation of each word in alongside the text, the RNN would do a bit better, though admittedly I'm not sure how you would do so.

If this is the case, I'd imagine training it against Lojban text would see similar results.

Very relevant recent paper: http://arxiv.org/pdf/1505.04771v1.pdf

DopeLearning: A Computational Approach to Rap Lyrics Generation

This is my deep learning enlightenment moment. 22/05/15

me to, mesmerised.

My question, and something this doesn't get into, is this: how do you train a RNN?

You need an error signal - a target value is compared with the networks prediction. That error is carefully assigned proportionally to the network weights that contributed to it and the weights adjusted a small amount in that direction. This is repeated many times.

Backpropagation suffers from vanishing gradients on very deep neural nets.

Recurrent Neural Nets can be very deep in time.

Or the weights could be evolved using Genetic Programming.

> Backpropagation suffers from vanishing gradients on very deep neural nets.

Especially when using saturating functions (tanh/sigmoid)

> Or the weights could be evolved using Genetic Programming

GA, not GP http://en.wikipedia.org/wiki/Genetic_algorithm

> Or the weights could be evolved using Genetic Programming.

Some algorithms, such as NEAT[0], use a genetic algorithm to describe not only the weights on edges in the network, but also the shape of the network itself - e.g., instead of every node of one layer connected to every node of the next, only certain connections are made.

0. http://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_to...

It would be interesting to occasionally train the generated C against a compiler.

In "Learning to Execute" by Zaremba & Sutskever http://arxiv.org/abs/1410.4615 An RNN learns snippets of python

Their next paper is "Reinforcement Learning Neural Turing Machines" http://arxiv.org/abs/1505.00521 based on Graves "Neural Turing Machines" http://arxiv.org/abs/1410.5401, which attempts to infer algorithms from the result.

In a lost BBC interview from 1951 Turing reputedly spoke of evolving cpu bitmasks for computation.

It would also be ideal to use a higher level interpreted language, and have it try to generate one page scripts rather than giant mega projects like linux.

there are various ways, but one is to unroll the network about some timestep and treat it as a regular NN. You might find this helpful:


That sounds rather absurdly computationally expensive.

Thanks for the link, I'll take a look.

Imagine a conversion-optimizing genetic algorithm for spam (web and/or email) generation, using a tool like this (e.g., when users perform the intended actions, DNA is passed on to the next iteration).

That would be one positive feedback loop to rule them all.

So, if Neural Networks can be thought of as just an optimized way of implementing unreasonably large dictionaries, Recurrent Neural Networks could be thought of as an optimized way of implementing unreasonably large Markov chains.

I've only read the first section but it seems RNNs are very close in concept to Mealy machines.


> They accept an input vector x and give you an output vector y. However, crucially this output vector's contents are influenced not only by the input you just fed in, but also on the entire history of inputs you've fed in in the past.

If it helps, you can think of a RNN as being analogous to a finite state machine. But instead of a single discrete state, it's a continuous, high-dimensional vector. That has the extremely important effect that the output is a continuous function of the input, which is necessary for training using gradient descent.

this is quite possibly the most interesting item I've read on HN

Would the returned samples from PG/Shakespeare/Wikipedia examples be of higher quality if you used a word-level language model instead of character model with similar parameters?

I was curious if the overhead of learning how to spell words (vs a pure task of sentence construction with word objects) out weigh the reduction in sample set size?

(Awesome article for a RNN newbie)

Karpathy states in the blog post that word-level models currently tend to beat character models, across the broad field of NLP related RNNs. But he argues that character models will eventually overtake (much in the same way that ConvNets have "replaced" manual feature extraction).

That said, I think the RNNs here are limited by the corpus. They need to be exposed to more writing. Even if all you want is a Shakespeare generator, you still need to expose it to other literature. That will give it greater context, and more freedom of expression and, dare I say, creativity. I mean, imagine if all you were exposed to your whole life was Shakespeare. Nothing else (no other senses). Even with your superior mind, I doubt you'd generate anything better than what this RNN spits out.

So yeah, it needs a large corpus to build a broader model. Then we need a way to instruct the broadly trained RNN to generate only Shakespeare-like text. Perhaps by adding an "author" or "style" input.

I fail to see how word-based models are character-based models with manual feature extraction. Word boundaries are read directly from deterministically tokenized inputs.

And, as I mentioned upthread, it has been known for about ten years, long before the current neural net revival, that high-order character-based models are competitive with word-based models (at least in terms of perplexity).

"Old-school" Markovian language models (the vast majority of what's being used in production today) are mostly word-based but for text applications with tons of data, high-order character models are competitive with word-based models. (http://www.aclweb.org/anthology/W05-1107)

I found the learning progress great. I was thinking some time ago how to generate english-sounding words which don't exist. Well, here they are: (from iteration 700)

Aftair, unsuch, hearly, arwage, misfort, overelical, ...

(although I admit, some of them may be just old words I haven't heard of before)

In all the examples on the page, the RNN is first trained and then used to generate the text. Is there a way to use RNNs for something interactive? For instance, can one train an RNN to mimic Paul Graham in a discussion, and not only in writing an essay?

I did have a bit of a chuckle when they got to Algebraic Geometry. That's incredible.

Does anyone know if these are/can be good for named entity recognition? I am stuck implementing second order CRFs right now for the lack of a good implementation, and this seems a lot easier.

I'm not aware of any strong RNN results for NER, no.

You'd probably find the paper here: http://aclweb.org/anthology/ (everything in CL is open access). You want the proceedings of CL, TACL, ACL, EMNLP, EACL, and NAACL. Don't bother with the workshops.

If neural networks are the way to build strong AI and neural nets are all about optimization, wouldn't a quantum computer be ideal to power an AI? (assuming we can get one to work)

I don't think so. NNs have millions of parameters, and making a quantum computer that large, and with that many complex interactions, would be very difficult.

Optimization of NNs isn't really that bad. Stochastic gradient descent is extremely powerful and roughly linear with the number of parameters, possibly better.

I've thought a bit about RNNs, and I can see an obvious problem: Fixed amount of memory.

Is there any chance someone's come up with an RNN that has dynamic amounts of memory?

There is this paper by Joulin & Mikolov: Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets (http://arxiv.org/abs/1503.01007).

In this case, the memory of the RNN is an ensemble of differentiable stacks.

There's a huge degree of data re-use in the weights. This should be exploited.

Second, one could envision paging the hidden units back to system memory on a coprocessor-based implementation (GPUs/FPGAs/not Xeon Phi, gag me). 256 GB servers are effectively peanuts these days relative to developer salaries and university grants (datapoint: my grad school work system was ~$100K in 1990 dollars) so unless you're trying to create the first strong AI, I don't think this is a serious constraint.

Good luck with that no matter what Stephen Hawking, Elon Musk, and Nick Bostrom harp on about: we have no idea what the error function for strong AI ought to be and even if we did, it's over a MW using current technology to achieve the estimated FLOPS of a human cerebrum.

I meant that the state vector has constant size and just setting it at the maximum available might give you problems with training.

Nothing you can't work around if you're willing to roll your own code. That said, I agree 100% if you're dependent on someone else's framework...

Look up Neural Turing Machines: connecting neural networks to external banks of memory.

Someone should train an RNN on neural network source code to see if it's possible to get neural networks to generate neural networks.

This felt like watching Ex Machina. Thanks a lot, this was extremely informative and super fun.

I have a dumb question. How is a recurrent neural network different from a Markov Model ?

Very neat, and funny article. I love the PG generator.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact