
Neural Architecture Search with Reinforcement Learning - saycheese
https://arxiv.org/abs/1611.01578
======
ericjang
The primary author is a Google Brain Resident
([https://research.google.com/teams/brain/residency/](https://research.google.com/teams/brain/residency/)).
The Brain Residency is a really great program for starting a career in Deep
Learning and ML research, and I'm really impressed by how quickly these new
researchers churn out great work like this.

disclosure: I work at Google Brain

~~~
cr0sh
I really would like to, someday, do the residency, but at the same time, I
don't know if I can.

Right now, I'm doing the Udacity Self-Driving Car Engineer Nanodegree. I won't
rehash that or my other knowledge here...

My issue, though, is my age, coupled with the fact that I don't have a real
degree (I've got this old associates degree from a tech trade school that is
almost worthless). After I finish the Udacity thing, I have this idea of
pursuing an online BA, then an MA - likely in CompSci. But, we are likely
looking at anywhere from 4 years or more (likely more) to do both things - and
not a small amount of money. After that point, I'll be close to 50 years old.

There's a very good and likely chance that the residency won't be around at
that point; or even the possibility that technology around ML may have
radically changed the world to the point where it would be difficult (or even
meaningless) to try to "catch up"...

Still - I'm not going to let that possibility stop or restrain me; that said,
is it viable for me to even think about this kind of thing - doing such a
residency? Would I even have a chance of being considered, vs someone younger?
Furthermore - how does one do a residency of this nature, while still paying
bills (I live in Phoenix, AZ - I own a house, plus have other bills)?

These are just questions I have - not really concerns, as I am no where near
the point I need to be to do the residency - and there isn't any guarantee
that I will be. I'll just continue to enjoy to journey, rather than worrying
about a specific end-goal.

~~~
ericjang
> is it viable for me to even think about this kind of thing - doing such a
> residency? Would I even have a chance of being considered, vs someone
> younger? Furthermore - how does one do a residency of this nature, while
> still paying bills (I live in Phoenix, AZ - I own a house, plus have other
> bills)?

Good on you for pursuing continued education.

Yes it's viable - never lose hope in yourself for _anything_. The Brain
Residency has no age requirement; in fact, the program looks for diverse
academic/career experiences. You'd be surprised how close you are to your
dream job if you are passionate, hardworking, and lucky.

I too, was intimidated by the math & magic of ML when I first took Andrew Ng's
Intro to Machine Learning Course 4 years ago (my first exposure). I took
regular CS courses in college and started focusing on ML 1.5 years ago. Beyond
taking courses, I highly recommend building your own projects - it exercises
your independent research ability and creativity.

The Brain residency pays a good wage for the SF Bay Area (comparable to new
grad salary at average tech firms), so you should be able to pay your bills.
I've heard rumors that many other companies highly invested in ML are
following in Google Brain's example and starting similar residency programs.
Your ML career path need not be at Google, though we're a pretty solid choice
;)

~~~
cr0sh
> I too, was intimidated by the math & magic of ML when I first took Andrew
> Ng's Intro to Machine Learning Course 4 years ago (my first exposure).

Thank you for your kind words and encouragement. I will take them to heart.

I got my first taste of modern machine learning when I took Andrew Ng's ML
Class (sponsored by Stanford), in the fall/winter of 2011; I completed and
passed it. At the same time, I was also taking Thrun and Norvig's AI Class; I
had to drop out due to personal reasons.

In 2012, after Thrun founded Udacity, they released the CS373 "Build Your Own
Self-Driving Vehicle" course (I probably have that title wrong for the time;
they changed the name of the course later) - it was meant as a stand-in for
the original AI Class (I think there were licensing issues or something that
prevented them from presenting it - they later incorporated it as a part of
their offerings). I jumped at the chance, and completed that course as well.

When they announced this Nanodegree course, I knew I had to apply. So far,
things are going well with the course. I'm in the November (2016) cohort, and
currently working on the behavioral cloning project.

I do have in mind several personal projects to pursue, once I can come up for
air from this course (most of my free time has been consumed by it). I do need
more education, though, which is why I want to pursue the BA and MA. Two of my
weak areas are calculus and stats/probabilities - which ML is heavily reliant
on (plus, I really want to understand what is going on "under the covers" as
well).

Onward and upward!

------
gallerdude
I think this is the way that Neural Networks achieve some modicum of
generality - chaining them together.

Let's say you have a robot that you want to grab a can of beer off the
counter. You say "grab that beer" and point to it. The first neural network
interprets the speech and visual input. A second neural network chooses the
proper neural nets to continue the task based on the information interpreted
from the first net - it picks one for walking and one for grabbing.

~~~
randcraw
Of course compound task federation like this is not exactly new. Blackboard
systems did the very same thing using expert systems back in the 1970s-80s.

The problems then remain today: complex and compound tasks don't decompose
neatly into well-defined constituent subtasks. Sub-task recombination rapidly
devolves into a rat's nest of selecting from competing subtask components that
don't have clearly distinct semantics, that are not free of contextual
dependencies, nor do they plug-in-and-play independently.

~~~
ced
Do you have any interesting material on this? Where's a good in-depth analysis
of blackboard systems' pros and cons?

~~~
randcraw
I'd start here:
[https://en.wikipedia.org/wiki/Blackboard_system](https://en.wikipedia.org/wiki/Blackboard_system)

I'm not up-to-date, but I haven't heard of new work on blackboards or
federated rule-based systems for maybe 20 years now, after expert systems grew
increasingly probabilistic (and procrustean), and BBSes based on binary rules
showed little sign of escaping the classic brittleness of RBSes.

The wiki article mentions 'Bayesian Blackboard' systems. Maybe they had
greater success?

------
Smerity
Extreme paper tldr - Humans usually construct neural network components and
the graph of how they fit together by hand. This work sets up a "controller"
neural network that constructs two core components in many neural networks, an
RNN and a CNN, through reinforcement learning. This is an intensive and slow
process, requiring 400 CPUs and 800 GPUs for the RNN and CNN respectively, but
achieves better than or near state of the art results for language modeling
and vision classification respectively.

This paper is currently under review for ICLR 2017 and is one of the papers I
was most excited about. I previously wrote an article, "In deep learning,
architecture engineering is the new feature engineering"[1], which discusses
the (often ignored) fact that much of the work in modern deep learning papers
is just assembling components in different combinations. This is one of the
first works that I feel provides a viable answer to my complaint/concern.

The paper itself tackles two problems - first, that optimizing architecture is
usually black magic directed poorly by humans, and second, humans rarely spend
their time tailoring towards a specific task, instead seeking generality. Zoph
and Le do this by having one neural network generate the architecture for a
later one through a large series of experiments. They perform experiments in
both vision (classification) and text (language modeling), replacing the
convolutional neural network component and the recurrent neural network
component respectively.

First is that many of the choices regarding constructing the neural network
architecture are somewhat arbitrary and only hit upon experimentally by the
practitioners themselves. Andrej Karpathy noted in one of his lectures
(paraphrased) "Start with an architecture that works, then modify from there"
\- mainly as there's a lot of "black magic" in these architectures that has
only been discovered by spilling blood to the experimental god of a hundred
GPUs and/or "graduate student descent" (i.e. where you lock a poor grad in a
room for an indeterminate period of time and tell them to do better on task
X). Being able to turn to a neural network to run this painful search for you
instead is a good idea - assuming you have the large number of GPUs or CPUs
necessary. In the paper they use 400 CPUs for the language modeling search and
800 GPUs for the CNN classification search!

The other is whether we should generalize or specialize these architectures.
There are many variants of architectures that are not built for or tested
against each possible new task. For example, within recurrent neural networks
(RNNs) we have the RNN/GRU/LSTM/QRNN/RHN/... and a million minor minor
variants between them, each of which perform slightly different depending on
the task. While we would like to imagine the architectures that humans make
would get progressively closer to "the perfect generic RNN cell" over time, it
makes sense that certain cells are, could, or should be optimized to a
specific task. Seeking generality isn't always the correct answer. Humans want
to seek generality as we don't have the time to tailor to each specific task -
but what if we could? Maybe in that situation Occam's razor is actually an
impediment to our thinking.

While this is early days, and hugely resource intensive, it is likely to get
more feasible over time either as we get more computing power or become
smarter regarding how we use it. As a researcher in neural networks, I don't
consider this a threat, but instead a useful tool, likely in the same way that
a compiler likely only helped assembly programmers.

If people are interested, I can write an article covering many of the details
of this paper like I did for Google's Neural Machine Translation
architecture[2]. In that article I try to step through how these systems work
from the ground up, and the reasoning behind many of the decisions in the
paper, hopefully in an understandable manner for a general audience.

P.S. Merity et al. is one of the numbers they beat in the language modeling
section, so you may read this entire post in a bitter tone if you'd like ;)

P.P.S. This paper has been out since November 2016 or earlier - I think it was
a recent MIT Tech Review article that might have resurfaced it? (oops: wrote
Wired initially, meant MIT Tech Review - thanks @saycheese)

[1]:
[http://smerity.com/articles/2016/architectures_are_the_new_f...](http://smerity.com/articles/2016/architectures_are_the_new_feature_engineering.html)

[2]:
[http://smerity.com/articles/2016/google_nmt_arch.html](http://smerity.com/articles/2016/google_nmt_arch.html)

~~~
saycheese
Found it on MIT Technology Review:

[https://news.ycombinator.com/item?id=13439691](https://news.ycombinator.com/item?id=13439691)

Do you have a link to the Wired article you're thinking of?

------
jorgemf
I would love to see a Neural Architecture Search that create Neural
Architecture Search as a future research project. Meta-meta-learning. I like
the idea of improving the network which creates other networks.

Also the size of the network can be used as part of the evaluation in order to
minimize the networks and maximize the accuracy.

~~~
westoncb
I would guess the computational complexity involved in training the top level
network would make the project infeasible.

~~~
jorgemf
You can relax the problem in order to reduce complexity and calculations. For
example: cut the training of slow learning networks, have a database of
already trained networks, decrease the number of epochs and examples, etc. Or
even create another network which predicts the convergence of a network and
use it as heuristic. If you also take into account the size of the network and
force the learning to minimize them, then you can train them faster.

We cannot do it at home, but for sure Google can try it in their servers.

------
saycheese
Ran across this research reading this article, "AI Software Learns to Make AI
Software" \- which is already posted here:

[https://news.ycombinator.com/item?id=13436195](https://news.ycombinator.com/item?id=13436195)

------
deepnotderp
This is pretty old, and neural nets can train neural nets too (better than
humans as usual). Check learning to learn w/ gradient descent by gradient
descent

~~~
westoncb
Err.. this paper cites the one you're referring to, so I do not think it's
older (also, the dates they were published).

------
jordansmithnz
Wait... if the neural net can design other neural nets, can it be taught to
design itself?

~~~
mastazi
Ironically, this is one of the oldest ideas in computer science
[https://en.wikipedia.org/wiki/Von_Neumann_universal_construc...](https://en.wikipedia.org/wiki/Von_Neumann_universal_constructor)

~~~
Houshalter
Von Neumanns constructor is about self replication. Not quite an AI that can
self improve.

I think the first person to put the idea forward was I. J. Good in 1962. He
speculated that someday AIs would be good enough to do AI research better than
their human masters. Then they would start making even better AIs. Which would
make even better AIs, and so on. Leading to what he called an "intelligence
explosion". He thought it would be "the last invention that man need ever
make."

[http://web.archive.org/web/20160428183531/http://le-
cretin-t...](http://web.archive.org/web/20160428183531/http://le-cretin-
transnational.ch/wp-content/uploads/2014/04/Good-Speculations-Concerning-the-
First-Ultraintelligent-Machine.pdf)

~~~
mastazi
Parent's comment was about self-replication, I was addressing that.

