
The Bitter Lesson - knbknb
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
======
OliverJones
It's not just AI. Our trade's history is chockfull of valiant efforts to solve
problems that were overrun by the exponential decline of computing costs by
the time they really worked properly.

Remember DSEE / ClearCase? They had all sorts of complicated virtual file
systems to deliver tagged and branched contents of source code repositories.
But drive space expanded with a Moore's-law style curve and now we have "git
pull". Far simpler. System administrators don't hate our guts for adopting
git.

Remember PHIGS? We needed display lists for graphics engines because host
machines were too slow. Silicon Graphics took the other approach, and now we
have GL.

Remember terminal concentrators like Digital's LAT? You don't? Good. I wish I
didn't. (Handling 9600 baud interrupts was too big a load for a host machine.
Really.)

Remember optical typesetting machines? The digital outlines / images for
creating nice-looking letters used to be too big and complex to use for
creating the images for actual pages. You want to use Univers or Gill Sans to
set a document? Fine. Buy a Selectric typeball. Or go pay Linotype or Monotype
a bundle for a little optical thingy with images of all the letters on it.
Take the lid off your typesetting machine and put that thingy into it. You
want to set Japanese? Too bad for you. Apple, Adobe, Chuck Bigelow and Kris
Holmes, and Matt Carter, and Donald Knuth, decided to ride the exponential
rocket and the rest is history.

The bright side: Margaret Hamilton and her team on the Apollo moonshot project
used simple, reliable, radiation-hardened, and redundant computers with
bugfree software to get those guys to the moon and back.

Let's be careful: Generals always fight the last war. Before starting new
valiant efforts, we should carefully assess whether the appropriate technology
for the planned delivery date is JMOS -- Just a Matter of Software. Sometimes
it might not be the case. But most often it will be.

~~~
pjc50
Ironically everything old is new again; now we have
[https://vfsforgit.org/](https://vfsforgit.org/) because keeping everything on
one disk is too big, and OpenGL ES gets rid of immediate mode because
communications with the host CPU is too slow.

~~~
blattimwind
OpenGL's immediate mode has been discouraged for a very long time, though.

------
sweezyjeezy
This post oversimplifies the story by putting all the emphasis on compute
power. Deep Blue using brute force to solve chess obviously fits this pattern,
but the others?

Let's take computer vision. Alex Krizhevsky et al destroyed the ImageNet
competition with a neural network in 2012, kicking off the current AI hype
cycle. Essentially everything in their model had been known about since the
late 80s. But we also didn't know how to train deep networks much before this
(it turned out how you initialise the neural network was important), and we
also didn't have a big enough dataset to train such a deep model on until
ImageNet. Since then, we have built models that perform another order of
magnitude better than the 2012 model, mainly because of improvements to the
architectures (a combination of ingenuity and a lot of trial and error).

So compute is necessary, but it isn't enough, I don't buy that we've 'brute
forced' image recognition in the same way as chess.

~~~
sdenton4
Likewise, the search in Go is a Monte Carlo search, very different from the
kind of search used in chess. And the neutral nets in alpha go are guiding
where to run the search, which is very very different from brute force search.

Many of these things have required the giant leaks in compute, but still
wouldn't work at all without the concurrent improvements in algorithms.

Along these lines, here's a classic blog post:
[https://www.johndcook.com/blog/2015/12/08/algorithms-vs-
moor...](https://www.johndcook.com/blog/2015/12/08/algorithms-vs-moores-law/)

"Grötschel, an expert in optimization, observes that a benchmark production
planning model solved using linear programming would have taken 82 years to
solve in 1988, using the computers and the linear programming algorithms of
the day. Fifteen years later — in 2003 — this same model could be solved in
roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a
factor of roughly 1,000 was due to increased processor speed, whereas a factor
of roughly 43,000 was due to improvements in algorithms! Grötschel also cites
an algorithmic improvement of roughly 30,000 for mixed integer programming
between 1991 and 2008."

~~~
SmooL
I think the authors point though is that all our effort into the algorithms
where algorithms to do just one thing - search - and that we used that in
conjunction with more compute power.

I'll agree that the author emphasizes compute power, but his real point still
holds. Monte Carlo search may not be classic brute force, and neural networks
guiding it may also not be standard, but the two just let you effectively
search on a massive scale.

------
gwern
This really needs a better title, or at least a subtitle. (I clicked only
because I recognized the domain name; the title itself is vague puffery which
gives no promise of being interesting...)

'compute beats clever'? 'fast > fancy'? 'better big than bright'? 'in the end,
brute force wins'?

Or at least call it "AI's Bitter Lesson" or _something_!

~~~
atomi
I agree with this fellow. The non-descriptive titles are a glaring issue that
needs to be fixed.

------
mirimir
I know ~nothing about AI. But to me, this seems a great summary. And as a one-
time developmental biologist, I'm struck by these observations:

> One thing that should be learned from the bitter lesson is the great power
> of general purpose methods, of methods that continue to scale with increased
> computation even as the available computation becomes very great. The two
> methods that seem to scale arbitrarily in this way are search and learning.

> The second general point to be learned from the bitter lesson is that the
> actual contents of minds are tremendously, irredeemably complex; we should
> stop trying to find simple ways to think about the contents of minds, such
> as simple ways to think about space, objects, multiple agents, or
> symmetries.

From what I know about brain development, "search and learning" are key
mechanisms. Plus massive overproduction and selection, which is basically
learning. Maybe that's the main takeaway from biology.

~~~
beobab
I was thinking the same thing. When I "play" with a new language or tool or
concept, I try lots of different scenarios (search), until I can reliably
predict how the new thing will work (learning).

~~~
mirimir
That's pretty much how our brains develop. Neurons are vastly overproduced,
during fetal development through the first few years. Ones that make useful
connections, and do useful stuff, survive. And the rest die.

Also, as in evolution, ~random variations occur during neuronal proliferation,
so there's also selection on epigenetic differences. The same sort of process
occurs in the immune system.

In this way, organisms can transcend limitations of their genetic sequences.
There's learning at levels of both structure and function.

------
kazinator
> _Time spent on one is time not spent on the other._

But from the AI researcher's view, "the other" doesn't require time; someone
_else_ is advancing the hardware, which is outside of the AI researcher's
area. The general method to-be-run on better hardware is known today; it
doesn't have to be researched. So should the AI researcher just twiddle their
thumbs, waiting for the hardware to improve?

In games like chess, it has long been known that if you have a big enough
database, an optimal game can be played. For each board configuration, the
entry in the database supplies the optimal move.

So according to the reasoning in this article, we shouldn't bother looking for
improving chess. The hardware will eventually get enough storage that we can
store all the moves, and fast enough that we can compute them all to populate
the table. Then twenty lines of Javascript doing hashed lookup of chess boards
with a giant Redis back end will be the grandmaster. :)

~~~
mannykannot
I think the author would agree with you, that improving chess-playing is not
likely to be a productive domain for advancing AI in future, and for
essentially the reasons you give here: it is too formal (this guess is based,
in part, on some of the authors' other writing.)

------
zimablue
You can only reliably say that something which has happened several times is
something that will always happen if you know the reason why it happened
several times. This article seems to think it's Moore's law, which has ended.
I think history tends to go in cycles as people over-index on whatever worked
well for the last n decades.

~~~
bshipp
This seems to be a trait of humanity in general, not just IT. Look at the
finance industry: banks loan based on historical track records, right up until
the bubble pops. Every. Single. Time.

Momentum lends itself to easy statistical support, and paradigm shifts are
notoriously difficult to predict with any degree of confidence.

No matter how long in the tooth a particular trend might be, matter how
certain you are that a reversal is imminent, it's hard to push against the
weight of trend-line evidence.

------
norswap
I'm suspicious of hindsight bias.

I'm not sure if that had written 10-20 years ago, that "learning" would figure
out so predominantly. Who's to say there isn't a third such big method?

Also, while the lesson fit the facts (easy in hindsight), it will hold...
until it doesn't anymore. The end of Moore's law has been long heralded, and
we're starting to enter this era. Progress can be made, probably, but
transistors can't get any tinier, and you can only put so much cores on one
chip. Hardware may continue to provide "free gains" but those will likely be
at an order of magnitude (or more?) smaller than before.

~~~
sgt101
In fact I stopped my research in supervised learning and switched to
collaborative agents ~97 because I saw ML as deadended. Agents would be the
thing! (hint: not so much, so far)

I think Moore's law is interesting. Technically Moore's law is about
transistor density/integration, in effect it became about CPU performance and
similar phenom were seen in disc and network performance. Just now we are
seeing a move in general architecture away from spinning rust and towards chip
based storage - ssd's and optane (or just huge DRAM) which has been much
slower than I thought, but is still happening. There will be more progress as
we wring out the opportunities in architecture and network devices, but
overall you are right - no more Mooore's.

Also there's been a wave of progress funded by excitement - it's really hard
to see how Google justified the spend on Deepmind's TPU infastructure, but
they did - in contrast to a rational investment from a research council which
would never have bought into Alphazero and the rest.

There's opportunity to do more - big gaps in datasets, evaluation metrics,
refinement of techniques (mac nets, adversarials etc), but it's back to
hardscrabble now - and I'm interested to see if this is a Warren Buffet
moment. After all, you only see who's wearing shorts when the tide goes out!

------
opticalflow
I hate the term "AI" (even though I am CTO of a company with "AI" in its name,
but since we use machine learning/DCNNs in our systems, it’s very trendy). The
problem with "AI" is the "intelligence" part. Intelligence is a construct like
"porn", like in the famous words of Justice Stewart about defining the latter
"...but I know it when I see it". At best, it's very ambiguous -- and
misleading at worst. They have been many attempts to quantitatively and
qualitatively define intelligence -- none of which I find particularly
satisfying and neither do any three given scientists in a room agree on a
single interpretation. My problem with TFA is that it is comparing apples to
oranges; deep convolutional networks are very different tools useful for a
subset of problems than the ones using Bayesian inference and other
statistical methods. Brute force methods like image morphology, object
counting, and transforms are useful for even yet an entire Other set of
problems. To say that one has displaced another is an error, in fact in most
useful, modern, production systems a combination of all three is utilized,
each to their purpose. To make direct comparisons between them while implying
the historical decisions to use one or the other are due to Moore's Law is a
false equivalence.

I clearly need my morning coffee.

------
taneq
> They said that ``brute force" search may have won this time, but it was not
> a general strategy

It seems self-evident to me that 'brute force' is the most general strategy
there is. Any (computable) problem is theoretically solvable by just coding
the simplest, most obvious solution, which is usually pretty easy. The run-
time of brute force is sometimes an issue, but that just means you need more
of it!

------
cs702
The majority of businesses and governments are insisting on learning this
bitter lesson anew.

In the minds of many business executives and government officials,
"explainable AI" means, quite literally, "show it to me as a linear
combination of a small number of features" (sometimes called "drivers" or
"factors") that have monotonic relationships with measurable outcomes.

I would go further: most people are understandably _scared_ and _worried_ of
intelligence that arises from scalable search and learning by self-play.

~~~
notacoward
If explainable AI is too limiting, what's the alternative? What's going to
happen when someone gets hauled into court to be held liable for their non-
explainable AI's outcomes? Oh right, I know, they'll hide behind corporate
limited-liability shenanigans, until people get tired of that and go straight
for the guillotines. Or maybe the non-explainable AI's owners will decide they
want to prevent that, and ... do you want Skynet? Because that's how you get
Skynet. Maybe spend some time thinking about the various _awful_ ways this
could play out before concluding that explainability isn't important.

~~~
beefcafe
I love the phrase “explainable AI”. We still can’t explain how our
intelligence works with any degree of biological detail.

~~~
albntomat0
We can't explain the implementation details, but a human system can literally
explain the logic she used to reach a decision. For example, for applications
in the justice system that AI has been recommended for, this is a highly
important quality.

~~~
taneq
Eeeeeeh... what we do is more like parallel construction. We can give a series
of plausible steps to explain where we ended up, but sometimes we can't really
explain why we did some of the steps.

------
tschellenbach
On the other hand at some point we will want AI to learn based on a small
number of interactions. IE an AI that beats a human after playing 10 games of
chess/starcraft etc. Right now it takes millions of training matches. Many
real world situations don't happen that often so this fundamentally limits
applications of the current generation of AI.

~~~
beefcafe
Show me a human that can win a Starcraft championship after only playing ten
games. If you find any, they learned the mechanics and strategy somewhere
else. That’s transfer learning, appears to be in its infancy in the ML
community but making progress.

~~~
albntomat0
The scale matters here. I think a better metric for your parent comment would
be the delta in skill per game played.

A human is significantly better on game 11 than game 1 (I recently got into
Starcraft). Current ML systems are not. It's up for discussion how to take the
human's previous experience into account, but the total amount of experience
is significantly less that the computer's.

------
emilfihlman
This is a horrible post. It advocates to just throwing out research and
replacing it with black boxes. Sure, they approximate (or even fully extract)
the actual behaviour, but they are opaque.

I'd like to remind everyone that science is in the business of understanding,
making things less opaque, less magic and engineering benefits from both.

~~~
georgeecollins
I think you missed the point. It is saying that when we build AI systems put
our understanding of a problem space into the system, we inhibit the
development of a system that can create its own understanding of the problem
space. He gives three very good examples of that. He also explains why people
are tempted to do that: it's satisfying and initially improves the results.

------
olooney
Very interested article. I've often railed against putting your thumb on the
scale (or even worse, second-guessing) machine learning models by applying too
many so-called "business rules," especially _post hoc_ rules. If the model
doesn't learn _on its own_ what you consider to be obvious structure of the
data, then either you've chosen the completely wrong model and it won't be
able to learn non-obvious truths either, OR your expectations were wrong and
the "obvious" structure isn't real. Ineed, the model discovering, entirely on
its own, the same structure as a human analyst is often the first evidence we
see that the model works! In any case it does you no good to try and force it
to fit your preconceptions with _post hoc_ adjustments. Either fix your
preconceptions (if they are mistaken) or switch to a model which naturally
agrees with you.

Sutton takes an even more extreme point of view, suggesting that most human
feature engineering is similarly a waste of time. It's hard to argue with if
you know the history: some of the best computer vision algorithms use exactly
two mathematical operations, convolution (which itself only requires addition
and multiplication) and the max(a,b) function. (This is true because both ReLU
and MaxPool can be implemented with max(), and because a fully connected layer
is a special case of a convolution.) A similar story occurred in speech
recognition, with human designed features like phonemes and MFCC are giving
way to end-to-end learning. Indeed, even general purpose fully connected
neural networks started to work much better once the biologically-motivated
sigmoid() and tanh() were replaced with the much simpler ReLU function, which
is is just ReLU(x) = max(x, 0). What _really_ made the difference was
leveraging GPUs, using more data, automating hyperparameter selection, and so
on.

I'm not sure if there's really a lesson there, or if this trend will hold
indefinitely, and I'm not sure why the lesson would be "bitter" even if it
holds. Certainly opinions are mixed. One the one hand, many researchers such
as Andrew Ng are big proponents of end-to-end learning; on the other hand, no
one can currently conceive of training a self-driving car that way. But
avoiding domain-specific, human-engineered features may be a viable guiding
philosophy for making big, across-the-board advances in machine learning.

~~~
taneq
> Sutton takes an even more extreme point of view, suggesting that most human
> feature engineering is similarly a waste of time.

In fact, wasn't there an article posted here recently saying that they'd had
good results with using learned features to feed traditional non-NN-based
machine learning?

------
zalnyx
Open AI said as much when discussing their move to a for-profit LP model.

They anticipate that real advances will be made by massively scaling up the
compute power they throw at any given problem. That’s driving their
fundraising efforts.

If the past 5 years are anything to go by, they’re right.

------
sevensor
What's missing in this account is all the interesting stuff that came from the
attempt to emulate human reasoning. Sure, it didn't get us image recognition
or chess mastery, but we have Prolog, much of what we now know as Lisp, and
proof assistants. Deep, powerful tools that augment, but do not replace, human
cognition.

------
gregw2
So if Moore's Law is slowing down and expected to end in 2025 (per it's
Wikipedia entry
[https://en.wikipedia.org/wiki/Moore%27s_law](https://en.wikipedia.org/wiki/Moore%27s_law)
), does this "bitter lesson" then need to be reversed?

------
homarp
A reply: A better Lesson - [https://rodneybrooks.com/a-better-
lesson/](https://rodneybrooks.com/a-better-lesson/)

------
casual_slacker
> Early methods conceived of vision as searching for edges, or generalized
> cylinders, or in terms of SIFT features. But today all this is discarded.

These aren't discarded, they are part of ML vision networks today. Edges are
one of the 3x3 convolutions that a network can learn, SIFT/etc are the dense /
clustering nets, I'll admit I just googled Generalized Cylinders (very
interesting). There are others like SLAM as well.

------
cbames89
I learned a similar lesson from working on robots. At first I attempted to
devise methods for Robonaut 2 to do things the way I do because it was
designed to be like me. It was missing little things that made my approaches
unfeasible, and it was infuriating. At that point I decided it only made sense
to make methods that allow the agent to discover its own behaviors, because
its merkwelt and my own will never be the same.

------
CSSer
body { max-width: 50em; margin: 1em auto; }

to make this more readable on a desktop...

~~~
bazzargh
you're not wrong, but I installed bookmarklets on my ipad mini to increase and
decrease font size because of _HN_, which sets lower-than-normal font sizes.
My eyesight isn't great, but this site is the absolute pits for undersized
text without max-width set.

------
known
I think without specialised CPUs, AI will remain futile.

