
The compute and data moats are dead - Smerity
http://smerity.com/articles/2018/limited_compute.html
======
stcredzero
Moats and walls are pretty good analogies here. The magnitude of the barrier
provided by a moat was considerable in the 1300's. By the late 20th century,
it was far less.

[https://www.youtube.com/watch?v=bWMrY49qqDw](https://www.youtube.com/watch?v=bWMrY49qqDw)

In the 11th century, a wooden palisade or an earthen berm fortification could
be held for something like a half year. By the end of WWII, it constituted a
delaying tactic.

[https://en.wikipedia.org/wiki/Rhino_tank](https://en.wikipedia.org/wiki/Rhino_tank)

A phase change happened with military tactics in the lead-up to the 1st half
of the 20th century, where the power of mobile mechanized armor and air
support greatly reduced the value of fortifications.

That said, I don't think moats are dead. It's just that the time-scales have
changed.

~~~
blihp
The timescales of the moats provided by the actual technology in the digital
realm have always been short. That's why you see so much time and money spent
on broadening and extending copyright and patents... that's the long-term
moat.

------
ThePhysicist
At the PyData DE I just saw an excellent talk about GANs and data augmentation
in image recognition:

[https://www.slideshare.net/FlorianWilhelm2/performance-
evalu...](https://www.slideshare.net/FlorianWilhelm2/performance-evaluation-
of-gans-in-a-semisupervised-ocr-use-case)

The authors were able to outperform Google ML by a large margin for a vision
task that involved recognizing numbers from car registration documents. With
just 160 manually collected training samples they were able to train a neural
net that could recognize characters with 99.7 % accuracy. GoogleML performed
very poorly in comparison, which I found very surprising because it didn't
seem to be such a hard recognition task (clean, machine-written characters on
a structured, green background).

~~~
QML
Isn’t this just the no free lunch theorem? Should you be expecting a more
general framework to beat an specially trained algorithm?

Another concern is generality: just because it performs well on this dataset
does not mean it will perform well on another.

~~~
ThePhysicist
It really doesn't matter that the model will probably not perform well on
another dataset, as it was built for a specific task.

It's also about flexibility: If Google ML doesn't provide you with a way to
train their algorithms specifically for your use case it won't help you that
they work well for generic text recognition tasks.

------
aub3bhat
I think you are overegenralizing applicability of Neural Architecture Search
etc. and cherry picking individual examples. There is an enormous gap between
what gets published in academia with what’s actually useful.

E.g. Compute wars have only intensified with TPUs and FPGA. sure for training
you might be okay with few 1080ti but good luck building any reliable, cheap
and low latency service that uses DNNs. Similarly big data for academia is few
terabytes but real Big data is Petabytes of street level imagery, Videos/Audio
etc.

~~~
gjstein
Your last comment reminded me of this article [1] on "Google Maps's Moat",
which discusses the _vast_ resources that Google has poured into collecting
data at a global scale to make Google Maps what it is.

[1] [https://www.justinobeirne.com/google-maps-
moat/](https://www.justinobeirne.com/google-maps-moat/)

------
korethr
Okay, I have a question about one of his assertions here:

> What may take a cluster to compute one year takes a consumer machine the
> next.

Is that not partly because the hardware is ever improving? I realize this is a
bit of exaggeration, but does not yesterday's cluster end up fitting onto the
die of tomorrow's GPU? And then since it's all on a single die, is not the
overhead of the interconnect drastically reduced? It takes less time to push
information to the next core over when the interconnect is a couple
micrometers of silicon instead of the couple meters of silicon, copper, and
fiber needed when the next core is in the next rack over.

Certainly improving the model will help; who hasn't marvelled at how better
his code ran when he fixed that On^2 hot spot? But I can't help but think
improving hardware plays a role too.

Am I off base here?

~~~
yetihehe
Off base not, but most of this "one year effect" is really improved
algorithms, not better hardware. Hardware doesn't improve that fast (1000x in
one year).

~~~
heavenlyblue
One year effect? What exactly are they speaking about?

The only reason deep learning exists is because by now we've finally learned
how to build GPUs fast enough to run the algorithm invented in 1983.

And let's be honest - most of the current, state-of-the-art algorithms only
work today because they've got access to scaled up massive databases of data.
You don't really need to be as smart here any more.

~~~
twtw
I don't mean to reduce the work of whichever researchers published in 1983
(are you thinking of Rumelhart et al 1986?), but gradient descent and reverse
mode automatic differentiation are way older than 1983. Adaptive filters were
being "trained" in the 50s using stochastic gradient descent.

~~~
_0ffh
And backpropagation (using the chain rule) was originally invented in the 60s
in the field of control theory, IIRC.

~~~
twtw
> and reverse mode automatic differentiation

------
QML
Two notes: 1\. This article is not talking about how, for neural networks, you
can just have pretrained networks — where the cost of compute and data is
incurred there — and then use them to classify images or what not on your
decades old computer. Correct? 2\. Often times, some problems are “solved” in
the sense that they become irrelevant. Is that also the case here. It seems
compute and data were seemingly constraints, but technology (algorithms) just
got more efficient. Should we not reframe this and say that algorithms are the
constraint, then, and that’s what we should aspire to improve? Usually
throwing compute and data marginally improves gains anyhow...

------
PaulHoule
The real "moat" is more and better training data for commercially useful
tasks.

You can write a lot of papers about Penn Treebank data but I can't imagine
anything you do with Penn Treebank will be commercially useful.

------
fizx
I feel like that we're getting these huge gains on tasks that can be made
faster via better architectures, regularization, normalization, data
augmentation, etc, such that he's right.

I just wonder if it will ever feel this way for reinforcement learning.

