
Powerful AI Can Now Be Trained on a Single Computer - MindGods
https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/powerful-ai-can-now-be-trained-on-a-single-computer
======
dan-robertson
Lots of people are focusing on this being done on a particularly powerful
workstation, but the computer described seems to have power at a similar order
of magnitude to the many servers which would be clustered together in a more
traditional large ML computation. Either those industrial research departments
could massively cut costs/increase output by just “magically keeping things in
ram,” or these researchers have actually found a way to reduce the
computational power that is necessary.

I find the efforts of modern academics to do ML research on relatively
underpowered hardware by being more clever about it to be reminiscent of
soviet researchers who, lacking anything like the access to computation of
their American counterparts, were forced to be much more thorough and clever
in their analysis of problems in the hope of making them tractable.

~~~
Barrin92
If anything it seems to me that doing the most work under constraint of
resources is precisely what intelligence is about. I've always wondered why
the consumption of compute resources is itself not treated as a significant
part of the 'reward' in ML tasks.

At least if you're taking inspiration from biological systems, it clearly is
part of the equation, a really important one even.

~~~
westurner
Isn't this what Proof of Work incentivizes? Energy efficiency over transistor
count.

------
datameta
The machine used is a 36-core + single gpu. So not quite a home computer yet
but this is some serious progress!

Paper: [https://arxiv.org/abs/2006.11751](https://arxiv.org/abs/2006.11751)

Source: [https://github.com/alex-petrenko/sample-
factory](https://github.com/alex-petrenko/sample-factory)

~~~
cbozeman
I dunno... $8000 builds a 64c/128t 256 GB RAM workstation with the same GPU
these researchers used
([https://pcpartpicker.com/list/P6WTL2](https://pcpartpicker.com/list/P6WTL2)).
That's arguably in the realm of home computer for just about anyone making
$90,000 and above, I would think; I would also think anyone working in those
fields could command at least that salary or greater, unless they're truly
entry level positions. Seems it would be a reasonable investment for someone
actively working in the area of machine learning / artificial intelligence.

~~~
nqzero
what's the per-hour cost spot price of this machine on AWS ?

~~~
ChuckMcM
well an m5ad.24xlarge is 96 threads and 384G with your own 2xSSD (saves on EBS
bandwidth costs). So fewer threads but a bit more memory. (we'll guess that is
a 48 core EPYC 7642 equivalent with 96 threads since there is no 96 core
version)

That bad boy costs $0.96/hr on the current spot prices
([https://aws.amazon.com/ec2/spot/pricing/](https://aws.amazon.com/ec2/spot/pricing/))

~~~
qayxc
That instance type is missing the RTX 2080 Ti-equivalent GPU, though...

The closest in performance to that GPU would be the V100 found in
P3-instances.

~~~
sabalaba
Yea which are $3 / hour. Or, you could, shameless plug, use Lambda's GPU cloud
which has V100s for $1.5 / hour!

[https://lambdalabs.com/service/gpu-cloud](https://lambdalabs.com/service/gpu-
cloud)

I'm embarrassed that I feel so compelled to post this--on a Friday night at
that--I apologize.

~~~
manjunaths
> I'm embarrassed that I feel so compelled to post this--on a Friday night at
> that--I apologize.

Don't be. Finding gems like this is why some of us read the comments.

~~~
6510
Frowning on self-promotion got us in those silos.

------
andreyk
Kind of a weird headline, the vision-based RL this article dubs 'powerful AI'
could already easily be trained on a single (pretty expensive) computer. They
say as much in terms of the speed ups they provide:

"Using a single machine equipped with a 36-core CPU and one GPU, the
researchers were able to process roughly 140,000 frames per second while
training on Atari videogames and Doom, or double the next best approach. On
the 3D training environment DeepMind Lab, they clocked 40,000 frames per
second—about 15 percent better than second place."

So, not massive this-is-now-doable speedup.

------
neatze
May be my AI tasks are to simplistic, but I never had problem training AI on
single machine and as many pointed out there always cloud services, otherwise
I find it an impressive work, it takes some courage to say well there so many
huge companies with there's frameworks, but we can outperform them all in
specific problems.

Couple things that stood out from github page:

Currently we only support homogenous multi-agent envs (same observation/action
space for all agents, same episode duration).

For simplicity we actually treat all environments as multi-agent environments
with 1 agent.

My speculation is that this why they gained such dramatic performance
improvement. (but I might be very wrong)

------
rgovostes
Is SLIDE being used anywhere, or were flaws discovered? It was supposed to
massively accelerate training on CPUs.

[https://www.hpcwire.com/off-the-wire/rice-researchers-
algori...](https://www.hpcwire.com/off-the-wire/rice-researchers-algorithm-
that-trains-deep-neural-nets-faster-on-cpus-than-gpus/)

------
fxtentacle
I'm surprised that this is IEEE worthy and not just common sense. Of course
there'll be huge speedups if, and only if, your dataset fits into main RAM and
your model fits into the GPU RAM.

But for most state of the art models (think gpt with billions of parameters)
that is far from being the case.

~~~
p1esk
Yes. Jukebox model was trained on 512x V100 GPUs for 4 weeks. Try doing that
on a $8k workstation.

Not saying it wouldn't be a worthwhile goal to improve the algorithms so that
it becomes possible. At least on a 8x V100 machine, for Christ's sake. Because
that's all I got.

~~~
qayxc
> At least on a 8x V100 machine, for Christ's sake. Because that's all I got.

Well that's still one powerful supercomputer and allows you to pretrain BERT
from scratch in just 33 hours [1].

I mean that's $100,000 in hardware you have at your disposal right there,
which is still orders of magnitude beyond 8k-level workstation hardware...

It speaks to the sad affair that is SOTA in ML/AI - only well funded private
institutions (like OpenAI) or multinational tech giants can really afford to
achieve it .

It's monopolising a technology and papers like this help democratise it again.

Edit: [1] [https://www.deepspeed.ai/news/2020/05/27/fastest-bert-
traini...](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-
training.html)

~~~
fxtentacle
Yes, it would be great to see AI training becoming more democratized again,
but with its mere ~2x this paper won't help that much, plus the most expensive
part in training a novel AI might well be to hire all those people that you
need to create a dataset spanning millions of examples.

~~~
qayxc
Training data isn't always an issue. There are plenty of methods that don't
require labels or use "weakly labelled" data.

Since most contemporary methods only make sense if lots of training data is
available in the first place, many companies interested in trying ML do have
plenty of manually labelled data available to them.

Their issue often is that they don't want to (or can't for regulatory reasons)
send their data into the public cloud for processing. Any major speed-up is
welcome in these scenarios.

------
RcouF1uZ4gsC
> His group took advantage of working on a single machine by simply cramming
> all the data to shared memory where all processes can access it
> instantaneously.

If you can get all your data into RAM on a single computer, you can have a
huge speedup, even over a cluster that has in aggregate more resources.

Frank McSherry has some more about this, though not directly about ML
training.

[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

------
gwern
It's very nice optimization and systems engineering work which shows what a
computer can do if you use it _properly_.

One correction: it's not Doom, but ViZDoom, a simplified version designed for
DRL.

------
alex_petrenko
Author of the paper here. Ask me anything!

~~~
YeGoblynQueenne
Hi and congratulations for the article on IEEE.

The article makes the point that your technique will give an advantage to
academic teams that don't have the resourcers of big corporations. To me it
seems that your technique optimises the use of available resources, but the
amount of available resources remains the deciding advantage. That is to say,
both large corporate teams and smaller academic teams can improve their use of
resources using your proposed approach, but large corporate teams have more of
those than the smaller academic teams. So the large corporate teams will still
come up ahead and the smaller academic teams will still be left "in the dust"
as the article puts it. What do you think?

~~~
highfrequency
The key is that large corporations/labs achieve scale through _many
distributed machines._ This paper explores optimizations that are particular
to a single multi-core machine. These optimizations exploit low-latency shared
memory between threads on one machine, and thus cannot be replicated on a
distributed cluster.

------
bno1
I wish machine learning would have become mainstream on a language with more
competent multithreading capabilities than python. During my machine learning
course I knew I could squeeze more performance out of my code by parallelising
data preprocessing and training (pytorch), but python cannot do proper
multithreading. The multiprocessing module requires you to move data between
processes, which is slow.

~~~
rpedela
Shouldn't be slow if shared memory is used.

[https://docs.python.org/3/library/multiprocessing.shared_mem...](https://docs.python.org/3/library/multiprocessing.shared_memory.html)

~~~
bno1
First time I've seen this, looks like its brand new (python 3.8). The problem
now is that you have to serialize/deserialize your data and modify your
numpy/pytorch code to use the shared memory. It's an improvement in
performance, true, but not as fast and easy as just sharing variables between
threads.

------
0xbkt
Out of the question, where should an indie .IO game developer start to build
their AI players in their game?

For the record, I am a solo developer in progress of developing a to-be-online
browser game. I must make intelligent bots and keep players busy until the
time it has a lot of online players.

I had a look at Reinforcement Learning but I am not sure people are really
using for this use case.

~~~
confuseshrink
I would start with David Silvers (DeepMind) youtube series to get an idea of
what's possible or not.

Running an already trained reinforcement learning agent is relatively cheap
(unless your model is massive).

I suspect the reason people aren't using it yet is because it's a) really
difficult to get right in training, even basic convergence is not guaranteed
without careful tuning b) really difficult to guarantee reasonable behavior
outside of the scenarios you're able to reach in QA.

edit: Link to lecture series
[https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTra...](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-
OYHWgPebj2MfCFzFObQ)

------
ladberg
So this basically boils down to keeping your training data in memory? Is there
something else I missed?

~~~
dan-robertson
It looks obvious when you write it like that but I think many people are
surprised by just how much slower distributed computations can be compared to
non distributed systems. Eg the COST paper [1]

[1]
[https://www.usenix.org/system/files/conference/hotos15/hotos...](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-
mcsherry.pdf)

------
xiaodai
the most misleading article ever. Twice as fast as best known method ain't
gonna cut it for training AlphaZero Go for example.

But still 140,000 images per second! That's nuts. Even 70,000 is nuts.

I can only play 1000 2048 games in one second. Damn! I am slow

------
godelski
There's a lot of research that you can get away with on smaller hardware. Much
of my own I can do with a single 2080Ti. BUT at times it is extremely
frustrating and I don't have the memory to tackle some problems with just that
hardware.

If you want to see major improvements in the academic and home side of AI then
NVIDIA and AMD need to bring more memory to consumer hardware. But there isn't
much incentive because gamers don't need nearly the memory that researchers
do.

------
YeGoblynQueenne
I'm reminded of the following episode of Pinky and the Brain:

    
    
      Brain plans to use a "growing ray" (originally a "shrinking ray") to grow
      Pinky into super-size while dressed up as Gollyzilla, while Brain would turn
      himself gigantic and stop him, using the name Brainodo, in exchange for
      world domination. However, the real Gollyzilla emerges from the ocean and
      starts to rampage through the city, making Brain think that the dinosaur is
      Pinky. The episode ends with the ray going out of control and making
      everything on Earth grow, including the Earth itself, to the point that
      Pinky, the Brain, and even Gollyzilla are mouse-sized by comparison again.
    

[https://en.wikipedia.org/wiki/List_of_Pinky_and_the_Brain_ep...](https://en.wikipedia.org/wiki/List_of_Pinky_and_the_Brain_episodes)

Being able to train deep RL models on commodity hardware is only an advantage
if there isn't anyone that can train on more powerful hardware (or if somehow
training on more powerful hardware fails to improve performance with respect
to your model). Otherwise, you're still just a little mouse and they have all
the compute.

~~~
currymj
well, you don't actually have to have your models fight their models. there's
lots of things to try that aren't the SOTA rat race.

like, if you want to show relative improvement of some new variation of an RL
algorithm, this could be a good way to do it. or if you have a new environment
that you want to solve for yourself. right now if you try to train anything in
a moderately interesting environment on a PC, it takes just a little too long
to get results -- makes the whole research process pretty painful.

~~~
YeGoblynQueenne
I'm afraid that computing resources are not and have never been the limiting
factor for innovative work in machine learning in general and in deep learning
in particular. I have quoted the following interview with Geoff Hinton a
number of times on HN - apologies if this is becoming repetitious:

 _GH: One big challenge the community faces is that if you want to get a paper
published in machine learning now it 's got to have a table in it, with all
these different data sets across the top, and all these different methods
along the side, and your method has to look like the best one. If it doesn’t
look like that, it’s hard to get published. I don't think that's encouraging
people to think about radically new ideas._

 _Now if you send in a paper that has a radically new idea, there 's no chance
in hell it will get accepted, because it's going to get some junior reviewer
who doesn't understand it. Or it’s going to get a senior reviewer who's trying
to review too many papers and doesn't understand it first time round and
assumes it must be nonsense. Anything that makes the brain hurt is not going
to get accepted. And I think that's really bad._

 _What we should be going for, particularly in the basic science conferences,
is radically new ideas. Because we know a radically new idea in the long run
is going to be much more influential than a tiny improvement. That 's I think
the main downside of the fact that we've got this inversion now, where you've
got a few senior guys and a gazillion young guys._

[https://www.wired.com/story/googles-ai-guru-computers-
think-...](https://www.wired.com/story/googles-ai-guru-computers-think-more-
like-brains/)

In other words, yes, unfortunately, everything is the SOTA rat race. At least
anything that is meant for publication, which is the majority of research
output.

~~~
currymj
there is some truth in this, frustratingly.

at the same time, if you go to this year's ICML papers and ctrl-F "policy",
there are several RL papers that come up with a new variant on policy gradient
and validate it using only relatively small computing resources on simpler
environments without any claim of being state of the art. probably many would
directly benefit from this well-optimized policy gradient code.

~~~
YeGoblynQueenne
Well, that's encouraging. "Pourvu que ça dure !" (as Letizia Bonaparte said).

It's funny, but older machine learning papers (most of what was published
throughout the '70s, '80s and '90s) was a lot less focused on beating the
leaderboard and much more on the discovery and understanding of general
machine learning principles. As an example that I just happened to be reading
recently, Pedro Domingos and others wrote a series of papers discussing
Occam's Razor and why it is basically inappropriate in the form where it is
often used in machine learning (or rather, data mining and knowledge
discovery, since that was back in the '90s). It seems there was a lively
discussion about that, back then.

Ah, the paper:

[https://link.springer.com/article/10.1023/A:1009868929893](https://link.springer.com/article/10.1023/A:1009868929893)

Not innovation, exactly- but not the SOTA rat race, either.

------
lostmsu
Their results on lasertag are very poor. Looks like their network was
incapable of solving it. Which probably means for apple-to-apples comparison
to IMPALA they need a better network, which might require them to forego their
throughput advantage.

~~~
alex_petrenko
The network in our work is exactly the same as in IMPALA paper. The overall
score of our agent is slightly higher, it did better on some envs and it did
worse on some other envs. These lasertag levels are exploration problems, and
with a bit of hyperparameter tuning they are not difficult, it's just the
agent that learned to do 30 different things somehow sucks on these levels.

~~~
lostmsu
Were other problems less about exploration? If not, I'd argue lasertag might
be more important.

In various benchmarks the geometric mean is usually used to compare total
score across different tasks to account for severe issues with specific tasks.

If your network is the same as IMPALA, why do you show its results with
different hyperparameters? Were some of them necessary for the optimization
(e.g. reduced batch size)?

------
hrgiger
Interesting article, i will check github at weekend, but assuming atari
resolution 40x192 single channel and float size 4 byte and 140000 image you
got 4300800000 bytes roughly 4gb data. I wonder if they talk about complete
training time

------
DrNuke
A mid-range gaming laptop can do much more and better in 2020 than in 2018,
thanks to novel available batch functions, a number of preliminary hacks and
much hw-friendlier frameworks.

~~~
lostmsu
What batch functions are you talking about?

~~~
DrNuke
Have a look at Google Scholar results from a “batch reinforcement learning”
query since 2020 for the latest.

------
jchrisa
Is this an early warning indicator that the cloud-to-edge pendulum has begun
swinging back in the direction of personal computing?

------
imranq
This is incredible, but won’t these techniques just make the gap wider as big
tech firms create even larger models?

------
lihan
Depends on the size of a single computer. Cloud services have something 200
core, 4TB in a single machine too

------
rbanffy
It always could be trained on a single computer. It was just a matter of
physical size versus time.

------
vz8
How much RAM did their test workstation have? I can't seem to spot it.

~~~
m463
I'm guessing most of the perf comes from the gpu memory size.

~~~
Enginerrrd
While that's likely true, I generally find that it's quite rare that enough
attention is paid to how information is moved between disk, RAM, CPU and GPU.
And paying close attention to that can be extremely helpful. Taking the RAM up
to 11 can eliminate a lot of the art to it, which is a good thing.

~~~
rbanffy
A machine in this class can easily have a terabyte of RAM. Add a couple Optane
DC sticks and you have enormous storage at exceedingly high bandwidth.

------
del_operator
How many RPi 4’s would you need?

------
fongitosous
with a titan and a threadriper for instance what would be the results?

------
softwrdethknell
Apple’s in-house SOC is the future.

I suspect the cloud has a decade, maybe less, of hype to grift on.

Huge data sets on a personal computer and opt-in data sharing with business
and healthcare, etc will be the new norm.

Further out, software as we know it will cease to exist as entirely custom
chips per application are the norm. IN TIME.

New hardware wars to capture consumer attention incoming.

~~~
rbanffy
I don't need to own a fast workstation unless I want to continuously train my
models. I can, however, quickly get a cloud instance that's much larger than
that and train the model at a fraction of the time and cost of a desktop
workstation.

~~~
neatze
And continuously training models is very hard, practicality in RL environment,
even then cloud services over long term is possibly more cost effective
solution, then hosting your own small cluster (few tightly packed racks).

~~~
rbanffy
Still, I don't mind the excuse to get a 128 core dual EPYC with a terabyte of
RAM and wide PCIe flash storage.

But I would rather not have proprietary GPU drivers.

