
220,000 cores and counting: largest ever public cloud job - Sami_Lehtinen
https://cloudplatform.googleblog.com/2017/04/220000-cores-and-counting-MIT-math-professor-breaks-record-for-largest-ever-Compute-Engine-job.html
======
GistNoesis
We recently developed Sienna
([http://sienna.gistnoesis.net/](http://sienna.gistnoesis.net/)) as an
experiment for easter (which went rather unnoticed by HN) to distribute
computation. It allows to ML researchers to build cluster over peerjs rapidly
by asking their friends to visit a webpage, so they can share their unused
resources. We are not yet to 220,000 cores but we can keep dreaming :)

~~~
boulos
Cool! If you wanted you could accept donations and run on Preemptible to
supplement spare desktops ;). I do love the SETI@Home vibe though. Especially
since nothing beats free.

------
throwaway2016a
Can someone explain Google Preamble Instances vs Amazon Spot Instances?

The author says Sutherland was dissuaded by having to specify a spot price
upfront on AWS but I don't see how that is any different than what Google is
doing.

~~~
user5994461
AWS Spot Instances are under bid. The highest bidder takes the instances, the
price changes all the time.

Google Spot Instances (preemptibles) are 80% off and that's it. It's simple.

~~~
takeda
Can't you just bid for 20% of the original price and get the same behavior?

In AWS you simply tell how much you're willing to pay to keep the instance
uninterrupted. If there's someone who is willing to pay more, they will get
the instance from you and yours will be shut down.

~~~
boulos
With Spot, if the market price goes above .2x retail, and your job runs long
enough then you _never_ succeed with Spot.

By comparison, if you are able to create a preemptible VM you are _guaranteed_
to pay .2x retail and then _likely_ (but not guaranteed) to have 24h to do
your work.

tl;dr: A maximum bid of .2x retail isn't equivalent to preemptible. There's
almost certainly a bid for a given instance for a given runtime that would
result in the same price, but it varies over time, space, runtime and
instances shape.

Disclosure: I work on Google Cloud (and launched preemptible VMs)

~~~
williamstein
> likely (but not guaranteed)

In my several years experience using GCE it is indeed __extremely __likely.
Two years ago it wasn 't very likely, but it became very likely about a year
ago. Incidentally, I'm the guy that convinced Drew Sutherland of this article
to use GCE in the first place :-).

------
t3soro
How does any of this math get used?

[http://www.lmfdb.org/Genus2Curve/Q/11664/a/11664/1](http://www.lmfdb.org/Genus2Curve/Q/11664/a/11664/1)

~~~
64738
I'd like to know, too. Not intending to sound irreverent, but does any of that
math matter? (I'm not rude, I'm ignorant:)

~~~
Game_Ender
There is always the existential "Does anything really matter, everybody dies
in the end?". This could have direct application, or as with many math
concepts, in 50-100 years we could find an application for it. Despite all of
that it doesn't have to matter, the process of exploration can have value in
and of itself.

------
massel
Possibly a dumb question, but if it's this embarrassingly parallel, wouldn't
this be a workload more suited for calculation on a GPU? I'm assuming there's
a good reason he's not using one, so could someone who understands this a
little better explain it?

~~~
scott_s
> _if it 's this embarrassingly parallel, wouldn't this be a workload more
> suited for calculation on a GPU?_

Maybe, but one does not necessarily follow from the other. Consider the task
of compiling 1 million separate C++ projects. That is obviously embarrassingly
parallel, but not well suited for a GPU. It's trivial to do many compilations
at once, but compiling itself is not easy to parallelize.

That example is obviously contrived, but I think it demonstrates the principle
that it's the computational profile of the core problem that will determine if
you can use a GPU. If the core problem requires 10s of GB of RAM, or it's
excessively branchy code, it may not be well suited for a GPU.

~~~
milcron
A great paper which delves into different approaches for parallel computing is
"Three layer cake for shared-memory programming" [0]. They characterize
parallel programming into three broad strategies:

1\. SIMD (parallel lines)

2\. Fork-Join (a directed acyclic graph of operations)

3\. Message-Passing (a graph of operations)

GPUs are great at SIMD, but bad at the other sorts of parallelism.

[0]
[https://www.researchgate.net/publication/228683178_Three_lay...](https://www.researchgate.net/publication/228683178_Three_layer_cake_for_shared-
memory_programming)

~~~
scott_s
This is a good paper, but not quite how I think about it. I use the terms
data-parallel (for SIMD), task-parallel (for fork-join; kinda) and message
passing. GPUs are basically data-parallel machines, but over the years, GPUs
have been getting more and more capable, so I imagine some people out there
are using them for task-parallel workloads.

~~~
claytonjy
Would tensorflow (or similar) count as task-parallel because the computation
graph is a DAG? If so, there's a pretty popular example of task-parallelism
running on GPU's.

~~~
chubot
I would say TensorFlow is a hybrid of two strategies: SIMD and dataflow/DAG.
(I wouldn't say fork-join and dataflow/DAG are synonymous; rather they are
related but different models/APIs).

At the level of a single node, TensorFlow uses Eigen [1]. Eigen is like BLAS,
but it's a C++ template library rather than Fortran. It compiles to various
flavors of SIMD. Nvidia's proprietary CUDA is the SIMD flavor most commonly
used by TensorFlow programs.

At the level of multiple nodes, TensorFlow derives a program graph from your
Python source code, using high level "ops", in the style of NumPy. Then it
distributes the ops across a cluster using a scheduler:

Quote: _Its dataflow scheduler, which is the component that chooses the next
node to execute, uses the same basic algorithm as Dryad, Flume, CIEL, and
Spark._ [2]

Python is the "control plane" and not the "data plane" \-- it describes the
logic and dataflow of the program, but doesn't touch the actual data. When you
use NumPy, the C code and BLAS code are the data plane. When you use
TensorFlow, the Eigen and GRPC/protobuf distribution layer are the data plane.

So you can have a big data dataflow system WITHOUT SIMD, like the four systems
mentioned in the quote. And you can have SIMD without dataflow, i.e. if you
are doing it in pure Eigen or procedural/functional R/Matlab/Julia on a single
machine. Languages like R and Julia may have dataflow extensions, but they're
single-threaded/procedural by default as far as I know.

A mathematical way to think of the DAG model is where you program uses a
partial order on computations rather than a total order (the procedural model)
-- this is what give gives you parallelism.

So TensorFlow uses both SIMD and dataflow.

[1]
[http://eigen.tuxfamily.org/index.php?title=Main_Page](http://eigen.tuxfamily.org/index.php?title=Main_Page)

[2]
[http://download.tensorflow.org/paper/whitepaper2015.pdf](http://download.tensorflow.org/paper/whitepaper2015.pdf)

------
snowwrestler
Imagine a Beowulf cluster of these.

~~~
jackcn
/. lives

------
chubot
Wow I'd be interested to see the code for this. Is it just a tiny amount of C
running on a huge number of machines, or is it a big stack of mathematical
libraries?

I guess since it's discrete math it probably doesn't use Fortran/BLAS?

------
venture_lol
Wow, and folks used to calculate the trajectories that put men on the moon
with just pencil and paper :)

------
strictnein
Sadly no note on how much that cost.

~~~
boulos
220,000 cores of preemptible was about $2200/hr (I believe Drew is using
n1-highcpu-32s).

Disclosure: I work on Google Cloud, launched Preemptible (and approved Drew's
quota requests!)

~~~
Theodores
The fuel bill per hour for the 737 going over my house right now is easily
double $2200. I can visualise what is going on with a plane burning through
fuel/money but with 220000 CPU cores it is all imaginary.

How big a house would be needed for such an array of 'cores'? How much
electricity is required, and again can that be compared to aeroplanes, Teslas
or even toasters?

~~~
pjc50
Let's say you're using 2x 10 core CPUs per 1U PC, in a 42U rack. That's 840
cores per rack, 262 racks. You won't fit that in a house, you need a
moderately large datacenter.

At 100W per CPU that's 22kW, which feels low; about a tenth of a Tesla and a
fraction of a plane. That doesn't account for the cooling you'll need!

(You can probably do a _lot_ better, but that's my piano tuners estimate)

~~~
JonyEpsilon
If it were 10 cores/CPU then 220,000 cores is 22,000 CPUs. At 100W per CPU
that's 2,200,000W = 2.2MW. So more like a train than a Tesla!

~~~
lightcatcher
Yes, and based on the $0.15/kWh comment, about $450/hour of electricity.

This is more in line with what I would have guessed for 220k cores.

------
ramshanker
I guess, It must have heat up the environment little bit in the respective
data centre. All servers in all racks running at full CPU utilisation out to
be probabilistically rare.

------
plantain
So that's why my preemption rates are so high lately.

~~~
boulos
Nope.

First, preemptible VMs don't preempt each other. So if you got shot it wasn't
Drew directly.

However, when we're full and someone needs to get shot that can be you or it
could be someone else. Drew being there actually makes it more likely that
_he_ would take the heat. But Drew runs all out well off peak (weekend
mornings like our docs encourage!) so unless you had a bad weekend it wasn't
him :).

Disclosure: I work on Google Cloud (and have gone back and forth with Drew
over email).

~~~
AdamJacobMuller
Do you get refunded if you get preempted?

~~~
boulos
You do if you get shot in the first 10 minutes. Otherwise, no. We happen to
preferentially "shoot the young" to minimize work lost:

[https://cloud.google.com/compute/docs/instances/preemptible#...](https://cloud.google.com/compute/docs/instances/preemptible#preemption_selection)

~~~
zaroth
Oh I can see the headlines now...

~~~
boulos
Yeah, nobody likes that I phrase it this way ;).

------
bjd2385
This is really cool!

------
wkoszek
How much? :)

------
IncRnd
Do you always disclose customer information?

~~~
strictnein
There was an article written about him, which he obviously had to approve. How
is disclosing what product he used "customer information"?

~~~
lightbyte
Not the mention the article already says the machine he started with was these
n1-highcpu-32s, so it's a pretty safe bet that he continued using it for this.

~~~
IanCal
For those that don't know google's list of instance types, this is the largest
high-cpu instance that's not in beta.

~~~
boulos
I should clarify: one of the reasons Drew stuck with those instead of larger
(I had recommended customs with 48s at one point) is so that he can use the
same config in all regions. We have some older zones still with Ivy Bridges
and such that won't do >32.

I fully expect that Drew will add a simple override map to say "use 64 threads
here, 48 here, 32 here and so on" but with a goal to minimize _preemptions_.

------
alexnewman
I've done bigger on aws

