Hacker News new | comments | show | ask | jobs | submit login
220,000 cores and counting: largest ever public cloud job (googleblog.com)
161 points by Sami_Lehtinen 181 days ago | hide | past | web | 76 comments | favorite



We recently developed Sienna (http://sienna.gistnoesis.net/) as an experiment for easter (which went rather unnoticed by HN) to distribute computation. It allows to ML researchers to build cluster over peerjs rapidly by asking their friends to visit a webpage, so they can share their unused resources. We are not yet to 220,000 cores but we can keep dreaming :)


Cool! If you wanted you could accept donations and run on Preemptible to supplement spare desktops ;). I do love the SETI@Home vibe though. Especially since nothing beats free.


Can someone explain Google Preamble Instances vs Amazon Spot Instances?

The author says Sutherland was dissuaded by having to specify a spot price upfront on AWS but I don't see how that is any different than what Google is doing.


AWS Spot Instances are under bid. The highest bidder takes the instances, the price changes all the time.

Google Spot Instances (preemptibles) are 80% off and that's it. It's simple.


Can't you just bid for 20% of the original price and get the same behavior?

In AWS you simply tell how much you're willing to pay to keep the instance uninterrupted. If there's someone who is willing to pay more, they will get the instance from you and yours will be shut down.


With Spot, if the market price goes above .2x retail, and your job runs long enough then you never succeed with Spot.

By comparison, if you are able to create a preemptible VM you are guaranteed to pay .2x retail and then likely (but not guaranteed) to have 24h to do your work.

tl;dr: A maximum bid of .2x retail isn't equivalent to preemptible. There's almost certainly a bid for a given instance for a given runtime that would result in the same price, but it varies over time, space, runtime and instances shape.

Disclosure: I work on Google Cloud (and launched preemptible VMs)


> likely (but not guaranteed)

In my several years experience using GCE it is indeed extremely likely. Two years ago it wasn't very likely, but it became very likely about a year ago. Incidentally, I'm the guy that convinced Drew Sutherland of this article to use GCE in the first place :-).


The main factor is the market for this unused CPU time, a bid of 20% could be vastly different on each service. The GCE pricing model in a way has a sort of step function applied to the pricing, whereas AWS you can bid however you like, how this effects consumer usage I would guess is higher cost on AWS due to perceived competition. This is all assuming the CPU time on machines is constant.


AWS also tends to have different prices in different regions and for different sized machines, so there's more work involved in making your 20% work.


That does sound pretty nice.


We run specific components of our application in spot instances across different regions and from what I have seen in the last 3 months is that the price is generally much less than 80%. I'd say is less than 20% most of the time, but some times there is a peak and it goes beyond 150% for 1 to 3 hours, the frequency of these events may vary but I'd say once a week.

I think AWS might be cheaper in this regard but less predictable, so it is a tradeoff.

Edit: I haven't looked yet into "Spot Fleets".


Not 80%, 80% off.


Wops, sorry for my confusion!

Then it is definitely interesting :)


Looking at the spot prices on our AWS account (us-east-1), most of the time you're looking at around 25% full price. Some instance types seem to go lower than this (r4 is about 12%) but it looks like the 'resting point' is around 25% for the ones I skimmed.


Preemptibles are at a fixed discount to the normal rate. Spot instances fluctuate according to demand.

That said, there is still a "market" here: your work vs Google's work. It just shows up as a private floating probability instead of a public floating price.

Most of the time you get the full 24 maximum window.

edit: wording


Just a small correction: it's not about "your work vs Google's", it's about regular VMs versus preemptible. There's no special code to give Googlers (or Alphabet Characters) special treatment with regards to preemption. The chrome clusterfuzz folks are standing side by side with you.

Disclosure: I work on Google Cloud (and launched Preemptible VMs).


Great answers, thank you.

Economically, because the price is fixed and everyone has the same status, wouldn't that have the effect of oversubscription if that price is always below amazon spot price? Are you eg legally prevented from using a spot-like bid? I'd imagine spot would balance supply and demand better.


No, we could have done an auction-based pricing model (Google's ads business is clearly a big fan). But I pushed for flat rate and won ;). However, just a note that you don't get oversubscription but rather a "first come first served" result which has its own downsides.


Excellent! One of GCPs many advantages (outside of it's technical ones) is it's simpler pricing model.


My suspicion is that Google's unit economics for compute and network are better than Amazon's, because they've never historically had the ability to push costs onto a customer. It all fell to their own bottom line.

If they can sustain a lower price, picking a fixed discount gives them a marketing edge, even if AWS would come out lower overall. Most humans will pay a premium for certainty -- the certainty effect or "taxicab effect".


There are certainly particular instances in particular regions or zones that are cheaper than the preemptible price we've set. If you look at larger instances, we often are drastically cheaper. Add in the per-minute versus per-hour pricing, and some customers have seen massive wins (I recall one at nearly 10x less).

The distinction, as you've surmised, is that predictability is awfully useful. This 220k run (and Drew's quick 400k run last Sunday) had a predictable price. I doubt that such a run on Spot would have left the market price untouched ;).

That said, we absolutely have excess capacity that maybe having an auction clearing mechanism would fill. But I don't think it's worth the customer pain. Moreover, even Drew has come to us from Spot, suggesting that simplicity at a usually-fair/good price can take market share away (and reminder this is a fast growing market!).


Right.

The basis of budgeting and capacity planning is that everybody who participates in a free market secretly wishes they didn't have to.


quick question about preemption instances:

if I were to launch an instance of Kubernetes (via container engine) with such VMs, would it have an issue? (as in, if I had more load, more preemptible vms would be provisioned automatically based on a previously set max amount and if a vm was removed, another would be assigned based on my minimum amount)

am I wrong in thinking that this is a much better option than running dedicated vms?


I did not realise that. On reflection it makes sense.


Looks like there are a number of differences:

https://www.quora.com/What-are-the-key-differences-between-A...


How does any of this math get used?

http://www.lmfdb.org/Genus2Curve/Q/11664/a/11664/1


"The LMFDB makes visible the connections predicted by the Langlands program. "

The Langlands program is mathematicians version of "string theory". And what this guy is doing with google cloud is comparable to the LHC: searching for new particles (of mathematics) in order to find the deeper unity therein.


I think it has to do with public key cryptography - in particular, elliptic curve cryptography[0], which was mentioned by Schneider to be one of the key topics on NSA's agenda[1].

[0] https://arstechnica.com/security/2013/10/a-relatively-easy-t... [1] https://www.schneier.com/blog/archives/2015/10/why_is_the_ns...


I'd like to know, too. Not intending to sound irreverent, but does any of that math matter? (I'm not rude, I'm ignorant:)


There is always the existential "Does anything really matter, everybody dies in the end?". This could have direct application, or as with many math concepts, in 50-100 years we could find an application for it. Despite all of that it doesn't have to matter, the process of exploration can have value in and of itself.


Possibly a dumb question, but if it's this embarrassingly parallel, wouldn't this be a workload more suited for calculation on a GPU? I'm assuming there's a good reason he's not using one, so could someone who understands this a little better explain it?


> if it's this embarrassingly parallel, wouldn't this be a workload more suited for calculation on a GPU?

Maybe, but one does not necessarily follow from the other. Consider the task of compiling 1 million separate C++ projects. That is obviously embarrassingly parallel, but not well suited for a GPU. It's trivial to do many compilations at once, but compiling itself is not easy to parallelize.

That example is obviously contrived, but I think it demonstrates the principle that it's the computational profile of the core problem that will determine if you can use a GPU. If the core problem requires 10s of GB of RAM, or it's excessively branchy code, it may not be well suited for a GPU.


A great paper which delves into different approaches for parallel computing is "Three layer cake for shared-memory programming" [0]. They characterize parallel programming into three broad strategies:

1. SIMD (parallel lines)

2. Fork-Join (a directed acyclic graph of operations)

3. Message-Passing (a graph of operations)

GPUs are great at SIMD, but bad at the other sorts of parallelism.

[0] https://www.researchgate.net/publication/228683178_Three_lay...


You can express some forms of #2 or #3 well on a GPU. It depends upon how wide the graph of tasks is (maximum number of concurrent tasks possible in the graph).

On Nvidia GPUs, 16 to 32 warps per SM x 60 SMs on a P100 gives a lot of hardware threads (1 thread == 1 warp) in flight at once; these are allowed to branch completely independent of each other (I forget the maximum occupancy of a P100's SM in warps at lowest resource use). Furthermore, you can use global memory atomics and spin-locks for event driven programming, work-stealing, etc. This kind of stuff is used in, e.g., persistent kernels. Of course, the single kernel that is being run must handle all of the code for all of the tasks. Not easy to write, but possible.


This is a good paper, but not quite how I think about it. I use the terms data-parallel (for SIMD), task-parallel (for fork-join; kinda) and message passing. GPUs are basically data-parallel machines, but over the years, GPUs have been getting more and more capable, so I imagine some people out there are using them for task-parallel workloads.


Would tensorflow (or similar) count as task-parallel because the computation graph is a DAG? If so, there's a pretty popular example of task-parallelism running on GPU's.


I would say TensorFlow is a hybrid of two strategies: SIMD and dataflow/DAG. (I wouldn't say fork-join and dataflow/DAG are synonymous; rather they are related but different models/APIs).

At the level of a single node, TensorFlow uses Eigen [1]. Eigen is like BLAS, but it's a C++ template library rather than Fortran. It compiles to various flavors of SIMD. Nvidia's proprietary CUDA is the SIMD flavor most commonly used by TensorFlow programs.

At the level of multiple nodes, TensorFlow derives a program graph from your Python source code, using high level "ops", in the style of NumPy. Then it distributes the ops across a cluster using a scheduler:

Quote: Its dataflow scheduler, which is the component that chooses the next node to execute, uses the same basic algorithm as Dryad, Flume, CIEL, and Spark. [2]

Python is the "control plane" and not the "data plane" -- it describes the logic and dataflow of the program, but doesn't touch the actual data. When you use NumPy, the C code and BLAS code are the data plane. When you use TensorFlow, the Eigen and GRPC/protobuf distribution layer are the data plane.

So you can have a big data dataflow system WITHOUT SIMD, like the four systems mentioned in the quote. And you can have SIMD without dataflow, i.e. if you are doing it in pure Eigen or procedural/functional R/Matlab/Julia on a single machine. Languages like R and Julia may have dataflow extensions, but they're single-threaded/procedural by default as far as I know.

A mathematical way to think of the DAG model is where you program uses a partial order on computations rather than a total order (the procedural model) -- this is what give gives you parallelism.

So TensorFlow uses both SIMD and dataflow.

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page

[2] http://download.tensorflow.org/paper/whitepaper2015.pdf


Good point! Which reminds me that I left off pipeline-parallelism, which is very common in dataflow programming models. And Tensorflow is a dataflow model. But I think that the core computation in Tensorflow programs will tend to be largely data-parallel affairs. That is, I think such programs tend to have a bunch of data-parallel computational kernels connected in a DAG. When I made that comment, I was thinking more of a Cilk style program.

(I work on a dataflow language and system.)


I'd say no, since the purpose of the GPU there is to make the matmul really fast.


That's not a dumb question. As alluded to below, for most people the main challenge is their software stack. Beyond that, 200k+ vcpus is still a ton so you'd need hundreds to thousands of GPUs to just match the flops (Note: I don't know if Drew's code is vectorised or not on CPUs).

Put together that by using Preemptible VMs (and yes, apologies we still don't offer GPUs as preemptible) it's economically rational to use spare CPUs.

Disclosure: I work on Google Cloud.


GPU's are significantly more expensive on GCE, could just be a cost thing.


Imagine a Beowulf cluster of these.


/. lives


Wow I'd be interested to see the code for this. Is it just a tiny amount of C running on a huge number of machines, or is it a big stack of mathematical libraries?

I guess since it's discrete math it probably doesn't use Fortran/BLAS?


Wow, and folks used to calculate the trajectories that put men on the moon with just pencil and paper :)


Sadly no note on how much that cost.


220,000 cores of preemptible was about $2200/hr (I believe Drew is using n1-highcpu-32s).

Disclosure: I work on Google Cloud, launched Preemptible (and approved Drew's quota requests!)


The fuel bill per hour for the 737 going over my house right now is easily double $2200. I can visualise what is going on with a plane burning through fuel/money but with 220000 CPU cores it is all imaginary.

How big a house would be needed for such an array of 'cores'? How much electricity is required, and again can that be compared to aeroplanes, Teslas or even toasters?


Let's say you're using 2x 10 core CPUs per 1U PC, in a 42U rack. That's 840 cores per rack, 262 racks. You won't fit that in a house, you need a moderately large datacenter.

At 100W per CPU that's 22kW, which feels low; about a tenth of a Tesla and a fraction of a plane. That doesn't account for the cooling you'll need!

(You can probably do a lot better, but that's my piano tuners estimate)


22kW of CPU heat would need 22kW of cooling. Refrigerated air conditioning is has about 1:3 thermal efficiency, so your cooling system will require ~7.5kW electricity to generate 22kW cooling capacity.

That makes about 30kW. At $0.15 / kWh we're talking about $4.5 per hour for the electricity.

Other costs dwarf the energy costs.


If it were 10 cores/CPU then 220,000 cores is 22,000 CPUs. At 100W per CPU that's 2,200,000W = 2.2MW. So more like a train than a Tesla!


Yes, and based on the $0.15/kWh comment, about $450/hour of electricity.

This is more in line with what I would have guessed for 220k cores.


So ~$300/hour in electricity.


Note that 22kW is right about what a Tesla will consume cruising at highway speeds. So you could label that as "an efficient cruising car."


I would be shocked if Google wasn't getting 2x 32 core CPUs per 1U at a minimum, which brings down your rack count a bit.


Even if those were ARM cores (currently 192 cores/U @ 1.8 GHz), it would still take about 30 racks. ARM sucks at single-core performance, but it handles parallel workloads very well.


Actually, a 737-700 burns less than $2200 worth of fuel per hour, depending on operating weight, and how good a deal on the fuel the airline receives.

737-700 burns approximately 750 Gallons per hour at approximately $2.50/Gallon for Jet-A1 purchased wholesale. $1875/Hr for the fuel.


That's a really nice info to share, i was thinking almost an order of magnitude higher, thank you.


I guess, It must have heat up the environment little bit in the respective data centre. All servers in all racks running at full CPU utilisation out to be probabilistically rare.


So that's why my preemption rates are so high lately.


Nope.

First, preemptible VMs don't preempt each other. So if you got shot it wasn't Drew directly.

However, when we're full and someone needs to get shot that can be you or it could be someone else. Drew being there actually makes it more likely that he would take the heat. But Drew runs all out well off peak (weekend mornings like our docs encourage!) so unless you had a bad weekend it wasn't him :).

Disclosure: I work on Google Cloud (and have gone back and forth with Drew over email).


Do you get refunded if you get preempted?


You do if you get shot in the first 10 minutes. Otherwise, no. We happen to preferentially "shoot the young" to minimize work lost:

https://cloud.google.com/compute/docs/instances/preemptible#...


Oh I can see the headlines now...


Yeah, nobody likes that I phrase it this way ;).


Google cloud uses per minute billing IIRC so I'd assume you just don't pay any more after preempted.


This is really cool!


How much? :)


Do you always disclose customer information?


We detached this subthread from https://news.ycombinator.com/item?id=14159177 and marked it off-topic.


None of that information is private. The number of cores is in the article, the type of machine he used to use is in the article and the hourly cost of those machines is public knowledge.

At worst the revelation here is that someone at google thinks the customer might be using what they've publicly said they used before. And that instance type is the largest high-cpu version available that's not in beta.


There was an article written about him, which he obviously had to approve. How is disclosing what product he used "customer information"?


Not the mention the article already says the machine he started with was these n1-highcpu-32s, so it's a pretty safe bet that he continued using it for this.


For those that don't know google's list of instance types, this is the largest high-cpu instance that's not in beta.


I should clarify: one of the reasons Drew stuck with those instead of larger (I had recommended customs with 48s at one point) is so that he can use the same config in all regions. We have some older zones still with Ivy Bridges and such that won't do >32.

I fully expect that Drew will add a simple override map to say "use 64 threads here, 48 here, 32 here and so on" but with a goal to minimize preemptions.


I've done bigger on aws




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: