HEPCloud formation: handling troves of high-energy physics data

boulos · on Dec 3, 2016

Disclosure: I work on Google Cloud, and helped out with this effort.

We're hoping to provide some more "How Fermilab did it" later (maybe at NEXT?) but our blog post had a little more information [1]. End-to-end from "Let's do this!" to Supercomputing was about 3 weeks (including some bugfixes to the GCE support in HTCondor).

For people asking, Fermilab put up about 500T into a GCS Regional bucket (into us-central1). We have great peering at Google, so I think this was streaming at about 100 Gbps from Fermilab to us.

As surmised, the science that we ran was lots of individual tasks, each one needing about 2 GB of RAM per vCPU, so during Supercomputing we right-sized the VMs using Custom Machine Types [2]. IIRC, several tasks needed up to 20 GB per task, so we sized PD-HDD to 20 GB per vCPU. All of this was read from GCS in parallel via gcsfuse [3]; we chose the Regional bucket for GCS to minimize cost per byte, and to maximize throughput efficiency (no reason to replicate elsewhere for this processing).

All the data after processing went straight back to Fermilab over that pipe. The output data size though was pretty small IIRC, and I don't think we were ever much over 10Gbps on output.

HTCondor was used to submit work from Fermilab onto GCE directly (the submission / schedd boxes were at Fermilab). and spun up Preemptible VMs. We used a mix of custom-32vCPU-64GB in us-central1-b and us-central1-c, as well as custom-16vCPU-32GB in us-central1-a and us-central1-f. You can see a graph of Fermilab's monitoring here [4] when it was all setup. 160k vCPUs for $1400/hr!

[Edit: Newlines, I always forget two newlines]

[1] https://cloudplatform.googleblog.com/2016/11/Google-Cloud-HE...

[2] https://cloud.google.com/custom-machine-types/

[3] https://github.com/GoogleCloudPlatform/gcsfuse

[4] https://twitter.com/googlecloud/status/798293201681457154

dekhn · on Dec 3, 2016

DOE and CERN have done a really nice job with HEP, the jobs can basically flow to any compute backend without a lot of effort. Instead of saying "we have a private cloud" or "we use the public cloud", they just say "we use all the clouds".

boulos · on Dec 3, 2016

Absolutely. Burt and the rest of the HEPCloud team are doing awesome work. And see above about how we went from 0 to 160k cores in just a few weeks. That's a testament to how prepared those guys were, and points the way to "Where's the best place for me to (re)run this experiment? At home, on Azure, on AWS, on Google?". Now they just have to solve the government budgeting problem! ;).

Disclosure: I work on Google Cloud, and worked with Fermilab on the mentioned Google work.

hcrisp · on Dec 3, 2016

How does the data get efficiently moved to the spun up cloud nodes? The article mentions parceling out jobs to nodes with independent storage, but how does that help data-intensive work? Are they doing something besides just copying data to the local storage of the cloud nodes? It seems like that would be a bottleneck.

boulos · on Dec 3, 2016

See my note above, but we have great peering with Fermilab and I failed to mention this above: us-central1 in Council Bluffs, Iowa is also quite close to Fermilab (less than 500 miles) so the latency was pretty good too.

[Edit: And I should add that by copying into GCS, they were able to grab compute anywhere in us-central1 without worrying about it. Similarly, on preemption, you don't need to transfer all the data from home, you just grab it from GCS again].

Disclosure: I work on Google Cloud and helped with this effort.

godmodus · on Dec 3, 2016

copying and moving data is the bottleneck. but considering the magnitude of traffic and the type of infrastructure they're using to pass this data.. it's extremely negligible.

this is big data and the type of analysis of such high-dim information is the real issue. (took me and my team 21-hours to clean/train on wikipedia data) - that is, each iteration was 21 hours long. it took us weeks to deliver results, using 3.7ghz/8gb i5 machines.

performance gain is achieved by carefully designing the data stack so that it's split-able, and making sure your routines implement sane message passing with little wastage.

the rest of it is having football-sized computer clusters crunch the hell of it so it saves you a fuck-ton of time eventually.

dekhn · on Dec 4, 2016

They used gcsfuse. It lets you mount GCS buckets as filesystem mounts and do reads/writes on objects, from GCE hosts.

gcsfuse is quite fast. It also caches files locally, if you ask it to. So, presumably, subsequent jobs access input files via a cache hit.