Hacker News new | comments | show | ask | jobs | submit login

I'm privileged to be an early-science user of Sierra (and Lassen) to pursue lattice QCD calculations while the machine is still in commissioning and acceptance. It's completely absurd. A calculation that took about a year on Titan last year we reproduced and blew past with a week on Sierra.

To add: Titan has 1 GPU per node, we were getting about 300 GFlop/sec/GPU sustained. Sierra has 4 GPUs per node, we get about 1.5 TFlop/sec/GPU sustained. (Summit has 6 GPUs per node, also about 1.5 TFlop/sec/GPU sustained). So performance went up by about 20 on a per-node basis. The large memory is not just luxurious but essential too---in our applications it has really helped compensate for the comparatively minor improvements in the communication fabric.

What’s it like to program for it? Do you need to manage the distribution and utilization per node your self like a normal HPC farm? Or is it more like a single computer like a Cray?

It actually doesn't use SLURM, even though SLURM is originally an LLNL development. It uses LSF. I do not know why the shift.


So your process is limited to the resources of a node right? And coordinating data between jobs is via a shared file system or network messaging between nodes?

Depending on the step a LQCD calculation might run on 4, 8, 16, 32, or even more nodes (linear solves and tensor contractions, for example). It's coordinated with OpenMPI and MPI (or equivalent, see my other comment on the software stack). The results from solves are typically valuable and are stored to disk for later reuse. That may prove impractical at this scale.

I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).

Any idea how it compares to Bridges?

I have read about some difficulties of running SLURM on POWER9 systems, so maybe IBM proposed/insisted to run their own POWER9-tested scheduler?

Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.

My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.

It could be---I really don't know. SLURM ran the BG/Q very successfully, which was PowerPC.

I'm curious -- is there something special about the calculations which requires bespoke supercomputers rather than large cloud installations?

Likely fast interconnects and higher performance (bare metal vs VMs). Cloud instances tend to have a lot less consistent performance due to oversubscription of the physical hardware, so you have high "jitter". High performance computing systems (like this) tend to have a better grasp of the required resources and can push the hardware to the max without oversubscription too much.

The biggest difference however, is that the goal for supercomputers such as this is as high of an average usage rate as feasible. The cloud is abysmal for this where it is more for bursting to say 100k cpu core jobs. For systems like this, they'd want the average utilization to be 80%+ all the time. The cost of constant cloud computing like this, even using reserved instances, would be a multiple of the 162 million USD it cost to build this. Also, the IO patterns you'll see for large amounts of data like this (almost certainly many Petabytes) isn't nearly as cost effective as it is to hire a team to build it yourself.

Not being in industry, I forget that AWS doesn't have 100% utilization 100% of the time. One of the reasons the lab likes us lattice QCD people is that we always have more computing to do, and are happy to wait in the queue if we get to run at all. So we really help keep the utilization very high. If the machine ever sits empty, that's a waste of money.

You're right that the IO tends to be very high performance and high throughput, too.

Yup totally understood. We have a much smaller (but still massive) supercomputer for $REAL_JOB where there is always a queue of work to do with embarrassingly parallel jobs or ranks and ranks of MPI work to do. When we add more resources, the users simply can run their work faster, but it never really stops no matter how much hardware we add.

As much as people love to hate them, I'd love to see you get IO profiles remotely similar to what you can get with Lustre or Spectrum Scale (gpfs). They're simply in an entirely different ballpark compared to anything in any public cloud.

We're lucky in the sense that the IO for LQCD is small (compared to other scientific applications), in that we're usually only reading or writing gigabytes to terbytes. But also our code uses parallel HDF5 and it's someone else's job to make sure that works well :)

To add to this, the jitter becomes more and more important the larger the scale you are running. The whole calculation (potentially the whole machine) is sitting around waiting for the slowest single task to finish up. Thus you get counterintuitive architectures, like a dedicated core just to handle networking operations. It seems like a bad deal to throw away 6% of your computation, but the alternative is even worse utilization. Highly coupled calculations are a very different beast because they cannot be executed out of sequence.

Ha, I just replied to your other message asking for an AWS comparison. The answer is yes, the communications has to be substantially better than anything you can order up.

LLNL also has national security concerns that are unparalleled by most AWS applications ;)

Supercomputers tend to need very high bisectional bandwidth.

With the Clos network style topologies that are commonplace in large data centers today, I'm not sure one couldn't achieve decent results in the public cloud.

AWS networking is pretty terrible, but in GCP, I can get 2gbps per core up to 16Gbps for an 8-core instance. For any bare metal deployment, I'm going to be maxed out around 100Gbps which will be close to saturating an x16 PCIe bus.

It's hard to find a dual-cpu frequency optimized processor with less than 8 cores and I'm not sure that'd be cost effective. With hyperthreading, that yields 32 usable cores or around 3.125gbps per core.

Even still, I wager they'd go for better density.

Also, I can get 8 GPUs along with that 8 core/16gbps instance in GCP. Sounds totally doable to me.

My back of the napkin calculation says that I can get 270 petaflops with 2160 n1-highmem-16 each with 8 v100 gpu on preemptible instances costing roughly $13k/hr or about $10m/mo

So with $120m/y in less than 2 years you'd exceed the price of the whole thing and also likely get worse interconnect speed and possibly raw computational speed.

AWS looks like a bad deal here.

If you have a constant load, on premise is always cheaper. Scientific computation has, for a large enough organization, a 100%

If you have a variable load, cloud infrastructure may make sense if you can easily auto-scale.

In my experience, most business real world applications are multi tiered applications with variable loads hence are a good fit for cloud infrastructure.

However, attaining the required application flexibility and KPIs for efficient auto scaling is quite hard and require strong functional & technical expertise.

My experience totally reflects this. Most enterprise IT infrastructure is idle the majority of the time.

I'm running infrastructure for a SaaS app in k8s. I feel like I'm doing well sustaining >50% efficiency, i.e. all cores running >50% all the time and more than half the memory consumed for things that aren't page cache. Hard to get better efficiency without creating hot spots.

That's GCP on preemptible nodes and you are correct.

Not a great deal.

Cloud is great when you have variable usage. These machines are probably driven near 100% all the time. In that scenario, they are probably more cost-effective than cloud infrastructure.

So, Lassen looks to be a 'mini Sierra' with ~20% performance but can be used for unclassified jobs, whereas Sierra is only for classified work, right?

I've heard the software stack includes MPI. Can you tell us more about the software? Is the software pretty efficient?

Edit: It sure isn't popular or easy to talk about from the looks of things.

There's a large stack of community software that was developed with funding from the DOE's SciDAC program.


People use the stack in a variety of different ways---I'll describe my own usage.

There's a message-passing abstraction layer, QMP, sitting over MPI or SMP or what have you (you can compile for your laptop for development purposes, for example). This keeps most of the later layers relatively architecture agnostic.

Over that sits QDP, the data parallel library. Here's where the objects we discuss in quantum field theory are defined. We almost always work on regular lattices. QDP also contains things like "shove everybody one site over in the y direction" (for example).

Finally, there's the physics/application layer, where the physics algorithms live. I am most familiar with chroma. QUDA is the GPU library and can talk to most application-layer libraries and has at least simple solvers for most major discretizations people want to use (it also has fancier solvers such as multigrid methods for some discretizations). Code in chroma by and large looks like physics equations, if you had a pain-in-the-ass pedantic student who didn't understand any abuse of notation.

Chroma can be used as a library, so that for your particular project you can do nonstandard things while leveraging everything it can already do.

Other physics layers include CPS, which grew out of the effort at Columbia with QCDSP/QCDOC, MILC (really optimized code for staggered fermions), and others.

The USQCD stack isn't the only one. Another modern lattice field theory package is grid, developed by a tight collaboration between intel and University of Edinburgh https://github.com/paboyle/grid. There's also openQCD http://luscher.web.cern.ch/luscher/openQCD/

As for efficiency: it depends on exactly what code you use and what your problem is. BAGEL, which included hand-coded assembly http://usqcd-software.github.io/bagel_qdp/ was getting something like 25% of peak, sustained, for doing linear solves on the BG/Q.

On a POWER8/NVIDIA P100 machine I know QUDA gets 20% of peak, sustained.

Have you been able to use the V100s' 16-bit tensor cores to speed up QCD at all?

I know some Gordon Bell finalists have used these tensor cores for a dramatic speedup. As far as I know, there isn't any QCD code yet. NVIDIA has a small team of developers who write libraries to make QCD screamingly fast https://github.com/lattice/quda/ ; I don't know if/where the tensor cores may be most effectively leveraged.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact