I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).
My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.
The biggest difference however, is that the goal for supercomputers such as this is as high of an average usage rate as feasible. The cloud is abysmal for this where it is more for bursting to say 100k cpu core jobs. For systems like this, they'd want the average utilization to be 80%+ all the time. The cost of constant cloud computing like this, even using reserved instances, would be a multiple of the 162 million USD it cost to build this. Also, the IO patterns you'll see for large amounts of data like this (almost certainly many Petabytes) isn't nearly as cost effective as it is to hire a team to build it yourself.
You're right that the IO tends to be very high performance and high throughput, too.
As much as people love to hate them, I'd love to see you get IO profiles remotely similar to what you can get with Lustre or Spectrum Scale (gpfs). They're simply in an entirely different ballpark compared to anything in any public cloud.
LLNL also has national security concerns that are unparalleled by most AWS applications ;)
With the Clos network style topologies that are commonplace in large data centers today, I'm not sure one couldn't achieve decent results in the public cloud.
AWS networking is pretty terrible, but in GCP, I can get 2gbps per core up to 16Gbps for an 8-core instance. For any bare metal deployment, I'm going to be maxed out around 100Gbps which will be close to saturating an x16 PCIe bus.
It's hard to find a dual-cpu frequency optimized processor with less than 8 cores and I'm not sure that'd be cost effective. With hyperthreading, that yields 32 usable cores or around 3.125gbps per core.
Even still, I wager they'd go for better density.
Also, I can get 8 GPUs along with that 8 core/16gbps instance in GCP. Sounds totally doable to me.
AWS looks like a bad deal here.
If you have a variable load, cloud infrastructure may make sense if you can easily auto-scale.
In my experience, most business real world applications are multi tiered applications with variable loads hence are a good fit for cloud infrastructure.
However, attaining the required application flexibility and KPIs for efficient auto scaling is quite hard and require strong functional & technical expertise.
I'm running infrastructure for a SaaS app in k8s. I feel like I'm doing well sustaining >50% efficiency, i.e. all cores running >50% all the time and more than half the memory consumed for things that aren't page cache. Hard to get better efficiency without creating hot spots.
Not a great deal.
Edit: It sure isn't popular or easy to talk about from the looks of things.
People use the stack in a variety of different ways---I'll describe my own usage.
There's a message-passing abstraction layer, QMP, sitting over MPI or SMP or what have you (you can compile for your laptop for development purposes, for example). This keeps most of the later layers relatively architecture agnostic.
Over that sits QDP, the data parallel library. Here's where the objects we discuss in quantum field theory are defined. We almost always work on regular lattices. QDP also contains things like "shove everybody one site over in the y direction" (for example).
Finally, there's the physics/application layer, where the physics algorithms live. I am most familiar with chroma. QUDA is the GPU library and can talk to most application-layer libraries and has at least simple solvers for most major discretizations people want to use (it also has fancier solvers such as multigrid methods for some discretizations).
Code in chroma by and large looks like physics equations, if you had a pain-in-the-ass pedantic student who didn't understand any abuse of notation.
Chroma can be used as a library, so that for your particular project you can do nonstandard things while leveraging everything it can already do.
Other physics layers include CPS, which grew out of the effort at Columbia with QCDSP/QCDOC, MILC (really optimized code for staggered fermions), and others.
The USQCD stack isn't the only one. Another modern lattice field theory package is grid, developed by a tight collaboration between intel and University of Edinburgh https://github.com/paboyle/grid. There's also openQCD http://luscher.web.cern.ch/luscher/openQCD/
On a POWER8/NVIDIA P100 machine I know QUDA gets 20% of peak, sustained.