
Lawrence Livermore National Lab's powerful new supercomputer - rbanffy
https://www.mercurynews.com/2018/10/27/meet-sierra-livermores-powerful-new-supercomputer/
======
evanb
I'm privileged to be an early-science user of Sierra (and Lassen) to pursue
lattice QCD calculations while the machine is still in commissioning and
acceptance. It's completely absurd. A calculation that took about a year on
Titan last year we reproduced and blew past with a week on Sierra.

~~~
JackFr
I'm curious -- is there something special about the calculations which
requires bespoke supercomputers rather than large cloud installations?

~~~
SEJeff
Likely fast interconnects and higher performance (bare metal vs VMs). Cloud
instances tend to have a lot less consistent performance due to
oversubscription of the physical hardware, so you have high "jitter". High
performance computing systems (like this) tend to have a better grasp of the
required resources and can push the hardware to the max without
oversubscription too much.

The biggest difference however, is that the goal for supercomputers such as
this is as high of an average usage rate as feasible. The cloud is abysmal for
this where it is more for bursting to say 100k cpu core jobs. For systems like
this, they'd want the average utilization to be 80%+ all the time. The cost of
constant cloud computing like this, even using reserved instances, would be a
multiple of the 162 million USD it cost to build this. Also, the IO patterns
you'll see for large amounts of data like this (almost certainly many
Petabytes) isn't nearly as cost effective as it is to hire a team to build it
yourself.

~~~
evanb
Not being in industry, I forget that AWS doesn't have 100% utilization 100% of
the time. One of the reasons the lab likes us lattice QCD people is that we
always have more computing to do, and are happy to wait in the queue if we get
to run at all. So we really help keep the utilization very high. If the
machine ever sits empty, that's a waste of money.

You're right that the IO tends to be very high performance and high
throughput, too.

~~~
SEJeff
Yup totally understood. We have a much smaller (but still massive)
supercomputer for $REAL_JOB where there is always a queue of work to do with
embarrassingly parallel jobs or ranks and ranks of MPI work to do. When we add
more resources, the users simply can run their work faster, but it never
really stops no matter how much hardware we add.

As much as people love to hate them, I'd love to see you get IO profiles
remotely similar to what you can get with Lustre or Spectrum Scale (gpfs).
They're simply in an entirely different ballpark compared to anything in any
public cloud.

~~~
evanb
We're lucky in the sense that the IO for LQCD is small (compared to other
scientific applications), in that we're usually only reading or writing
gigabytes to terbytes. But also our code uses parallel HDF5 and it's someone
else's job to make sure that works well :)

------
erickj
"It’s not just powerful, it has a stunning memory. There’s enough storage
space to hold every written work of humanity, in all languages – twice."

I really wish they just said it has XX TB of memory

Maybe I'm grossly underestimating how much data all human written works would
actually occupy... but that sounds like the amount of data I could put on a
home NAS (frankly I would have guessed my laptop hard drive until reading that
comparison).

~~~
JackFr
"every written work of humanity X times" and "weighs as much as X elephants"?
What is this 1895? How about Y hours of HD video and weighs as much as Y cars.

------
rdtsc
The article is a bit light on technical details. I found some specs here:

[https://hpc.llnl.gov/hardware/platforms/sierra](https://hpc.llnl.gov/hardware/platforms/sierra)

Summary:

* 190,080 IBM Power9 CPU cores

* 17,280 NVIDIA V100 (Volta) GPUs

* 125,626 peak TFLOPS (CPUs+GPUs)

~~~
JackFr
I would be nice if the article had indicated 1) the incremental improvement
over renting ~200K cores and similar memory from AWS, and 2) the cost of this
behemoth. I assume there is a significant advantage -- it would be nice to
know how much and at what cost.

~~~
SEJeff
The cost of Amazon for this type of thing is only useful to burst to 200K
cores, this is a supercomputer, which will be heavily used all the time. From
a pure economics standpoint, the cloud makes zero sense for this sort of
thing.

Also, the performance of 200K cores on Amazon in VMs compared to 200K physical
cores is a lot different. These HPC systems are designed to eek every last
3-5% of performance out of the entire thing, something that you simply can not
do even if you try using virtual machines or the cloud.

~~~
scrooched_moose
I maintain a much (much much much much.....much) smaller HPC system at my
company and this seems to be a quarterly battle I'm beyond sick of.

Every time a new IT manager sees our costs they immediately declare 'I can get
you that on the cloud for a fraction of the cost' and starts trying to
decommission the system. Never mind the 20-100x increase in solve time and the
multi-TB a day data transfer required. Waste 10 hours in meetings, stave it
off once again, and gear up for the next round in January...

------
erickj
Well I just built a 4 node array of Raspberry Pis... so eat your heart out
Lawrence Livermore National Lab

~~~
0xdeadbeefbabe
They have more than one heart btw. It's like a super heart.

------
twtw
Interesting thing about Summit and Sierra: the CPUs and GPUs are connected via
nvlink rather than PCIe, which reduces the well known cost to move data to
GPU.

------
nuguy
I know it’s stupid but I wonder what it would be like to play a video-game
that fully utilized this super-computer. I imagine an open world rpg where
there are no loading screens and where every npc ai is running all the time.
It would be closer to an interactive simulation than to a game. Fun to think
about.

~~~
rjplatte
I'm not even sure what would be possible with this kind of power. Gamedevs:
What's your dream, limited only by lack of power?

------
grkvlt
The compute platform overview pages [0] list _all_ the HPC systems at LLNL.
It's an incredible list, with a huge amount of TFLOPS, GBs, CPU and GPU cores
available. The systems range from small 50 TFLOP, 20 node Xeon CPU Linux
clusters up to the multi PetaFLOP POWER9 and Nvidia GPU monsters with millions
of cores. Not sure what the total compute power available across all lab
systems would be, although I guess it will be dominated by recent
installations like Sierra - maybe 250 PFLOPS? Anyway, impressive
engineering....

0\.
[https://hpc.llnl.gov/hardware/platforms](https://hpc.llnl.gov/hardware/platforms)

------
onetimemanytime
Is there a bragging rights thing to this? Not wanting to be left behind China?
Seriously, seems like an arms race--which is cool for me, USA has the money
and computers are it.

Will this computer be used close to 100% /was the old super-computer just not
enough?

~~~
dragontamer
Supercomputers are used for:

* Weather modeling

* Nuclear Research

* Car crash simulations

* Electronic Design Automation (ex: mathematically proving chips are correct)

* Protein folding: looking for new chemicals for medicine.

Etc. etc.

There's a huge need to grow our supercomputer capacity. I bet you that every
major field has a use for a super-computer.

I think there's an element of bragging rights. But the USA buys "practical"
supercomputers most of the time. There are designs that push out more FLOPs
but are less useful to scientists.

The hugely powerful interconnect and CAPI / NVLink connections on this
supercomputer demonstrate how "practical" the device is. Most people are RAM
constrained, or message-constrained, and these are the biggest and best
interconnects available in 2018.

Interconnects are NOT a "bragging" metric, very few people look at it. Most
people look at the Linpack benchmark (purely FLOPs measurement). However,
experts can tell when a supercomputer is built with a poor interconnect and
purely for "bragging rights" reasons.

------
antpls
So, it runs RHEL according to the link from another comment.

How many instances of the Linux Kernel is running in total? Is it only one
Kernel instance for the entire machine?

~~~
tanderson92
These machines typically have many thin clients ('nodes') each running their
own kernel with network-attached storage (but high-performance
storage/interconnect/filesystem technology)

~~~
antpls
According to [https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-
sys...](https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-system#coral)
, there are 4320 of such "nodes", so there are 4320 linux kernels running? It
means there are some commanding kernels scheduling the job on other kernels,
it sounds not optimal

According to [https://access.redhat.com/articles/rhel-
limits](https://access.redhat.com/articles/rhel-limits) , RHEL 7 on POWER
system can "only" manage 32TB of memory, (the whole system has more than a
petabyte of memory) assuming they don't run a modified version. So there is
definitely not only one OS running, I guess.

Still, the goal of an OS, especially an Unix-like "Time Sharing" OS, is to
manage resources. I wonder how hard it would be for the Linux kernel to manage
the entire system (with some virtual devices to aggregate all the nodes in one
system, even if they are at the opposite side of the room), and if all the
developed code about scheduling could be reused at that scale.

~~~
tanderson92
The problem is less of scheduling and more of communication. Typically all
nodes will run (facilitated by a program like mpirun or mpiexec) the same
executable which will be launched simultaneously in each OS. This is done
inside a job scheduler and the resulting host list results in a defined
communication pattern.

By the way, even systems completely across the room may be "close" to each
other, depending on the topology of the system. You can imagine this being
important for solving a physical system which is periodic in certain
dimensions, so there should be little interconnect distance between
physically-distant nodes.

~~~
antpls
At that scale, I find odd people didn't come up with an optimized OS that sees
and manages the resources and also plan the communications between them.

It looks more like a private data center with 4000 dedicated machines in the
same network to run distributed algorithms than a "single super computer". Are
we just "wow-ing" at what is basically a data center here?

~~~
philipkglass
For HPC applications that actually need low-latency coordination between
nodes, the application code itself manages communication. The communication
can't be better optimized by the OS.

If you have an embarrassingly parallel problem, it will run well on this
machine but it will also be a waste of the machine's expensive design.
Embarrassingly parallel problems run just as well on generic data center
hardware. This machine is built for problems that only parallelize effectively
with low-latency coordination between nodes. Such problems come up a lot in
scientific/engineering simulations but are comparatively rare in general
purpose computing environments. General purpose nodes in a cloud computing
environment cannot run some of the harder problems this machine runs, _at any
price_. For any non-trivial parallel computing job there comes a crossover
point where adding more nodes makes the total time-to-solution longer rather
than shorter. This point comes a lot sooner if you don't have dedicated high-
bandwidth, low-latency interconnects between nodes.

~~~
antpls
Precisely, if we are talking about low-latency, why would we let the
communications go through the application (in user-land), to the kernel, then
to the network stack, then be received by a kernel from another node, and then
finally received again by the application in user-land. I would imagine as a
first guess that bypassing several linux kernels and directly accessing remote
hardwares would be mandatory for best low-latency.

If they are some info on internet about the software stack/architecture of the
entire system, I would document myself on that. I didn't explore all the links
I posted above yet.

I'm nowhere an expert, and HPC is really specific use case, but there is
surely interesting bits to learn from it

~~~
tanderson92
One place to start might be
[http://infiniband.sourceforge.net/](http://infiniband.sourceforge.net/)

------
buttslasher69
So is it filled with nvidias TESLA GPUs?

~~~
qubax
Yes and Power9.

