
Accelerated Computing Powering World’s Fastest Supercomputer - Poalopat
https://blogs.nvidia.com/blog/2018/06/08/worlds-fastest-exascale-ai-supercomputer-summit/
======
ahelwer
"If every computation were represented by a single grain of sand, you could
fill up the Houston Astrodome with sand 350 times in a single second."

There's a new one. Can anyone comment on the utilization of these
megacomputers? Do they have somewhere near 100% usage, with a queue extending
weeks? Also, is all this computational power really... necessary? I've seen
some intensely inefficient simulation code in my time.

~~~
mutagen
Several years ago I had the opportunity to attend a non-classified project
coordination conference / catchup meeting with some DoE scientists, among them
attendees from LLNL and Los Alamos. I overheard discussion and some bragging
about the number of cycles they had their simulation kernels at. At least some
of them are dedicated to extracting every last bit of performance out of their
code.

~~~
darkmighty
I wonder what portion of time on those supercomputers is used in optimizing
the code itself; and how much could be done.

With those massive clusters you could afford testing in parallel trillions of
mutations to your simulation kernels and prove the correctness of the fastest
ones -- pruning should be extremely fast by finding counterexamples. Or even
higher level architectural optimizations. Surely they could afford at least a
few % of the total time on this pre-optimization (although the tools to
achieve this automatically would need to be quite sophisticated!).

~~~
hedora
[http://fftw.org](http://fftw.org) is optimized by doing massive parameter
sweeps on each architecture (it only considers correct implementations).

There are also a few “software synthesis” and “sketching” approaches that use
a constraint solver to find all correct implementations of a high level spec,
subject to some implementation pattern. Then they either try them all with
brute force or pick the one that optimizes some objective function.

~~~
stochastic_monk
fftw isn't exactly developing an optimal kernel from scratch. It's testing a
range of different methods and simply choosing the best for one's parameters.

Facebook's Tensor Comprehensions framework, which generates CUDA kernels
through a genetic algorithm, is closer to the sort of approach which would
take greatest advantage of the hardware it's on.

------
NKosmatos
The next revision of the TOP500 will come out this month, let’s wait to see
how it stands against the competition. The previous supercomputer mentioned in
the article, “Titan”, currently is the 5th fastest in the list
[https://www.top500.org/lists/2017/11/](https://www.top500.org/lists/2017/11/)

------
ttul
Forgive me for being naive, but what OS runs on this beast? What framework do
people code to to make use of the environment?

~~~
quadruplebond
Probably something that looks like Red Hat. All these codes will use some form
of message passing with MPI being the most common. Finally to get on node
parallelism some codes will hand write cuda for their problem and others can
get away with calling Nvidia libraries or using something like new versions of
OpenMP.

~~~
ttul
I want to see the output of “top”

~~~
quadruplebond
Unfortunately you would be disappointed. top is only going to display what's
running on your current node, not the whole machine. There is probably some
sort of global top, but just logging in and running top in your shell will
only run top on the head node. That head node might still be beefy, but only
as beefy as a large shared memory machine can be.

~~~
ikeyany
Why not just fork a child top for each node?

------
martinpw
If you look at the statistics over time here for accelerators:

[https://www.top500.org/statistics/overtime/](https://www.top500.org/statistics/overtime/)

it appears the number of top500 systems with accelerators is not really
increasing much - it has been sitting at around 20% of the top 500 machines
for the past ~4 years.

Can anyone working with these systems comment on why that might be? Are
accelerators still tough to apply to a lot of the problems these machines are
used for?

~~~
quadruplebond
Depends on what the computer is designed for. Imagine you want to parallelize
up to about 16 nodes or 1000-2000 cores of a new xeon. And your code is not
optimized for accelerators. Well you might build a large machine with hundreds
of nodes, but with the intent to run many 16 node jobs at a time. The largest
machines though mostly are aiming for jobs that use a significant fraction of
the total machine and are chasing peak performance so accelerators it is. Also
I didn't look too closely at the link, but some machines that aren't
technically accelerators like K computer or the old IBM Bluegene computers are
closer to general purpose accelerators than they are to fat xeon nodes.

------
tntn
As a point of interest, there has been some cool work done recently regarding
accelerating high precision dense system solvers using tensor cores +
iterative refinement. See [http://on-
demand.gputechconf.com/gtc/2018/presentation/s8478...](http://on-
demand.gputechconf.com/gtc/2018/presentation/s8478-new-frontiers-for-dense-
linear-solvers-towards-extreme-performance-and-energy-efficiency.pdf)

------
noahdesu
The only up-to-date information I've found so far regarding the storage system
for this beast is:

* 250 PB GPFS

* 2.5 TB/s

I'd like to find some information on the burst buffer that is sitting between
compute and the capacity storage system.

------
neurologic
This supercomputer is surprisingly underwhelming, since it's equivalent to
just two of Google's TPU v3 pods.

~~~
staticfloat
The numbers I've seen for a TPU v3 pod are ~100 PFLOPs, whereas this article
claims over 3 EFLOPs, so that's at least 30 TPU v3 pods. Arguments about how
useful FLOPs as a measure actually are aside, that's still quite a lot of
computing power.

~~~
neurologic
No, it's 200 PFLOPs:

> Built for the U.S. Department of Energy, this is a machine designed to
> tackle the grand challenges of our time. It will accelerate the work of the
> world’s best scientists in high-energy physics, materials discovery,
> healthcare and more, with the ability to crank out 200 petaflops of
> computing power to high-precision scientific simulations.

EDIT: OK, it's 200 PFLOPs of high precision math, 3 EFLOPs of lower precision
math. I take my comment back.

------
eeks
I will certainly gets dozens of downvotes for obvious fanboyism, but I'm
prepared to take the heat.

For the doubters and the disbelievers that have been wondering what is the
relevance of IBM in this day an age: this. This is what IBM is all about.

And it's not just about PFLOPS; each node has 1/2 terabyte of memory, globally
addressable across _the entire cluster_ using RDMA over Mellannox 200Gb/s EDR.

It's also P9: 44 cores per node; but most importantly each node drives a
couple of V100 through NVlinks, which allows the GPU to share the system's
main memory.

~~~
Keyframe
NVlink allows gpus to share system's memory?

~~~
tntn
Summit node CPUs can access GPU memory coherently and unified memory allows
for a single pointer across all processors. On most systems that involves page
faults migrating pages, but summit has something called ATS that allows GPU to
directly access all system memory.

[https://vimeo.com/262870773/recommended](https://vimeo.com/262870773/recommended)

~~~
Keyframe
IBM integrated NVlink into their Power CPUs, well I'll be damned.

------
patelster
But can it run Crysis?

~~~
aqzman
Please don't post if you don't have anything to add, or a valid question to
ask. If you feel like posting memes Reddit is a better medium for that.

------
senatorobama
NVIDIA is my dream company. Does _anyone_ have an in?

