

U.S. To Build Two Flagship Supercomputers for National Labs - eslaught
http://nvidianews.nvidia.com/News/U-S-to-Build-Two-Flagship-Supercomputers-for-National-Labs-c0d.aspx

======
cwal37
I spent some time at Livermore, and I work at Oak Ridge now. It's been
interesting to see the difference in how the HPC assets are referred to, or
not, and I think it reflects the culture and layouts of the labs.

At LLNL it was constant, and everyone I interacted with had something to say
about being at or near the #1 spot in the top 500 list (with Sequoia at the
time), and advice on projects to try and get time. I went on a tour of the
facility too, and it was really neat to get some perspective on the physical
aspects of it all.

I've been at ORNL since February, and haven't heard Titan mentioned once.
Partially I think this is due to the more cohesive overall mission of LLNL vs.
ORNL, but I think geography plays a role as well.

The labs have fairly similar numbers of employees, but with one major
difference; ORNL is spread out over a pretty large area and LLNL is a single
square mile. Perhaps as a result of that, groups at ORNL feel a bit more
insular. Heck, I've been to the "main" cafeteria once here, because I work on
an edge of the campus, but at LLNL I went every day because it was easy to get
to.

I wonder if a more compressed area results in a more useful utilization of
huge assets like HPC because there are a lot more connections to be made
between different departments. Then again, I also got the feeling that at LLNL
it was a carrot they could try and dangle to draw people out of the valley.
ORNL doesn't really have that same local competition for talent.

EDIT: And the article even specifically mentions the difference in lab
missions by noting the new computers' uses: security vs. open science.

~~~
marktangotango
There was some discussion in Dewars book "To the End of the Solar System"[1]
about the different cultures at the national labs related to the development
of nuclear thermal rockets. Something about the Los Alamos guys being all
about experimentation, and blowing radioactive material out of the tail of the
rocket, until an Oak Ridge director was brought in and mandating more
modelling of internal behavoir. I may have those labs mixed up, it's been a
long time since I've read the book.

[1][http://books.google.com/books/about/To_the_End_of_the_Solar_...](http://books.google.com/books/about/To_the_End_of_the_Solar_System.html?id=zmpxV1ygjvsC)

~~~
elektronjunge
The nuclear rocket stuff was in the 50s and 60s. With the shift to stock-pile
stewardship at the end of the cold war all of the labs became fairly focused
on modeling. It also depends on what group you work for. I worked for one of
the modeling groups so we were obviously talking about it all the time. Many
other groups did to. Some were more experiment focused. Most the physicists
treated it as a third branch to the classic experimental, theoretical split in
physics.

~~~
ganzuul
Why do nuclear stockpiles require HPC?

~~~
icegreentea
Since actually testing nuclear weapons is banned, the only way to verify (ha!
verify with simulation) that the current nuclear stockpile is reliable is with
simulation. In brief, the idea is to model degradation of current warheads and
how that effects their performance/reliability.

For example, the National Ignition Facility (the warp core in Into Darkness)
was created partly too provide a source of fusion that could be used to verify
the computer models used to simulate nuclear weapon stability. Ie, write a
general enough model to see what happens in the fusion bit of a nuke, then
apply the model to something like what the NIF does, and then actually test it
in the NIF. If the NIF experimental outcomes are in agreement with the model
predictions, then we now have higher confidence that the model predictions
with regards to the actual nukes are useful.

[http://en.wikipedia.org/wiki/Stockpile_stewardship](http://en.wikipedia.org/wiki/Stockpile_stewardship)

------
Xcelerate
This is really exciting! (I'm actually running a molecular dynamics simulation
on Titan right now.)

What I would like to see though, is also a return to an increase in absolute
processing speed. GPUs and more nodes are great for simulating _larger_
systems of molecules and atoms, but they are actually worse for simulating
_longer_ timespans. For example, one of my projects was a small carbon
crystallite of 136 atoms. I ran that one on my laptop because it would have
taken just as long on Titan.

Problems like protein folding require a sequential series of operations where
each step depends on the last one. Right now, the solution to this is specific
ASIC systems (like Anton) but then that is a lot of money invested into a
machine that can only be used for one purpose.

Regardless, most of my work is size bound rather than time bound, so Summit
will be great!

~~~
eslaught
> GPUs and more nodes are great for simulating larger systems of molecules and
> atoms, but they are actually worse for simulating longer timespans.

For a number of reasons, that's basically not going to happen. CPU single-
threaded performance is still increasing slowly, but probably not by enough to
satisfy your simulation needs.

In a lot of these cases, the only practical solution (assuming you can't spend
the money on custom hardware) is to go back to the code and optimize the hell
out of it. Partly this means clever low-level optimizations, but it might
involve switching to programming models that help make better use of the
hardware. For example, S3D, a combustion simulation which was one of the
acceptance tests for Titan, runs about 2x faster under Legion (my research
project) compared to the previous OpenACC code hand-tuned by Cray and NVIDIA
engineers [1].

If that sounds interesting to you, feel free to contact me, and if you'll be
at SC next week maybe we can meet up.

[1]: [http://legion.stanford.edu/pdfs/legion-
fields.pdf](http://legion.stanford.edu/pdfs/legion-fields.pdf)

------
acadien
The joke is the current #1 spot is occupied by a broken machine that has never
run at full capacity and spends most of its time OFF or hamstrung to save on
power. Tianhe2 was a huge mistake, poorly planned out and poorly implemented.

The American National Labs seem to have the experience and patience to
implement new systems that work well enough and deliver on a dollar/flop
ratio.

~~~
bane
Like lots of big programs in China, Tianhe2 was probably more the fulfillment
of a national prestige program than a serious scientific undertaking.

------
Someone1234
Ignorant question: Why are supercomputers still popular? I would have thought
that their examples could equally be accomplished with a more much flexible
array of less powerful nodes (see Google's search engine as an example).

That way instead of doing a "big bang" upgrade such as this, you just upgrade
individual nodes as the technology allows and are almost always "current."

PS - I find their usage of the term "energy independence" hilarious. That was
a term coined largely to justify fracking and other environmentally damaging
practices in the US. I'm glad to see it has been over-used so much now third
parties are using it to justify other projects...

~~~
rtkwe
The general reason comes down to inter-node bandwidth. Things like Google's
search or SETI are able to use standard interconnect (or the open internet in
SETI's case) because the problem, link counting and signal analysis, can be
broken down into individual pieces that don't interact very much between
sections of the computation.

Things like physics simulations, weather etc., have a lot of interaction
between any divisions you could try to draw to be able to split the
information between nodes. To work with these problems you need a faster
interconnect between processors than ethernet provides, infiniband seems a
popular choice but I'm no expert on the details of their architecture. These
problems also need to move in relative lockstep meaning loosely connected
systems don't work as well or at least don't provide any real advantages. Node
homogeneity makes these tightly coupled processing nodes easier to manage.

Occasionally there will be upgrades to a super computer but generally they're
run 24/7 until they're too far behind the technology, both in speed and lower
power consumption, that it becomes cheaper to completely replace it than to
try to do an upgrade that will replace more than half the cluster's total
electronics to begin with.

~~~
epistasis
>The general reason comes down to inter-node bandwidth

And sometimes even more than bandwidth, inter-node latency, which is where
Inifinband really shines over ethernet. Calculations which have high levels of
node-to-node dependencies are pretty much the difference between super
computers and the type of stuff that large internet companies compute on.

I hear (but don't personally know) that Google-style large data center
installs are moving towards the CLOS-style networks that have been popular in
HPC for a long time. These network topologies give equal bandwidth between any
pair of nodes, as well as nearly equal latency.

~~~
mscman
Not just Google. Amazon's "Enhanced networking" uses a feature in 10GbE that's
been used in smaller HPC clusters for around 5-10 years now. And MS Azure has
InfiniBand backing their highest-tier instance types.

Many datacenters are adopting HPC technologies to reach the scale they need.

------
eslaught
See also: NVLink, which is getting rid of the PCI-E bus between the CPU and
GPU:
[https://news.ycombinator.com/item?id=8609071](https://news.ycombinator.com/item?id=8609071)

------
higherpurpose
This is interesting to see, Power8 and Volta together. I wonder if they'll
push Power8's bandwidth to match Volta's 1TB/s by 2017 as well. Currently has
230GB/s I think.

