
Supercomputers: Obama orders world's fastest computer - m-i-l
http://www.bbc.co.uk/news/technology-33718311
======
davegardner
10 years ago the fastest supercomputer was BlueGene/L which was rated at 136.8
TFlop/s. The current fastest supercomputer is rated at 33,862.7 TFlop/s, or
247 times faster.

It seems to me that the aim of taking 10 years to build a supercomputer that
is only 20 times faster than the current one might fall a little short if it's
aiming to take the top spot.

~~~
vmarsy
This isn't only about the FLOPS, the big trend of these countries* ordering
new supercomputers by 2020/2025 is very focused on power. Current
supercomputers consume a lot.

Also the FLOPS measurement is a bit broken: It focuses on dense linear algebra
problem, for which GPU or other accelerators boost the results easily. If all
you plan to do is running simulations that are easily parallelized on GPU it
is fine, for other types of programs it is hard to tell which is the _fastest_
supercomputer.

* France is also ordering a would -be top 10 supercomputer : [http://www.hpcwire.com/off-the-wire/the-cea-agency-and-atos-...](http://www.hpcwire.com/off-the-wire/the-cea-agency-and-atos-team-to-deliver-exaflop-supercomputer-by-2020/)

~~~
amelius
> Also the FLOPS measurement is a bit broken

Does anybody have a better measurement?

Perhaps it could be the size of the matrix that can be inverted on it in an
hour of time, with IEEE double precision floats, using some standard
algorithm.

~~~
ajdecon
The "High Performance Conjugate Gradients" benchmark was proposed a couple
years ago as an alternative metric for ranking supercomputers. Its proponents
claim its behavior is more similar to real applications (irregular access
patterns, lower ratio of computation to memory access, etc), compared to
linear algebra problems like the "High Performance Linpack" benchmark
currently used by the Top500.

The different performance numbers for top systems on HPCG vs HPL are pretty
striking: [http://www.hpcg-
benchmark.org/custom/index.html?lid=155&slid...](http://www.hpcg-
benchmark.org/custom/index.html?lid=155&slid=279)

Original proposal to use HPCG as an alternative to HPL for supercomputer
rankings: [http://www.sandia.gov/~maherou/docs/HPCG-
Benchmark.pdf](http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf)

~~~
jedbrown
HPCG basically measures STREAM and has many technical flaws making it scale-
dependent and difficulty to adjudicate. As codeveloper of a different
benchmark, I'll just cite this paper from a third party.
[https://hpgmg.org/static/MarjanovicGraciaGlass-
PerformanceMo...](https://hpgmg.org/static/MarjanovicGraciaGlass-
PerformanceModelHPCG-2014.pdf)

The reality is that there are many dimensions to supercomputing performance
and it's impossible for one number to capture the utility of the machine. Our
HPGMG benchmark ([https://hpgmg.org](https://hpgmg.org)) attempts to strike a
balance and give useful supplementary information. I do think it's better than
any other single benchmark for evaluating today's machines and will also prove
to be more durable over time.

~~~
dekhn
How would you use a benchmark like this to predict the performance of a well-
designed asynchronous parallel conjugate gradient solver, like most modern
deep learning neural networks that run on Internet HPC machines?

~~~
jedbrown
CG isn't truly asynchronous due to its reductions. It can be pipelined in
various ways (we have several implementations in PETSc), but performance
requires a quality implementation of asynchronous reduction (e.g.,
MPI_Iallreduce) which the vendors have been slow about developing (I've been
working with some on fixing this and Cray has made recent progress).

With respect to deep learning and other applications using CG or related
algorithms, the bottlenecks depend on the scale, and ability to expose
locality, and operator/preconditioner representation. If there is no locality,
then matrix-vector products require all-to-all communication which tend to
dwarf the cost of the reductions in CG. Even with locality in the matrix-
vector product, preconditioners often need to communicate globally in a
scalable way similar to HPGMG. Operators need not be represented as a table of
numbers or a sparse matrix format, but could use a tensor product, fast
transform, or other information to compute the action using less storage. If
they are represented explicitly (sparse or dense), then matrix-vector product
performance (thus CG as a whole) is dominated by memory bandwidth for problem
sizes that do not fit in cache. HPGMG tries to strike a balance between memory
bandwidth demands and compute using a matrix-free representation. HPGMG also
reports dynamic range expressed as Performance versus Time-to-solution as the
problem size is varied, which allows applications to see performance barriers
that might be relevant to them (e.g., see how Titan cannot do a solve in less
than 200 ms while Edison can do 50 ms, and how that relates to climate
simulation performance targets; see slide 7 of
[https://jedbrown.org/files/20150624-Versatility.pdf](https://jedbrown.org/files/20150624-Versatility.pdf)).

~~~
manjunaths
Is it possible to calculate the theoretical performance of a cluster under
HPGMG and then do a practical run and come with an efficiency number like in
HPL ?

One of the biggest reasons for use of HPL is that many sizing considerations
can be based off of the theoretical calculations.

But anyway this is very interesting. I definitely need to check this out.

~~~
jedbrown
HPL has an abundance of flops at all scales (N^{1.5} flops on N data), so one
can expect a decent fraction of peak flop/s on any architecture with enough
memory and adequate cache performance. This is a problem because architectural
tricks like doubling the vector registers without commensurate improvements in
bandwidth, cache sizes, load/store/gather/scatter produce huge (nearly 2x)
benefit for HPL and little or no benefit to a large fraction of real
applications.

HPGMG is representative of most structure-exploiting algorithms in that it
does not have this abundance of flops, thus theoretical performance is
actively constrained by both memory bandwidth and flop/s. We see many active
constraints in practice; e.g., improving any of peak flop/s, memory bandwidth,
network latency, or network bandwidth produces a tangible improvement in HPGMG
performance. Depending on the fidelity of the performance model, these
dimensions can be a fairly accurate predictor of performance, but ILP,
compiler quality, on-node synchronization latency, cache sizes, and similar
factors also matter (more for HPGMG-FE than HPGMG-FV).

I think it is actually quite undesirable for benchmark performance to be
trivially computed from one parameter in machine provisioning. No computing
center has a mission statement asking for a place on a benchmark ranking list
(like Top500). Instead, they have a scientific or engineering mandate. Press
releases tend to overemphasize the ranking and I think it is harmful to the
science any time the benchmark takes precedence over the expected scientific
workload. HPGMG is intended to be representative in the sense that if you
build an "HPGMG Machine", you'll get a balanced, versatile machine that
scientists and engineers in most disciplines will be happy with. I'd still
rather the centers focus on their workload instead of HPGMG.

------
phreeza
A bit more informative is the actual fact sheet put out by the white house
[1]. What they are really aiming for is exa _scale_ computing, which they
define as being capable of applying exaFlops to exabytes. From my limited
knowledge, the latter will actually be the bigger deal. As pointed out
elsewhere, an exaflop supercomputer will probably come around beforehand.

[1]
[https://www.whitehouse.gov/sites/default/files/microsites/os...](https://www.whitehouse.gov/sites/default/files/microsites/ostp/nsci_fact_sheet.pdf)

~~~
trsohmers
The real problem is getting 1 exaflop (or around it) within a reasonable power
budget. The DOE's power budget for all of their supercomputing resources is 20
Megawatts, so at a full system level we would need to be at 50 GFLOPs per
watt, while the best system right now is at 5.

~~~
tim333
Nvidia's Pascale GPU, due next year, is supposed to do 28 terraflops with 1300
watts which I make 21 GFLOPs per watt.

~~~
trsohmers
That's single precision, which the DOE doesn't care about. The latest NVIDIA
GPU's do 8 to 16 times better at single precision (32 bit) floating point
compared to double precision (64 bit).

The best next generation DP GFLOPs/watt from one of the big players will most
likely be the 2016 Xeon Phi, at ~10-12GFLOPs/watt... You are also forgetting
that GPUs also have a ~100W+ CPU sitting next to it, which brings down total
efficiency significantly.

Shameless self promotion: My startup
([http://rexcomputing.com](http://rexcomputing.com)) is aiming for 64 double
precision GFLOPs/watt, and 128 GFLOPs/watts single precision for its first
chip next year.

~~~
tim333
Your chip looks cool. I guess it may be tricky to adapt software to run on the
thing? Or else you could try to sell Obama 4 million of them for his new
computer.

------
dekhn
It looks like they are explicitly saying they want to make a machine that
works for both types of HPC -- classic low-latency high-bandwidth internode
communication (physics simulations) and modern Internet-driver high-bandwidth
storage/node communications.

This is because the supercomputer community has long ignored the Internet-
style of computation (MapReduce etc). But most of the new generation of
scientists are adapting their codes to this new style, because dollar-for-
dollar they can get more throughput than the classic style machines. Classic
machines invest heavily in low-latency communication and typically require
APIs like MPI to achieve it, while Internet HPC just uses well-designed TCP-
based socket communications.

Building dual-design systems like this- especially when the community has
little or no skill at building NG Internet HPC systems- is likely to produce a
system that is good at few things.

Instead, build two systems. One is the largest (but not necessarily exaflop)
you can afford and is a classic supercom[puter. Then, for the second, hire
some datacenter designers from Google/Facebook and have them build a modern
HPC cloud design.

The biologists will flock to the second one; they have long been underserved
by the DOE supercomputing community.

~~~
sevensor
> especially when the community has little or no skill at building NG Internet
> HPC systems

I would argue that the community of people who actually have the skills to
take advantage of the interconnects in a classic HPC system is vanishingly
small, and in consequence we've overbuilt them on an epic scale.

Allow me to vent. I had the good fortune to have a login on a "petascale" HPC
system, and access to an allocation of hours.

The /scratch filesystem would fail weekly, which killed everybody's jobs. If
you had a big run going when /scratch failed, you lost everything. Scratch
failed so much because the models that were being used often did wildly
inappropriate amounts of file IO --- debugging print statements, detailed
intermediate calculations, excessively verbose output --- that worked all
right in development but when run in parallel brought the filesystem to its
knees.

Furthermore, the login nodes were almost unusably slow because of all the
Python and Perl post-processing scripts running on them. This isn't even a
matter of users being cheap with their hours --- post-processing would have
been a tiny fraction of their allocations. Instead, it's that many of them
gave no thought at all to how the post-processing might be structured and run
through the batch scheduler, and saw no downside to abusing the login nodes
for that purpose.

In conclusion, I can attest to at least one HPC system that was badly
mismatched to its users' needs and level of sophistication, despite
allocations of hours being awarded only to a small number of researchers from
across the country through a highly competitive process. Building these things
serves national and institutional pride far more than any utilitarian
interest.

~~~
dekhn
You're describing an exceptionally poorly built and used system. That said,
it's not inconsistent with what I've seen as well.

My claim is that the design of classic interconnects is a big waste of money,
because only a few codes need it, yet the cost dominates (>50%) of the
cluster. I've learned, from years of studying Google's papers, that there are
better ways to build code that communicates, and those mechanisms are much
easier to teach to scientists and computer scientists than MPI.

~~~
rwallace
That's true for many purposes, but is it true for physical simulations? Don't
they need communication on every time step?

~~~
dekhn
I would say no.

Here is my argument: when I worked for DOE, everybody told me I had to run my
MD simulations on a super computer using all the processors, and I would
judged on my parallel efficiency. This meant using a code that used MPI to
communicate at every (or every N) timesteps. I asked, instead, "Why not just
run N independent simulations, and pool the results?" In this case, you run an
M-thread simulation on each machine (where M = number of cores on the machine)
with no internode communication at all except to read input files and write
output files.

The short answer is, that approach works just fine, but the DOE supercomputer
people won't let you run embarassingly parallel codes because they already
spent money on the interconnect to run tightly coupled codes.

In reponse to this, I went to Google, built Exacycle (loosely coupled HPC) and
published this well-cited paper:
[http://www.ncbi.nlm.nih.gov/pubmed/24345941](http://www.ncbi.nlm.nih.gov/pubmed/24345941)
which in my opinion put the last nail in the coffin of DOE-style physics
simulations for molecular dynamics.

That said, there are systems which are so large you can't practically simulate
a single instance of the system on a single machine, so you have to partition.
Simulating the ribosome is a nice example. However, simulating the ribosome
currently provides no valuable scientific data except to tell us that we have
major problems with our simulation systems (force field errors, missing QM,
electrostatic approximations,e tc).

~~~
rwallace
Interesting! Would it be accurate to say that as the amount of computing power
and memory per CPU has increased over the years, so also has the percentage of
scientific problems where a single simulation instance will fit on a single
CPU? Certainly if you can do so, it's more efficient (in both machine and
human resources) to partition by one job per CPU.

~~~
dekhn
Yes, for example when I did my PhD work ~2001 with a T3E I could run a
simulation of a duplex DNA in a box of water by running it in parallel. This
was true both for memory and CPU reasons. It limited to me studying a single
sequence at a time, or 2-3 which was the practical limit on the number of
concurrent jobs. This used the well-balanced design of the T3E, which had a
great MPI system.

Eventually it reached the point (~2007) where I could fit the whole simulation
on a single 4-core Intel box with similar performance. Then, I ran one "task"
per machine, and scaled to the number of available machines. This uses only
inter-node communication, which goes over a hub or crossbar on the
motherboard. Much faster.

Now, I can fit many copies of DNA on a single machine (one task per core).
This is far and away the best, because each processor just accesses its own
memory, greatly reducing motherboard traffic, so the problem is basically CPU-
bound instead of communication bound (this also now applies to GPUs, such that
single GPUs can run one large simulation within its own RAM and not have to
spill data back and forth over the CPU/GPU communication path).

This moves the challenge to the IO subsystem- I generate so much simulation
data that I need a fat MapReduce cluster to analyze the trajectories.

~~~
markhahn
none of this is news - what you're describing is really just strong scaling.
and sure, most systems already have subsets of nodes set aside for post-
simulation cleanup.

~~~
dekhn
Here is the news: the Jupiter paper is now published.
[http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....](http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf)

I'm not just describing strong scaling. I'm describing a cost-effective way to
achieve it; that's what really matters.

Why have subsets of nodes for post-simulation cleanup? Why not just run that
cleanup on the same nodes you used for simulation? Or other general nodes?
Otherwise, you've got two sets of nodes which are used at lower utilization
than they would normally be.

------
bjacobel
> The supercomputer would be 20 times quicker than the current leading
> machine, which is in China.

So given Moore's Law, by the time it's finished in 2025, it will be 50 times
slower than 2025's fastest?

Yes, yes, Moore's Law is slowing, transistors on a chip =/= flops, etc. Still
seems like they'd want to aim higher than 20x in 10 years.

~~~
unfamiliar
You're assuming the goal is to build the fastest supercomputer in the world. I
think the goal is probably closer to "get the computing resources we need at
the lowest cost".

~~~
mathetic
Well, the title literally says "world's fastest computer."

~~~
_delirium
Probably fluff added by the headline writer to sound more exciting. The White
House press release doesn't make world's-fastest claims:
[https://www.whitehouse.gov/blog/2015/07/29/advancing-us-
lead...](https://www.whitehouse.gov/blog/2015/07/29/advancing-us-leadership-
high-performance-computing)

------
codewithcheese
Can anyone provide back of the napkin calculations on this proposed
supercomputers computing power vs Googles compute farm?

~~~
graphene
You could compare raw FLOPS (Floating point operations per second) but that
would only tell part of the story. These supercomputers are highly engineered
for low network latency between nodes, which is necessary for many scientific
workloads. Google and other companies are generally able to express their
algorithms in highly parallel ways, which means there are much reduced
requirements for communication between nodes.

Therefore, even if the raw performance in terms of FLOPS sound similar, the
two systems will have widely differing performance on real workloads.

~~~
oaktowner
Depends on what you mean by a "real" workload.

Capturing and indexing the entire web is certainly a real workload, even if it
is massively psrallelizable, so it would probably run equally well on Google's
infrastructure as on a supercomputer because those fast interconnects wouldn't
provide much advantage, right?

However,when simulating a nuclear explosion or a weather system (maybe that's
what you mean by "real" workloads?), the heavy node-to-node communication
makes the supercomputer much, much better suited.

------
tdicola
Don't we already have a exabyte-scale supercomputer in Utah run by the NSA?

~~~
striking
Yeah, but everyone knows that thing's doing illegal stuff (illegal now or
soon-to-be illegal), and Obama doesn't want that on his record.

~~~
acaloiar
Do you have a source for that information?

~~~
striking
[http://www.usnews.com/news/politics/articles/2015/07/27/nsa-...](http://www.usnews.com/news/politics/articles/2015/07/27/nsa-
will-stop-looking-at-old-phone-records)

The NSA's currently being sued over its metadata collection. I wrote that
comment slightly tongue-in-cheek, but repurposing that supercomputer for
civilian use would actually be a great way to recoup some of your losses.

------
angdis
I suppose that "supercomputers" are all multi-processor these days, so the
colossal FLOP numbers are counted as an aggregation over many processors and
one has to coordinate these processors in any application that takes advantage
of the FLOP specs.

Now I am curious what is the fastest single processor?

~~~
rayiner
On a per-core basis, probably Power 8:
[http://www.anandtech.com/show/9193/the-
xeon-e78800-v3-review...](http://www.anandtech.com/show/9193/the-
xeon-e78800-v3-review/7)

~~~
StillBored
If your memory bandwidth limited then, yes. Otherwise your better of with a
decently clocked intel. Anything that spends a portion of its time running out
of L2 or better on the Xeon will be significantly faster.

------
sadgit
Press: "What's it for?" Obama: "Uuuh... NASA."

~~~
jazzyk
I think he added an extra "A" there at the very last minute. :-)

~~~
agumonkey
Quite a large NAS don't think ?

------
varelse
I'd really love to see speed measured by performance by a single collective
computation of an O(n) or O(n log n) algorithm. This would emphasize the
importance of balancing communication performance with computation. Not
holding my breath, the LINPACK is strong with these people...

------
mturk
Reading over the list of priorities in the PDF linked from the Whitehouse
blog, the one that I was the most pleased about was improving HPC
productivity.

~~~
onalark
Let's hope this is not just bluster :(

------
simi_
IMHO if you need to break down a task well enough to run on a supercomputer,
there isn't a lot more to do to make it run on a regular server farm.

edit: Actually, in the scenarios you'd use a supercomputer for, the added
latency and overhead (shoddy servers, network, etc.) would most likely make
the run time orders of magnitude higher.

~~~
apawloski
Maybe for embarrassingly parallel tasks, but if you require nontrivial
interprocess communication, a server farm can't compete with the interconnect
of a modern supercomputer.

------
bitwize
NSA director be making that smug frog face right about now

------
ytdht
too bad for him that he didn't order it earlier, maybe he could have figured
out a way to stay for more then 8 years as President...

------
nickpsecurity
Step 1: Order exascale computer. Step 2: ??? Step 3: Profit.

U.S. and other countries have been in a race for exascale. The thing holding
us back isn't funding or political will: exascale is so ridiculously hard that
it requires fundamentally different architectures. The main issues are making
our CPU's do more work, eliminating memory bottlenecks, and dramatically
improving energy efficiency of both. It's just very tough, technical
challenges that might also have to operate on process nodes that are
themselves tough.

Rexx Computing is one attempt whose founder posts here a lot [except in one
thread dedicated to it lol]. I'm curious if any other exascale researchers
read HN and can post their concepts as it's probably interesting stuff. Here's
some links for readers interested in this stuff.

LLNL gives data on exascale and its challenges
[https://asc.llnl.gov/content/assets/docs/exascale-
white.pdf](https://asc.llnl.gov/content/assets/docs/exascale-white.pdf)

Also describes problems but skip to Venray's TOMI approach
[http://www.edn.com/design/systems-design/4368705/The-
future-...](http://www.edn.com/design/systems-design/4368705/The-future-of-
computers--Part-1-Multicore-and-the-Memory-Wall)

Rexx Computing's approach [http://www.theplatform.net/2015/03/12/the-little-
chip-that-c...](http://www.theplatform.net/2015/03/12/the-little-chip-that-
could-disrupt-exascale-computing/)

Intel's relatively conventional approach [http://www.exascale-computing.eu/wp-
content/uploads/2012/02/...](http://www.exascale-computing.eu/wp-
content/uploads/2012/02/Exascale_Onepager.pdf)

Architecture from Univ of Texas and NVIDIA
[https://www.cs.utexas.edu/users/skeckler/pubs/SC_2014_Exasca...](https://www.cs.utexas.edu/users/skeckler/pubs/SC_2014_Exascale.pdf)

Boise exploring non-Von-Neuman with ParalleX [http://cswarm.nd.edu/news-
events/assets/PSAAP_II_Kick-off_CS...](http://cswarm.nd.edu/news-
events/assets/PSAAP_II_Kick-off_CS.pdf)

Same group enlightens on details that all fight with
[http://sites.ieee.org/boise-cs/files/2015/04/Thomas-
Sterling...](http://sites.ieee.org/boise-cs/files/2015/04/Thomas-Sterling-
Exascale-Computing-BSU-talk-20150409.pdf)

Bonus: 1,000 core, cache-coherent, optical interconnect. Sort of thing might
be useful in exascale. [http://dspace.mit.edu/openaccess-
disseminate/1721.1/67490](http://dspace.mit.edu/openaccess-
disseminate/1721.1/67490)

Have fun with these. Submit a link if I left out any chip architecture in
exascale race that's pretty cool.

~~~
RhysU
Step 1: Nuclear test ban treaty. Step 2: Avoid nuclear test disasters. Step 3:
Maintain military status quo.

Notice step 3 implies international stability and hence profit. NNSA stresses
computation so heavily because stockpile stewardship cannot be done by
noncomputational means.

~~~
nickpsecurity
Interesting point. Exascale is a lot more than that though: many stakeholders.
And, even if none, it's still going to get funded as another international
pissing contest (see Top 500). ;)

------
scottcanoni
I wonder how many bitcoins can be mined with it :)

------
RA_Fisher
Why not just use commodity spot instances?

~~~
chatman
Haha.. AWS?

~~~
RA_Fisher
Sure, I'm asking sincerely.

~~~
bstamour
One of the issues faced in super computing isn't just raw horsepower or more
cpu's, it's latency. It's not enough to just connect a ton of machines via
ethernet, you need specialized hardware to provide high-throughput sharing of
data.

~~~
RA_Fisher
Thanks for the insight!

------
neonbat
Thanks Obama (no sarcasm).

------
88e282102ae2e5b
It kills me every time I read about how the world's nth-fastest computer is
just used to simulate nuclear explosions, so it was a delight to see that
they're planning on using this one for some good.

~~~
graphene
I don't know if this is what you're referring to, but a common application for
government-owned supercomputers is simulating the degredation of nuclear
warheads. The degradation of the fissile material as well as its surroundings
is highly critical to a nation's security, and also very hard to model well.

Of course in an ideal world those cycles would be used to help cure cancer,
but given that these warheads exist, it's probably a good idea to invest
resources into getting an idea of what shape they're in.

~~~
twoodfin
The old way to do that was to blow one of them up every so often. The Nuclear
Test Ban Treaty put a stop to that.

------
chatman
I think one of the primary motivations for doing this is to break
cryptographic keys, and using that for surveillance and to hack into Chinese
websites.

~~~
_nedR
Is this in the realm of feasibility for modern crypto? And what about
past/current transmissions using older crypto.

