
Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K Nvidia  GPU Cores - sciwiz
http://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores
======
hendzen
This is pretty awe-inspiring but as a programmer I know it would be fairly
difficult to use this machine for existing workloads because so much code
would have to be rewritten from typical x86 code to CUDA/OpenCL to use all
those GPUs.

Personally, I'm more excited for the next wave of supercomputers built with
racks of Xeon Phis [1].

[1] - [http://www.intel.com/content/www/us/en/high-performance-
comp...](http://www.intel.com/content/www/us/en/high-performance-
computing/xeon-phi-for-researchers-infographic.html.html)

~~~
tmurray
(full disclosure: used to work for NV on CUDA and did very extensive work on
Titan, so I am probably biased)

If you think your existing MPI app is going to automatically scale to a
heterogeneous architecture (high-power x86 on the main CPU, Xeon Phi cores on
the accelerator) and get acceptable performance, sorry, it's not going to
happen.

The fundamental constraints on 2012/2013 Xeon Phi performance that determine
how apps should be written are exactly the same as current desktop GPUs
(small, high-latency local memory that is not coherent with the rest of the
system; relatively slow, high-latency link to CPU; ugly interactions with
network cards in most environments; fundamental need to hide memory latency at
all times). For any sort of performance beyond a standard Xeon, you're going
to want to run a Xeon Phi as a targeted accelerator rather than offloading
entire processes to it and using a standard MPI stack. This means you're going
to be running in a hybrid host/device mode and using compiler directives or a
specific parallel language and API to deal with on-chip execution and data
transfer, which puts you in exactly the same solution space as with GPUs.

in other words: the Phi of today is not a panacea. you get better tools and
more flexibility in terms of the programming model, but the fast path that any
of its intended market would use in applications looks identical to GPUs.

~~~
batgaijin
To my understanding GPU's basically suck at anything with decision paths/move
away from straight matrix manipulation/signals analysis, right?

~~~
vilya
GPUs are SIMD machines, so they're executing the same instruction
simultaneously on all the active cores. That means if you have code which
branches, it has to mask out the cores which follow branch B while it executes
branch A; then has to mask out all the cores which follow branch A while it
executes branch B. In other words, if at least one core follows each side of
the branch, it has to execute both branches.

If all cores branch in the same direction, you don't get that penalty. A large
part of optimising for the GPU comes down to arranging your data and code so
that this can happen.

------
caf
The hexadecimal numbers in the design on the front panels of the racks appear
to say in part:

    
    
      ...Computing Oak Ridge National Laboratory Le...
    

(not too surprising, I suppose ;)

------
asdfs
Title should be "nVidia GPUs", not "nVidia GPU cores".

------
paulsutter
Does anyone know why they have a separate disk IO system when they could more
easily just plug drives into each node/motherboard for higher aggregate
throughout, less complexity, and a lower overall cost?

EDIT: Blade systems or no, the drives have to physically be placed somewhere.
Having a separate subsystem can only take up more space, not less. Two reasons
I can think of: (1) independent scaling of compute and storage, and (2) lack
of software for a distributed filesystem. Most likely (2) plus inertia is the
real reason, all the others seem like rationalizations. For example, they are
either able to take nodes offline or they aren't. The need exists whether or
not the disks are attached there.

~~~
rbabich
There are a couple of (related) reasons. The first thing to recognize is that
systems such as this one are designed for parallel workloads where all
processes are running in lockstep, communicating via MPI with frequent
barriers. This is very different from MapReduce and other asynchronous or
"embarrassingly parallel" workloads where GFS, HDFS, etc. tend to be used.
Distributed filesystems used in high-performance computing (such as Lustre,
IBM's GPFS, etc.) also have to be able to handle both reads and writes with
high throughput, whereas GFS is mostly optimized for reads and appends.

Why not just install disks in the compute nodes and run Lustre there? Since
all the nodes are working together in lockstep, system jitter is a major
problem. Imagine that you have a job running across 10,000 nodes and 160,000
cores, and a process on one of those cores get preempted for a millisecond
while a disk I/O request is being serviced. Everyone waits, and you've
suddenly wasted 160 core-seconds. Now, if this happens only 1000 times per
second across the whole machine, it's clear that you're not going to make much
forward progress, and the whole system is going to run at very low efficiency.
For this reason, Crays and similar large machines run a very minimal OS on the
compute nodes (a linux-based "compute node kernel" in the case of Cray).
Introducing local disks would go against the whole philosophy.

There's also the issue of network contention. The network is typically the
bottleneck, and you want to minimize the extent to which file I/O competes
with your MPI traffic.

As someone else mentioned, the solution is to have a dedicated storage system
(often Lustre running on a semi-segregated cluster). This approach is used
almost universally by the 500 systems on the Top 500 list
(<http://top500.org>), for example. It's not just inertia :-).

~~~
paulsutter
Disk io has negligible CPU overhead. Preempted for a millisecond? A
millisecond is millions of instructions. You're off by orders of magnitude. No
matter where the disk is located, the disk io has to go across the network. If
network capacity truly is the bottleneck, you have a different design problem
and you cant exploit the CPUs.

EDIT: I still dont buy it, but I will give some thought to the
synchronous/lockstep nature of the environment.

~~~
rbabich
That was a straw man (sorry). There's also the overhead of maintaining
consistency, synchronizing metadata, etc. I don't think assuming 0.1% CPU
overhead for Lustre is a terrible estimate, but even if it were much lower,
the argument would still hold (at least at the scale of Titan).

------
mtgx
How can Anandtech make such a big mistake? It's 46 million GPU cores, not
GPU's.

------
Cogito
The full page (print version) version of the article is at
<http://www.anandtech.com/print/6421>

------
tjaerv
177 trillion transistors in total.

~~~
zspade
More Transistors than there are synapses in the human brain (I'm aware they
are not a direct analogue).

