
Cray, AMD to Extend DOE’s Exascale Frontier - arcanus
https://www.hpcwire.com/2019/05/07/cray-amd-exascale-frontier-at-oak-ridge/
======
tntn
IMO the coolest thing about Summit/Sierra is that the GPUs and CPUs have a
fully coherent single address space with all memory available to the GPUs by
default, meaning that your stack- and malloc-allocated variables can be used
directly from the GPUs.

I wonder if that will be the case on Frontier.

~~~
vvanders
That's basically the entire state of mobile(minus a few weird SoCs) and many
game consoles, makes for a much more convenient development platform.

~~~
tntn
I'm not familiar with mobile / APUs, but I was under the impression that it
was still necessary to do clCreateBuffer (or similar). I could only find old
slides
([https://developer.amd.com/wordpress/media/2013/06/1004_final...](https://developer.amd.com/wordpress/media/2013/06/1004_final.pdf)),
though.

Would you mind pointing me in the right direction to learn about how to do
this? (something equivalent to slide 2 of [https://www.olcf.ornl.gov/wp-
content/uploads/2018/12/summit_...](https://www.olcf.ornl.gov/wp-
content/uploads/2018/12/summit_workshop_UVM.pdf) but for mobile/APU)?

~~~
vvanders
[https://www.khronos.org/registry/OpenGL/extensions/OES/OES_E...](https://www.khronos.org/registry/OpenGL/extensions/OES/OES_EGL_image_external.txt)
is the basic entry point on most the OpenGL ES platforms, elsewhere you get
into platform-specific APIs.

Mostly my point was the unified system/gpu memory is pretty common outside of
the desktop space.

------
foobard
> In a media briefing ahead of today’s announcement at Oak Ridge, the partners
> revealed that Frontier will span more than 100 Shasta supercomputer
> cabinets, each supporting 300 kilowatts of computing.

So 30 megawatts of computing, plus cooling and other supporting services. How
do you power something like this? Does ORNL have their own power station
(given they have reactor(s) on site)? If power comes from an external station
do they coordinate with the station operator when bringing a system like this
online?

~~~
kincl
As has been noted in other comments, we do not have a power station at ORNL.
We buy power from TVA at about 5.5 cents per kW hour which in part is because
of the locality of the lab to TVA power plants.

TVA recently completed a 210 MW substation on ORNL's campus to better serve
our needs. We do not need to coordinate with them for large runs on the
machines.

~~~
W-Stool
With that much gear and those kind of loads do you still have a traditional
UPS / transfer switch / genset arrangement for everything in the room? If not,
how do you manage short duration power outages?

~~~
kincl
Yep, we have battery-backed generators for UPS and a transfer switch at the
480-V feed that comes into the room but it is not enough to power the compute
nodes. The UPS allows cluster management nodes and the parallel filesystem
(which is a small cluster by itself) to ride through full outages and other
PQE.

------
cr0sh
So - on a more "applies to ordinary mortals" level - the fact that they are
going to use all AMD components is intriguing.

In reference to AI, NVidia has things "locked up" with CUDA, versus 2nd cousin
AMD's OpenCL.

From what I understand, it is possible to recompile TensorFlow (for instance -
not that ORNL will be using TF) for OpenCL - but I don't know how well it
works. Personally, I've only used TF with CUDA.

Does this mean we might see greater/better support for OpenCL in the AI realm?
Might we seem it become on-par with CUDA because of this collaboration for
this HPC?

Or will things stay as-is, at least "down here" in the consumer/business realm
of AI hardware and applications? Do things like this trickle down, or are
things so customized and/or proprietary for the needs of HPC at ORNL (or
elsewhere) that anything to do with AI on this machine will have little to no
bearing outside of the lab?

Ultimately, I'd just like to see another choice (a lower cost choice!) for GPU
in the world of consumer/enthusiast/hobbyist AI/DL/ML - while today's higher-
end GPUs, no matter the manufacturer, tend to be fairly expensive, AMD still
has an edge here that make them attractive to users (not to mention the fact
that their Linux drivers are open-source, which is also a plus).

~~~
petschge
I doubt they are going to run much AI on that machine. The national labs
mostly run "traditional HPC" workloads such as fluid codes that simulate
(magneto)hydrodynamics in one way or another.

~~~
timClicks
The DOE is responsible for the USA's nuclear arsenal so I expect a few
simulations of that nature.

~~~
dekhn
this system is mostly for non-classified work; not clear just how much
stockpile stewardship will occur on it.

~~~
kincl
Yep, actually all of our user projects are unclassified at the OLCF.

------
arcanus
"greater than 1.5 exaflops" of performance will likely correspond to greater
than 1 Exaflop of sustained performance on HPL (used for the top-500 ranking),
making this a likely candidate for the first 'true' exascale computer.

~~~
shifto
But will it run Crysis?

~~~
azhenley
This is just a few miles down the road from me. I can try to see if they will
let me run it...

~~~
kincl
The call for proposals for INCITE (one of the programs we provide cycles for)
is open for 2020 but this would be for Summit not Frontier this time around :)

[http://www.doeleadershipcomputing.org/proposal/call-for-
prop...](http://www.doeleadershipcomputing.org/proposal/call-for-proposals/)

------
BooneJS
Looks like Frontier will use Cray’s Slingshot network.
[https://www.cray.com/products/computing/slingshot](https://www.cray.com/products/computing/slingshot)

[https://www.anandtech.com/show/14302/us-dept-of-energy-
annou...](https://www.anandtech.com/show/14302/us-dept-of-energy-announces-
frontier-supercomputer-cray-and-amd-1-5-exaflops)

------
berbec
What's the advantages of infinity fabric over pcie 4 for cpu/gpu?

What interconnects do these sorts of machines use? I assume even 100GbE isn't
enough?

Just curious. It's interesting what exists in the "so far beyond my price
range as to be ludicrous" category.

~~~
Symmetry
PCIe provides communication but isn't intended to provide memory coherency.
There's a lot of work that goes on in figuring out which cache(s) have a copy
of which cache line and figuring out how to resolve conflicting access needs.

------
ksec
I wonder if this will help the adoption of ROCm. It seems everything I read
settled on CUDA.

------
gok
1.5 exaflops, 30 megawatts, around 50 GFLOPs per watt? Impressive if true;
that's 3x more efficient than the current top of the Green500.

~~~
tntn
It's probably not a great comparison to compare the theoretical numbers of
Frontier to the achieved numbers of the green 500. The achieved flops is
pretty much always considerably lower than the theoretical flops. Titan is a
27 PFLOPS machine that achieves 17.6 PFLOPS, sequoia is a 20 PFLOPS machine
that achieves 17, summit is a 200 PFLOPS machine that achieves 143, ...

~ 37 GFLOPS/W is probably a better projection if we assume (out of nowhere)
that the theoretical/achieved ratio of Frontier is comparable to Summit (75%).
Still very impressive.

~~~
gok
Well they don't usually hit their peak power usage either; Summit is rated at
13 MW but used less than 10 during its LINPACK run. But fair point.

