I worked with a Cray back in 2012 and the amount of engineering that goes into each rack and down to each blade is incredible. The biggest thing that got me was the geometry of the heatsinks that were designed to slightly compress air as it left each blade into the one above it.
The average Google/AWS datacenter doesn't have the networking to support the kinds of high-adjacency work that supercomputers do. The biggest ones may be able to score higher on Linpack if they tried, but at much higher power.
What kinds of problems is this used to solve? Is it a different class of problems that couldn’t be solved with an equivalent number of commodity servers?
These essentially are commodity servers that have been hyper-optimized for maximum bandwidth and minimum latency. Many types of large-scale computation are effectively bandwidth limited.
This implies needing to pack all of that compute as closely together physically as possible to minimize latency due to speed-of-light limits, which is substantial in massive compute clusters. It takes a long time for light to get from one side of the cluster to the other from the perspective of a computer.
This, in turn, implies needing sophisticated custom cooling since the power density is very high due to packing so much silicon into such a small footprint.
Another issue is that systems this large are at high risk of producing erroneous results due to the quantity of silicon involved creating a large attack surface for bit flips. It is very expensive to run a large supercomputing job only to find out that the output is incorrect and you'll need to start over, so aggressive mitigation, detection, and correction of sporadic data and compute errors is important.
You could run all of these codes on a stack of beige boxes. It would take so long for the computation to complete that it would be extraordinarily expensive in both time and money. These systems focus on codes that would be effectively computationally intractable if you tried to run them in, say, AWS.
Large-scale problems that look like computational fluid dynamics, graph analysis, spatiotemporal analysis, various kinds of computational science and physics models, etc. Just about anything that involves modeling and analyzing behaviors in the physical world. All of these involve high-bandwidth parallel orchestration at the data structures and algorithms level. The quality of the network is everything if your code is competently parallel. They don’t run these things on supercomputers for fun, it would be much easier if you could run these codes on AWS.
The cloud has many advantages, but high-quality inter-node bandwidth and topology isn’t one of them. In HPC, the network is the most important part of the system.
To expand a bit, I did R&D for a supercomputing companies on software latency-hiding many years ago. The idea was that we could run many of these codes on commodity hardware by taking ideas from exotic latency-hiding silicon, which had no direct analogues that could be implemented in software, and inventing something similar for software. This was surprisingly successful!
Unfortunately, this had two practical problems. First, it turns out that developers are quite poor at reasoning about control flow in latency-hiding architectures generally. It is analogous to reasoning about very complex lock graphs but worse. Second, you still need prodigious quantities of high-quality network bandwidth and topology, even if latency matters much less, for typical HPC problems. At which point you are half the way to a traditional HPC network anyway.
There is still a space for this research in non-HPC applications (like join parallelization in databases), but for HPC the cost-benefit ratio pushes everyone to purpose-built networks. Less learning curve for the devs and you get the exceptional bandwidth and topology the codes need anyway.
Totally agree with your last point. But based on my understanding, over the past 5-ish years commodity multicores with multi-terabyte memories have made a big dent in the supremacy of supercomputing, at least for some of the topics you mention (thinking about graph analysis in particular).
Graph analysis was one of my core research areas in my supercomputing days. You can do it on commodity hardware with some caveats. This problem is intrinsically cache-unfriendly, which makes it interesting but also slow even when it fits in memory on a single machine. We made a lot of progress on approaching the theoretical throughput bound on real hardware and even made it work reasonably well when storage-backed, with some caveats.
That said, multi-terabyte memories won’t solve interesting problems; we already had that. When I was working on this 15-ish years ago, the real-world data models had trillions of vertices, never mind edges. And that has only gotten larger with time. A lot of the research ended up focusing on the problem of how do you boil the ocean selectively and incrementally to optimize throughput.
There is no way to trivially throw hardware at the problem; graph-cutting is hard, and you have to do it even within single servers. Even with sophisticated latency-hiding, it ends up being about effective bandwidth in a context where caches are almost useless.
For graph analysis specifically, we could do a lot with big servers, this is true. But it would require a completely different software architecture to the way most graph analysis is done now. This is perpetually on my “copious spare time” lists of projects because there is a big gap here.
Makes sense. I hope you keep working on this. There is a big gap between what people are doing in the academic (and even publicly-described industrial literature) and what you describe. So I think there is a lot of opportunity to push on this front if you have some ideas.
every time a commodity multicore machine gets better, the supercomputer folks just switch to a larger problem that wouldn't fit on a single machine. Their goal is to engineer a system/build a code that reaches peak performance limited primarily by the physical constraints of the biggest systems.
You can find the list of 2023 project allocations from the DOE INCITE program here [1]. INCITE allocates the majority of compute hours at the Leadership Computing Facilities, which includes the systems at Argonne and Oak Ridge.
Most of the top supercomputers are used to model nuclear weapons, weather and climate, and various sciencey things.
I don't know if the top ones are still used for oil and gas exploration (crunching data to provide higher resolution and higher accuracy oil field maps), but they have been in the past.
They are commodity chips. But the interconnect is extremely beefed up.
The bandwidth and communications are always huge on these supercomputers. That's really the difference between a typical cluster and a supercomputer... The interconnect.
Generally, supercomputers are used in problems where you need fast access to all or lots of a large dataset. I think a common example is weather simulation, where each grid cell needs to update based on its neighbors at each timestep.
From what I've heard it's the high speed/throughput low latency interconnect that's the differentiator compared to thousands of commodity machines. (Weather and bomb explosions being two big ones that are hard to do concurrently, as the results depend on each other)
Depends: if your application has strong scaling then just throw more power whichever way at it.
Tightly coupled hpc systems aim to improve execution speed (or application resolution) to weak scaling applications.
Politely, this computer exists because the DoE performs stockpile stewardship.
Less politely, it exists because of The Bombs. Everything else it will do is a hobby. It is specifically designed for whatever compute needs their classified program has.
All of this capital is being spent because the test ban treaty prohibits the detonation of these weapons, but the US stockpile and its efficacy must be maintained and new weapons must be designed. So how do you know if an old weapon or a new design will work or not?
Some part of the answer is real data + simulations. Lots and lots of simulations. It's why Frontier exists and why the DoE has pursued exascale compute for quite some time.
(as for the real data part, it's why places like the National Ignition Factory exist, and why there's an active experimental program that exists to study dummy warheads in unexpected ways, https://www.youtube.com/watch?v=FYdAT0v4DHs )
———
Addendum, re: comments arguing that the NNSA computers are separate etc., the points raised are both true in the specific sense, but not true holistically.
Quotes from the most recent report to congress, published in March 2022 about the Stockpile Stewardship and Management Plan,
Near-Term and Out-Year Mission Goals:
◼ Advance the innovative experimental platforms, diagnostic equipment, and computational capabilities necessary to ensure stockpile safety, security, reliability, and effectiveness:
– Achieve exascale computing by delivering an exascale-capable machine and modernizing the nuclear weapons code base
– Develop an operational enhanced capability (advanced radiography and reactivity measurements) for subcritical experiments
– Quantify the effects of plutonium aging on weapon performance over time
– Assure an enduring, trusted supply of strategic radiation-hardened microsystems
and,
The weapons comprising the U.S. nuclear stockpile are assessed to be safe, reliable, effective, and secure. DOE/NNSA’s scientific infrastructure is currently adequate to support stockpile actions. The DOE/NNSA plans to address near-term gaps in required scientific capabilities by deploying the DOE/NNSA’s first exascale computing platform in fiscal year (FY)2023 and by improving capabilities to conduct subcritical experiments through the Enhanced Capabilities for Subcritical Experiments project by FY 2026.
It seems they're building three of these computers, Frontier is one of them, from the budget requests and a press release,
The budget request for Advanced Simulation and Computing increased to support pursuing new validated integrated design codes and advanced high-performance computing capabilities, including the El Capitan exascale system procurement.
The press release,
Featuring advanced capabilities for modeling, simulation and artificial intelligence (AI), based on Cray’s new Shasta architecture, El Capitan is projected to run national nuclear security applications at more than 50 times the speed of LLNL’s Sequoia system. Depending on the application, El Capitan will run roughly 10 times faster on average than LLNL’s Sierra system, currently the world’s second most powerful supercomputer at 125 petaflops of peak performance. Projected to be at least four times more energy efficient than Sierra, El Capitan is expected to go into production by late 2023, servicing the needs of NNSA’s Tri-Laboratory community: Lawrence Livermore National Laboratory, Los Alamos National Laboratory and Sandia National Laboratories
El Capitan will be DOE’s third exascale-class supercomputer, following Argonne National Laboratory’s "Aurora" and Oak Ridge National Laboratory’s "Frontier" system. All three DOE exascale supercomputers will be built by Cray utilizing their Shasta architecture, Slingshot interconnect and new software platform.
It does seem to me that it's an accurate assessment to say that this project exists because of the nuclear weapons research mandate.
They're the same computer design, replicated. Are they using these to test/mature their software architecture? Hoping that many eyes will make bugs shallow? Or, is it something else?
Perhaps it's seeing patterns where there are none, but for me the link between these projects and stockpile stewardship seems to be undeniable.
This is really not true, the stockpile stewardship thing is a separate part of DoE. Frontier does no nuclear weapons-related simulations at all and it does zero classified work (the space it's in/network it's on can't support it). Frontier is things like materials research, nuclear reactor simulations (for power generation), machine learning etc... whatever people propose to do on it. Compute time on Frontier is zero cost and open to almost any US researcher, though it's extremely competitive so actually getting any time isn't easy. You basically have to make a really compelling case that your proposal is a) exceptionally valuable and b) not feasible to be done on any more modest system.
DoE is building (and has built) similar leadership-class computers for stockpile stewardship but it's a completely separate program that actually usually lags behind the open science machines they build. El Capitan is the system they're building at Livermore for stockpile work.
I am unsure how tangled the relationships are as what you're saying can be strictly true, but I doubt that this has nothing to do with weapons related simulation.
Here's the most recent report to congress, published in March 2022 about the Stockpile Stewardship and Management Plan,
Near-Term and Out-Year Mission Goals:
◼ Advance the innovative experimental platforms, diagnostic equipment, and computational capabilities necessary to ensure stockpile safety, security, reliability, and effectiveness:
– Achieve exascale computing by delivering an exascale-capable machine and modernizing the nuclear weapons code base
– Develop an operational enhanced capability (advanced radiography and reactivity measurements) for subcritical experiments
– Quantify the effects of plutonium aging on weapon performance over time
– Assure an enduring, trusted supply of strategic radiation-hardened microsystems
and,
The weapons comprising the U.S. nuclear stockpile are assessed to be safe, reliable, effective, and secure. DOE/NNSA’s scientific infrastructure is currently adequate to support stockpile actions. The DOE/NNSA plans to address near-term gaps in required scientific capabilities by deploying the DOE/NNSA’s first exascale computing platform in fiscal year (FY)2023 and by improving capabilities to conduct subcritical experiments through the Enhanced Capabilities for Subcritical Experiments project by FY 2026.
It seems they're building three of these computers, Frontier is one of them, from the budget requests and a press release,
The budget request for Advanced Simulation and Computing increased to support pursuing new validated integrated design codes and advanced high-performance computing capabilities, including the El Capitan exascale system procurement.
The press release,
Featuring advanced capabilities for modeling, simulation and artificial intelligence (AI), based on Cray’s new Shasta architecture, El Capitan is projected to run national nuclear security applications at more than 50 times the speed of LLNL’s Sequoia system. Depending on the application, El Capitan will run roughly 10 times faster on average than LLNL’s Sierra system, currently the world’s second most powerful supercomputer at 125 petaflops of peak performance. Projected to be at least four times more energy efficient than Sierra, El Capitan is expected to go into production by late 2023, servicing the needs of NNSA’s Tri-Laboratory community: Lawrence Livermore National Laboratory, Los Alamos National Laboratory and Sandia National Laboratories
El Capitan will be DOE’s third exascale-class supercomputer, following Argonne National Laboratory’s "Aurora" and Oak Ridge National Laboratory’s "Frontier" system. All three DOE exascale supercomputers will be built by Cray utilizing their Shasta architecture, Slingshot interconnect and new software platform.
> It does seem to me that it's an accurate assessment to say that this project exists because of the nuclear weapons research mandate.
It's not an accurate assessment, in that it's an over-simplification. DoE is a big entity. DoE does weapons research. DoE also does general science research. They are separate responsibilities and have to be, because weapons research is classified and is legally required to be kept separate. The funding pathway that pays for the computing needs of each mission likewise is legally separate and jealously guarded from each other by program managers. The computers are housed, as noted by your quoted press release, in separate laboratories. Those laboratories are run independently of each other, and have distinct responsibilities. Frontier cannot run weapons-related simulations because it's not approved for classified work, which has some really strenuous requirements that the government takes extremely seriously.
It is true to say DoE exists because of nuclear weapons research, that's a big part of their role and history. The distinction is that it's not their only role, and their general science mission is not purely in service of the weapons program legally or practically. It's true to say that the programs coexist together and operate in parallel, but they are separate with their own goals. You're citing the goals of the NNSA computing program, which absolutely are weapons-related, but the Office of Science that funds and operates Frontier, Aurora etc is a separate entity with independent funding and governance within DoE.
The best analogy I can think of is that it's like saying PowerPoint wouldn't exist without Word. Taken very literally there is an element of truth, but it's a gross oversimplification of the actual situation and history.
Note that DoE publicly releases almost all of its unclassified publications on osti.gov for free, including publications using Frontier and the other open supercomputers. OLCF also specifically aggregates publications from researchers using their machines if you wanna get an idea of what is done with them https://www.olcf.ornl.gov/publications/ . You should be able to get the full texts on OSTI.
I appreciate your points, and objectively you are right. I think the place where we are diverging are in how we slice the history/pie. At this point, at least for me, our discussion is becoming extremely interesting and I'm thinking/learning more in the process.
Would you mind humoring me?
—
For me, this is the fruit of a tree and a part of a long chain of causality. We start with von Neumann fiddling with machines at Los Alamos and wind up at the Accelerated Strategic Computing Initiative from the 90s https://www.ncbi.nlm.nih.gov/books/NBK44974/ and then fast forward to the excascale program in the early-to-mid 2010s.
I am not a professional, but this is an area of interest for me. And I've been tracking it for some time, and in the earlier communications, the ORNL etc seemed to be fairly clear on who was cutting the checks, from 2013,
The Department of Energy’s (DOE) Office of Science and the National Nuclear Security Administration (NNSA) have awarded $25.4 million in research and development contracts to five leading companies in high-performance computing (HPC) to accelerate the development of next-generation supercomputers.
Under DOE’s new DesignForward initiative, AMD, Cray, IBM, Intel Federal and NVIDIA will work to advance extreme-scale, on the path to exascale, computing technology that is vital to national security, scientific research, energy security and the nation's economic competitiveness.
“Exascale computing is key to NNSA’s capability of ensuring the safety and security of our nuclear stockpile without returning to underground testing,” said Robert Meisner, director of the NNSA Office of Advanced Simulation and Computing program. “The resulting simulation capabilities will also serve as valuable tools to address nonproliferation and counterterrorism issues, as well as informing other national security decisions.”
The program was funded fully in 2016 after many years of partial funding, ORNL's report,
The mission of the Exascale Computing Project (ECP) is the accelerated delivery of a capable exascale computing ecosystem to provide breakthrough solutions addressing our most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security.
As a multi-lab effort, sponsored by the DOE’s Office of Science and National Nuclear Security Administration, the ECP is chartered with the following tasks:
Developing exascale-ready applications and solutions that address currently intractable problems of strategic importance and national interest.
Creating and deploying an expanded and vertically integrated software stack on DOE HPC pre-exascale and exascale systems.
Delivering US HPC vendor technology advances and deploying ECP products to DOE HPC pre-exascale and exascale systems.
Delivering exascale simulation and data science innovations and solutions to national problems that enhance US economic competitiveness, change our quality of life, and strengthen our national security.
I think the other applications are far more important than the weapons research, fwiw. Just as ARPA-net and the internet it spawned ended up being far more valuable than their initial strategic use case. Even if the important stuff is a side project, just like fusion as an eventual source of energy seems to be for NIF, it's great that it's happening at all.
NNSA does pay into the broader ECP (which refers to the whole supercomputer program not just Frontier) but they aren't the only contributor. Like I said, DoE does build supercomputers for stockpile maintenance missions. Those computers will be built with basically all NNSA money and probably little to no Office of Science money. AFAIK some amount of NNSA money went into building Frontier specifically but it was earmarked under their non-stockpile missions. NNSA doesn't only design/build/maintain weapons they also are responsible for nuclear security missions. For example, they provide the R&D that backs US support to the IAEA. They are funding a limited amount of unclassified, non-weapons work being done on Frontier in support of their other missions. The key point is that stockpile maintenance is firewalled, both in terms of funding and operations, from the other stuff and they are not the only contributor.
DoE has always been a "big tent" of separate missions and I totally understand how it gets confusing. Its predecessor the AEC also had a weird mix of practically orthogonal goals. The non-weapons and non-nuclear missions have gotten a lot bigger in the last 30-40 years and are more or less operated independently of each other. NNSA is also functionally an independent agency within DoE, but it's important to note that they do fund pure science work too, with non-weapons related goals.
Believe you’re thinking of the HPC facilities at LLNL and LANL. Frontier is run under DOE ASCR as one of the Leadership Computing Facilities (LCF), which focus mainly on open science use cases. Most of the compute hours in the LCFs is allocated via DOE INCITE and ALCC programs.
With AWS's elastic network adapter (ENA) providing fast interconnect, there's practically not much difference, assuming you can purchase the total amount of compute.