TOP-500 » June 2022 : https://www.top500.org/lists/top500/2022/06/
HN discussion of TOP-500 » 67 days ago » 71 comments : https://news.ycombinator.com/item?id=32002830
Aurora exascale Supercomputer – Planned to be completed in late 2022 : https://en.m.wikipedia.org/wiki/Aurora_(supercomputer)
There's so many buzzwords you would think they were shilling crypto.
Science at ORNL : https://www.olcf.ornl.gov/leadership-science/
OLCF researchers win R&D 100 award
Team honored for work on Flash-X software simulation package : https://www.olcf.ornl.gov/2022/09/08/olcf-research-team-wins...
Not much! These devices are vanity projects and prey upon people's intellectual blindness in the face of giant numbers.
And while some of the time the entire cluster will be given to a single large scale project, most of the time it will be acting as a massive GPU farm for all sorts of research. A win-win for everyone.
These computer platforms are drastically inefficient on a flop / $ basis. They exist to funnel money into the pockets of the companies who assemble them. They never ever achieve even a tiny fraction of their peak rated flops on any calculation that has any scientific meaning.
Meanwhile, Defense and closed-science systems of similar scale continue to be used at very good efficiency on problems that are strictly non-feasible on smaller clusters. The leadership-class systems are prestigious, and that prestige helps drive needed technological advances, even if the places that need them aren't in the university system.
Asking as a private user of commercially non trivial compute, but very short on the research depth required to translate optimal thinking into efficiency.
Edit: we're similarly bound by eg PDE solutions. We've found the greatest improvements in rolling our own storage. Not purely capex improvements but orders in ingest.
(While I am picking on electron, the truth is, if compute power exists, it seems DEVs never need to reign in code, or worry about efficiency.
Where I work, we purposefully set DEVs loose in VMs with minimal RAM, minimal CPU. If your app can't work with small RAM, and small CPU, how on Earth will it scale to 100s of requests per second? Compute costs.)
Because these supercomputers also need communication networks so that they can actually work on such a large problem together. So any "large computer" is going to be slower than any "small personal computer" because communication costs grow with the size of a computer.
A computer with 1-million cores needs more than 1-million times the communication than a computer with 1-core. That's just the innate issues of complexity theory, Ahmdal's law, and other such fundamental compute problems.
But the "small personal computer" is *impossible* to work on a larger problem. The "small computer" doesn't even have enough RAM to even hold a problem that these supercomputers work on, let alone the time/energy needed to finish the problem within 2 months.
At a minimum, supercomputers are needed to solve and verify the models of the next-generation of computers. Its not like these chips with 8-billion transistors in them are correct on their first design. The design is iterated upon, simulated, and verified before hardware is made. These simulation steps happen on a computer, and a rather large one at that.
If they wanted a GPU farm they could probably have built it for 1/10 the cost or less by throwing out the interconnect and infrastructure that makes it a “supercomputer”.
It's fun to have big toys sure, it's fun to make big GPU clusters. But if they spent what they spend on this computer, on just funding students to solve problems and toss them a 3090 100000000x more scientific breakthroughs would happen. This machine is 60% paperweight, at best, with a hefty budget for good old fashioned contract pork.
If the people in charge of this thing had an epiphany (or a blackout, take your pick) and left you with the keys for a year, what could you do with this inefficient but impressively large cluster, besides anchoring your paperwork?
the scientific codes which parallelize poorly often parallelize poorly because they are written in ancient languages with support for whatever supercomputer interconnect bolted on poorly, Whereas TPU's + JAX have beautiful functional abstractions for distributed tensor computations.
Just funding re-writes of all the basic math/physics stack into a language with a PORTABLE parallel functional design and perhaps a compilation layer would definitely get more basic science done than this thing.
> (at 1000x the cost)
Not as true. Summit, the previous machine, hit 148 petaflops (Rmax) at a cost of $325 million. Frontier has already hit 1102 petaflops (Rmax) at a cost of just under $600 million.