Perlmutter seems like an awesome system. But, I think the “ai exaflops” is a “X GPUS times the NVIDIA peak rate”. The new sparsity features on A100 are promising, but haven’t been demonstrated to be nearly as awesome in practice (yet).
It also all comes down to workloads: large-scale distributed training is a funny workload! It’s not like LINPACK. If you make your model compute intensive enough, then the networking need mostly becomes bandwidth (for which multi-hundred Gbps worth of NICs is handy) but even without it there are lots of ways to max out your compute.
Similarly, storage is a serious need for say giant video corpora, but not for things like text! GPT-2 had like a 40 GiB corpus.
For those asking about largest-scale cloud GPU runs, there are basically three examples (chronological)
- the work OpenAI did on Five (thousands of V100s on GCP)
- the IceCube science work  on many clouds (51000 GPUs at peak!)
- OpenAI’s 10000-V100 cluster they used for GPT-3
The A100s used here are recently released and another step-change in perf per part (and memory). All major providers now offer them, though with different density and networking configurations (GCP went with 16 in a single box, most folks went with 8, some have lots of networking, etc.).
What everyone should be asking is: what awesome stuff is NERSC / LBL going to do with Perlmutter? You can’t just rent one for a few hours on GCP or any other provider :). (But, fwiw, most usage will be small slices: this is the sad fate of giant supercomputers!)
Ultimately this is a machine to solve everybody’s needs, not just ML needs (although it must solve those needs very well in any case)
I’m not sure Cori will be remembered particularly fondly, but Perlmutter will probably be because it seems like it will be versatile enough to meet everyone’s needs.
For those wondering why not cloud - most simulations generally make sense in cloud because there’s nowhere near the data movement/storage involved and anything in particle physics is always embarrassingly parallel. The data isn’t moving there though - it is very cheap to keep the data at a place like NERSC especially with tape in the mix.
Tape is worthless. Tape is where you put data you never want to retrieve again.
Yes, all that happens, and most project computing (Office of Science) is usually pushed to NERSC, but it really depends on many factors. NERSC has a lot of users (much more than 8), probably the most out of the DOE complex, since it’s a user facility. Big projects (>$500M) don’t often have this problem and most of them will push for clusters they own at their home lab. Small projects are most likely to have trouble either in procuring resources or effectively using super computing resources.
A dirty secret is many users just use MPI for a job queue/coordination so their jobs “scale” when large reservations are required, but IB is not really exercised in those situations. When you realize that, it’s easy to realize most workflows scale (as long as disk is considered), but very annoying from a development perspective.
Tape isn’t worthless because you have to have a disaster plan or long term data preservation plan and it’s usually just a part of that, but some labs/projects have historically pushed as part of the integral data management plan for active/warm data and that is indeed worthless.
Nope, tape is just useless. Live but cold hard drives.
> namesake of Saul Perlmutter, an astrophysicist at Berkeley Lab who shared the 2011 Nobel Prize in Physics for his contributions to research showing that the expansion of the universe is accelerating
Opticks : GPU Optical Photon Simulation for Particle Physics using NVIDIAR OptiX
Projected sensitivities of the LUX-ZEPLIN (LZ) experiment to new physics via low-energy electron recoils
Ah not to be confused with OpenAI Rapid described at https://openai.com/blog/openai-five/ ^_^
Disclaimer disclaimer: Language religious wars are dumb, unless we're talking about Rust. I can't find too much bad to say about it other than it doesn't look like Pony. :D Let me go build a programming language with implicit lifetimes and no GC as something no one will ever use. Hang on... gimme a minute.
Perlmutter's 120 petaflops peak would place it very favorably in TOP500's rankings, within the top five – if these flops aren't apples-to-oranges with what those rankings measure. Can anyone shed more light on the distinctions involved?
> Note: Perlmutter’s “AI performance” is based on Nvidia’s half-precision numerical format (FP16 Tensor Core) with Nvidia’s sparsity feature enabled.
FP16 is a 16 bit floating point format. FLOPS for top 500 are measured with LINPACK HPL, which says it is over 64 bit floating point values (I think):
> HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.
This isn't totally disingenuous though. These FP16 operations are very useful for some kinds of calculations.
ML for example. If you use 16bit precision you can fit the model in half the memory and your lookups are twice as fast. Newer GPU models offer "mixed precision mode" It takes some doing to get that working in your tooling though.
The trend is more to call these things 'superpods' than 'supercomputers'.
Also, I wouldn't want their bills for chilled water or electricity. My puny 96-thread EPYC ranges from 300 - 850 W and heats my room. :B
(Nothing remotely against Dr Perlmutter whom I know nothing about. It just seems we live in a scandal-filled age).
Here’s a fun comparison of total silicon wafer space used.
A Cerebras die is 46,255 mm^2.
1,500 Milan 64-core CPUs * 8 compute chiplets each is 1,004,832 mm^2. (Not including the I/O chiplet).
6,159 NVidia A100 dies is 5,087,334 mm^2.
These are all made on TSMC 7nm, funnily enough.
If you mean network bandwidth, the level of batching controls the bandwidth bound and is configurable. If you mean chip bandwidth, that relies on advanced compilation that is pretty darn hard to get right.
For a chip like cerebras’s to win, it’d have to have a bandwidth bound workload and deliver on the software to eliminate the bottlenecks.
1) does it run FarCry?
2) are the GPUs limited in their ETH hashrate?
2) No, it has the extremely expensive server GPUs, so their drivers should be normal.
Myself: What does 42 mean?
Source: Berkeley PhD student in supercomputing, have access to Perlmutter.
Many people at HN have probably read this already, but maybe some of them could be part of the lucky 10,000!
For me the ability to ask a question that is not clear to some and for others will have literary humour (Douglas Adams reference) would be telling about the AI. Some humans would pick up on that reference and the humour, some would be flummoxed and it is for me how intelligence answers questions that it does not fully understand is, at least for me insightful into how intelligent they are.
Right… but there’s no point wondering because the answer is it can’t answer things like that.