Hacker News new | past | comments | ask | show | jobs | submit login
Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer (hpcwire.com)
124 points by infodocket 8 months ago | hide | past | favorite | 58 comments

Disclosure: I used to work on Google Cloud.

Perlmutter seems like an awesome system. But, I think the “ai exaflops” is a “X GPUS times the NVIDIA peak rate”. The new sparsity features on A100 are promising, but haven’t been demonstrated to be nearly as awesome in practice (yet).

It also all comes down to workloads: large-scale distributed training is a funny workload! It’s not like LINPACK. If you make your model compute intensive enough, then the networking need mostly becomes bandwidth (for which multi-hundred Gbps worth of NICs is handy) but even without it there are lots of ways to max out your compute.

Similarly, storage is a serious need for say giant video corpora, but not for things like text! GPT-2 had like a 40 GiB corpus.

For those asking about largest-scale cloud GPU runs, there are basically three examples (chronological)

- the work OpenAI did on Five (thousands of V100s on GCP)

- the IceCube science work [1] on many clouds (51000 GPUs at peak!)

- OpenAI’s 10000-V100 cluster they used for GPT-3

The A100s used here are recently released and another step-change in perf per part (and memory). All major providers now offer them, though with different density and networking configurations (GCP went with 16 in a single box, most folks went with 8, some have lots of networking, etc.).

What everyone should be asking is: what awesome stuff is NERSC / LBL going to do with Perlmutter? You can’t just rent one for a few hours on GCP or any other provider :). (But, fwiw, most usage will be small slices: this is the sad fate of giant supercomputers!)

[1] https://insidehpc.com/2019/11/sdsc-conducts-50000-gpu-cloudb...

Perlmutter is a new kind of supercomputer - It’s mini-cloud disguised as a supercomputer smaller DOE projects to funnel their compute to. Cori was kind of that too, but IIRC kind of on accident because the Xeon Phi wasn’t available during delivery time, so the first “phase” was Haswell nodes. I’m not sure what this means for whole-machine reservation but hopefully that’s a thing of the past in any case.

Ultimately this is a machine to solve everybody’s needs, not just ML needs (although it must solve those needs very well in any case)

I’m not sure Cori will be remembered particularly fondly, but Perlmutter will probably be because it seems like it will be versatile enough to meet everyone’s needs.

For those wondering why not cloud - most simulations generally make sense in cloud because there’s nowhere near the data movement/storage involved and anything in particle physics is always embarrassingly parallel. The data isn’t moving there though - it is very cheap to keep the data at a place like NERSC especially with tape in the mix.

One of the reasons I left DOE for industry many years ago was that so much of the funding went to large supercomputers with about 8 users (the folks who had made their code scale to the largest size) while vast majority of people with compute needs went unsupported.

Tape is worthless. Tape is where you put data you never want to retrieve again.

I understand most of what you are saying, and it isn’t pleasant in many cases but I think the situation is improving.

Yes, all that happens, and most project computing (Office of Science) is usually pushed to NERSC, but it really depends on many factors. NERSC has a lot of users (much more than 8), probably the most out of the DOE complex, since it’s a user facility. Big projects (>$500M) don’t often have this problem and most of them will push for clusters they own at their home lab. Small projects are most likely to have trouble either in procuring resources or effectively using super computing resources.

A dirty secret is many users just use MPI for a job queue/coordination so their jobs “scale” when large reservations are required, but IB is not really exercised in those situations. When you realize that, it’s easy to realize most workflows scale (as long as disk is considered), but very annoying from a development perspective.

Tape isn’t worthless because you have to have a disaster plan or long term data preservation plan and it’s usually just a part of that, but some labs/projects have historically pushed as part of the integral data management plan for active/warm data and that is indeed worthless.

I really doubt people are running embarassingly parallel codes on NERSC supercomputers. I was a PI at NERSC and I was told explicitly, repeatedly, I couldn't do that. That's why I moved to Open Science Grid, and eventually Google's compute platform.

Nope, tape is just useless. Live but cold hard drives.

Not just for perl!

> namesake of Saul Perlmutter, an astrophysicist at Berkeley Lab who shared the 2011 Nobel Prize in Physics for his contributions to research showing that the expansion of the universe is accelerating

Aptly named, as the first simulations will be detecting dark matter. What's incredible is that optical simulation in high energy physics can be reduced to ray tracing in computer graphics. Photon generation and propagation directly mapping to NVidia's OptiX framework ;)

Opticks : GPU Optical Photon Simulation for Particle Physics using NVIDIAR OptiX


Projected sensitivities of the LUX-ZEPLIN (LZ) experiment to new physics via low-energy electron recoils


Are there hard objectives that get reached or achieved with the simulations or whatever's being run? Or is it some amount of data that may or may not show something significant but up to interpretation?

Dude is getting stuff named after him while he's still alive!

> Python programmers will be able to use RAPIDS, Nvidia’s open software suite for GPU-enabled data science

Ah not to be confused with OpenAI Rapid described at https://openai.com/blog/openai-five/ ^_^

Would Ruby be out-of-the-question on religious or technical grounds? ;-P

Disclaimer disclaimer: Language religious wars are dumb, unless we're talking about Rust. I can't find too much bad to say about it other than it doesn't look like Pony. :D Let me go build a programming language with implicit lifetimes and no GC as something no one will ever use. Hang on... gimme a minute.

It doesn't really matter what language you use at the high level, as long as it is calling a highly-optimized, aggressively-compiled, native-instruction-set kernel for most of the work.

How does this compare with one of Google's TPU v4 pods? https://www.hpcwire.com/2021/05/20/google-launches-tpu-v4-ai... ?

From reading the two press releases, Perlmutter is 3.5 Exaflops vs Google TPUv4 at 1 Exaflop. Who knows how that compares in real life on real problems.

One pod is one exactions, and that release says they have dozens.

It's EOL date isn't set for a year after its first press release.

I'm curious what exactly distinguishes the "AI supercomputer" category from whatever the rest of the TOP500 are.

Perlmutter's 120 petaflops peak would place it very favorably in TOP500's rankings, within the top five – if these flops aren't apples-to-oranges with what those rankings measure. Can anyone shed more light on the distinctions involved?

I believe there's a hint towards the end of the article:

> Note: Perlmutter’s “AI performance” is based on Nvidia’s half-precision numerical format (FP16 Tensor Core) with Nvidia’s sparsity feature enabled.

FP16 is a 16 bit floating point format. FLOPS for top 500 are measured with LINPACK HPL, which says it is over 64 bit floating point values (I think):

> HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

(from https://www.netlib.org/benchmark/hpl/)

This isn't totally disingenuous though. These FP16 operations are very useful for some kinds of calculations.

"These FP16 operations are very useful for some kinds of calculations."

ML for example. If you use 16bit precision you can fit the model in half the memory and your lookups are twice as fast. Newer GPU models offer "mixed precision mode" It takes some doing to get that working in your tooling though.

Less than it used to! Apex now offers automatic mixed precision on cublas operations, so you get a good bit of result even without a significant code change.

AI supercomputers report their performance in mixed-precision (bfloat16 or another format) FLOPs, and it's based on the peak performance of the machine, rather than realistic performance (like LINPACK). It's a terrible marketing approach that juices numbers in ways that are irrelevant to the existing supercomputer community.

The trend is more to call these things 'superpods' than 'supercomputers'.

GPFS, Lustre, Panasas (PanFS / pNFS), BeeGFS are just about the only games in town for HPC pFSes. OrangeFS maybe.

Also, I wouldn't want their bills for chilled water or electricity. My puny 96-thread EPYC ranges from 300 - 850 W and heats my room. :B

I work on the WekaFS (proprietary) that can definitely handle the relevant workloads and is in use in fairly large installations.

I am not sure about LBNL overall but I would guess computing use is a fraction of what the ALS uses.

I found this video overview of the system to be pretty good: https://youtu.be/hm7NbBnEJqk

I wish I could try running the DQVM, the distributed quantum simulator written in Common Lisp [0], on this thing. It would be awesome to show that Lisp is a viable “peta-scale” language.

[0] https://github.com/quil-lang/qvm/tree/master/dqvm

Doing back of the napkin math you could accomplish this with 770 cloud instances in Google Cloud all running 8 attached GPUs. Now running lustre in the cloud is a pita so you would have to find out how to take advantage of cloud native storage. Running Rapids should be possible using Dataproc. Anyone have the budget to kick off the largest cloud native HPC workload?

The interconnect on a tightly-coupled machine like this means it performs substantially better than the naively-equivalent distributed compute resources.

google cloud does not have high performance networking. You want azure or maybe AWS if you wnt high performance networking.

Maybe this is off topic and a matter of taste, but wouldn't you hesitate to have a massive computer cluster named after you? While you're alive? Either out of modesty, or in the worst case, if something goes terribly wrong with it?

Modesty is a very personal preference. I imagine a scientist with a Nobel Prize and world renown is comfortable with something like this. And not much can go terribly wrong with a computing cluster. It may have bugs I guess? Or the whole building could catch fire? Not really reasons to weigh too heavily.

For the naming institution, the prospect that the eponymous person would be found to have done something dreadful is probably a bigger worry.

(Nothing remotely against Dr Perlmutter whom I know nothing about. It just seems we live in a scandal-filled age).

Does anyone have the power consumption of the beast at its processing peak?

Why don’t press releases like this ever show one of the nodes opened up!? Let me see the water-cooled mobo with all the A100s and the Cray interconnect!

Doesnt Cerebras chip just out perform this?

I don’t think they are very comparable. This is still very much a conventional supercomputer despite the marketing.

Here’s a fun comparison of total silicon wafer space used.

A Cerebras die is 46,255 mm^2.

1,500 Milan 64-core CPUs * 8 compute chiplets each is 1,004,832 mm^2. (Not including the I/O chiplet).

6,159 NVidia A100 dies is 5,087,334 mm^2.

These are all made on TSMC 7nm, funnily enough.

AI needs bandwidth. Those 1500 chips might as well be orbiting the Earth.

There’s more to performance than peak theoretical performance. The architectures that stick around have tended to be ones with more software support (e.g., x86).

If you mean network bandwidth, the level of batching controls the bandwidth bound and is configurable. If you mean chip bandwidth, that relies on advanced compilation that is pretty darn hard to get right.

For a chip like cerebras’s to win, it’d have to have a bandwidth bound workload and deliver on the software to eliminate the bottlenecks.

Can it run Crysis?

How do we know this is the largest ML facility in the field? Can anybody rule out the idea that I could go to GCP right now and launch 1500 nodes with 4 GPUs each?

1500 compute nodes on GPC don't make a supercomputer the same way that all the computers on campus don't. Table stakes are a high speed interconnect, shared data storage, and a job scheduler. I know of a few devices built out of DGX machines with InfiniBand interconnects that are about half the size, and they were massive projects as well. You can't get there from generic cloud compute.

Do it! You'll make the news, and then fifteen minutes later NERSC will still be the largest ML facility in the field. The HPC community is accustomed to stunt machines, and accustomed to disregarding the ones that don't wind up enabling meaningful research.

That wasn't really much of a quantitative argument. Tons of meaningful research has been done in clouds. Granted, it was often by the owner of the cloud, but there it is. So is there anything we can compare? Does it train resnet-50 in a split second or what?

My understanding is the title is “fastest”, which is 3.8 exaflops of sparse fp16. I assume these are theoretical peak or “benchmark” peak performance. Replicating their hardware in the cloud would get you the same title until a concrete workload is used.

The difference is probably the interconnect, no? What does GCP have between nodes?

According to the article, HPE/Cray 200gb ethernet. GCP A2 instances with 16 GPUs have 100gb ethernet. I'm not sure it's very important, would depend on the workload.

Now the important questions:

1) does it run FarCry?

2) are the GPUs limited in their ETH hashrate?

1) It runs Linux, so you're limited to 0 A.D.

2) No, it has the extremely expensive server GPUs, so their drivers should be normal.

I've run Far Cry 2 on Proton so the ceiling is open. My question is can we do Octree sims at scale???

What would be the first question you would ask an AI supercomputer?

Myself: What does 42 mean?

Dude it's just a bunch of GPUs with a fast network...

Source: Berkeley PhD student in supercomputing, have access to Perlmutter.

Be honest. Has anyone used it to mine crypto yet?

That's very strictly forbiden. The fastest-way-to-get-fired kind of forbiden.

Asimov wrote a short story[0] about this exact scenario!

Many people at HN have probably read this already, but maybe some of them could be part of the lucky 10,000![1]

[0] https://templatetraining.princeton.edu/sites/training/files/...

[1] https://xkcd.com/1053/

I don't think that's the sort of question a supercomputer like this can answer. It has additional compute power - it doesn't have any breakthrough in the AI itself.

True, but then how we define what AI is does have many answers more than it did back in Asimov's days.

For me the ability to ask a question that is not clear to some and for others will have literary humour (Douglas Adams reference) would be telling about the AI. Some humans would pick up on that reference and the humour, some would be flummoxed and it is for me how intelligence answers questions that it does not fully understand is, at least for me insightful into how intelligent they are.

> would be telling about the AI

Right… but there’s no point wondering because the answer is it can’t answer things like that.

Are they gonna use the supercomputer for their NFTs?

[1] https://www.berkeleyside.org/2021/05/27/cal-will-auction-nft...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact