I was talking with a friend in HPC lately who said that AMD is actually quite competitive in the HPC space these days. For example, Frontier (https://docs.olcf.ornl.gov/systems/frontier_user_guide.html) is an all-AMD installation. Do scientists actually use ROCm in their code or does AMD have another programming framework for their Instinct chips?
I'm not sure that's the right question to ask. Afaik ROCm is the name of that entire tech stack and HIP is AMD's equivalent to CUDA C++ (they basically replicated the API and replaced every "CUDA" by "hip", they have functions called "hipmalloc" and "hipmemcpy").
My project is ROCm (torch, more or less) and working with OLCF staff I've never heard of HIP in use but based on their training series it is supported[0].
Of course my personal experience isn't exhaustive and it can be inferred from the ongoing training series that it is in use in some cases.
Speaking from personal experience ROCm itself is... Challenging (which I already knew from prior endeavors). We've taken to dev and staging workloads on more typical MI2xx hardware and then working it over to Frontier.
We currently have 20k node hours on Frontier via a Director's Discretion Project[1]. It's a relatively simple application and at the end of the day you have access to significant compute so depending on workload the extra effort for ROCm, etc is still worth it.
National labs sign "cost-effective" deals. NVIDIA isn't cost-effective. Aurora (at Argonne) is all Intel GPU. Aurora is also a clusterfuck so that just tells you these decisions aren't made by the most competent people.
They are competent people, just not in the fields techies want.
When you're a national laboratory and your wallet is taxes from fellow Americans, it is very important that you find a balance between bang and buck. Lest you get your budget slashed or worse.
nvidia absolutely gives deals to national labs and universities. See Crossroads @ LANL, Isambard in the UK, Perlmutter @ LBL. While AMD is being deployed at LLNL and ORNL, Nvidia isn’t done with their HPC game. Maybe not at the leadership level, but we’ll see how Oak Ridge and LANL decide their next round of procurements
"Winning" a national lab definitely confers benefits far beyond just financial ones – these are, by definition, the biggest deployments in the world. Both the technical experience setting these up, and the reputational benefit associated with this, is worth a great, great deal. (I don't know how much money HPE Cray makes, for example, but I'm sure it's not the money it makes that's stopped HPE from quietly sunsetting the brand.)
An interesting alternative question: "how necessary is ROCm when working with APU?".
CUDA's advantage seemed to me to come mostly from memory management and task scheduling being so poor on AMD cards. If AMD has engineered that problem out of the system, we might be able to get away with using 3rd party libraries instead of these vendor-promoted frameworks.
This is a great question. In the sense that ROCm is pure userspace it's never necessary - make the syscalls yourself and the driver in the Linux kernel will do the same things ROCm would have done.
In practice if you go down that road on discrete GPU systems, allocating "fine grain" memory so you can talk to the GPU is probably the most tedious part of the setup. I gave up around there. An APU should be indifferent to that though.
There will be some setup to associate your CPU process with the GPU. Permissions style, since Linux doesn't let processes stomp on each other. That might be rather minimal and should be spelled out in roct.
Launching a kernel involves finding the part of the address space the GPU is watching, writing 64 bytes to it and then "ringing a doorbell" which is probably writing to a different magic address. There's a lot of cruft in the API from earlier generations where these things involved a lot of work.
Game plan for finding out goes something like:
1. Compile some GPU code and put it in the host processs
2. Make the calls into hsa.h to run that kernel
3. Delete everything unused from hsa to get an equivalent that only uses roct
4. Delete everything unused from roct to get the raw syscalls
Roct is a small C library that implements the userspace side of the kernel driver. I'd be inclined to link it into your application instead of drop it entirely, but ymmv. Rocr / HSA is a larger C++ library that has a lot more moving parts and is more tempting to drop from the dependency graph.
Going beyond that, you could build a simplified version of the kernel driver that drops all the other hardware. Might make things better, might not. And beyond that there's the firmware on the GPU which might be getting more accessible soon, but iiuc is written in assembly so might not be that much fun to hack on. And beyond that you're on the silicon, where changing it is making a different chip really.
APU’s for HPC are going to be a wild ride. Accelerated computing in shared memory. Get CPU-focused folks will actually get access to some high throughput compute accessible on the sort of timescales that we can actually reason about (the GPU is so far away).
APUs are very cool for GPU programming in general. Explicitly copying data to/from GPUs is a definite nuisance. I'm hopeful that the MI300A will have a positive knock on effect on the low power APUs in laptops and similar.
All video game consoles use APUs and it does make memory related operations potentially faster but at least for video games it's not the bottleneck. I suppose for HPC it might have more significance.
If you're doing simulations, or poking big matrices continuously on CPUs, you can saturate the memory controller pretty easily. If you know what you're doing, your FPU or vector units are saturated at the same time, so "whole system" becomes the bottleneck while it tries to keep itself cool.
Games move that kind of data in the beginning and doesn't stream new data that much after the initial texture and model data. If you are working on HPC with GPUs, you may need to constantly stream in new data to the GPU while streaming out the results. This is why datacenter/compute GPUs have multiple independent DMA engines.
Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.
> Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.
Other way around, bobcat was the era of “onion bus”/“garlic bus” and today things like apple silicon don’t need to be explicitly accessed in certain ways afaik.
I think it is nice because it supports both C and Fortran, and they use the same runtime, so you can do things like pin threads to cores or avoid oversubscription. Stuff like calling a Fortran library that uses OpenMP, from a C code that also uses OpenMP, doesn’t require anything clever.
OpenMP has been around for a long time. People know how to use it, and it has gained many features that are useful for scientific computing.
The consortium behind OpenMP consists mostly of hardware companies and organizations doing scientific computing. Software companies are largely missing. That may contribute to the popularity of OpenMP, as the interests of scientific computing and software development are often different.
I use it sometimes with C++ because it is super easy to make "embarrassingly parallel" code actually run in parallel. And by using nothing but #pragma statements it will still compile single threaded if you don't have OMP as the pragmas will be ignored.
funny how we only get LoC between the different versions, but not the performance...
Of course the parallel algorithms are shorter, it's a more high-level interface. But being explicit gives you more control and potentially more performance.