Hacker News new | past | comments | ask | show | jobs | submit login
Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP (arxiv.org)
95 points by arcanus 7 months ago | hide | past | favorite | 27 comments



I was talking with a friend in HPC lately who said that AMD is actually quite competitive in the HPC space these days. For example, Frontier (https://docs.olcf.ornl.gov/systems/frontier_user_guide.html) is an all-AMD installation. Do scientists actually use ROCm in their code or does AMD have another programming framework for their Instinct chips?


I currently have a project with ORNL OLCF (on Frontier). The short answer is yes. Happy to answer any questions I can.


ROCm or HIP? Does it start out with porting a lot from CUDA etc. or starting fresh on top of the AMD APIs?

How much of the project time is spent on that compute API stuff in comparison to "payload" work?


>ROCm or HIP?

I'm not sure that's the right question to ask. Afaik ROCm is the name of that entire tech stack and HIP is AMD's equivalent to CUDA C++ (they basically replicated the API and replaced every "CUDA" by "hip", they have functions called "hipmalloc" and "hipmemcpy").

The repository is located at https://github.com/ROCm/HIP.


My project is ROCm (torch, more or less) and working with OLCF staff I've never heard of HIP in use but based on their training series it is supported[0].

Of course my personal experience isn't exhaustive and it can be inferred from the ongoing training series that it is in use in some cases.

Speaking from personal experience ROCm itself is... Challenging (which I already knew from prior endeavors). We've taken to dev and staging workloads on more typical MI2xx hardware and then working it over to Frontier.

We currently have 20k node hours on Frontier via a Director's Discretion Project[1]. It's a relatively simple application and at the end of the day you have access to significant compute so depending on workload the extra effort for ROCm, etc is still worth it.

[0] - https://www.olcf.ornl.gov/hip-training-series/

[1] - https://www.olcf.ornl.gov/for-users/documents-forms/olcf-dir...


National labs sign "cost-effective" deals. NVIDIA isn't cost-effective. Aurora (at Argonne) is all Intel GPU. Aurora is also a clusterfuck so that just tells you these decisions aren't made by the most competent people.


LANL might disagree given that they just unveiled a new supercomputer with NVIDIA chips [1]. NVIDIA CEO Jensen Huang was even at the unveiling.

[1] https://ladailypost.com/los-alamos-national-laboratory-unvei...


Both Frontier and Aurora bet on unproven future chips. Sometimes it pays off and sometimes it doesn't.


They are competent people, just not in the fields techies want.

When you're a national laboratory and your wallet is taxes from fellow Americans, it is very important that you find a balance between bang and buck. Lest you get your budget slashed or worse.


nvidia absolutely gives deals to national labs and universities. See Crossroads @ LANL, Isambard in the UK, Perlmutter @ LBL. While AMD is being deployed at LLNL and ORNL, Nvidia isn’t done with their HPC game. Maybe not at the leadership level, but we’ll see how Oak Ridge and LANL decide their next round of procurements


"Winning" a national lab definitely confers benefits far beyond just financial ones – these are, by definition, the biggest deployments in the world. Both the technical experience setting these up, and the reputational benefit associated with this, is worth a great, great deal. (I don't know how much money HPE Cray makes, for example, but I'm sure it's not the money it makes that's stopped HPE from quietly sunsetting the brand.)


AMD had pretty much always been competitive in HPC, AI not so much because of software.


An interesting alternative question: "how necessary is ROCm when working with APU?".

CUDA's advantage seemed to me to come mostly from memory management and task scheduling being so poor on AMD cards. If AMD has engineered that problem out of the system, we might be able to get away with using 3rd party libraries instead of these vendor-promoted frameworks.


This is a great question. In the sense that ROCm is pure userspace it's never necessary - make the syscalls yourself and the driver in the Linux kernel will do the same things ROCm would have done.

In practice if you go down that road on discrete GPU systems, allocating "fine grain" memory so you can talk to the GPU is probably the most tedious part of the setup. I gave up around there. An APU should be indifferent to that though.

There will be some setup to associate your CPU process with the GPU. Permissions style, since Linux doesn't let processes stomp on each other. That might be rather minimal and should be spelled out in roct.

Launching a kernel involves finding the part of the address space the GPU is watching, writing 64 bytes to it and then "ringing a doorbell" which is probably writing to a different magic address. There's a lot of cruft in the API from earlier generations where these things involved a lot of work.

Game plan for finding out goes something like:

  1. Compile some GPU code and put it in the host processs
  2. Make the calls into hsa.h to run that kernel
  3. Delete everything unused from hsa to get an equivalent that only uses roct
  4. Delete everything unused from roct to get the raw syscalls
Roct is a small C library that implements the userspace side of the kernel driver. I'd be inclined to link it into your application instead of drop it entirely, but ymmv. Rocr / HSA is a larger C++ library that has a lot more moving parts and is more tempting to drop from the dependency graph.

Going beyond that, you could build a simplified version of the kernel driver that drops all the other hardware. Might make things better, might not. And beyond that there's the firmware on the GPU which might be getting more accessible soon, but iiuc is written in assembly so might not be that much fun to hack on. And beyond that you're on the silicon, where changing it is making a different chip really.


APU’s for HPC are going to be a wild ride. Accelerated computing in shared memory. Get CPU-focused folks will actually get access to some high throughput compute accessible on the sort of timescales that we can actually reason about (the GPU is so far away).


APUs are very cool for GPU programming in general. Explicitly copying data to/from GPUs is a definite nuisance. I'm hopeful that the MI300A will have a positive knock on effect on the low power APUs in laptops and similar.


>Explicitly copying data to/from GPUs is a definite nuisance.

CXL allows fine grained shared memory, but people look at the shiny high bandwidth NVLink and talk about how much better it is for... AI.


All video game consoles use APUs and it does make memory related operations potentially faster but at least for video games it's not the bottleneck. I suppose for HPC it might have more significance.


If you're doing simulations, or poking big matrices continuously on CPUs, you can saturate the memory controller pretty easily. If you know what you're doing, your FPU or vector units are saturated at the same time, so "whole system" becomes the bottleneck while it tries to keep itself cool.

Games move that kind of data in the beginning and doesn't stream new data that much after the initial texture and model data. If you are working on HPC with GPUs, you may need to constantly stream in new data to the GPU while streaming out the results. This is why datacenter/compute GPUs have multiple independent DMA engines.


Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.


> Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.

Other way around, bobcat was the era of “onion bus”/“garlic bus” and today things like apple silicon don’t need to be explicitly accessed in certain ways afaik.

https://www.realworldtech.com/fusion-llano/3/

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...


I’ve been waiting for something like that in the HPC space for years - that’s what I wanted when HSA first came out.

https://en.wikipedia.org/wiki/Heterogeneous_System_Architect...


Having looked briefly at the code I still think C++17 parallel algorithms are more ergonomic compared to OpenMP: https://rocm.blogs.amd.com/software-tools-optimization/hipst...


Is language support why people like OpenMP?

I think it is nice because it supports both C and Fortran, and they use the same runtime, so you can do things like pin threads to cores or avoid oversubscription. Stuff like calling a Fortran library that uses OpenMP, from a C code that also uses OpenMP, doesn’t require anything clever.


OpenMP has been around for a long time. People know how to use it, and it has gained many features that are useful for scientific computing.

The consortium behind OpenMP consists mostly of hardware companies and organizations doing scientific computing. Software companies are largely missing. That may contribute to the popularity of OpenMP, as the interests of scientific computing and software development are often different.


>> Is language support why people like OpenMP?

I use it sometimes with C++ because it is super easy to make "embarrassingly parallel" code actually run in parallel. And by using nothing but #pragma statements it will still compile single threaded if you don't have OMP as the pragmas will be ignored.


funny how we only get LoC between the different versions, but not the performance...

Of course the parallel algorithms are shorter, it's a more high-level interface. But being explicit gives you more control and potentially more performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: