Hacker News new | past | comments | ask | show | jobs | submit login
The AMD “Aldebaran” GPU that won exascale (nextplatform.com)
127 points by jonbaer 73 days ago | hide | past | favorite | 60 comments

I would sincerely hope for a competitive AMD GPU for deep learning. But as long as it's a week-long journey with unknown ending to try to recompile TensorFlow to support ROCm, everyone I know in AI will firmly stick with NVIDIA and their production-proven drivers and CUDA APIs.

I wish AMD would offer something like NVIDIA's Inception program to gift some accelerators and GPUs to suitable C++ coders (like me) so that there's at least a few tutorials on the internet on how other people managed to successfully use AMD + ROCm for deep learning.

EDIT: And it seems ROCm doesn't even support any of those new RDNA2 accelerators or gaming GPUs: https://github.com/RadeonOpenCompute/ROCm/issues/1344

So this is great hardware, but absolutely useless unless you are big enough to write your own GPU drivers from scratch ~_~

ROCm 4.5 is also the last release to support their own Vega 10 based accelerator. (Radeon Instinct MI25)


aka AMD doesn't care... they just want the supercomputer contracts where the customers are savvy enough to build their own very specific SW stack.

Sounds like AMD might still be using the 'tesla roadster' strategy, of selling fewer, more lucrative contracts for the time being. Probably not that they don't care, just that for now, they have to focus.

That sounds reasonable on the surface, but cards don't cost a lot, and sending them out for free to leverage community interest for the open source stack could be extremely cost effective.

If they generate demand they can't keep up with (without messing up their strategy), it's entirely possible that negative perception would hurt then more than the 'don't care' perception they have now.

Even better, you can provide support for gaming cards and then the community pays you for them instead of you having to pay for it out of dev-rel budget.

And yet AMD is actively moving in the opposite direction and dropping support for older hardware from both their Windows driver, as well as from ROCm and their other compute software.

It's hard to overstate what a positive impact gaming card support and donations of hardware to educational institutions and other devrel have been for CUDA - those are the people who are writing the next 10 years of your software, that sells the actual enterprise hardware. And at this point it's not just "can't afford to support it", AMD is doing fine these days, they just don't want to. Kind of baffling.

GPGPU have been a thing for two decades, they should have had this solved years ago. Lets hope SYCL gets wide support, Intel seems to bet on it and there's a Cuda backend available. If AMD wants to make themselves irrelevant for anything but gaming and HPC that's their choice.

Years ago AMD was almost bankrupt and put everything into one last attempt at a CPU. Now AMD is hiring like crazy, and I hope they are able to catch up to their larger competitors in tooling and software.

Here's what I think they'll do for the near future. They can't build an all-encompassing library ecosystem that supports all the hardware, all the existing software, has really good tooling, etc. in a reasonable amount of time. In consumer they can mostly get away with just supporting Microsoft's APIs, which they already do pretty well. If they design a really good compute GPU, they probably won't be able to make enough of them. They can sell all of their production with a few big deals to important customers, and they can attach engineers to those deals to help support the specific needs of those customers. That is much more practical than trying to catch up to CUDA in its entirety. Basically, status quo.

As someone who isn't a big customer and who is mad at Nvidia I don't hope for this to happen, but it seems the most likely path.

I have friends here in the Austin area who have worked for AMD. From what I’ve heard from them, it’s not that AMD doesn’t care, it’s that AMD is clueless and hopelessly disorganized, and they’re constantly doing whatever they can to chase the latest supercomputing contract, to the exclusion of all else.

It’s a target-rich environment if you want to learn all the bad anti-patterns, so that you can avoid doing them in the rest of your career.

So, it’s not that they don’t care. It’s that they don’t have enough hours in the every-day-is-a-hair-on-fire-day that they would be capable of caring.

Is there even a single machine in the supercomputer top 10 that uses AMD GPUs?

I see NVIDIA all over the place there but I'm not aware of any of them using AMD GPUs, though a couple do use AMD CPUs.

Yeah frankly it's a little misleading to frame this as "AMD won this", this is a gimmie contract to keep them in the game. The DoE is throwing gimmie contracts to Intel too for Xe and they haven't even produced a working product yet.

Their CPUs pretty much fall into the same boat too - is it justifiable to buy Intel CPUs right now for HPC applications, especially with AMD supporting AVX-512 with their Zen4 chips (which are the counterparts to the Sapphire Rapids DoE is buying)? Not really but their interests are in keeping Intel in the game, an AMD monoculture doesn't benefit anyone anymore than an Intel monoculture did.

Although of course this is not fab-related, it's the same basic strategy - the US wants as diverse and thriving a tech ecosystem as they can get in the west, and particularly in the US, to counterbalance a rising China. Not that China is anywhere close today, but in the 20-year timeframe it's a major concern.

None today.

those HPC machines will be the first ones

ROCm support for Gaming cards has been poor (and not advertised) but it's part of the tech stack they are selling with these accelerators:


It's clearly a real problem that AMD's ML Software stack isn't quite there, and lacks in support for the non-specialized cards, but that's not really an issue for these HPC use cases....

Interesting, thanks for sharing this.

Apparently they are going to use “HIP” to convert CUDA applications to be able to run on AMD:

> The OLCF plans to make HIP available on Summit so that users can begin using it prior to its availability on Frontier. HIP is a C++ runtime API that allows developers to write portable code to run on AMD and NVIDIA GPUs. It is essentially a wrapper that uses the underlying CUDA or ROCm platform that is installed on a system. The API is very similar to CUDA so transitioning existing codes from CUDA to HIP should be fairly straightforward in most cases. In addition, HIP provides porting tools which can be used to help port CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application. HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.

Would this one day lead to a HIP based PyTorch? I hope so!

> Would this one day lead to a HIP based PyTorch? I hope so!


you can think of HIP as being a fancy batch script to convert CUDA to OpenCL and it works about as well.

If you're going to have to re-debug it anyway, what's the point? AMD's focus certainly should have been on something like GPU Ocelot or compiling PTX to GCN/RDNA instead.

AMD's not nowhere. https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learni... shows what should be a followable happy path to getting TensorFlow going (2 year old TF 1.15, and a 2.2beta). I'm curious what is prickly or hard about it.

IMO the deep learning folk need to be working more actively towards the future. The CUDA free ride is amazing, but AMD's HIP already does a good job being CUDA compliant in a general sense. But CUDA also sort of encompasses the massive collection of libraries that Nvidia has written to accelerate a huge amount of use cases. Trying to keep pace with that free-ride is hard.

My hope is eventually we start to invest in Vulkan Compute. Vulkan is way way way harder than CUDA, but it's the only right way I can see to do things. Getting TensorFlow & other libraries ported to run atop Vulkan is a herculean feat, but once there's a start, I tend to believe most ML practitioners won't have to think about the particulars, and I think the deep engineering talent will be able to come, optimize, improve the Vulkan engines very quickly, rapidly be able to improve whatever it is we start with.

It's a huge task, but it just seems like it's got to happen. I don't see what alternative there is, long term, to starting to get good with Vulkan.

> I'm curious what is prickly or hard about it.

I don't want to presume, but it sounds like you haven't actually tried using ROCm "what should be".

My experience with it was an absolute nightmare, I've never gotten ROCm working before. Just as well, since it turns out my systems never would have supported it for various reasons (lacking PCIe atomics for one), but I never actually got so far as to run into the driver problem, I never got the whole custom LLVM fork/ROCm software stack to work.

Caveat, I'm not professionally involved in deep learning or HPC, and as others have mentioned, the framework was only intended for a few specific cards running on very specific hardware for HPC cases.

But pretending like this is even a fraction as useful for the "average" person trying to experiment or even work at a low-medium level in machine learning feels off to me.

I don't think people will be swayed by platitudes about creating a competitive open-systems ecosystem to use plainly inferior software. Companies aren't going to spend oodles of money (and individuals won't volunteer tons of time) to suffer porting frameworks to target bare-bones APIs for the sake of being good sports.

Until either nvidia screws over everyone so much that using AMD cards becomes the path of least resistance, or AMD/Intel offers products at significantly lower prices than nvidia, I don't see the status quo changing much.

> My hope is eventually we start to invest in Vulkan Compute.

Vulkan is for graphics. Khronos' compute standard that's most similar to Cuda is SYCL. Both compile shaders to SPIR-V though.

> Vulkan is for graphics.

Incorrect. Quoting the spec:

> Vulkan is an API (Application Programming Interface) for graphics and compute hardware.

Vulkan has compute shaders[1], which are generally usable. Libraries like VkFFT[2] demonstrate basic signal processing in Vulkan. There are plenty of other non-graphical Compute Shader examples, and this is part of the design of Vulkan (and also WebGPU). Further, there is a Vulkan ML TSG (Technical Subgroup)[3], which is supposed to be working on ML. Even Nvidia is participating, with extensions like VK_NV_cooperative_matrix, which specifically target the ml tensor cores. A more complex & current example, we see works like Google's IREE, which allow inference/Tensorflow Lite execution on a variety of drivers, including Vulkan[4], which has broad portability across hardware & fairly decent performance, even on mobile chips.

There's people could probably say this better/more specifically, but I'll give it a try: Vulkan is, above all, an general standard for modelling, dispatching & orchestrating work usually on a GPU. Right now that usage is predominately graphics, but that is far from a limit. The ideas of representing GPU resources, dispatching/queueing work are generic, apply fairly reasonably to all GPU systems, and can model any workload done on a GPU.

A good general introduction to Vulkan Compute is this great write up here[5]: https://www.duskborn.com/posts/a-simple-vulkan-compute-examp...

> Khronos' compute standard that's most similar to Cuda is SYCL.

SYCL is, imo, the opposite of where we need to go. It's the old historical legacy that CUDA has, of writing really dumb ignorant code & hoping the tools can make it run well on a GPU. Vulkan, on the other hand, asks us to consider deeply what near-to-the-metal resources we are going to need, and demands that we define, dispatch, & complete the actual processing engines on the GPU that will do the work. It's a much much much harder task, but it invites in fantastic levels of close optimization & tuning, allows for far more advanced pipelining & possibilities. If the future is good, it should abandon lo-fi easy options like SYCL and CUDA, and bother to get good at Vulkan, which will allow us to work intimately with the GPU. This is a relationship worth forging, and no substitutes will cut it.

[1] https://vkguide.dev/docs/gpudriven/compute_shaders/

[2] https://github.com/DTolm/VkFFT

[3] https://www.khronos.org/assets/uploads/developers/presentati...

[4] https://google.github.io/iree/deployment-configurations/gpu-...

[5] https://www.duskborn.com/posts/a-simple-vulkan-compute-examp...

History has shown that it's not necessarily the most performant option that wins, it's usually the most convenient. Otherwise we wouldn't use Python for data science and ML, or Javascript for IDEs. Remeber AMD's Close To Metal? It wasn't a big hit. Take a look at the Blender announcement that's on the front page right now, Blender is removing OpenCL code and replacing it with HIP/Cuda to get a single codebase [0].

There's room for both solutions, but I think it's important to have a relatively easy way to use accelerators like GPUs in a cross platform way, without being an expert or having to rewrite code for new architectures.

It is my understanding that because both Vulkan and SYCL use SPIR-V, the work done on drivers and compilers for one of them benefits the other as well.

[0] https://news.ycombinator.com/item?id=29234671

AMD's close-to-the-metal was the blueprint for Vulkan, or at leat the majorest contributor (ed: oops, im thinking of AMD Mantle, a little bit latter & also extremely super low level). It won. It's ideas power almost all GPU technology today.

Vulkans not relatively easy to do from scratch, but it is cross platform, and again, most folks dont need to write Vulkan. They're using ML frameworks that abstract that away.

Vulkan will enable use of countless great extensions & capabilities that tools like SYCL wont be smart enough to use. Maybe the SPIR-V can be optimized by the drivers well, but the code SYCL spits out, i'd wager, will be world's less than what we do if we try. This is what CUDA does: it allows bad/cheap SYCL like code, but most folks use NV's vast ultra-complete libraries that are far far far better written in much lower level code & tweaked for every last oz of performance. CUDA is basically a scripting language + close to the metal library. SYCL might eventually be able to become similar, but only if we do the hard work of making really good Vulkan libraries. Otherwise it'll be pretty much trash.

There's a lot of argumemts in favor of being bad, of not going for thr good stuff. But end of the day, we just should do the right thing. Everyone other than NV should make a real start, should make a bid for their & the survival of everyone else. Vulkan is the only real bid i see.

It won where?

Consoles use their own APIs, even the Switch, has an alternative to Vulkan.

Apple has Metal and Windows DirectX.

The only places where Vulkan has "won" are Linux, irrelevant given the 1%, and Android, where most games are still GL ES if one cares about reaching everyone.

There is a competitive port of tensorflow for AMD GPU. Just the only issue is that it is from Apple and works only on Mac. Tensorflow fully supports AMD GPU in Mac and it could max out the graphics card usage as well. Makes me wonder if Apple could do it, why can't AMD work to use Vulkan instead of Metal.


> I wish AMD would offer something like NVIDIA's Inception program to gift some accelerators and GPUs to suitable C++ coders (like me) so that there's at least a few tutorials on the internet on how other people managed to successfully use AMD + ROCm for deep learning.

Why not go a step further and pay some people to integrate good first-party support in the most popular libraries? It would probably be quite cheap in comparison and kickstart the adaption.

NVIDIA, Intel, Apple, Google (TPU) have hundreds of engineers working on squeezing the performance out of their chips. This is not trivial.

Sometimes it's cheaper to pay a third party to implement something than going through the hiring process and extraneous legalities of contribution.

Sponsoring a third party to do something or contracting them, is very common.

It will be expensive no matter who does it. Probably more for a contractor, given that they’d have less experience and internal knowledge of the hardware.

ROCm doesn't support RDNA at all https://github.com/RadeonOpenCompute/ROCm/issues/887

"might expect good news with 5.0" is not a promise.

And they've been delaying this for months. Last time on April they've said on Github that 5700XT support should be roughly available in 2~4 months, and its already November.

Today a Blender beta version with HIP support has been released. This is working on RDNA hardware (RDNA2 officially supported, RDNA1 enabled but not supported). I guess a release date for ROCm is approaching after all.

https://github.com/RadeonOpenCompute/ROCm/issues/1617 appears to be the issue to track if you are interested in ROCm support for RDNA2.

Around 1980 my family got their first computer. I've followed this business ever since, and I was amazed that a CRAY could do MFLOPS. My MS basic interpreter could do hundreds or even thousands of FLOPs on its 8080A. I watched as the high end went to hundreds of MFLOPS, then GIGAFLOPS which seemed insane. There were national efforts to reach TFLOPS, and reading about the challenges (IIRC at the time interconnect was a huge deal) made it seem like the end was near. Moore's law was always in danger. Then came PETAFLOPS consuming megawatts of power.

And now I play VR on a battery powered gizmo doing about 1 TFLOP strapped to my head, and EXAFLOPS are basically here. This is all with at least TSMC 5nm, 3nm, 2nm, and multi-layer left on the table. After watching this relentless advance for 4 decades I'm pretty sure it will go beyond even that, but we just don't know what it will look like yet.

It's become everyday tech to me, but if I look back the progression is absolutely astounding.

> After watching this relentless advance for 4 decades I'm pretty sure it will go beyond even that, but we just don't know what it will look like yet.

Just like people in the 1950s, having seen the rise of the car and the airplane in their lifetimes, thought we would have flying cars by the 2000s.

Things will eventually plateau, and we will see improvements elsewhere.

It is astounding. What is more astounding to me is that we burn so many of these cycles on eye candy, and that we waste so many of them on bloat. If not for that your battery powered gizmo would run for many days on one charge instead of having to be connected to its umbilical for 8 hours every night.

If that’s astounding think about what most of us spend our time doing with the insane amount of processing power available between our ears.

Sure, but we didn't engineer that with performance in mind, and with computers it was enough to declare a previous generation obsolete. Whereas productivity for computer based applications was actually pretty good for the first generation of those machines, pretty much every cycle counted. Some people used them to play games and there was a recreational element to programming in its own right but it wasn't as though anybody would burn cycles to give a UI the texture of the real thing, it was a computer that worked and it produced results, which was all that mattered.

most of that processing power is unfortunately special purpose and can't be easily repurposed for something else. We have very good vision processing, but it is not like you can entirely high-jack that for something else.


The article points out this CDNA2 whitepaper, which has the juicy technical details.

CDNA1 is here: https://www.amd.com/system/files/documents/amd-cdna-whitepap...


CDNA2 / MI200 is a chiplet strategy with two "GCDs", each functioning as a classic GPU. These two GCDs can access each other's memory, but only at a lower 400GBps speed (page 8 whitepaper).

The actual HBM RAM is designed for 1600 GBps (article), x2 since two GCDs exist. AMD says its like 3200 GBps but in actuality, any one such block/workgroup can only get 2000 GBps (1600 GBps local RAM + 400 GBps from infinity-fabric / partner GCD). So its really a bit complicated and will likely be very workload specific.

If your data can be cloned / split efficiently, then the RAM probably will look like 3200GBps. But if you have to communicate with both parts of RAM to see all the data, you'll see a clear slowdown.

I've been playing a little of the small-arena-survival game Warhammer40k: Dawn of War 2 (2009), and when starting, the AMD "The Future is Fusion" logo shows full screen. For the longest time this was kind of a sad memento to something lost, a future that never happened: Fusion was a ~2009 campaign for their APUs, their GPU+CPU chips, and other possible shared-memory systems.

Well, it's happening. Sort of. AMD is finally getting into the post PCIe game for reals this time. Only at the very high end of the market though. (Perhaps upcoming consumer GPUs might have such capabilities, but AMD seems to be only shipping literally dozens/hundreds of high end GPUs a month atm.) Fusion is happening... for the very big. Oh, and also, Apple, whose 200 / 400GBps M1 Pro/MAX chips are performing true wonders via fused/unified memory. The Steam Deck, with ~66GBps and integrated AMD APU/GPU, will be a next test. I'm not sure how consoles are doing these days, which is another strongly AMD corner.

In some ways, the Infinity Fabric 3 news makes me a bit sad. In it's past life, Infinity Fabric was known as HyperTransport (HTX), an open standard, backed by a HyperTransport Consortium, with roots supposedly going all the way back to DEC & Jim Keller (who Apple got some experience from too via the 2007 PA Semi acquisition) & other engineers seemingly. FPGAs, network cards, storage could all work closely with the CPU via HTX. In this new regime, Infinity Fabric is finally catching up with the also closed/proprietary capability of Nvidia gpus plus CPUs as well (only available on NV+Power architecture computers AFAIK). But outside players aren't really able to experiment with these new faster closer-to-memory architectures, unlike with HTX. For that, folks need to use some of the various other more-open fabric/interconnects, which are often lower latency than PCIe but usually not faster: CXL, OpenCAPI, Gen-Z, and others.

OAM is an open standard that Intel + AMD seem to be supporting though.

And with AMD now the owner of Xilinx, there's a good chance that this technology will be across Xilinx FPGAs + AMD GPUs + AMD CPUs.

From the same source two weeks ago: https://www.nextplatform.com/2021/10/26/china-has-already-re...

Do they read their own news?

China won Exascale. Twice. Before anybody else.

Is there an MI200 systems in the Top500 yet?

Supercomputing 2021 is running and the updated November 2021 Top500 list was announced.

There is only one system in the top 10, and that's an NVIDIA A100 system from Microsoft.

The only 2 systems with > 100 PFLOPS are Summit and Fugaku.

If AMD wanted to sell more in this space, wouldn't it pay to support the code which runs in this space? Intel and nvidia are masters of funding compiler and tool chain and even applications stacks to work on their dice. Reading this article and here, I get the impression AMD hasn't entirely mastered how you need to sell the gateway drugs as well as the hard stuff, to sell more of the hard stuff in the end.

AMD was a much smaller company than Intel / NVidia just a few years ago.

Even today, their market cap is a bit misleading, because AMD has much lower profits / revenue than either company. It only makes sense for AMD to work with HPE / Cray instead for these details for now.

ROCm / HIP seems like the path AMD is taking moving forward. But without Windows support and with few GPUs supporting HIP, its a bit of a hard sell.

That being said: Polaris 580X was good for a while (until ROCm 3.x or so). Vega64 was good through ROCm 4.5. And now Blender is saying RDNA2 is supported.

So it seems like AMD is keeping at least some consumer GPUs available to try out HIP. You've gotta be careful, its not like NVidia where PTX cross-compiles the CUDA into all different architectures. Only specific cards seem to have good results (such as Vega64 or Rx 580. But the RX 550 had issues)


Just 7 years ago in 2014, AMD was laying off entire teams as it was risking bankruptcy. The loss of good engineers like that has reverberating effects on the company. They're beginning to recover now (ROCm development really kicked into high gear this past year), but a lot of their earlier stuff (HSA, Fusion, Bobcat cores, etc. etc.) were curtailed in the fight for survival.

Intel had similar scaling/layoff issues. Wasn't there a story in 2018/19 timeframe about Intel having to re-hire shedloads of greybeards because of their production issues?

AMD cut back to survive. There were bankruptcy concerns in 2014.

Intel cut back for no reason. So it was more of a strategic mistake. Intel was making so much money with Sandy Lake (and other i7 processors: 2nd generation through 6th generation were a one-sided domination of the market).

I think this was less than 6 months ago, when Pat came back.

> If you want to know how and why AMD motors have been chosen for so many of the pre-exascale and exascale HPC and AI systems...



Probably bad translation.

The author lives in North Carolina and has a BA in English. It's just a metaphor.

What software will they use to, for instance, train large deep learning models? Nvidia has CUDA, AMD has what? Are they writing new software from scratch? Maybe they have a lot of frameworks to solve problems in the “traditional” HPC space (eg weather forecasts), but in the ML space I only heard of ROCm which seems to be poorly supported.

AMD seems such an odd choice for “AI supercomputers”.

ROCm works with Tensorflow, Pytorch, etc. This generation they added more matrix throughput for AI workloads but CDNA is still compute focused, hence the name.

Who actually cares about FP64 for AI? All I've seen has been going the other way, ie. bfloat16 or even 8-bit floats.

Since neural transition functions are largely characterized by a single point of inflection, it seems hard to see how you'd make 32 bits of mantissa pay rent even in principle.

I think they market those fp64 things for the simulation market, which is basically pretty much every supercomputer's mission (I'm simplifying). The A100 also has specific silicium for fp64 (so called dp tensor cores) that you can't find in lower end datacenter stuff.

FP64 really stings even on (e.g) mid-end nvidia hw, what with the 1:32 handbrake. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... for reference

These aren't for AI workloads, they're for things like simulating atomic blasts.

Or space simulations: https://spaceengine.org/news/blog210611/

> FP64 is extremely slow on Nvidia and AMD GPU, thanks to market segmentation: full performance with FP64 is enabled only on “professional” cards like Quadro. So for consumer hardware, we implemented FP64 emulation using two FP32 numbers. Performance is still not great, but it’s much better than using hardware FP64.

I'd be selling my NVidia stocks tomorrow if I had them

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact