I wish AMD would offer something like NVIDIA's Inception program to gift some accelerators and GPUs to suitable C++ coders (like me) so that there's at least a few tutorials on the internet on how other people managed to successfully use AMD + ROCm for deep learning.
EDIT: And it seems ROCm doesn't even support any of those new RDNA2 accelerators or gaming GPUs: https://github.com/RadeonOpenCompute/ROCm/issues/1344
So this is great hardware, but absolutely useless unless you are big enough to write your own GPU drivers from scratch ~_~
aka AMD doesn't care... they just want the supercomputer contracts where the customers are savvy enough to build their own very specific SW stack.
And yet AMD is actively moving in the opposite direction and dropping support for older hardware from both their Windows driver, as well as from ROCm and their other compute software.
It's hard to overstate what a positive impact gaming card support and donations of hardware to educational institutions and other devrel have been for CUDA - those are the people who are writing the next 10 years of your software, that sells the actual enterprise hardware. And at this point it's not just "can't afford to support it", AMD is doing fine these days, they just don't want to. Kind of baffling.
Here's what I think they'll do for the near future. They can't build an all-encompassing library ecosystem that supports all the hardware, all the existing software, has really good tooling, etc. in a reasonable amount of time. In consumer they can mostly get away with just supporting Microsoft's APIs, which they already do pretty well. If they design a really good compute GPU, they probably won't be able to make enough of them. They can sell all of their production with a few big deals to important customers, and they can attach engineers to those deals to help support the specific needs of those customers. That is much more practical than trying to catch up to CUDA in its entirety. Basically, status quo.
As someone who isn't a big customer and who is mad at Nvidia I don't hope for this to happen, but it seems the most likely path.
It’s a target-rich environment if you want to learn all the bad anti-patterns, so that you can avoid doing them in the rest of your career.
So, it’s not that they don’t care. It’s that they don’t have enough hours in the every-day-is-a-hair-on-fire-day that they would be capable of caring.
I see NVIDIA all over the place there but I'm not aware of any of them using AMD GPUs, though a couple do use AMD CPUs.
Their CPUs pretty much fall into the same boat too - is it justifiable to buy Intel CPUs right now for HPC applications, especially with AMD supporting AVX-512 with their Zen4 chips (which are the counterparts to the Sapphire Rapids DoE is buying)? Not really but their interests are in keeping Intel in the game, an AMD monoculture doesn't benefit anyone anymore than an Intel monoculture did.
Although of course this is not fab-related, it's the same basic strategy - the US wants as diverse and thriving a tech ecosystem as they can get in the west, and particularly in the US, to counterbalance a rising China. Not that China is anywhere close today, but in the 20-year timeframe it's a major concern.
those HPC machines will be the first ones
It's clearly a real problem that AMD's ML Software stack isn't quite there, and lacks in support for the non-specialized cards, but that's not really an issue for these HPC use cases....
Apparently they are going to use “HIP” to convert CUDA applications to be able to run on AMD:
> The OLCF plans to make HIP available on Summit so that users can begin using it prior to its availability on Frontier. HIP is a C++ runtime API that allows developers to write portable code to run on AMD and NVIDIA GPUs. It is essentially a wrapper that uses the underlying CUDA or ROCm platform that is installed on a system. The API is very similar to CUDA so transitioning existing codes from CUDA to HIP should be fairly straightforward in most cases. In addition, HIP provides porting tools which can be used to help port CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application. HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.
Would this one day lead to a HIP based PyTorch? I hope so!
If you're going to have to re-debug it anyway, what's the point? AMD's focus certainly should have been on something like GPU Ocelot or compiling PTX to GCN/RDNA instead.
IMO the deep learning folk need to be working more actively towards the future. The CUDA free ride is amazing, but AMD's HIP already does a good job being CUDA compliant in a general sense. But CUDA also sort of encompasses the massive collection of libraries that Nvidia has written to accelerate a huge amount of use cases. Trying to keep pace with that free-ride is hard.
My hope is eventually we start to invest in Vulkan Compute. Vulkan is way way way harder than CUDA, but it's the only right way I can see to do things. Getting TensorFlow & other libraries ported to run atop Vulkan is a herculean feat, but once there's a start, I tend to believe most ML practitioners won't have to think about the particulars, and I think the deep engineering talent will be able to come, optimize, improve the Vulkan engines very quickly, rapidly be able to improve whatever it is we start with.
It's a huge task, but it just seems like it's got to happen. I don't see what alternative there is, long term, to starting to get good with Vulkan.
I don't want to presume, but it sounds like you haven't actually tried using ROCm "what should be".
My experience with it was an absolute nightmare, I've never gotten ROCm working before. Just as well, since it turns out my systems never would have supported it for various reasons (lacking PCIe atomics for one), but I never actually got so far as to run into the driver problem, I never got the whole custom LLVM fork/ROCm software stack to work.
Caveat, I'm not professionally involved in deep learning or HPC, and as others have mentioned, the framework was only intended for a few specific cards running on very specific hardware for HPC cases.
But pretending like this is even a fraction as useful for the "average" person trying to experiment or even work at a low-medium level in machine learning feels off to me.
I don't think people will be swayed by platitudes about creating a competitive open-systems ecosystem to use plainly inferior software. Companies aren't going to spend oodles of money (and individuals won't volunteer tons of time) to suffer porting frameworks to target bare-bones APIs for the sake of being good sports.
Until either nvidia screws over everyone so much that using AMD cards becomes the path of least resistance, or AMD/Intel offers products at significantly lower prices than nvidia, I don't see the status quo changing much.
Vulkan is for graphics. Khronos' compute standard that's most similar to Cuda is SYCL. Both compile shaders to SPIR-V though.
Incorrect. Quoting the spec:
> Vulkan is an API (Application Programming Interface) for graphics and compute hardware.
Vulkan has compute shaders, which are generally usable. Libraries like VkFFT demonstrate basic signal processing in Vulkan. There are plenty of other non-graphical Compute Shader examples, and this is part of the design of Vulkan (and also WebGPU). Further, there is a Vulkan ML TSG (Technical Subgroup), which is supposed to be working on ML. Even Nvidia is participating, with extensions like VK_NV_cooperative_matrix, which specifically target the ml tensor cores. A more complex & current example, we see works like Google's IREE, which allow inference/Tensorflow Lite execution on a variety of drivers, including Vulkan, which has broad portability across hardware & fairly decent performance, even on mobile chips.
There's people could probably say this better/more specifically, but I'll give it a try: Vulkan is, above all, an general standard for modelling, dispatching & orchestrating work usually on a GPU. Right now that usage is predominately graphics, but that is far from a limit. The ideas of representing GPU resources, dispatching/queueing work are generic, apply fairly reasonably to all GPU systems, and can model any workload done on a GPU.
A good general introduction to Vulkan Compute is this great write up here: https://www.duskborn.com/posts/a-simple-vulkan-compute-examp...
> Khronos' compute standard that's most similar to Cuda is SYCL.
SYCL is, imo, the opposite of where we need to go. It's the old historical legacy that CUDA has, of writing really dumb ignorant code & hoping the tools can make it run well on a GPU. Vulkan, on the other hand, asks us to consider deeply what near-to-the-metal resources we are going to need, and demands that we define, dispatch, & complete the actual processing engines on the GPU that will do the work. It's a much much much harder task, but it invites in fantastic levels of close optimization & tuning, allows for far more advanced pipelining & possibilities. If the future is good, it should abandon lo-fi easy options like SYCL and CUDA, and bother to get good at Vulkan, which will allow us to work intimately with the GPU. This is a relationship worth forging, and no substitutes will cut it.
There's room for both solutions, but I think it's important to have a relatively easy way to use accelerators like GPUs in a cross platform way, without being an expert or having to rewrite code for new architectures.
It is my understanding that because both Vulkan and SYCL use SPIR-V, the work done on drivers and compilers for one of them benefits the other as well.
Vulkans not relatively easy to do from scratch, but it is cross platform, and again, most folks dont need to write Vulkan. They're using ML frameworks that abstract that away.
Vulkan will enable use of countless great extensions & capabilities that tools like SYCL wont be smart enough to use. Maybe the SPIR-V can be optimized by the drivers well, but the code SYCL spits out, i'd wager, will be world's less than what we do if we try. This is what CUDA does: it allows bad/cheap SYCL like code, but most folks use NV's vast ultra-complete libraries that are far far far better written in much lower level code & tweaked for every last oz of performance. CUDA is basically a scripting language + close to the metal library. SYCL might eventually be able to become similar, but only if we do the hard work of making really good Vulkan libraries. Otherwise it'll be pretty much trash.
There's a lot of argumemts in favor of being bad, of not going for thr good stuff. But end of the day, we just should do the right thing. Everyone other than NV should make a real start, should make a bid for their & the survival of everyone else. Vulkan is the only real bid i see.
Consoles use their own APIs, even the Switch, has an alternative to Vulkan.
Apple has Metal and Windows DirectX.
The only places where Vulkan has "won" are Linux, irrelevant given the 1%, and Android, where most games are still GL ES if one cares about reaching everyone.
Why not go a step further and pay some people to integrate good first-party support in the most popular libraries? It would probably be quite cheap in comparison and kickstart the adaption.
Sponsoring a third party to do something or contracting them, is very common.
And now I play VR on a battery powered gizmo doing about 1 TFLOP strapped to my head, and EXAFLOPS are basically here. This is all with at least TSMC 5nm, 3nm, 2nm, and multi-layer left on the table. After watching this relentless advance for 4 decades I'm pretty sure it will go beyond even that, but we just don't know what it will look like yet.
It's become everyday tech to me, but if I look back the progression is absolutely astounding.
Just like people in the 1950s, having seen the rise of the car and the airplane in their lifetimes, thought we would have flying cars by the 2000s.
Things will eventually plateau, and we will see improvements elsewhere.
The article points out this CDNA2 whitepaper, which has the juicy technical details.
CDNA1 is here: https://www.amd.com/system/files/documents/amd-cdna-whitepap...
CDNA2 / MI200 is a chiplet strategy with two "GCDs", each functioning as a classic GPU. These two GCDs can access each other's memory, but only at a lower 400GBps speed (page 8 whitepaper).
The actual HBM RAM is designed for 1600 GBps (article), x2 since two GCDs exist. AMD says its like 3200 GBps but in actuality, any one such block/workgroup can only get 2000 GBps (1600 GBps local RAM + 400 GBps from infinity-fabric / partner GCD). So its really a bit complicated and will likely be very workload specific.
If your data can be cloned / split efficiently, then the RAM probably will look like 3200GBps. But if you have to communicate with both parts of RAM to see all the data, you'll see a clear slowdown.
Well, it's happening. Sort of. AMD is finally getting into the post PCIe game for reals this time. Only at the very high end of the market though. (Perhaps upcoming consumer GPUs might have such capabilities, but AMD seems to be only shipping literally dozens/hundreds of high end GPUs a month atm.) Fusion is happening... for the very big. Oh, and also, Apple, whose 200 / 400GBps M1 Pro/MAX chips are performing true wonders via fused/unified memory. The Steam Deck, with ~66GBps and integrated AMD APU/GPU, will be a next test. I'm not sure how consoles are doing these days, which is another strongly AMD corner.
In some ways, the Infinity Fabric 3 news makes me a bit sad. In it's past life, Infinity Fabric was known as HyperTransport (HTX), an open standard, backed by a HyperTransport Consortium, with roots supposedly going all the way back to DEC & Jim Keller (who Apple got some experience from too via the 2007 PA Semi acquisition) & other engineers seemingly. FPGAs, network cards, storage could all work closely with the CPU via HTX. In this new regime, Infinity Fabric is finally catching up with the also closed/proprietary capability of Nvidia gpus plus CPUs as well (only available on NV+Power architecture computers AFAIK). But outside players aren't really able to experiment with these new faster closer-to-memory architectures, unlike with HTX. For that, folks need to use some of the various other more-open fabric/interconnects, which are often lower latency than PCIe but usually not faster: CXL, OpenCAPI, Gen-Z, and others.
And with AMD now the owner of Xilinx, there's a good chance that this technology will be across Xilinx FPGAs + AMD GPUs + AMD CPUs.
Do they read their own news?
China won Exascale.
Before anybody else.
Is there an MI200 systems in the Top500 yet?
Supercomputing 2021 is running and the updated November 2021
Top500 list was announced.
There is only one system in the top 10, and that's an NVIDIA A100 system from Microsoft.
The only 2 systems with > 100 PFLOPS are Summit and Fugaku.
Even today, their market cap is a bit misleading, because AMD has much lower profits / revenue than either company. It only makes sense for AMD to work with HPE / Cray instead for these details for now.
ROCm / HIP seems like the path AMD is taking moving forward. But without Windows support and with few GPUs supporting HIP, its a bit of a hard sell.
That being said: Polaris 580X was good for a while (until ROCm 3.x or so). Vega64 was good through ROCm 4.5. And now Blender is saying RDNA2 is supported.
So it seems like AMD is keeping at least some consumer GPUs available to try out HIP. You've gotta be careful, its not like NVidia where PTX cross-compiles the CUDA into all different architectures. Only specific cards seem to have good results (such as Vega64 or Rx 580. But the RX 550 had issues)
Just 7 years ago in 2014, AMD was laying off entire teams as it was risking bankruptcy. The loss of good engineers like that has reverberating effects on the company. They're beginning to recover now (ROCm development really kicked into high gear this past year), but a lot of their earlier stuff (HSA, Fusion, Bobcat cores, etc. etc.) were curtailed in the fight for survival.
Intel cut back for no reason. So it was more of a strategic mistake. Intel was making so much money with Sandy Lake (and other i7 processors: 2nd generation through 6th generation were a one-sided domination of the market).
Probably bad translation.
AMD seems such an odd choice for “AI supercomputers”.
Since neural transition functions are largely characterized by a single point of inflection, it seems hard to see how you'd make 32 bits of mantissa pay rent even in principle.
FP64 really stings even on (e.g) mid-end nvidia hw, what with the 1:32 handbrake. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... for reference
> FP64 is extremely slow on Nvidia and AMD GPU, thanks to market segmentation: full performance with FP64 is enabled only on “professional” cards like Quadro. So for consumer hardware, we implemented FP64 emulation using two FP32 numbers. Performance is still not great, but it’s much better than using hardware FP64.