Hacker News new | past | comments | ask | show | jobs | submit login
PyTorch 1.8, with AMD ROCm support (github.com/pytorch)
313 points by lnyan on March 5, 2021 | hide | past | favorite | 138 comments



PyTorch is the most impressive piece of software engineering that I know of. So yeah, it's a nice interface for writing fast numerical code. And for zero effort you can change between running on CPUs, GPUs and TPUs. There's some compiler functionality in there for kernel fusing and more. Oh, and you can autodiff everything. There's just an incredible amount of complexity being hidden behind behind a very simple interface there, and it just continues to impress me how they've been able to get this so right.


and TPUs

BS. There's so much effort getting Pytorch working on TPUs, and at the end of it it's incredibly slow compared to what you have in Tensorflow. I hate this myth and wish it would die.

Old thread on this, detailing exactly why this is true: https://news.ycombinator.com/item?id=24721229


OTOH PyTorch seems to be highly explosive if you try to use it outside the mainstream use (i.e. neural networks). There's sadly no performant autodiff system for general purpose Python. Numba is fine for performance, but does not support autodiff. JAX aims to be sort of general purpose, but in practice it is quite explosive when doing something other than neural networks.

A lot of this is probably due to supporting CPUs and GPUs with the same interface. There are quite profound differences in how CPUs and GPUs are programmed, so the interface tends to restrict especially more "CPU-oriented" approaches.

I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs), but Python really needs a general purpose, high performance autodiff.


> I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs), but Python really needs a general purpose, high performance autodiff.

As someone who works with machine learning models day-to-day (yes, some deep NNs, but also other stuff) - GPUs really seem unbeatable to me for anything gradient-optimization-of-matrices (i.e. like 80% of what I do) related. Even inference in a relatively simple image classification net takes an order of magnitude longer on CPU than GPU on the smallest dataset I'm working with.

Was this a comment about specific models that have a reputation as being more difficult to optimize on the GPU (like tree-based models - although Microsoft is working in this space)? Or am I genuinely missing some optimization techniques that might let me make more use of our CPU compute?


For gradient-optimization-of-matrices for sure. Just make sure that you don't use gradient-optimization-of-matrices just because they run well on GPUs. There may well be more efficient approaches to your problems that are infeasible for the GPUs' wide SIMD architecture you may miss if you tie yourself to GPUs.

In general it's more that some specific models are easy for GPUs. Most models probably are not.


I really don't understand the GPUs are overrated comment. As someone who uses Pytorch a lot and GPU compute almost every day, there is an order of magnitude difference in the speeds involved for most common CUDA / Open-CL accelerated computations.

Pytorch makes it pretty easy to get large GPU accelerated speed-ups with a lot of code we used to traditionally limit to Numpy. And this is for things that have nothing to do with neural-networks.


For a lot of cases you don't really need that much performance. Modern processors are plenty fast. It seems that current push to use GPU also pushes people towards GPU oriented solutions, such as using huge NNs for more or less anything, while other approaches would in many cases be magnitudes more efficient and robust.

GPUs (or "wide SIMDs" more generally) have quite profound limitations. Branching is very limited, recursion is more or less impossible and parallelism is possible only for identical operations. This makes for example many recursion-based time-series methods (e.g. Bayesian filtering) very tricky or practically impossible. From what I gather, running recurrent networks is also tricky and/or hacky on GPU.

GPUs are great for some quite specific, yet quite generally applicable, solutions, like tensor operations etc. But being tied to GPUs' inherent limitations also limits the space of approaches that are feasible to use. And in the long run this can stunt the development of different approaches.


> For a lot of cases you don't really need that much performance. Modern processors are plenty fast. It seems that current push to use GPU also pushes people towards GPU oriented solutions, such as using huge NNs for more or less anything, while other approaches would in many cases be magnitudes more efficient and robust.

for instance?


I still don't get the criticism of Pytorch. If anything, you can get the best of both worlds in many way with their API supporting on GPU and on CPU operations in exactly the same ways.


What do you mean by “seems to be highly explosive”? I have used Pytorch to model many non-dnn things and have not experienced highly explosive behavior. (Could be that I have become too familiar with common footguns though)


I get what you mean by the GPUs are overrated comment, which is that they're thought of as essential in many cases when they're probably not, but in many domains like NLP, GPUs are a hard requirement for getting anything done


Have you tried using Enzyme* on Numba IR?

* https://enzyme.mit.edu


Wait wat, jax and also pytorch is used in a lot more areas then NN's. Jax is even consider to do better in that department in terms on performance then all of julia so wat are u talking about


GP makes a fair point about JAX still requiring a limited subset of Python though (mostly control flow stuff). Also, there's really no in-library way to add new kernels. This doesn't matter for most ML people but is absolutely important in other domains. So Numba/Julia/Fortran are "better in that department in terms on performance" than JAX because the latter doesn't even support said functionality.


> Jax is even consider to do better in that department in terms on performance then all of julia so wat are u talking about

Please provide sources for this claim


> There's sadly no performant autodiff system for general purpose Python.

Like there is for general purpose Julia code? (https://github.com/FluxML/Zygote.jl)

> I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs),

Do you run much machine learning code? All those matrix multiplications run a good bit faster on the GPU.


> Oh, and you can autodiff everything.

Well, not everything. Julia's Zygote AD system can autodiff most Julia code (currently with the exception of code that mutates arrays/matrices).


and you didn't even talk about data and model parallelism. which often just works out of the box


Its python wrappers on top of existing ThTensor library which was already provided by torch. But yes great engineering nonetheless.


I don't think this is a particularly accurate description of pytorch in 2021. Yeah, the original c++ backend came from torch, but I think most of that has been replaced. AFAIK, all the development of the c++ backend for pytorch over that last several years has been done as part of the pytorch project -it's not just python wrappers at this point.


What I like about PyTorch is that most of the functionality is actually available through the C++ API as well, which has 'beta API stability' as they call it. So, there are good bindings for some other languages as well. E.g., I have been using the Rust bindings in a larger project [1], and they have been awesome. A precursor to the project was implemented using Tensorflow, which was a world of pain.

Even things like mixed-precision training are fairly easy to do through the API.

[1] https://github.com/tensordot/syntaxdot


This is pretty neat since it is the first time in years that a top-tier deep learning framework has official support for any training accelerator with open source kernel drivers.

I guess the TPU also doesn't require kernel drivers because you talk to it over network instead of PCIE. But you cannot buy a TPU, only the int8 edge TPU is for sale. (And I've heard that the edge TPU's are absolutely top-notch for performance per $ and Watt right now, as an aside.)


I believe TensorFlow is a top-tier deep learning framework, and it had ROCm support since 2018.

> edge TPU's are absolutely top-notch for performance per $ and Watt right now

Do you mean "aren't"? The performance per $ and Watt is not awesome even when it was released, I was hoping for great toolchain support but that also didn't happen.


Tensorflow doesn't seem to officially support ROCm, only unofficial community projects do. This is official support from PyTorch.


Tensorflow does officially support ROCm. The project was started by AMD and later upstreamed.

https://github.com/tensorflow/tensorflow/tree/master/tensorf...

https://github.com/tensorflow/tensorflow/blob/master/tensorf...

It is true that it is not Google who are distributing binaries compiled with ROCm support through PyPI (tensorflow and tensorflow-gpu is uploaded by Google, but tensorflow-rocm is uploaded by AMD). Is this what you meant by "not officially supporting"?


What you describe sounds a lot like the PyTorch support before this announcement: You could download PyTorch from AMD's ROCm site or build it yourself for >= 2 years now and this worked very reliably. (Edit: The two years (Nov 2018 or so) are the ones I can attest to from using it personally, but it probably didn't start then.)

The news here is that the PyTorch team and AMD are confident enough about the quality that they're putting it on the front page. This has been a long way in the making, and finally achieving official support is a great step for the team working on it.


Oh, interesting. I do wonder if Google puts the same quality control and testing into the Rocm version, though. Otherwise it would really be a lower tier of official support.

Granted, I don't know anything about the quality of PyTorch's support either.


Jetson Nano: 1.4 TOPS/W, Coral TPU: 2 TOPS/W ?

Of course it doesn't really help that Google refuses to release a more powerful TPU that can compete with e.g. Xavier NX or a V100 or RTX3080 so for lots of applications there isn't much of a choice but to use NVIDIA.


Sorry, should have mentioned "if you have access to Shenzhen" in my post :)

What I have in mind is something like RK3399Pro, it has a proprietary NPU at roughly 3Tops / 1.5W (on paper). But its toolchain is rather hard to use. Hisilicon had similar offerings. There are also Kendryte K210 which claims 1Tops @ 0.3W but I haven't get any chance to try it.

I was already playing with RK3399Pro When Edge TPU was announced, life is tough when you had to feed your model into a blackbox "model converter" from the vendor. That's the part I hope Edge TPU excels at. But months later I was greeted by... "to use Edge TPU, you have to upload your TFLite model to our online model optimizer", which is worse!


There's now a blackbox compiler that doesn't have to run on their service, but it's basically the same as all the others now because of that.


On Xavier, the dedicated AI inference block is open source hardware.

Available at http://nvdla.org/


Are there any <10W boards that have better performance/watt for object detection?

If it exists, I wanna buy it


I used the Intel neural compute sticks [1] for my porn detection service [2] and they worked great. Could import and run models on a Pi with ease

[1] https://blog.haschek.at/2018/fight-child-pornography-with-ra... [2] https://nsfw-categorize.it/


Very interesting project and article. You should consider submitting it to HN for its own post (if you have not done so already).


Jetson Xavier NX, but that comes with a high price tag. It’s much more powerful however.


I'll also add a caveat that toolage for Jetson boards is extremely incomplete.

They supply you with a bunch of sorely outdated models for TensorRT like Inceptionv3 and SSD-MobileNetv2 and VGG-16. WTF, it's 2021. If you want to use anything remotely state-of-the-art like EfficientDet or HRNet or Deeplab or whatever you're left in the dark.

Yes you can run TensorFlow or PyTorch (thankfully they give you wheels for those now; before you had to google "How to install TensorFlow on Jetson" and wade through hundreds of forum pages) but they're not as fast at inference.


> I'll also add a caveat that toolage for Jetson boards is extremely incomplete.

A hundred times this. I was about to write another rant here but I already did that[0] a while ago, so I'll save my breath this time. :)

Another fun fact regarding toolage: Today I discovered that many USB cameras work poorly on Jetsons (at least when using OpenCV), probably due to different drivers and/or the fact that OpenCV doesn't support ARM64 as well as it does x86_64. :(

> They supply you with a bunch of sorely outdated models for TensorRT like Inceptionv3 and SSD-MobileNetv2 and VGG-16.

They supply you with such models? That's news to me. AFAIK converting something like SSD-MobileNetv2 from TensorFlow to TensorRT still requires substantial manual work and magic, as this code[1] attests to. There are countless (countless!) posts on the Nvidia forums by people complaining that they're not able to convert their models.

[0]: https://news.ycombinator.com/item?id=26004235

[1]: https://github.com/jkjung-avt/tensorrt_demos/blob/master/ssd... (In fact, this is the only piece of code I've found on the entire internet that managed to successfully convert my SSD-MobileNetV2.)


They provide some SSD-Mobilenet-v2 here:

https://github.com/dusty-nv/jetson-inference

Yeah, it works. I get 140 fps on a Xavier NX. It's super impressive for the wattage and size of the device. But they want you to train it using their horrid "DIGITS" interface, and it doesn't support any more recent networks.

I really wish Nvidia would stop trying to reinvent the wheel in training and focus on keeping up with being able to properly parse all the operations in the latest state-of-the-art networks which are almost always in Pytorch or TF 2.x.


> They provide some SSD-Mobilenet-v2 here: https://github.com/dusty-nv/jetson-inference

I was aware of that repository but from taking a cursory look at it I had thought dusty was just converting models from PyTorch to TensorRT, like here[0, 1]. Am I missing something? (EDIT: Oh, never mind. You probably meant the model trained on COCO[2]. Now I remember that I ignored it way back when because I needed much better accuracy.)

> I get 140 fps on a Xavier NX

That really is impressive. Holy shit.

[0]: https://github.com/dusty-nv/jetson-inference/blob/master/doc...

[1]: https://github.com/dusty-nv/jetson-inference/issues/896#issu...

[2]: https://github.com/dusty-nv/jetson-inference/blob/master/doc...


You have https://github.com/NVIDIA-AI-IOT/torch2trt as an option for example to use your own models on TensorRT just fine.

And https://github.com/tensorflow/tensorrt for TF-TRT integration.


TF-TRT doesn't work nearly as well as pure TRT. On my Jetson Nano a 300x300 SSD-MobileNetV2 with 2 object classes runs at 5 FPS using TF, <10 FPS using TF-TRT and 30 FPS using TensorRT.


This. Try any recent network with TF-TRT and you'll find that memory is constantly being copied back and forth between TF and TRT components of the system every time it stumbles upon an operation not supported in TRT.

As such I often got slower results with TF-TRT than just pure TF, and at most a marginal improvement, even though what TRT does is conceptually awesome from a deployment standpoint, and if it only supported all the operations in TF, it could be a several-fold speed up in many cases.


> even though what TRT does is conceptually awesome from a deployment standpoint

I thought the same until, earlier this week, I realized that if I convert a model to TensorRT and serialize it & store it in a file that file is specific to my device (i.e. my specific Jetson Nano), meaning that my colleagues can't run that file on their Jetson Nano. What the actual fuck.

Do you happen to have found a workaround for this? I really don't want to have to convert the model anew every single time I deploy it. There are just too many moving parts involved in the conversion process, dependency-wise.


https://github.com/wang-xinyu/tensorrtx has a lot of models implemented for TensorRT. They test on GTX1080 not jetson nano though, so some work is also needed.

TVM is another alternative to get models to inference fast on nano


How does TVM compare to TensorRT performance-wise?



Now that major frameworks finally started supporting ROCm, AMD has half-abandoned it (IIRC the last consumer cards supported were the Vega ones, cards from 2 generations ago). I hope this will change.


I work for AMD, but this comment contains exclusively my personal opinions and information that is publicly available.

ROCm has not been abandoned. PyTorch is built on top of rocBLAS, rocFFT, and Tensile (among other libraries) which are all under active development. You can watch the commits roll in day-by-day on their public GitHub repositories.

I can't speak about hardware support beyond what's written in the docs, but there are more senior folks at AMD who do comment on future plans (like John Bridgman).


"Future plan"... it's been years and it's still "future plan".

I want to buy AMD because they are more open than Nvidia. But Nvidia supports CUDA day one for all their graphic cards and AMD still don't have rocm support on most of their product even years after their release [0]

Given AMD size & budget, the reason why they don't hire a few more employee full time on making rocm work with their own graphic card is beyond me.

The worst is how they keep people in waiting. It's always vague phrases like "not currently", "may be supported in the future", " "future plan", " we cannot comment on specific model support ", etc.

AMD doesn't want rocm on consumer card ? Then say it. Stop making me check rocm repos every week to get always more disappointed.

AMD plans to support it on consumer card ? Then say it and give a release date : "In May 2021 , the RX 6800 will get rocm support, thanks for your patience and your trust in our product".

I like AMD for their openness and support of standards, but they are so unprofessional when it comes to Compute

[0] https://github.com/RadeonOpenCompute/ROCm/issues/887


> why they don't hire a few more employee

The reason is that it would take more than a few more employees to provide optimized support across all GPUs. If nothing else, it's a significant testing burden.


I really don't want to buy Nvidia, but AMD really isn't an alternative if you do anything outside of gaming...


Hardware support is key, though.

CUDA works with basically any card made in the last 5 years, consumer or compute. ROCm seems to work with a limited set of compute cards only.

There's a ROCm team in Debian [1] trying to push ROCm forward (ROCm being open), but just getting supported hardware alone is already a major roadblock, which stalls the effort, and hence any contributions Debian could give back.

[1] https://salsa.debian.org/rocm-team


I didn't know about the Debian team. Thank you very much for informing me! I'm not sure how much I can do, but I would be happy to discuss with them the obstacles they are facing and see what I can help with. I'm better positioned to help with software issues than hardware matters, but I would love to hear about ROCm from their perspective. What's the best way to contact the team?


The team's communication channel is the Debian AI mailing list [1], and they'd absolutely appreciate someone reaching out. Having ROCm succeed is a mutual interest.

From what I recall, on the software side, I think the difficulties stemmed from ROCm being distributed (naturally, of course) as a vendor would do it, for example: targeting specific kernels, compilers, distribution versions. The problem with this is that all these may become out of date, which means that the GitHub instructions may no longer work for anyone wanting to give it a try. This was the case for some of the team members attempting to rebuild some of the packages.

The Debian team would prefer to package all elements of the ROCm ecosystem and include them in the official Debian archive, so that users can just run `apt-get install rocm-something` without worrying about the kernel version etc., just as they can already do a trivial `apt-get install nvidia-cuda-toolkit`.

I think collaboration with AMD here would be mutually beneficial, as on the one hand, Debian wants FOSS accelerated computing, which AMD is pushing, and on the other hand, Debian is naturally experienced in distributing things for Debian, Ubuntu, etc., something AMD could benefit from.

Thank you for engaging!

[1] https://lists.debian.org/debian-ai/


I have no affiliation with AMD. Just own few pieces of their hardware.

The following is my speculation. For what I noticed, Polaris and Vega had a brute force approach; you could throw a workload on them and they would brute force through it. Navi generation is more gamer oriented; it does not have the hardware for the brutal approach, it relies more on optimizations from the software running on it and using it in specific ways. Provides better perfomance for games, but worse compute - and that's why it is not and is not going to be supported by ROCm.

The downside of course is, that you can no longer buy Vega. If I knew it last time it was on sale...


Can you comment perhaps on what you guys have compared to nVidia’s DGX? I’d rather buy a workhorse with open drivers.


The MI100 is our server compute product. The marketing page is probably a better source of information than I am: https://www.amd.com/en/products/server-accelerators/instinct...


> PyTorch is built on top of rocBLAS, rocFFT, and Tensile (among other libraries) which are all under active development.

That's not very helpful if we can't use them on our own computers... Not many devs are able to put their hands on a datacentre-class card...


Yes, you can ask bridgmanAMD on Reddit.


It’s worse....

It’s Linux only so no Mac, Windows or WSL.

No support what so ever for APUs which means if you have a laptop without a dedicated GPU you’re out of luck (tho discrete mobile GPUs aren’t officially supported either and often do not work).

They’ve not only haven’t been supporting any of their consumer based R“we promise it’s not GCN this time”DNA GPUs, but since December last year (2020) they’ve dropped support for all pre-Vega GCN cards which means that Polaris (400/500 series) which is not only the most prolific of AMD GPUs that have been released in the past 5 years or so but also the most affordable ones are no longer supported.

That’s on top of all the technical and architectural software issues that plague the framework.


> It’s worse....

> It’s Linux only so no Mac, Windows or WSL.

I don't see really a problem here. You want Office or Photoshop? It runs on Mac and Windows only, so you better get one. You want ROCm? Get Linux for exactly the same reason.


No I don’t want ROCm I want GPGPU, which is why the entire planet is running CUDA.

Every decision AMD made with ROCm seems to boil down too “i never want to be actually competitive with CUDA”.

Everything from being Linux only and yes that’s is a huge huge huge limitation because most people don’t want ROCm but many would like Photoshop or Premier or Blender being able to use it. Especially with the OpenCL issues AMD has....

Through them not supporting their own hardware to other mind boggling decisions like the fact that ROCm binaries are hardware specific so you have no guarantee of interoperability and worse no guarantee for future compatibility and in fact it does break.

The point is that I can take a 6 years old CUDA binary and run it today still on everything form an embedded system through a high end server, it would maintain its compatibility across multiple generations of GPU hardware, multi operating systems and multiple CPU architectures.

And if you can’t understand why that is not only valuable but more or less mandatory for every GPGPU application you would ever ship to customers, perhaps you should stick to photoshop.

AMD is a joke in this space and they are very much actively continuing in making themselves one with every decision they make.

It’s not that they’ve abandoned ROCm is that it’s entire existence is telegraphing “Don't take use seriously, because we don’t”.


It's a big problem if you want to use GPGPU in a product shipped to customers, or if you want to do some GPU programming on your own computer in your free time.

Keep in mind that not only is it Linux only, it doesn't work on consumer hardware either.


> It's a big problem if you want to use GPGPU in a product shipped to customers,

So ship linux version and let the customers decide whether they want it or not.

If you are doing server product, it is even handy; just ship docker image or docker-compose/set of k8s pods, if it needs them.

> or if you want to do some GPU programming on your own computer in your free time.

You can install Linux very easily; it's not that it is some huge expense.

> it doesn't work on consumer hardware either.

It does, just not on the current gen one or even currently purchasable one. I agree, this is a problem.


> So ship linux version and let the customers decide whether they want it or not.

What if I want to use GPGPU in Photoshop, or a game with more than two users? Or really anything aimed at consumers?

> If you are doing server product

That's irrelevant, server products can choose their own hardware and OS.

> You can install Linux very easily

"For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem"

Also, their main competitor with a huge market lead have this: https://docs.nvidia.com/cuda/cuda-installation-guide-microso...

> It does, just not on the current gen one or even currently purchasable one. I agree, this is a problem.

Only on old hardware, and you can't expect users' computers to be compatible. In fact, you should expect users' copmuters to be incompatible. That's like saying consumer hardware support GLIDE.


> What if I want to use GPGPU in Photoshop, or a game with more than two users? Or really anything aimed at consumers?

Then use API supported on your target platform. It's not ROCm then. Maybe Vulcan Compute/DirectCompute/Metal Compute?

> That's irrelevant, server products can choose their own hardware and OS.

It is relevant for ROCm.

> > You can install Linux very easily

> "For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem"

I'm quite surprised to see comparisons ad absurdum after suggesting to use proper tool for a job. Installing a suitable operating system - for a supposed hacker - is a convoluted, nonsensical action nowadays?

> Only on old hardware, and you can't expect users' computers to be compatible. In fact, you should expect users' copmuters to be incompatible. That's like saying consumer hardware support GLIDE.

Vega is not that old; the problem is that is not not procurable anymore, which, if you read my comment again, I agreed that it is a problem.


> Then use API supported on your target platform. It's not ROCm then.

That's the point, ROCm isn't suitable for things outside datacenters or maybe some workstations. Cuda is however, and that's what AMD should be aiming for. Their best bet is SYCL, but that uses ROCm as backend...

> Installing a suitable operating system - for a supposed hacker - is a convoluted, nonsensical action nowadays?

Again, if all you need is to run ROCm on your own computer Linux isn't a hurdle. If you want to ship software to customers you can't just say "switch OS", they're probably already using their computers for other things.

> Vega is not that old; the problem is that is not not procurable anymore, which, if you read my comment again, I agreed that it is a problem.

The fact that they used to support something isn't a relevant argument and I just don't see the point in bringing it up, other than to underline the fact that AMD doesn't care about compute support for the mass market anymore. At least we agree on one thing.

The splitting between graphics specific and compute specific hardware is an even bigger issue than Linux only. ROCm stands for RadeonOpenCompute, and their Radeon DNA hardware can't run it, so streamers can't use AMD hardware to play games and improve their mic sound, while it's trivial to do with Nvidias OptiX. And what good are all the ML models if you can't ship them to customers?


GPGPU with the graphics APIs isn’t a comparable developer experience at all, and comes with quite some major limitations.

About DirectCompute, C++ AMP is in practice dead on Windows, stuck at a DX11 feature level with no significant updates since 2012-13, staying present just for backwards compatibility.


Then probably SYSCL is what you are looking for. But that one is still WIP.


We all started as noobs once. A sizeable market of GPGPU is academics who don't yet have their nix chops. Had to walk an undergrad through basic git branching the other day.


I understand, but look at from another POV: ROCm is an USP of the Linux platform. Windows has its own USPs (DirectX or Office, for example), MacOS the same (iOS development, for example).

Why should Linux platform give up its competitive advantages against others? It would only diminish the reasons for running it. The other platforms won't do the same -- and nobody is bothered by that. In fact, it is generally considered to be an advantage of platform and reasons for getting it.

And it's not like getting Linux to run could be a significant expense (like purchasing an Apple computer is for many) or you need to get a license (like getting Windows). You can do it for free, and you will only learn something from it.

I've got my nix chops in the early '90 exactly this way: I wanted 32-bit flat-memory C compiler, didn't have the money for Watcom and associated DOS-extenders and djgpp wasn't a thing yet (or I didn't know about it yet). So I've got Slackware at home, as well as access to DG-UX at uni. It was different, I had to learn something, but ultimately, I'm glad I did.


ROCm isn’t a USP of the Linux platform it’s vendor locked.

It provides no competitive advantage to Linux, not to mention that that entire concept is a bitty laughable as far as FOSS/OS goes.

I don’t understand why AMD seems adamant at not wanting to be a player in the GPU compute market.

Close to Metal was aborted.

OpenCL was abandoned.

HSA never got off the ground.

ROCm made every technical decision possible to ensure it won’t be adopted.

Can’t get GPUs that could run it, can’t ship a product to customers because good luck buying the datacenter GPUs from AMD the MI100 is unavailable for purchase unless you make a special order and even then apparently AMD doesn’t want anyone to actually buy it, can’t really find GPU cloud instances that run it on any of the major providers.

So what exactly is it? It’s been 5 years and if you want today to develop anything on the platform you have to build a desktop computer, find overpriced hardware on the second hand market pay a premium for it, hope that AMD won’t deprecate it within months like they did with GCN 2/3 without even a heads up notice all so you can develop something that only you can run with no future compatibility or interoperability.

If this is the right tool for the job then the job is wrong.


> or WSL

Does WSL not support PCIe passthrough?


It does not.

And even then, almost all AMD customer GPUs are not supported. (Everything but Vega)

APUs? Nope too.


Oh oof. Thanks for saving me time not having to look up ROCm benchmarks. I find it really surprising they they don't wanna compete on performance/$ at all by not supporting consumer cards


No worries, for future reference you can check here (hopefully that page will report improved support in the future)

https://github.com/RadeonOpenCompute/ROCm#supported-gpus


I think this is more of an issue that they have Compute optimised and Graphics optimised cards and Vega is their last compute optimised card.

It would be very nice for them to refresh their compute cards as well.


Going for market segmentation like that sounds like a pretty bad idea if you are already the underdog in the game.


“Making science accessible to everyone is important to us. That's one of the reasons why GeForce cards can run CUDA.” — Bryce Lelbach at Nvidia

AMD meanwhile considers GPU compute as a premium feature... which is a different approach.

Not having a shippable IR like PTX but explicitly targeting a given GPU ISA, making this unshippable outside of HPC and supporting Linux only also points in that direction.

Intel will end up being a much better option than AMD for GPU compute once they ship dGPUs... in their first gen.


Especially since more and more consumer workloads use compute as well. DLSS makes real time ray tracing at acceptable resolutions possible, your phone camera uses neural nets for every photo you take, Nvidias OptiX noise reduction is very impressive, and so on.

AMD doesn't seem to want to be part of future computer usages, it's a shame. "Armies prepare to fight their last war, rather than their next war".


They made new compute cards, but they aren’t available to customers. (Only businesses, under the Radeon Instinct brand)

With the price to match for those...


They aren’t available for business either, you can buy a Tesla card in a microcenter or on Newegg, and through a million partners ranging from small system builders to Dell and HP.

Good luck getting an instinct card.


Part of the problem is that Vega GPUs are too good for crypto currency mining.

Example: I bought a 16GB Radeon VII for €550 (including 19% VAT). That would be a decent value proposition for Deep Learning even today. However, the card appears to trade second hand(!) for €1300+ on e-bay, because you apparently can run 90-100 MH/sec or so (this is an estimate I found on the web, I don't engage in this type of thing). As a result, you would not be able to get consumer Vega with these performance characteristics for a good price even if AMD produced them.


So, for someone not familiar, how far is AMD behind Nvidia's CUDA? I ask because AMD clearly has better linux driver support than Nvidia, and it would be awesome if their AI/ML libs were catching up.


At my previous employer, we bought two Radeon VIIs (in addition to NVIDIA GPUs). The last time I tried it (just over ~6 months ago), there were still many bugs. Things would just crash and burn very frequently (odd shape errors, random crashes, etc.). Two colleagues reported some of those bugs in ROCm, but the bug reports were largely ignored.

Maybe out-of-the-box support for PyTorch will result in more polish. Who knows? Another issue is that AMD has not yet implemented support for consumer GPUs after Vega. You'd think that targeting researchers on a budget (most of academia) would help them improving the ecosystem and weed out bugs. But they only seem interested in targeting large data centers.

It's all a shame, because as you say, AMD graphics on Linux is awesome. And I think many Linux enthusiasts and researchers would be happy to use and improve ROCm to break NVIDIA's monopoly. Only if AMD actually cared.


Yeah I think having support for consumer level cards is really important because it makes it easier to have a pipeline of students who are familiar with your tech. These same folks advocate for AMD in the workplace and contribute to OSS on behalf of AMD. Forcing this to be a 'datacenter only' solution is really short sighted.


So, just up front: these are my personal opinions. I do not speak on behalf of AMD as a company. I'm just a software developer who works on ROCm. I joined AMD specifically because I wanted to help ROCm succeed.

If the problems you encountered are related to a particular ROCm software library, I would encourage you to open an issue on the library's GitHub page. You will get the best results if you can get your problem directly in front of the people responsible for fixing it. Radeon VII is one of the core supported platforms and bugs encountered with it should be taken seriously. In fact, if you want to point me at the bug reports, I will personally follow up on them.

ROCm has improved tremendously over the past 6 months, and I expect it to continue to improve. There have been growing pains, but I believe in the long-term success of this project.


As someone who bought a RX580 for playing with Deep learning with ROCm (It was supported at the time). After posting to one or two bug threads, I had the same experience as the gp -- our issues were ignored. The issues have recently closed as the RX580 is no longer supported.

As for long term success, good luck, but once bitten twice shy.


I empathize. There have been plenty of mistakes. Trust is difficult to earn, and clearly we have not acted in a manner deserving of yours.

When I first encountered ROCm, my impressions were mixed. The idea thrilled me, but the execution did not. I planned to ignore ROCm until it was clear that it would meet my needs. Obviously, my plans changed, but I haven't forgotten the perspective I had as a potential user.

There are still rough edges and I know that nobody gets third chances, so it's good that you're cautious. We are steadily improving and I believe we will do better in supporting our users going forward, but to earn that trust back, we will have to prove it through our actions.


The supported ROCm version is 4.0 which only the latest AMD instinct supports it. There's still a long way before its supported in the latest consumer RDNA 2 GPUs ( RX 6000 series )


Lack of ROCm support in consumer RDNA 2 GPUs really makes it impossible for regular people to use ROCm. As an owner of an AMD Radeon RX 6800 I'm pretty salty about it.


I am getting some conflicting messages about support. There is a small group of people working on ROCm for Julia (in AMDGPU.jl) and while that work is still alpha quality, they seem to expect even devices as old as RX 580 to work.

Are all of these support issues something that is on the ROCm/AMD/driver side, or are they on the side of libraries like pytorch?


RX580 had a pro version which got ROCm support. It's the newer cards and older cards which aren't supported.


RX580 is out of support for ROCm nowadays, which only supports Vega and Instinct now.

https://github.com/RadeonOpenCompute/ROCm/issues/1353#issuec...


Can confirm, I made an 8x rx580 rig for an r&d project specifically for that reason.


Time to re-sell into the hot GPU market now? :P


Don't work there anymore; even so that server is probably buried under loads of failed ideas.


This is an issue with AMD not wanting to long-term support the code paths in ROCm components necessary to enable ROCm on these devices. My hope is that Polaris GPU owners will step up to the plate and contribute patches to ROCm components to ensure that their cards keep working, when AMD is unwilling to do the leg work themselves (which is fair, they aren't nearly as big or rich as NVidia).


It's the last thing that keeps me on Nvidia with proprietary Linux drivers. I wouldn't mind ML training on a AMD card to be slower but I need my workload to be at least GPU-accelerated.


I mean I wouldn't worry too much about it, I think if something big like PyTorch supports it AMD might rethink their strategy here. They have a lot to gain by entering the compute market.


AMD only cares about data center. Anything below that, they don't care. They don't care about supporting anything other than linux. God forbid some person try to experiment with their hardware/software as a hobby before putting it to use in a work setting.


Eh, not sure if that's correct. Their Ryzen consumer CPUs work amazingly well on Windows. The gaming graphics cards also still target Windows.

And if they do care about the data center that much and ROCm becomes a thing for compute, they will want as many people as possible to be able to experiment with ROCm at home. So that they demand data centers with ROCm.


> And if they do care about the data center that much and ROCm becomes a thing for compute, they will want as many people as possible to be able to experiment with ROCm at home. So that they demand data centers with ROCm.

The engineers know this and have tried to explain this to upper management, but upper management only listens to "customer interviews" which consists of management from data centers. Upper management does not care about what engineers hear from customers because those are not the customers they care about.


ROCm has been a thing for four years now, Cuda had been around almost a decade before ROCm was conceived, and AMD still doesn't show any interest in an accessible compute platform for all developers and users. I think they should focus on OpenCL and Sycl.


Well it's a chicken and egg problem.

Without major library support nobody cares about ROCm, if nobody cares about ROCm AMD management will focus on other things.

But PyTorch just laid that egg.


As far as ROCm support on consumer products, it is strictly a management problem which will takes years for them to figure out since they do not listen to their engineers and do not view users of consumer graphics cards as compute customers.

I would love an alternative to Nvidia cards. After waiting for so long for ROCm support for RDNA cards and reading an engineer's comments about why there is no support yet, I've given up on AMD for compute support on their graphics cards. I'm hoping Intel's graphic cards aren't garbage and get quick support. I probably will buy an Nvidia card before then if I have the opportunity since I'm tired of waiting for an alternative.


At least they promised support.

They're actively refusing to comment on 5000-series.


Indeed. Yet, the amdgpu-pro driver supports the 5000 series.

I managed to get it to work for my 5500 XT, but only in Ubuntu and for the 5.4 kernel and it wasn’t straightforward.

It would be nice if a project like Mesa would exist for OpenCL.

In the future, I suppose Vulkan will be used instead for compute though.


It also looks like they are in the progress of adding Apple metal support, possibly for the M1: Part of 1.8 is this issue: https://github.com/pytorch/pytorch/pull/47635


What's the point? If you have enough money to buy a brand new Apple M1 laptop, you can afford a training rig or cloud credits. Any modern discrete GPU will blow away any M1 laptop for training.

Is anyone training ML models on their ultra-thin laptop?


I don’t understand this logic. If someone has $1000 for an entry level m1 machine then they also have enough money for a separate rig with a GPU that’s probably another $600-1000 for something decent? Cloud GPUs are also pretty expensive.

I don’t think anyone is seriously training their ML models on their ultra thin laptop but I think the ability to do so would make it easier for lots of people to get started with what they have. It might be that they don’t have lots of extra cash to throw around or it’s hard to justify on something they are just starting.

Cloud training is also just less convenient and an extra hassle compared to doing things locally, even if it’s something like google collab which is probably the best option right now for training on thin laptops


If someone is shelling out for a brand new, early adopter product, then they probably have a decent amount of money.

Even when TensorFlow and PyTorch implement training support on the M1, it will be useless for practically anything except training 2-3 layer models on MNIST.

So why should valuable engineering time be spent on this?


This is just patently false. Most of the folks I know that have an M1 are students that were saving up to upgrade from a much older computer and got the M1 Air. I can assure you that they don't have a decent amount of money.

Tensorflow has a branch that's optimized for metal with impressive performance. [1] It's fast enough to do transfer learning quickly on a large resnet, which is a common use-case for photo/video editing apps that have ML-powered workflows. It's best for everyone to do this locally: maintains privacy for the user and eliminates cloud costs for the developer.

Also, not everyone has an imagenet sized dataset. A lot of applied ML uses small networks where prototyping is doable on a local machine.

[1] https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...


Because with support for M1 you can prototype your network on your local machine with „good“ performance. There are many cloud solutions etc. but for convenience nothing beats your local machine. You can use an IDE you like etc.


Because contrary to what you believe, M1 simply is not performant enough to be used to "prototype" your network. NNs can't be simply scaled up and down. It is *NOT* like those web apps which you can run on potatoes just fine as long as nobody are hitting them heavily.


Not so sure about that. Here’s two things you can do (assuming you’re not training huge transformers or something).

1. Test your code with super low batch size. Bad for convergence, good for sanity check before submitting your job to a super computer.

2. Post-training evaluation. I’m pretty sure the M1 has enough power to do inference for not-so-big models.

These two reasons are why I’m sometimes running stuff on my own GTX 1060, even though it’s pretty anemic and I wouldn’t actually do a training run there.

There quite a bit of friction to training in the cloud, especially if it’s on shared cluster (which is what I have access to). You have a quota, and wait time when the supercomputer is under load. Sometimes you just need to quickly fire up something!


1. Test your code with super low batch size. Bad for convergence, good for sanity check before submitting your job to a super computer.

Or you can buy a desktop machine for the same price as an M1 MacBook with 32GB or 64GB RAM and an RTX2060 or RTX3060 (which support mixed-precision training) and you can actually finetune a reasonable transformer model with a reasonable batch size. E.g., I can finetune a multi-task XLM-RoBERTa base model just fine on an RTX2060, model distillation also works great.

Also, there are only so many sanity checks you can do on something as weak (when it comes to neural net training). Sure, you can check if your shapes are correct, loss is actually decreasing, etc. But once you get at the point your model is working, you will have to do dozens of tweaks that you can't reasonably do on an M1 and still want to do locally.

tl;dr: why make your life hard with an M1 for deep learning, if you can buy a beefy machine with a reasonable NVIDIA GPU at the same price? Especially if it is for work, your employer should just buy such a machine (and an M1 MacBook for on the go ;)).


Absolutely agree! My points were more about the benefits of running code on your own machine rather than in the cloud or on a cluster. I don’t own an M1, but if I did I wouldn’t want to use it to train models locally... When on my laptop I still deploy to my lab desktop; this adds little friction compared to a compute cluster, and as you mention we’re able to do interesting stuff with a regular gaming GPU. When everything works great and I now want to experiment at scale, I then deploy my working code to a supercomputer.


Sure you probably don’t want to do full training runs locally, but There’s a lot you can do locally that has a lot of added friction on a gpu cluster or other remote compute resource

I like to start a new project by prototyping and debugging my training and cunning config code, setting up the data loading and evaluation pipeline, hacking around with some baseline models and making sure they can overfit some small subset of my data

After all that’s done it’s finally time to scale out to the gpu cluster. But I still do a lot of debugging locally

Maybe this kind of workflow isn’t as necessary if you have a task that’s pretty plug and play like image classification, but for nonstandard tasks I think there’s lots of prototyping work that doesn’t require hardware acceleration


Coding somewhat locally is a must for me too because the cluster I have access to has pretty serious wait times (up to a couple hours on busy days). Imagine only being able to run the code you’re writing a few times a day at most! Iterative development and doing a lot of mistakes is how I code; I don’t want to go back to punch card days where you waited and waited before you ended up with a silly error.


This is false. You can prototype a network on an M1 [1] and teacher-student models are a de facto standard for scaling down.

You can trivially run transfer-learning on an M1 to prototype and see if a particular backbone fits well to a small dataset, then kickoff training on some cloud instance with the larger dataset for a few days.

[1] https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...


You can just run your training with a lot less of data.


I think you underestimate just how many MacBooks Apple sells. There’s only so many affluent early adopters out there. And the M1 MacBooks are especially ideal for mainstream customers.


> Any modern discrete GPU will blow away any M1 laptop for training.

Counter example: none of the AMD consumer GPUs can be used for training. So no, not any modern discrete GPU blows away the M1.

The M1 might not be a great GPGPU system, but it is much better than many systems with discrete GPUs.


I use PyTorch for small-ish models, not really the typical enormous deep learning workload where you let it train for days.

I still prefer to work on a GPU workstation because it's the difference between running an experiment in minutes vs hours, makes it easier to iterate.

Some of that speed on a low-power laptop would be great. Much less friction.


This still won't work with a RX 580 right? It seems like ROCm 4 doesn't support that card.


Ryzen APU - the embedded GPU inside laptops is not supported by AMD for running rocm

There is an unofficial Bruhnspace project...but it is really sad that AMD has made a management decision that prevents use of a perfectly viable Ryzen laptop to make use of these libraries.

Unlike...say a M1


Might have to do with the fact that AMD just doesn't seem to have the resources (see the common complaints about their drivers' quality) to fully support every chip.

Another reason is certainly that they simply don't need to - just like Intel's iGPU, people working with deep learning opt for discrete GPUs (either built-in or external), both just isn't an option (yet?) for M1-based systems.

The audience would be a niche within a niche and the cost-benefit-ratio doesn't seem to justify the effort for them.


> AMD just doesn't seem to have the resources

AMD's net income for 2020 was about $2.5B. If it was a management priority, they would fund more people to focus on this.

I would love to support open-source drivers, but AMD's efforts with ROCm on consumer hardware are a joke. It's been said in other comments that AMD only cares about the datacenter. That certainly seems to be the case. So until AMD takes this seriously and gets a legitimate developer story together, I'm spending my money elsewhere.


> So until AMD takes this seriously and gets a legitimate developer story together, I'm spending my money elsewhere.

Fair enough. Thing is, AMD's market share in the mobile market has been below 15% over the past years [1] and only last year increased to about 20%.

Of these 20%, how many notebooks are (think realistically for a second) intended to be used for DL while also not featuring an NVIDIA dGPU?

ROCm on consumer cards isn't a priority for AMD, since profits are small compared to the datacentre market and there's not that many people actually using consumer hardware for this kind of work.

I always feel there's a ton of bias going on and one should refer to sales data and market analysis to find out what the actual importance of one's particular niche really is.

AMD's focus w.r.t. consumer hardware is on gaming an CPU performance. That's just how it is and it's not going to change anytime soon. On the notebook side of things, and AMD APU + NVIDIA dGPU is the best you can get right now.

[1] https://www.tomshardware.com/news/amd-vs-intel-q3-2020-cpu-m...


> ROCm on consumer cards isn't a priority for AMD, since profits are small compared to the datacentre market and there's not that many people actually using consumer hardware for this kind of work.

I think causality runs the other way: Profits are small and there aren't many people using AMD cards for this _because_ the developer experience for GPGPU on AMD is terrible (and that because it's not a priority for AMD).


That would imply that there even was a laptop market for AMD in the first place. As the market numbers show, up until last year, AMD simply wasn't relevant at all in the notebook segment, so what developer experience are you even talking about if there were no developers on AMD's platform?


I agree that AMD on mobile is a wasteland. But AMD has shipped over 500m desktop GPUs in the last 10 years. Surely some of those could/would have been better used for GPGPU dev if there was a decent developer experience.


I disagree with you. AMD is running behind developers to use AMD for GPU based training. Just 1 month or so back, it announced partnership with AWS to get its GPU on the cloud.

https://aws.amazon.com/blogs/aws/new-amazon-ec2-g4ad-instanc...

So I would disagree with your claim about marketshare being the reason for its helplessness to create a superior developer-laptop experience.

Cluelessness? sure. But not helplessness. If it wants the developer market (versus the gamer), then it better start acting like a developer tools company...which includes cosying up to Google/Facebook/AWS/Microsoft and throwing money on ROCm. Education is one of them - https://developer.nvidia.com/educators/existing-courses ... and giving developers a generally superior development experience on the maximum number of machines is another.


Well, NVIDIA view themselves as a software company that also builds hardware.

> "NVIDIA is a software-defined company today," Huang said, "with rich software content like GeForce NOW, NVIDIA virtual workstation in the cloud, NVIDIA AI, and NVIDIA Drive that will add recurring software revenue to our business model." [1]

Not sure that's the case with AMD. AMD lags behind massively when it comes to software- and developer support and I doubt that a few guys who insist on ROCm running on their APUs are in their focus.

It's just not a relevant target demographic.

[1] https://www.fool.com/investing/2020/08/21/nvidia-is-more-tha...


I dont believe so...and looks like neither does either Apple or Google/Tensorflow

- https://developer.apple.com/documentation/mlcompute

- https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...

For a more personal take on your answer - do consider the rest of world. For example, Ryzen is very popular in India. Discrete GPU are unaffordable for that college student who wants to train a non-English NLP model on GPU.


How does that contradict my point - Apple has to support the M1, since their M1-based SoCs don't support dGPUs (yet), so the M1 is all there is.

Besides, Apple is the most valuable company in the world and has exactly the kind of resources AMD doesn't have.

> or example, Ryzen is very popular in India. Discrete GPU are unaffordable for that college student who wants to train a non-English NLP model on GPU.

Well Google Collab is free and there are many affordable cloud-based offers as well. Training big DL models is not something you'd want to do on a laptop anyway. Small models can be trained on CPUs, so that shouldn't be an issue.

Inference is fast enough on CPUs anyway and if you really need to train models on your Ryzen APU, there's always other libraries, such as TensorflowJS, which is hardware agnostic as it runs on top of WebGL.

That's why I don't think this is a big deal at all - especially given that Intel still holds >90% of the integrated GPU market and doesn't even have an equivalent to ROCm. Again, niche within a niche, no matter where you look.


Curious to see how this performs in a real world setting. My understanding is that Nvidia's neural network libs and other proprietary foo would still hold an edge over a standard AMD card.

If this is not the case then this is a really big deal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: