Hacker News new | past | comments | ask | show | jobs | submit login
Tensorflow 2.0 AMD Support (github.com)
296 points by montalbano 14 days ago | hide | past | web | favorite | 83 comments

We've been experimenting with TF 1.14 on a Radeon VII. When you manage to properly install it, it does work at about 90% the speed of an NVIDIA RTX 2080 Ti, but is much cheaper, which can be a good trade-off.

However, the tooling and installation procedure is still rough around the edges and the documentation is lacking.

Last time I tried to upgrade the rocm things and the drivers, and bumped up against an issue that one of the packages tried to write to /data/jenkins_workspace during installation. We have /data as an NFS mount and is not writable even for the root user, so the installation broke down. Worked around it, but there are other issues as well. The order of packages you install can matter as well, I had to purge packages and reinstall them a few times. It worked with TF 1.14.1 but not with 1.14.0 if I recall correctly. The first few iterations are very slow, it seems it needs to find the optimal conv implementation for all kernel sizes separately.

Also, the card does not have anything like NVIDIA's tensor cores for half-precision (float16/FP16) computation. It can do FP16 but won't speed up and delivers worse results (in terms of prediction accuracy) than NVIDIA cards running FP16. There is no prediction accuracy difference in FP32.

And some other small issues here and there. So be warned and test it yourself on one card before investing in big multi-GPU machines.

But I do wish AMD entering the field will benefit us through increased competition in the long run. But at the moment ROCm seems like just a side project for a small team in AMD and they can't yet deliver the streamlined experience we're used to from CUDA.

Interesting. What benchmark are you running? Getting it within 10% on such drastically different hardware looks wrong to me. And sounds like GPU is not a bottleneck in that particular case. Could it be that the system is CPU bound (with some horrid thread contention in TensorFlow or something)?

LambdaLabs' benchmark: https://github.com/lambdal/lambda-tensorflow-benchmark

It's certainly around 10-15% and it's not CPU bound, as the benchmark uses synthetic images without the need for data loading and preprocessing.

I cannot rerun them now for precise numbers because my ROCm packages are broken and I don't have the spare hours to fix things. Surely it can be fixed but needs fiddling.

You can still be CPU bound, even if the benchmark is using synthetic images. A quick check could be to look if your 2080 Ti is 100% utilized (with nvidia-smi). And calculate percentage of theoretical peak FP32 throughput that you are getting for both cards.

It is just that "more normal" result that you would expect to get, would be a drastic differences in performance. Note, Cvikli comment on "May 12" on that GitHub page:

* The result with RNN networks on 1 Radeon VII and 1080ti was close to the same

* Comparing convolutional performance the 4AMD and 4Nvidia, difference got really huge because of cuDNN for Nvidia cards. We can get more than 10x performance from the 1080Ti than the Radeon VII card. We find this difference in speed a little too big at image recognition cuDNN, I can't believe that this should happen and the hardware shouldn't be able to achieve the same.

Note also in that comment, it looks like 4xGPU system just didn't work.

It is hard to fully utilize GPU resources. Getting same result normally means that the same bottleneck is being hit. Algorithmic differences could result in drastic performance differences (10x). And as a final point of small things making a difference, note it is 1080Ti in that comment.

I fixed the installation and ran a benchmark again: https://news.ycombinator.com/item?id=21666411

The benchmark is not CPU bound, as evidenced by looking at rocm-smi, nvidia-smi (GPU usage at 100%) and htop (low CPU usage).

Nice. Your results could be more reproducible, if you'd include CuDNN version. Turing architecture support (and optimizations specific to DNN training) are still relatively recent and there are differences in performance between the versions [1]. Multiplying matrices is kinda tricky ;) [2]

[1] https://developer.nvidia.com/cudnn . [2] https://scholar.google.com/scholar?as_ylo=2019&q=nvidia+gemm

We (lambda) are thinking about offering amd virtual workstation instances. One concern we had was that AMDs drivers have in the past had poor stability. Did you do burn in testing as well?

> Did you do burn in testing as well?

Not sure what that means. I did this: https://news.ycombinator.com/item?id=21666411

I run the fan at 100% and even after several thousand iterations it keeps its speed at 274-275 im/sec and 68 C temperature with the stock fan and a standard desktop chassis kept open.

Cool! Hey if you want to get together and chat about amd GPUs, I'm i@<lambda url>, happy to get you a coffee or beer or something.

What's drastically different between Radeon VII and 2080 Ti? They have basically the same peak fp32 tflops.

> it does work at about 90% the speed of an NVIDIA RTX 2080 Ti, but is much cheaper, which can be a good trade-off.

Is this 90% of fp32 performance, of 90% of fp16 performance?

FP32 on both.

What card/brand are you using?

I'd have to look up the brand, someone else ordered it.

Anyways, my complaints may be related partially to me using the xenial (Ubuntu 16.04) repo on bionic (18.04), because the there is no support for 18.04 LTS, even though it came out more than one and a half years ago. I also had to downgrade the kernel from 5.0 to 4.15 as 5.0 is not supported. I also had to force the installation of several 32-bit libraries on this 64-bit system, which took quite some googling and trial and error, otherwise installation didn't succeed. Little things like that.

float16 or bfloat16?

I wasn't aware of the difference, but now I looked. I've always used float16, not bfloat16 (tf.float16 is the usual float16), and it causes a large speed boost on Turing NVIDIA cards with tensor cores as opposed to float32.

But certainly bfloat16 sounds attractive with its larger dynamic range, as float16 requires workarounds like loss scaling to avoid rounding down to zero when the exponent gets too small.

Seems like bfloat16 is only available for TPUs in TensorFlow[0-1].

[0] https://github.com/tensorflow/tensorflow/issues/21612

[1] https://github.com/tensorflow/tensorflow/issues/21317

>it does work at about 90% the speed of an NVIDIA RTX 2080 Ti

Minor nitpick, but it looks like the Radeon vii is in the same tier as the 1080ti, and actually performs worse [1] in benchmarks.

1. https://gpu.userbenchmark.com/Compare/Nvidia-GTX-1080-Ti-vs-...

The two compare differently depending on the benchmark that is used.

Do you have any benchmarks which show comparable performance? Two quick searches now but I can only find video game based benchmarks but in every case the 1080TI outperforms the VII, which is kind of surprising and IMO isn't a good sign for ML performance.

Video games are never a great comparison for raw compute performance. AMD's GCN based architectures have all been very good at compute relative to their graphics performance (recall how it was nearly impossible to find an MSRP AMD GPU during the mining boom). The Radeon VII is pretty ridiculous for compute, aided by the 1TB/s bandwidth due to it's HBM2 stacks.

I just ran LambdaLabs' ResNet50 training benchmark (I could do more models, but this is it for now) on a few machines I have access to:

TensorFlow 1.14.4

    git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive
    python lambda-tensorflow-benchmark/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --optimizer=sgd --model=resnet50 --num_gpus=1 --batch_size=64 --variable_update=replicated --distortions=false --num_batches=10000 --data_name=imagenet

  - NVIDIA GTX 1080 Ti: ~215 images/sec
  - NVIDIA RTX 2080 Ti: ~300 images/sec
  - NVIDIA TITAN RTX: ~320 images/sec
  - NVIDIA Tesla V100: ~383 images/sec
  - AMD Radeon VII: ~275 images/sec

  - NVIDIA GTX 1080 Ti: ~277 images/sec
  - NVIDIA RTX 2080 Ti: ~495 images/sec
  - NVIDIA TITAN RTX: ~518 images/sec
  - NVIDIA Tesla V100: ~725 images/sec
  - AMD Radeon VII: ~373 images/sec
So in FP32 for this ResNet50 benchmark the Radeon VII is about 9-10% slower than the 2080 Ti and about 25-30% faster than the 1080 Ti.

In FP16 it is 25% slower than the 2080 Ti and 35% faster than the 1080 Ti.

Sure, it's just one benchmark, but ResNet is quite representative of deep learning architectures in vision.


Here it performs VERY well for machine learning. Look at PlaidML.

That's using OpenCL on the NVidia card ("Once hitting PlaidML for various machine learning benchmarks with OpenCL").

No one does for machine learning that because the CUDA path is much better.

I posted this because I have noticed many people aren't aware TensorFlow can run on AMD gpus. When informed, a common response is how well does it work?

This GitHub issue provides answers to both those questions with respect to the latest major release of TensorFlow. (Admittedly just a single benchmark data point, but better than nothing.)

That's because it really doesn't work on AMD gpus (at least on Linux.). Sure if you spent a lot of time setting up the rocm driver you might get lucky but it's alpha quality at best and in my case (and the case of several people in this thread and the GitHub one linked) it doesn't work at all.

From skimming the thread I get the impression it's a beta quality / work-in-progres kind of deal. Is that the case?

I was following the old GH issue / thread related to OpenCL support for 1.x of TensorFlow. I tinkered an got it working with opensource kernel drivers on GCN1 (HD 7970) https://github.com/tensorflow/tensorflow/issues/22

You could do it, and GCN was good at compute (still is with Radeon VI) but I must admit it was far more complicated than any AI researcher would have liked compared to the closed source binary approach of NVidia / Cuda.

My tf or pytorch code usually uses custom CUDA extensions, so sadly it will never be plug and play on AMD gpus for me. I hate that scientific computing and ML settled on a (admittedly superior) proprietary API.

Google is working on integrating CUDA directly into TensorFlow with the end goal being that you can use any GPU to do your computation. Basically a leapfrog over the inherently closed system that Nvidia has tried to implement.

I’m going to be glad when that finally happens. It’s not healthy for machine learning to be so dependent on a single provider like that.

> Google is working on integrating CUDA directly into TensorFlow

How is it going to untie it from Nvidia? The only way to do it, is not to use CUDA, but something portable that can target any GPU.

Google is working on integrating CUDA directly into TensorFlow

What does this mean? And how does having CUDA in TF make it better for vendors that don't support CUDA?

Could you post a link to this work?

It's going to be oracle v Google all over again.

Oracle v Google is still working its way through the court system.

by the time Google v NVidia will hit the highest courts, those will be stacked with AI overlords. I wonder what they'd think about being stuck in a proprietary API. :D

IIRC ROCm basically takes PTX bytecode and runs them on AMD cards, so this shouldn't (theoretically) be an issue.

That wasn't the case when I last looked it.

IIRC ROCm (in the form of HIP) defines a new C/C++ API that maps to either AMD intrincis or CUDA depending on a compile time flag.

It required converting your CUDA source code to ROCm code, though there was a code translation tool to help you with that.

To be honest: I don't really understand what ROCm stands for. AMD has been redefining their GP compute platforms so many times that it's easy to lose track.


Yeah I’m sure stuff like this can work without code rewrite but my guess is that it’s far from plug and play. I couldn’t use that to run my model on an AMD card tommorrow without some effort.

Google implemented AMD for Stadia. As my understanding this is part of their Cloud infrastructure: will be interesting to see if they start supporting AMD in their cloud compute engine instances. If this happens, we should see a price difference and "probably" a push from their customer base to also support AMD in their Deep Learning VM. There is a company out there called LogicalClocks which already integrated AMD and TensorFlow

Yes, we added support in our Resource Manager (a fork of YARN) for AMD GPUs and ROCm. It works fine with TensorFlow, and our benchmarks should at ROCm 2.10, the Radeon VII's performance for RESNET-50 is just a few percent lower than the 2080Ti. If you install our open-source platform, Hopsworks, using TensorFlow on Nvidia or AMD is exactly the same experience. We handle the complexity of installing/configuring drivers/libraries.

With nvidea not supporting MacOS beyond the current version, and Google pushing Swift for Tensorflow, I assume both Apple and Google have an increasing incentive to improve the adaptation of AMD GPUs for deep learing.

Correct me if I'm wrong but I believe one of the advantages of macOS is that the GPU can use the system ram for deep learning. Which makes it possible to use cheaper GPUs. One of the reasons I'm looking forward to better AMD Tensorflow support.

But you don't want to do that. The bandwidth between GPU and system RAM is 10x less than the bandwidth between GPU and GPU RAM, so the GPU would spend > 90% of time waiting for data. In this scenario, using CPU would most likely be faster.

Correct me if I'm wrong but I believe one of the advantages of macOS is that the GPU can use the system ram for deep learning. Which makes it possible to use cheaper GPUs.

I've never heard of this (and I do deep learning and use Macs). Do you have a reference?

Oh really? As an outsider, that’s super interesting to know. Why can’t system RAM be used in Linux/Windows, or why is it able to be on macOS?

You can do that with OpenCL or CUDA on any platform, it's just ridiculously slow because of the much lower bandwidth so nobody does it. GPUs can access their memory (GDDR or HBM) at 300-800 GB/s, but system memory at only 16 GB/s.

Why would anyone care about macOS for Tensorflow though? Even without CUDA, it doesn't support anything portable that targets GPUs either (i.e. OpenCL is in bit rot there, and Vulkan isn't supported to begin with). MacOS is a dead end, and Apple can blame their own extreme lock-in and NIH syndrome for it.

I don't know of a way to get any TensorFlow GPU implementation to work on macOS. I know that Keras runs with PlaidML+Metal on mac, but not TF.

`tensorflow-rocm` assumes Linux/ROCm, and the ROC (Radeon Open Compute) stuff builds Linux kernel extensions (rocm-dkms).

How do you mean Google is pushing Swift for Tensorflow? I knew the fast.ai folks were working on it, but I didn't know Google was pushing it at all.

Google is the company developing it with Chris Latner in charge

And thanks to it they decided on a language that is barely workable outside Apple's eco-system, instead of more mature alternatives.

I don't have big hopes for the project.

I’m definitely not disagreeing with you, but the Linux story isn’t completely horrible mostly due to IBM’s work. Though I’ve no idea if they’ve kept pace with the macOS version and tooling, it’s been a couple years since I looked. A shame, I quite like it as a language.

IBM seems to have lost interest.

Kitura last commit was on June 2018.

And even with all that support, it is not possible to write OS independent Swift without kludges (import darwin vs glibc, really?).

Also many enterprises impose Windows desktops, so none of their dev divisions would even consider Tensorflows's Swift variant, regardless how well it would be doing on GNU/Linux.

Of course many HN readers know that TF has been available for AMD for a while, but I still think that this Github thread is newsworthy, specifically this comment[1]:

> The RADEON VII's performance is crazy with tensorflow 2.0a. In our tests, we reached close to the same speed like our 2080ti(about 10-15% less)! But the Radeon VII has more memory which was a bottleneck in our case. On this price this videocard has the best value to do machine learning we think that in our company!

The comment above echoes the findings of HN user bonoboTP, see comment currently at the top of this thread[2]

[1] comment link on Github: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/...

[2] comment link on HN: https://news.ycombinator.com/item?id=21657173

Does anyone know if AMD intend to bring ROCm to Navi? Because I dont see any Vega successor on their Roadmap, and Navi doesn't seems to be focus on GPU Compute.

Phoronix states that Arcturus is the next compute acceleration chipset after Vega20, and thus Navi support may be unofficial if it does work:


Thank You. Wasn't aware of Arcturus, I have always thought GPU Compute and GPU Gaming focus are at odds with each other and the trade off do not make any sense. So I am glad after so many years we have GPU specifically for Compute and GPU for gaming.

It's being done, albeit not quickly.


I've checked the prices and it seems a XFX branded Radeon VII with 16GB RAM can be had for similar price as Nvidia RTX 2070 super with 8GB. This is very interesting.

I hope someone publishes benchmarks comparing those cards in training most frequently used network architectures. I would be heading to buy one of those Radeon cards myself if not for their lack of performance with conv networks.

Also, current version of Nvidia RTX cards with tensorflow supports automatic mixed precision training which can significantly speed up training of some networks.

I think we're still quite far from having a serious competitor to nvidia in general purpose machine learning.

> I hope someone publishes benchmarks comparing those cards in training most frequently used network architectures.

I just did a haphazard, not very scientific, but ballpark-useful comparison: https://news.ycombinator.com/item?id=21666411

> Can't bloody nvidia just open source everything. Those managers should be hanged. Wasted way too much time on this.

As much as I want NVIDIA to open-source their drivers, they won't because of the same reasons the Raspberry Pi guys won't open-source their bootloader; they have to protect their IP.

To us it is a triumphant revelation to the 'open-source' community if they open-source their drivers, but in reality, to them it is absolutely 'treasonous' for someone in the company to open-source them as it is essentially risks their IP being stolen or fully reverse-engineered from a 1:1 reference to the original code.

Not as simple as it sounds to just 'open-source' something like drivers does it?

The claims about their IP being stolen are not credible.

For one, AMD has an open source graphics stack and none of their IP has been stolen.

Furthermore, Nvidia actually has pretty good support for nouveau (open source driver) on the Tegra line. Nvidia's real IP has not been compromised. Nvidia had to provide good open source support because the automotive industry wants guarantees that hardware will be able to run modern kernels many years in the future.

Open sourcing/upstreaming the device drivers has business value because it means the actual devices sold will be supported by Linux until they are truly obsolete. (Think i368)

Device drivers are not sexy, all they do is setup/shutdown hardware and provide hooks for the operating system to make system calls work. This is not core IP, kernel drivers are the bare minimum to enable hardware to be useful.

The real reason Nvidia doesn't want to open source it's device drivers is because it would cost money to do so. It would likely require a total rewrite, and Nvidias current driver presumably shares code between platforms with a shim that only Nvidia knows about. If everyone used Linux Nvidia would probably have open source drivers for workstations (like they do for embedded)

> all they do is setup/shutdown hardware and provide hooks for the operating system

NVidia GPU driver does more than that, at least on Windows. Ever wondered why NVidia ships a new version of GPU driver at the same time an AAA game is released?

Same reason AMD does: driver included hacks, either stability or performance.

AMD is playing catch up, being far away from NVidia; the strategy is to open its stack so that people like you jump on the bandwagon and do exactly what you are doing, i.e. complaining about closed nature of the GPGPU computing NVidia created and cultivated. The same complaints as with Intel on MKL, i.e. AMD ignoring that market completely and then fans suddenly waking up and complaining everywhere. As we see with new Threadripper, once they are dominant they change their attitudes quickly. The only thing we as customers can do is to vote with our wallets and strategically support whoever does more sensible stuff, playing companies against each other.

What did AMD do with Threadripper? Raise prices to 70% of Intel's after Intel lowered theirs? Sure AMD is "just another company" but they have years to go until they can afford to get lazy and use exclusion tactics.

I’m genuinely uncertain as to what you’re complaining about re. AMD

They had my sympathies as they were on the verge of bankruptcy multiple times in the past decade; it's logical CUDA and MKL passed by and they couldn't react due to a lack of funds. However all the fanboys have to realize that building CUDA or MKL takes many talented/expensive developers/other personnel and obviously companies want to keep markets they created with them. And now with TR3 basically aiming for total high end, completely bypassing existing x399 and offering no "lower core number" versions of TR3, it's just a CPU for elite, not for regular folks, which never happened with AMD before and I view it as a worrying shift.

You can kind of see how they ended up where they did with TR3. With Zen 1 they had Ryzen going up to 8 cores, Threadripper between 8 and 16 and Epyc picking up from there.

For Zen 2 they've got the AM4 Ryzen with up to 16 cores and PCIe 4.0, and Epyc now goes up to 64 but it still starts at 8. Where does that leave for TR3? They got questions about whether they were even going to produce it. If a 16-core Ryzen 9 3950X is $749 and a 16-core Epyc 7302P is $825 then where would you put a 16-core TR3?

It's also possible the new I/O die pushed them to change the socket. Chopping the socket in half to create Threadripper from Epyc is one thing when the socket is already a 4-die package and you can just take two out, but what happens with the new Epyc I/O die which is designed for the whole socket, and could use it even for Threadripper? AM4 and Epyc have all the volume, reworking things like that specifically for compatibility with existing Threadripper boards may not have been worth it. Meanwhile a new socket lets them actually take advantage of the new I/O die to add more PCIe lanes etc.

It stinks that it's not compatible but it smells more like hard trade offs than malice.

Frankly, TR3 is a bit schizophrenic. So they introduced a new socket/platform for supporting up to 64 cores, yet still restricted memory to 256GB UDIMMs, instead of switching over to LRDIMM and allowing up to 4TB of RAM. And the whole platform is dead the moment DDR5 shows up (2021?). Restrict TR3 to top-end only, but limit RAM to low-to-mid-end workstation. It simply doesn't make any sense why to go through all those steps to bypass x399 while not bringing much to the table. Also, no single TRX40 board has more than 4 PCIe slots...

> So they introduced a new socket/platform for supporting up to 64 cores, yet still restricted memory to 256GB UDIMMs, instead of switching over to LRDIMM and allowing up to 4TB of RAM.

The only reason all desktops don't support RDIMMs in addition to UDIMMs is market segmentation. The Athlon XP supported both because they were really the same die with the same memory controller as the Athlon MP. They're playing the same stupid game as Intel now but it's nothing new with TR3. They've been doing it since the Athlon 64 or so.

Though it's especially annoying because the "workstation" models (FX/TR) have higher clocks than Opteron/Epyc, so there is no way to combine the higher clocked processors with the higher memory capacity boards.

> And the whole platform is dead the moment DDR5 shows up (2021?).

That happens either way.

You're regarding this as a whole new platform they had to design just for this, but it's more like a little tweak on the existing design with the unfortunate side effect of breaking compatibility.

> Also, no single TRX40 board has more than 4 PCIe slots...

There are boards with four real x16 slots, which is 64 lanes. The other 8 are needed for the on-board devices.

I don't think you know what you're talking about, at all.

and AMD has no IP?

Genuine question: how would open sourcing help with that compatibility issue? Open source software is nice, but doesn't magically make all compatibility problems go away.

I am so excited for this. I have some AMD GPUs that I used for ethereum mining that I would LOVE to turn into an AI machine.

You can run AMD's port of Tensorflow 1.3 today if you don't want to wait for main-lined support in TF 2.0.

Just ordered a 16" MacBook Pro with a AMD Radeon Pro 5000M. Will that now work with TF 2.0? How does it compare with the Radeons mentioned here?

I’m pretty sure it won’t, at least not tensorflow-gpu. Tensorflow-cpu will though. There’s no ROCm support for MacOS and obviously no CUDA support.

I'd like to point out that bfloat16! = traditional float16

and forms of bfloat16 are not available for hardware acceleration on all GPUs yet.

Surprise. Can this support Mbp 15 2019 vega?

Yes, the vega architecture is supported by rocm (basically AMD's cuda). You're best bet would be to use the pre-built docker images though.

On macOS? It is possible to use a GPU inside of Docker/macOS with Hyperkit? There's PCIe passthrough and/or SR-IOV support? ROCm requires Linux kernel extensions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact