
Tensorflow 2.0 AMD Support - montalbano
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/362
======
bonoboTP
We've been experimenting with TF 1.14 on a Radeon VII. When you manage to
properly install it, it does work at about 90% the speed of an NVIDIA RTX 2080
Ti, but is much cheaper, which can be a good trade-off.

However, the tooling and installation procedure is still rough around the
edges and the documentation is lacking.

Last time I tried to upgrade the rocm things and the drivers, and bumped up
against an issue that one of the packages tried to write to
/data/jenkins_workspace during installation. We have /data as an NFS mount and
is not writable even for the root user, so the installation broke down. Worked
around it, but there are other issues as well. The order of packages you
install can matter as well, I had to purge packages and reinstall them a few
times. It worked with TF 1.14.1 but not with 1.14.0 if I recall correctly. The
first few iterations are very slow, it seems it needs to find the optimal conv
implementation for all kernel sizes separately.

Also, the card does not have anything like NVIDIA's tensor cores for half-
precision (float16/FP16) computation. It can do FP16 but won't speed up and
delivers worse results (in terms of prediction accuracy) than NVIDIA cards
running FP16. There is no prediction accuracy difference in FP32.

And some other small issues here and there. So be warned and test it yourself
on one card before investing in big multi-GPU machines.

But I do wish AMD entering the field will benefit us through increased
competition in the long run. But at the moment ROCm seems like just a side
project for a small team in AMD and they can't yet deliver the streamlined
experience we're used to from CUDA.

~~~
dchichkov
Interesting. What benchmark are you running? Getting it within 10% on such
drastically different hardware looks wrong to me. And sounds like GPU is not a
bottleneck in that particular case. Could it be that the system is CPU bound
(with some horrid thread contention in TensorFlow or something)?

~~~
bonoboTP
LambdaLabs' benchmark: [https://github.com/lambdal/lambda-tensorflow-
benchmark](https://github.com/lambdal/lambda-tensorflow-benchmark)

It's certainly around 10-15% and it's not CPU bound, as the benchmark uses
synthetic images without the need for data loading and preprocessing.

I cannot rerun them now for precise numbers because my ROCm packages are
broken and I don't have the spare hours to fix things. Surely it can be fixed
but needs fiddling.

~~~
dchichkov
You can still be CPU bound, even if the benchmark is using synthetic images. A
quick check could be to look if your 2080 Ti is 100% utilized (with nvidia-
smi). And calculate percentage of theoretical peak FP32 throughput that you
are getting for both cards.

It is just that "more normal" result that you would expect to get, would be a
drastic differences in performance. Note, Cvikli comment on "May 12" on that
GitHub page:

* _The result with RNN networks on 1 Radeon VII and 1080ti was close to the same_

* _Comparing convolutional performance the 4AMD and 4Nvidia, difference got really huge because of cuDNN for Nvidia cards. We can get more than 10x performance from the 1080Ti than the Radeon VII card. We find this difference in speed a little too big at image recognition cuDNN, I can 't believe that this should happen and the hardware shouldn't be able to achieve the same._

Note also in that comment, it looks like 4xGPU system just didn't work.

It is hard to fully utilize GPU resources. Getting same result normally means
that the same bottleneck is being hit. Algorithmic differences could result in
drastic performance differences (10x). And as a final point of small things
making a difference, note it is 1080Ti in that comment.

~~~
bonoboTP
I fixed the installation and ran a benchmark again:
[https://news.ycombinator.com/item?id=21666411](https://news.ycombinator.com/item?id=21666411)

The benchmark is not CPU bound, as evidenced by looking at rocm-smi, nvidia-
smi (GPU usage at 100%) and htop (low CPU usage).

~~~
dchichkov
Nice. Your results could be more reproducible, if you'd include CuDNN version.
Turing architecture support (and optimizations specific to DNN training) are
still relatively recent and there are differences in performance between the
versions [1]. Multiplying matrices is kinda tricky ;) [2]

[1] [https://developer.nvidia.com/cudnn](https://developer.nvidia.com/cudnn) .
[2]
[https://scholar.google.com/scholar?as_ylo=2019&q=nvidia+gemm](https://scholar.google.com/scholar?as_ylo=2019&q=nvidia+gemm)

~~~
bonoboTP
Cannot edit that comment any more, but it's driver version 430, CUDA 10.1,
CuDNN 7.5.

------
montalbano
I posted this because I have noticed many people aren't aware TensorFlow can
run on AMD gpus. When informed, a common response is how _well_ does it work?

This GitHub issue provides answers to both those questions with respect to the
latest major release of TensorFlow. (Admittedly just a single benchmark data
point, but better than nothing.)

~~~
edgyquant
That's because it really doesn't work on AMD gpus (at least on Linux.). Sure
if you spent a lot of time setting up the rocm driver you might get lucky but
it's alpha quality at best and in my case (and the case of several people in
this thread and the GitHub one linked) it doesn't work at all.

------
jefft255
My tf or pytorch code usually uses custom CUDA extensions, so sadly it will
never be plug and play on AMD gpus for me. I hate that scientific computing
and ML settled on a (admittedly superior) proprietary API.

~~~
partingshots
Google is working on integrating CUDA directly into TensorFlow with the end
goal being that you can use any GPU to do your computation. Basically a
leapfrog over the inherently closed system that Nvidia has tried to implement.

I’m going to be glad when that finally happens. It’s not healthy for machine
learning to be so dependent on a single provider like that.

~~~
minxomat
It's going to be oracle v Google all over again.

~~~
monocasa
Oracle v Google is still working its way through the court system.

~~~
black_puppydog
by the time Google v NVidia will hit the highest courts, those will be stacked
with AI overlords. I wonder what they'd think about being stuck in a
proprietary API. :D

------
spicyramen
Google implemented AMD for Stadia. As my understanding this is part of their
Cloud infrastructure: will be interesting to see if they start supporting AMD
in their cloud compute engine instances. If this happens, we should see a
price difference and "probably" a push from their customer base to also
support AMD in their Deep Learning VM. There is a company out there called
LogicalClocks which already integrated AMD and TensorFlow

~~~
jamesblonde
Yes, we added support in our Resource Manager (a fork of YARN) for AMD GPUs
and ROCm. It works fine with TensorFlow, and our benchmarks should at ROCm
2.10, the Radeon VII's performance for RESNET-50 is just a few percent lower
than the 2080Ti. If you install our open-source platform, Hopsworks, using
TensorFlow on Nvidia or AMD is exactly the same experience. We handle the
complexity of installing/configuring drivers/libraries.

------
mastazi
Of course many HN readers know that TF has been available for AMD for a while,
but I still think that this Github thread is newsworthy, specifically this
comment[1]:

> The RADEON VII's performance is crazy with tensorflow 2.0a. In our tests, we
> reached close to the same speed like our 2080ti(about 10-15% less)! But the
> Radeon VII has more memory which was a bottleneck in our case. On this price
> this videocard has the best value to do machine learning we think that in
> our company!

The comment above echoes the findings of HN user bonoboTP, see comment
currently at the top of this thread[2]

[1] comment link on Github:
[https://github.com/ROCmSoftwarePlatform/tensorflow-
upstream/...](https://github.com/ROCmSoftwarePlatform/tensorflow-
upstream/issues/362#issuecomment-479366031)

[2] comment link on HN:
[https://news.ycombinator.com/item?id=21657173](https://news.ycombinator.com/item?id=21657173)

------
zapnuk
With nvidea not supporting MacOS beyond the current version, and Google
pushing Swift for Tensorflow, I assume both Apple and Google have an
increasing incentive to improve the adaptation of AMD GPUs for deep learing.

~~~
Vomzor
Correct me if I'm wrong but I believe one of the advantages of macOS is that
the GPU can use the system ram for deep learning. Which makes it possible to
use cheaper GPUs. One of the reasons I'm looking forward to better AMD
Tensorflow support.

~~~
girvo
Oh really? As an outsider, that’s super interesting to know. Why can’t system
RAM be used in Linux/Windows, or why is it able to be on macOS?

~~~
rrss
You can do that with OpenCL or CUDA on any platform, it's just ridiculously
slow because of the much lower bandwidth so nobody does it. GPUs can access
their memory (GDDR or HBM) at 300-800 GB/s, but system memory at only 16 GB/s.

------
ksec
Does anyone know if AMD intend to bring ROCm to Navi? Because I dont see any
Vega successor on their Roadmap, and Navi doesn't seems to be focus on GPU
Compute.

~~~
rarecoil
Phoronix states that Arcturus is the next compute acceleration chipset after
Vega20, and thus Navi support may be unofficial if it does work:

[https://www.phoronix.com/scan.php?page=news_item&px=Radeon-R...](https://www.phoronix.com/scan.php?page=news_item&px=Radeon-
ROCm-2.9-Released)

~~~
ksec
Thank You. Wasn't aware of Arcturus, I have always thought GPU Compute and GPU
Gaming focus are at odds with each other and the trade off do not make any
sense. So I am glad after so many years we have GPU specifically for Compute
and GPU for gaming.

------
Roark66
I've checked the prices and it seems a XFX branded Radeon VII with 16GB RAM
can be had for similar price as Nvidia RTX 2070 super with 8GB. This is very
interesting.

I hope someone publishes benchmarks comparing those cards in training most
frequently used network architectures. I would be heading to buy one of those
Radeon cards myself if not for their lack of performance with conv networks.

Also, current version of Nvidia RTX cards with tensorflow supports automatic
mixed precision training which can significantly speed up training of some
networks.

I think we're still quite far from having a serious competitor to nvidia in
general purpose machine learning.

~~~
bonoboTP
> I hope someone publishes benchmarks comparing those cards in training most
> frequently used network architectures.

I just did a haphazard, not very scientific, but ballpark-useful comparison:
[https://news.ycombinator.com/item?id=21666411](https://news.ycombinator.com/item?id=21666411)

------
dnpp123
Finally.....

Me, yesterday :
[https://github.com/tensorflow/tensorflow/issues/30638#issuec...](https://github.com/tensorflow/tensorflow/issues/30638#issuecomment-559201634)

~~~
rvz
> Can't bloody nvidia just open source everything. Those managers should be
> hanged. Wasted way too much time on this.

As much as I want NVIDIA to open-source their drivers, they won't because of
the same reasons the Raspberry Pi guys won't open-source their bootloader;
they have to protect their IP.

To us it is a triumphant revelation to the 'open-source' community if they
open-source their drivers, but in reality, to them it is absolutely
'treasonous' for someone in the company to open-source them as it is
essentially risks their IP being stolen or fully reverse-engineered from a 1:1
reference to the original code.

Not as simple as it sounds to just 'open-source' something like drivers does
it?

~~~
bgorman
The claims about their IP being stolen are not credible.

For one, AMD has an open source graphics stack and none of their IP has been
stolen.

Furthermore, Nvidia actually has pretty good support for nouveau (open source
driver) on the Tegra line. Nvidia's real IP has not been compromised. Nvidia
had to provide good open source support because the automotive industry wants
guarantees that hardware will be able to run modern kernels many years in the
future.

Open sourcing/upstreaming the device drivers has business value because it
means the actual devices sold will be supported by Linux until they are truly
obsolete. (Think i368)

Device drivers are not sexy, all they do is setup/shutdown hardware and
provide hooks for the operating system to make system calls work. This is not
core IP, kernel drivers are the bare minimum to enable hardware to be useful.

The real reason Nvidia doesn't want to open source it's device drivers is
because it would cost money to do so. It would likely require a total rewrite,
and Nvidias current driver presumably shares code between platforms with a
shim that only Nvidia knows about. If everyone used Linux Nvidia would
probably have open source drivers for workstations (like they do for embedded)

~~~
bitL
AMD is playing catch up, being far away from NVidia; the strategy is to open
its stack so that people like you jump on the bandwagon and do exactly what
you are doing, i.e. complaining about closed nature of the GPGPU computing
NVidia created and cultivated. The same complaints as with Intel on MKL, i.e.
AMD ignoring that market completely and then fans suddenly waking up and
complaining everywhere. As we see with new Threadripper, once they are
dominant they change their attitudes quickly. The only thing we as customers
can do is to vote with our wallets and strategically support whoever does more
sensible stuff, playing companies against each other.

~~~
girvo
I’m genuinely uncertain as to what you’re complaining about re. AMD

~~~
bitL
They had my sympathies as they were on the verge of bankruptcy multiple times
in the past decade; it's logical CUDA and MKL passed by and they couldn't
react due to a lack of funds. However all the fanboys have to realize that
building CUDA or MKL takes many talented/expensive developers/other personnel
and obviously companies want to keep markets they created with them. And now
with TR3 basically aiming for total high end, completely bypassing existing
x399 and offering no "lower core number" versions of TR3, it's just a CPU for
elite, not for regular folks, which never happened with AMD before and I view
it as a worrying shift.

~~~
zrm
You can kind of see how they ended up where they did with TR3. With Zen 1 they
had Ryzen going up to 8 cores, Threadripper between 8 and 16 and Epyc picking
up from there.

For Zen 2 they've got the AM4 Ryzen with up to 16 cores and PCIe 4.0, and Epyc
now goes up to 64 but it still starts at 8. Where does that leave for TR3?
They got questions about whether they were even going to produce it. If a
16-core Ryzen 9 3950X is $749 and a 16-core Epyc 7302P is $825 then where
would you put a 16-core TR3?

It's also possible the new I/O die pushed them to change the socket. Chopping
the socket in half to create Threadripper from Epyc is one thing when the
socket is already a 4-die package and you can just take two out, but what
happens with the new Epyc I/O die which is designed for the whole socket, and
could use it even for Threadripper? AM4 and Epyc have all the volume,
reworking things like that specifically for compatibility with existing
Threadripper boards may not have been worth it. Meanwhile a new socket lets
them actually take advantage of the new I/O die to add more PCIe lanes etc.

It stinks that it's not compatible but it smells more like hard trade offs
than malice.

~~~
bitL
Frankly, TR3 is a bit schizophrenic. So they introduced a new socket/platform
for supporting up to 64 cores, yet still restricted memory to 256GB UDIMMs,
instead of switching over to LRDIMM and allowing up to 4TB of RAM. And the
whole platform is dead the moment DDR5 shows up (2021?). Restrict TR3 to top-
end only, but limit RAM to low-to-mid-end workstation. It simply doesn't make
any sense why to go through all those steps to bypass x399 while not bringing
much to the table. Also, no single TRX40 board has more than 4 PCIe slots...

~~~
zrm
> So they introduced a new socket/platform for supporting up to 64 cores, yet
> still restricted memory to 256GB UDIMMs, instead of switching over to LRDIMM
> and allowing up to 4TB of RAM.

The only reason all desktops don't support RDIMMs in addition to UDIMMs is
market segmentation. The Athlon XP supported both because they were really the
same die with the same memory controller as the Athlon MP. They're playing the
same stupid game as Intel now but it's nothing new with TR3. They've been
doing it since the Athlon 64 or so.

Though it's especially annoying because the "workstation" models (FX/TR) have
higher clocks than Opteron/Epyc, so there is no way to combine the higher
clocked processors with the higher memory capacity boards.

> And the whole platform is dead the moment DDR5 shows up (2021?).

That happens either way.

You're regarding this as a whole new platform they had to design just for
this, but it's more like a little tweak on the existing design with the
unfortunate side effect of breaking compatibility.

> Also, no single TRX40 board has more than 4 PCIe slots...

There are boards with four real x16 slots, which is 64 lanes. The other 8 are
needed for the on-board devices.

------
ralphc
Just ordered a 16" MacBook Pro with a AMD Radeon Pro 5000M. Will that now work
with TF 2.0? How does it compare with the Radeons mentioned here?

~~~
DreGPU
I’m pretty sure it won’t, at least not tensorflow-gpu. Tensorflow-cpu will
though. There’s no ROCm support for MacOS and obviously no CUDA support.

------
rjkennedy98
I am so excited for this. I have some AMD GPUs that I used for ethereum mining
that I would LOVE to turn into an AI machine.

~~~
sangnoir
You can run AMD's port of Tensorflow 1.3 _today_ if you don't want to wait for
main-lined support in TF 2.0.

------
vkaku
I'd like to point out that bfloat16! = traditional float16

and forms of bfloat16 are not available for hardware acceleration on all GPUs
yet.

------
ngcc_hk
Surprise. Can this support Mbp 15 2019 vega?

~~~
yold__
Yes, the vega architecture is supported by rocm (basically AMD's cuda). You're
best bet would be to use the pre-built docker images though.

~~~
rarecoil
On macOS? It is possible to use a GPU inside of Docker/macOS with Hyperkit?
There's PCIe passthrough and/or SR-IOV support? ROCm requires Linux kernel
extensions.

