However, the tooling and installation procedure is still rough around the edges and the documentation is lacking.
Last time I tried to upgrade the rocm things and the drivers, and bumped up against an issue that one of the packages tried to write to /data/jenkins_workspace during installation. We have /data as an NFS mount and is not writable even for the root user, so the installation broke down. Worked around it, but there are other issues as well. The order of packages you install can matter as well, I had to purge packages and reinstall them a few times. It worked with TF 1.14.1 but not with 1.14.0 if I recall correctly. The first few iterations are very slow, it seems it needs to find the optimal conv implementation for all kernel sizes separately.
Also, the card does not have anything like NVIDIA's tensor cores for half-precision (float16/FP16) computation. It can do FP16 but won't speed up and delivers worse results (in terms of prediction accuracy) than NVIDIA cards running FP16. There is no prediction accuracy difference in FP32.
And some other small issues here and there. So be warned and test it yourself on one card before investing in big multi-GPU machines.
But I do wish AMD entering the field will benefit us through increased competition in the long run. But at the moment ROCm seems like just a side project for a small team in AMD and they can't yet deliver the streamlined experience we're used to from CUDA.
It's certainly around 10-15% and it's not CPU bound, as the benchmark uses synthetic images without the need for data loading and preprocessing.
I cannot rerun them now for precise numbers because my ROCm packages are broken and I don't have the spare hours to fix things. Surely it can be fixed but needs fiddling.
It is just that "more normal" result that you would expect to get, would be a drastic differences in performance. Note, Cvikli comment on "May 12" on that GitHub page:
* The result with RNN networks on 1 Radeon VII and 1080ti was close to the same
* Comparing convolutional performance the 4AMD and 4Nvidia, difference got really huge because of cuDNN for Nvidia cards. We can get more than 10x performance from the 1080Ti than the Radeon VII card. We find this difference in speed a little too big at image recognition cuDNN, I can't believe that this should happen and the hardware shouldn't be able to achieve the same.
Note also in that comment, it looks like 4xGPU system just didn't work.
It is hard to fully utilize GPU resources. Getting same result normally means that the same bottleneck is being hit. Algorithmic differences could result in drastic performance differences (10x). And as a final point of small things making a difference, note it is 1080Ti in that comment.
The benchmark is not CPU bound, as evidenced by looking at rocm-smi, nvidia-smi (GPU usage at 100%) and htop (low CPU usage).
 https://developer.nvidia.com/cudnn .
Not sure what that means. I did this: https://news.ycombinator.com/item?id=21666411
I run the fan at 100% and even after several thousand iterations it keeps its speed at 274-275 im/sec and 68 C temperature with the stock fan and a standard desktop chassis kept open.
Is this 90% of fp32 performance, of 90% of fp16 performance?
Anyways, my complaints may be related partially to me using the xenial (Ubuntu 16.04) repo on bionic (18.04), because the there is no support for 18.04 LTS, even though it came out more than one and a half years ago. I also had to downgrade the kernel from 5.0 to 4.15 as 5.0 is not supported. I also had to force the installation of several 32-bit libraries on this 64-bit system, which took quite some googling and trial and error, otherwise installation didn't succeed. Little things like that.
But certainly bfloat16 sounds attractive with its larger dynamic range, as float16 requires workarounds like loss scaling to avoid rounding down to zero when the exponent gets too small.
Minor nitpick, but it looks like the Radeon vii is in the same tier as the 1080ti, and actually performs worse  in benchmarks.
git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive
python lambda-tensorflow-benchmark/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --optimizer=sgd --model=resnet50 --num_gpus=1 --batch_size=64 --variable_update=replicated --distortions=false --num_batches=10000 --data_name=imagenet
- NVIDIA GTX 1080 Ti: ~215 images/sec
- NVIDIA RTX 2080 Ti: ~300 images/sec
- NVIDIA TITAN RTX: ~320 images/sec
- NVIDIA Tesla V100: ~383 images/sec
- AMD Radeon VII: ~275 images/sec
- NVIDIA GTX 1080 Ti: ~277 images/sec
- NVIDIA RTX 2080 Ti: ~495 images/sec
- NVIDIA TITAN RTX: ~518 images/sec
- NVIDIA Tesla V100: ~725 images/sec
- AMD Radeon VII: ~373 images/sec
In FP16 it is 25% slower than the 2080 Ti and 35% faster than the 1080 Ti.
Sure, it's just one benchmark, but ResNet is quite representative of deep learning architectures in vision.
Here it performs VERY well for machine learning. Look at PlaidML.
No one does for machine learning that because the CUDA path is much better.
This GitHub issue provides answers to both those questions with respect to the latest major release of TensorFlow. (Admittedly just a single benchmark data point, but better than nothing.)
You could do it, and GCN was good at compute (still is with Radeon VI) but I must admit it was far more complicated than any AI researcher would have liked compared to the closed source binary approach of NVidia / Cuda.
I’m going to be glad when that finally happens. It’s not healthy for machine learning to be so dependent on a single provider like that.
How is it going to untie it from Nvidia? The only way to do it, is not to use CUDA, but something portable that can target any GPU.
What does this mean? And how does having CUDA in TF make it better for vendors that don't support CUDA?
Could you post a link to this work?
IIRC ROCm (in the form of HIP) defines a new C/C++ API that maps to either AMD intrincis or CUDA depending on a compile time flag.
It required converting your CUDA source code to ROCm code, though there was a code translation tool to help you with that.
To be honest: I don't really understand what ROCm stands for. AMD has been redefining their GP compute platforms so many times that it's easy to lose track.
> The RADEON VII's performance is crazy with tensorflow 2.0a. In our tests, we reached close to the same speed like our 2080ti(about 10-15% less)! But the Radeon VII has more memory which was a bottleneck in our case. On this price this videocard has the best value to do machine learning we think that in our company!
The comment above echoes the findings of HN user bonoboTP, see comment currently at the top of this thread
 comment link on Github: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/...
 comment link on HN: https://news.ycombinator.com/item?id=21657173
I've never heard of this (and I do deep learning and use Macs). Do you have a reference?
`tensorflow-rocm` assumes Linux/ROCm, and the ROC (Radeon Open Compute) stuff builds Linux kernel extensions (rocm-dkms).
I don't have big hopes for the project.
Kitura last commit was on June 2018.
And even with all that support, it is not possible to write OS independent Swift without kludges (import darwin vs glibc, really?).
Also many enterprises impose Windows desktops, so none of their dev divisions would even consider Tensorflows's Swift variant, regardless how well it would be doing on GNU/Linux.
I hope someone publishes benchmarks comparing those cards in training most frequently used network architectures. I would be heading to buy one of those Radeon cards myself if not for their lack of performance with conv networks.
Also, current version of Nvidia RTX cards with tensorflow supports automatic mixed precision training which can significantly speed up training of some networks.
I think we're still quite far from having a serious competitor to nvidia in general purpose machine learning.
I just did a haphazard, not very scientific, but ballpark-useful comparison: https://news.ycombinator.com/item?id=21666411
Me, yesterday : https://github.com/tensorflow/tensorflow/issues/30638#issuec...
As much as I want NVIDIA to open-source their drivers, they won't because of the same reasons the Raspberry Pi guys won't open-source their bootloader; they have to protect their IP.
To us it is a triumphant revelation to the 'open-source' community if they open-source their drivers, but in reality, to them it is absolutely 'treasonous' for someone in the company to open-source them as it is essentially risks their IP being stolen or fully reverse-engineered from a 1:1 reference to the original code.
Not as simple as it sounds to just 'open-source' something like drivers does it?
For one, AMD has an open source graphics stack and none of their IP has been stolen.
Furthermore, Nvidia actually has pretty good support for nouveau (open source driver) on the Tegra line. Nvidia's real IP has not been compromised. Nvidia had to provide good open source support because the automotive industry wants guarantees that hardware will be able to run modern kernels many years in the future.
Open sourcing/upstreaming the device drivers has business value because it means the actual devices sold will be supported by Linux until they are truly obsolete. (Think i368)
Device drivers are not sexy, all they do is setup/shutdown hardware and provide hooks for the operating system to make system calls work. This is not core IP, kernel drivers are the bare minimum to enable hardware to be useful.
The real reason Nvidia doesn't want to open source it's device drivers is because it would cost money to do so. It would likely require a total rewrite, and Nvidias current driver presumably shares code between platforms with a shim that only Nvidia knows about. If everyone used Linux Nvidia would probably have open source drivers for workstations (like they do for embedded)
NVidia GPU driver does more than that, at least on Windows. Ever wondered why NVidia ships a new version of GPU driver at the same time an AAA game is released?
For Zen 2 they've got the AM4 Ryzen with up to 16 cores and PCIe 4.0, and Epyc now goes up to 64 but it still starts at 8. Where does that leave for TR3? They got questions about whether they were even going to produce it. If a 16-core Ryzen 9 3950X is $749 and a 16-core Epyc 7302P is $825 then where would you put a 16-core TR3?
It's also possible the new I/O die pushed them to change the socket. Chopping the socket in half to create Threadripper from Epyc is one thing when the socket is already a 4-die package and you can just take two out, but what happens with the new Epyc I/O die which is designed for the whole socket, and could use it even for Threadripper? AM4 and Epyc have all the volume, reworking things like that specifically for compatibility with existing Threadripper boards may not have been worth it. Meanwhile a new socket lets them actually take advantage of the new I/O die to add more PCIe lanes etc.
It stinks that it's not compatible but it smells more like hard trade offs than malice.
The only reason all desktops don't support RDIMMs in addition to UDIMMs is market segmentation. The Athlon XP supported both because they were really the same die with the same memory controller as the Athlon MP. They're playing the same stupid game as Intel now but it's nothing new with TR3. They've been doing it since the Athlon 64 or so.
Though it's especially annoying because the "workstation" models (FX/TR) have higher clocks than Opteron/Epyc, so there is no way to combine the higher clocked processors with the higher memory capacity boards.
> And the whole platform is dead the moment DDR5 shows up (2021?).
That happens either way.
You're regarding this as a whole new platform they had to design just for this, but it's more like a little tweak on the existing design with the unfortunate side effect of breaking compatibility.
> Also, no single TRX40 board has more than 4 PCIe slots...
There are boards with four real x16 slots, which is 64 lanes. The other 8 are needed for the on-board devices.
and forms of bfloat16 are not available for hardware acceleration on all GPUs yet.