I've recently gotten an eGPU for my Macbook Pro for playing games in whatever little off-time I have on Windows. I also wanted to use it on Mac, though, and Macs only support AMD graphics. So I got a Vega 64. The other half of the reason was that I wanted to play with deep learning.
Turns out TensorFlow just does not work on AMD. There is a fork of it maintained by AMD, but they only support it for Linux, and the code in question reaches deep enough to hardware layer that it cannot be run in a VM. It's also always at least a few versions behind as well.
With NVIDIA GPUs more expensive and poorly performing per dollar than they have ever been, TF2 could have been the moment they made a major improvement. Sad to see that is not the case.
Here's the thing: In reality, it doesn't matter if a Radeon VII or Vega64 has better theoretical TFLOPs than a 1080 Ti or whatever if the card runs 50W hotter(!!!) and the software is bad enough it runs at 1/2 the performance of the competitor. 50W hotter at a 10% perf loss and a 40% decrease in price is attractive. Hotter at 1/2 the perf, is a non-starter (almost entirely ignoring price factors). Free drivers do not make up for this. DL is a fast enough field that, for most practitioners, your time is better served as an enthusiast/participant by just paying the extra upcharge on an NVidia card and getting on with your life, so you can actually train models and do your work.
There is only one viable solution if you want to "do your work" today -- not for better, and almost certainly for worst. Ceterum censeo.
DL moves fast enough that it's always possible for new frameworks to come around, so perhaps a competitor with a multi-vendor design, from the ground up, can break though. But then you have to ask whether a lot of these frameworks can even succeed/tread water in such an environment without Google/Facebook propping them up, which is a separate problem... And before that, AMD has to make sure HIP/HCC have to work reliably, every time, everywhere, and they have the libraries to back it up. I'm hoping the succeed at this.
NVidia made the decision to hop on the Clang bandwagon years ago, and clearly that's allowed them to more easily optimize their code and move things forward. AMD finally has hopped on Clang with ROCm (but unfortunately still has some kinks to work out, esp with OpenCL)... and the compiler output seems to be superior as a result.
It seems unlikely to get a value-proposition out of AMD if you're going for deep learning. NVidias libraries are simply superior at the moment, and they get superior performance as a result.
But when you look at lower-level software that is written for both GPUs, AMD's compute girth really shows. Cryptocoins are perhaps the best example, although not every work-load looks like mining.
Luxray and Blender are more pragmatic: software raytracers which demonstrate AMD's incredibly GPU girth and memory performance with HBM2. The AMD Vega chips can outperform NVidia's equivalent in these cases.
All in all, AMD's ROCm environment seems like it is ready for programmers who are willing to stick to the C (or HCC) level, coding their own libraries and maybe using a bit of inline assembly here or there to maximize performance.
CUDA has more libraries (BLAS: linear algebra, better Tensorflow support for Deep Learning, etc. etc.) and that's definitely a major selling point. Especially because of how well those libraries are written.
I guess what I'm trying to say is... AMD's hardware is fine. Its their software thats clearly holding them back.
Coincidentally I was just digging through the Tensorflow code to find where it delegates to CUDA -- an enormous task given that it is a huge volume of dependencies many layers deep -- because ultimately most neural nets entail a significant amount of simple math. It seems, and I say this in the nonsensical, overly confident way that we do when we look on from the outside, that it should be trivial making it work with OpenCL or even just a branch for AMD.
...but incredibly time consuming. Developing bug free, high performance op implementations takes time and elbow grease. It's worth comparing the gradual progress of tf lite (inference on arm devices) and the tpu architectures: rolling out a full set of ops for a new platform is a multi year project. When it's only partially done, you get inconsistent support and weird performance bottlenecks; so maybe a minimal mnist model is fast, but many other things pass back to cpu for lots of intermediate ops, killing performance.
AMD can't even write a comparable library unless it is built on top of ROCm / HCC. Like, how would you even start to implement Tensorflow 2.0.0 for AMD Graphics cards otherwise?
OpenCL was theoretically an option. But the old AMDGPU-pro driver had the OpenCL compiler as part of the driver stack. Which means OpenCL Compiler issues would appear and disappear as you updated your graphics card. Uhhhh... not good for deployment to say the least. Lots of OpenCL code with "This causes the compiler to enter an infinite loop on AMD Driver 12.5.whatever".
There are serious issues about OpenCL's fundamental design. You really shouldn't have a full OpenCL compiler hidden inside of your device drivers. Its a nightmare to test and deploy, especially in a world of changing device drivers.
ROCm is a proper compiler stack and a proper development environment. If anything, AMD's mistake was taking so long before ROCm became usable. Even today, there are some issues (it doesn't work with Blender 2.79 OpenCL code yet), but at least its a proper development environment where you can build libraries on top of.
AMD needs to get its compute stack in order if it wants to be taken seriously. Fortunately, it seems like they understand the issue. Hopefully the next year or two will be helpful, since they're ramping up on ROCm development.
There are a variety of reasons for this - anticompetitive behavior from those competitors, chip designs like Bulldozer that focused on many wide execution units at the expense of single threaded performance, etc., but what it comes down to is that they just don't have a comparable amount of resources dedicated to ROCm compared to what nVidia has invested in CUDA.
I'm quite impressed with their current and near future designs in the CPU space, but GPU design is hard and beginning to hit a performance wall - witness the lackluster reaction to nVidia's current RTX chips in many areas.
However, the issue with NVIDIA is that they have adopted the antics of the game development space they cooperate with, and they have a poor attitude towards their customers as a result. So far as outright saying that their customers are stuck, so that the pricing doesn't matter, you'll just pay for it and get on with your life anyway. One major example of this is the banning of their 'gaming' cards from data centres via an EULA, because they couldn't sell their 10X-priced 'enterprise' cards, since they performed within 10% of each other.
Another example is RTX, again. They've stopped producing their 10xx series cards to sell more RTXes at a higher margin, because, boy it turns out people still want and prefer older cards compared to the RTX ones, in consideration of the price-bloat that NVIDIA has saddled the RTX with.
Can you explain this more? All the tests of RTX in actual games have been very positive and showed significant improvements in visual quality.
RTX is clearly a compute-hog: its barely usable even if you pay for the best-of-the-best GPUs. I mean, not to knock Nvidia down or anything, raytracing is one of the most computationally difficult problems in existence right now.
But from a practical and pragmatic perspective, you suffer a major loss in frame-rates and force the GPU to drop in resolution to make the jump to raytracing. Even if you spent $1200 on the card...
I'm personally excited to see offline renderers use the RTX features to accelerate offline raytracing. That's probably the more important use of the technology. As it is, RTX isn't quite fast enough for "real time" yet. Just grossly accelerated offline raytracing (which is still impressive)
Also last I read, RTX really isn't designed for offline raytracing and doesn't really bring much to the table. Its use is in realtime.
Battlefield V Ultra 1080p RTX on is ~63 FPS average (not minimum FPS, but average), which means it will regularly dip below 60 in practice.
> Also last I read, RTX really isn't designed for offline raytracing and doesn't really bring much to the table. Its use is in realtime.
On the contrary! RTX Cores are a hardware-accelerated BVH-traversal. That has HUGE implications for the offline rendering scene.
See NVidia Optix for details: https://devblogs.nvidia.com/nvidia-optix-ray-tracing-powered...
IMO, this is the killer-feature of RTX. Accelerating those Hollywood Renders from hours-per-frame to minutes-per-frame. NVidia Optix takes industry-standard scene trees and can use the RTX Cores to traverse them for hardware-accelerated raytracing.
Or more specifically: coarse AABB Bounds checking, which is a very compute & memory heavy portion of the Raytracing algorithm.
Even if RTX is too slow for video games, any improvement to offline rendering is a huge advantage.
What NVIDIA is trying to sell people is a denoiser.
And are you sure RT Cores even exist or are they just tensorcores by a different name?
Here is a comparison of the regular global illumination vs RTX real-time raytracing, particularly take look at noise levels generated by RTX in the background. https://www.youtube.com/watch?v=CuoER1DwYLY
NVidia has that temporal denoising algorithm though. So noise is smoothed between space and time. When you have 60-frames per second, you can "recycle" old frame's raytracing data for the current frame.
As long as the raytracing lights rarely change, it'd work out pretty well. The fast-moving lights would probably have to still use rasterization techniques, but the demos of the temporal-denoising algorithm from NVidia are very impressive.
FYI: Blender Cycle's denoising is still kinda bad even at 1000 or even 5000 samples (at least, when you look at low-light areas close to walls). You gotta zoom in and look, but artistic CGI requires many, many samples before raytracing looks smooth. Its a major advancement for NVidia to come up with a raytracing denoising algorithm that works well for video games on only 1-ish samples per pixel.
EDIT: lots of theories and discussion here: https://www.reddit.com/r/MachineLearning/comments/7tf4da/d_w...
That said, I'm still surprised Apple gave up on NVIDIA support, especially since there's no it-just-works option for deep learning on a Mac.
2012 Apple was pretty much the height of the quality apex.
What's the it-just-works consumer machine learning use case that requires GPGPU on end user hardware?
A developer can do the ugly hacking to get an NVIDIA eGPU working on macOS, but it's a hassle.
Theano used to work with OpenCL and therefore AMD, but the support was never that great, and I don't know that many people used it. And Theano is now EOL.
AMD took minimal effort route. They just design graphics cards with some added functionality and if someone wants to write build libraries fro them, good.
Is TF the dominant tool in commercial or startup DNN projects?
I have previously tried using TF but it was super painful so I used Keras instead.