
ROCm (AMD) Now Upstreamed into TensorFlow - jamesblonde
https://medium.com/tensorflow/community-supported-amd-rocm-build-for-tensorflow-e8e9ac258369
======
dragontamer
Okay, so I like ROCm but... I just don't see how AMD can strategically win
here.

NVidia has Tensor-cores. NVidia GPUs literally have an assembly statement that
performs a 4x4 FP16 matrix multiplication, while AMD hardware at best can
perform FP16 dot-products. (which... is still greatly accelerated and pushing
20+TFlops, but no where near as fast as the 100TFlops that some NVidia GPUs
are putting out).

What's cool about AMD GPUs is that they have better general-purpose compute
numbers than NVidia. An AMD Vega64 is over 10TFlops with 480GBps HBM2 RAM and
is priced under $400 now. Vega64 is a few years old, but its still a behemoth
in compute power.

Radeon VII is still around 10-TFLOPs (specified as 13.84 TFlops), but has
1000GBps HBM2 bandwidth and 16GB of RAM. Radeon VII is an absolute monster in
parallel compute and only costs $700. Since many problems are Memory-bandwidth
bound (163840 hardware SIMD threads of execution uses up a LOT of memory
bandwidth), the +memory bandwidth on Radeon VII is quite practical (YMMV of
course: it always depends on your specific workload).

Crushing a Terrabyte of memory bandwidth per second is a huge amount of memory
bandwidth, no matter how you look at it. Memory-hard problems will almost
certainly want to execute on the Radeon VII.

ROCm is a decent programming environment to unlock this "general purpose"
compute power... but Tensorflow is simply going to be in the bag for NVidia
due to those 4x4 FP16 Matrix multiplication instructions.

Hmm... its still nice I guess for AMD to catchup. But I feel like AMD should
maybe focus on other compute problems where they can better demonstrate their
advantages? For example, OpenCV probably would run better on AMD hardware than
equivalently priced NVidia hardware.

I understand that there's a lot of hype to Deep Learning, but the compute
world is bigger than just that. There's a lot of problems out there, and AMD
GPUs seem better (!!) for a lot of these other problems. But there just isn't
very much discussion or marketing that points this fact out.

If AMD worked on optimizing OpenCV, or maybe some other libraries, for the
ROCm platform, surely they'd beat NVidia in price/performance. AMD's hardware
is pretty kick-ass, the main issue is the amount of programming effort you
need to put forth since AMD's software stack is just very far behind NVidia's
CUDA environment.

\-----------

Maybe AMD could show how memory bandwidth can become a unique asset in some
workloads? The specs are downright amazing, the benchmarks prove its possible.
Its just a matter of playing with the hardware and figuring out what problems
that 4096 SIMD-ALUs + 1TBps memory bandwidth can crush.

~~~
0-_-0
> 100TFlops that some NVidia GPUs are putting out...

Theoretically. In practice you need to make sure you're not bandwidth
bottlenecked, which is heavily dependent on workload. E.g. large fully
connected layers are bandwidth bound, so the FLOPS difference doesn't matter.
Besides, nothing prevents AMD from implementing its own matrix multiply OP in
the next generation of GPUs.

~~~
dragontamer
> large fully connected layers are bandwidth bound

Hmm, well that might be useful for the Radeon VII and its 1TBps bandwidth.

The #1 network of interest to me is the Leela-Zero network, which is basically
a "deep" neural net with 50 layers of tiny 3x3. Something like Leela-Zero
would definitely benefit from those Tensor Cores, and it'd be small enough to
fit inside of L1 cache (or maybe even shared memory. I haven't looked into it
too deeply though).

> Besides, nothing prevents AMD from implementing its own matrix multiply OP
> in the next generation of GPUs.

While that's true, its not really something this generation of hardware buyers
can think about.

I think your point about large and fully connected layers may really benefit
from the Radeon VII however. Someone should probably benchmark that kind of
network between Radeon VII vs RTX 2080 (both $700-class GPUs).

------
hammeiam
So does this mean that I no longer need an nvidia gpu for doing ML at home?

~~~
0-_-0
"We are motivated by the results of MLIR and XLA, and we are working towards
enabling and optimizing these technologies for AMD GPUs."

So the answer is "not yet".

