
A new GPU backend for the TVM stack - ziheng
http://www.tvmlang.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
======
muli
Some quick calculations about TFLOPS per dollar on GEMM.

Since NVIDIA's volta consumer card is not out yet, I used Titan Xp as the
reference card. I grabbed prices from wikipedia, and assume TVM reaches 64%
peak perf on Vega and 90% peak perf on Titan Xp:

Radeon RX Vega 64: 12.6TFLOPS * 65% / $499 = 0.01638 TFLOPS/$ Pascal Titan Xp:
12TFLOPS * 90% / $1200 = 0.009 TFLOPS/$

So Vega outperforms a lot here.

~~~
dragontamer
I'd assume that Video RAM is important though. The Titan XP has 12GB of RAM,
while the $500 Vega 64 only has 8GB of RAM.

A better comparison is probably Vega Frontier with 16GB of RAM for $1000. If
you're doing heavy compute, you're probably gonna need a ton of RAM to go with
it.

~~~
microcolonel
Vega has half-precision floats as well though, which (with a seemingly
negligible loss in precision) in combination with their HBCC (transparent main
memory DMA) should more than make up for the lack of memory on the Vega 64,
and in the case of Frontier Edition, all the more so.

------
railgun2space
New usage for RX480/580 miner?

------
0xbear
How’s the perf per dollar? It’s not enough to “bring” it to AMD, it must be
competitive as well.

~~~
dragontamer
Hmm, with the "Tensor Cores" of NVidia's next-generation Volta coming in, I'd
bet that NVidia cards will be faster in machine learning tasks.

But then again: AMD Vega 56 / 64 have HBM2 and are under $1000. IIRC, the Vega
Frontier Edition is $999 and 16GB of HBM2 at 480GB/s theoretical bandwidth.

NVidia also has an offering with high-speed HBM2 RAM: The Tesla P100, but its
way more expensive: $7000 each.

I dunno if there are major benefits of HBM2 over GDDR5x however. Just listing
off numbers here. The Titan Xp apparently has more bandwidth from the GDDR5x
RAM for example, although the Titan XP is still more expensive than the Vega
64.

\------------

If there is some problem that is global memory-bandwidth constrained, then it
might be better to run it on AMD Vega 64. After all, you can pretty much
afford 7x AMD Vega Frontier editions than the NVidia P100.

Obviously, this very much depends on your workload.

~~~
dharma1
Yep, Nvidia is quoting 125 TFLOPs mixed precision on V100, boosted by Tensor
Cores.

Vega 64 can in theory do 25 TFLOPs half precision.

But as you say there's a large price difference too.

For a market segment that needs 1-8 GPU rigs for ML on a low budget AMD could
kill it if they invested in software support and kernel optimisation.

For servers and large scale training, unless AMD has some ML specialised cores
in the pipeline, Nvidia Volta and Google TPUs have a serious lead.

~~~
mtgx
Aren't the Tensor Cores mostly for inference, not training?

~~~
dragontamer
The Volta Tensor Cores aren't released yet, and I haven't played with Google's
TPUs at all.

But NVidia markets Tensor cores as:

> New Tensor Cores designed specifically for deep learning deliver up to 12x
> higher peak TFLOP/ss for training, and 6x higher peak TFLOP/s for inference

[https://devblogs.nvidia.com/parallelforall/cuda-9-features-r...](https://devblogs.nvidia.com/parallelforall/cuda-9-features-
revealed/)

I wouldn't be surprised if they were inflating the numbers slightly, as is
common in a lot of marketing material.

~~~
dharma1
You can use Tensor Cores on AWS p3 instances (v100) now. Just doesn't seem to
be a whole lot of ML framework support for mixed precision training with CUDA
9 yet

[https://devblogs.nvidia.com/parallelforall/programming-
tenso...](https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-
cuda-9/)

Google's TPUs also do low precision training (with some special version of
TF). [https://cloud.google.com/tpu/](https://cloud.google.com/tpu/)

