
2080 Ti TensorFlow GPU Benchmarks - sabalaba
https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/
======
sabalaba
Lower in this thread, bitL pointed out that the prices we used in our analysis
is not exactly in line with current market prices. bitL hit the nail on the
head in terms of the biggest weakness of our post: choosing the price. So,
we've decided to make the spreadsheet that generated our graphs and
(performance / $) tables public. You can view it here:

[https://docs.google.com/spreadsheets/d/1La55B-AVHSv9LiQcs6GM...](https://docs.google.com/spreadsheets/d/1La55B-AVHSv9LiQcs6GMhUyackNdhJ4WHprUnWfQOxI/edit#gid=0)

You can copy that spreadsheet and insert whatever system price (in
kilodollars) you want into B15:F15. Hope this makes everybody's decision
making easier.

As a system builder and AI research company, we're trying to make benchmarks
that are scientific, reproducible, correlate with real world training
scenarios, and have accurate prices.

------
shaklee3
The fp16 results versus the 1080ti are somewhat surprising. The author
specifically pointed out that they are using tensor cores were possible. I
would have expected fp16 would be more than 100% faster than the 1080ti if
they were using tensor cores. Can anyone explain that?

~~~
pavanky
Not everything in the compute pipeline is going to be converted to fp16
operations. Anytime you are doing accumulation or exponentials, you would have
to have it in fp32.

There was a good talk from NVIDIA at last years GTC: [http://on-
demand.gputechconf.com/gtc/2018/presentation/s8923...](http://on-
demand.gputechconf.com/gtc/2018/presentation/s8923-training-neural-networks-
with-mixed-precision-theory-and-practice.pdf)

Here is another relevant blog post: [https://devblogs.nvidia.com/mixed-
precision-training-deep-ne...](https://devblogs.nvidia.com/mixed-precision-
training-deep-neural-networks/)

EDIT: Also not everything in the training loop is a matrix multiplication
where tensor cores are useful.

~~~
shaklee3
Right, but the tensor cores should be about 10x faster on the compute side,
and about 2x the memory bandwidth. GEMM is usually constrained by compute,
which is why the tensor cores exist.

These benchmarks are for training, so the expectation is that they are running
them in fp16 all the way through. Also, tensor cores can accumulate in fp32
registers with a slight hit to performance.

~~~
sabalaba
Is there a benchmark you've seen that matches the claimed 10x increase in
performance for the Tensor Cores? NVIDIA hype train can sometimes make it
difficult to find hard numbers.

~~~
shaklee3
I've done my own benchmarks where I've hit over 100 TFLOPS on the V100, and
that's about 85% of the peak theoretical throughout of them. Granted the
matrix size needs to be large enough, but it's definitely doable. Anandtech
also showed similar results in their V100 review. I haven't yet seen a
comparable SGEMM done on the 2080Ti, so I don't know how it'll compare. I have
some coming in at the end of the month though, so I should know soon.

------
std_throwawayay
For comparison the Phoronix benchmarks:
[https://www.phoronix.com/scan.php?page=article&item=nvidia-r...](https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=1)

this one is more oriented to the lower end.

------
sabalaba
The TL;DR on this is that the 2080 Ti is the most cost effective GPU on the
market today for deep learning.

It is:

\- 37% faster than the 1080 Ti with FP32, 62% faster with FP16, and 25% more
expensive.

\- 35% faster than the 2080 with FP32, 47% faster with FP16, and 25% more
expensive.

\- 96% as fast as the Titan V with FP32, 3% faster with FP16, and ~1/2 of the
cost.

\- 80% as fast as the Tesla V100 with FP32, 82% as fast with FP16, and ~1/5 of
the cost.

~~~
The_rationalist
Actually AMD gpus are the most cost effective gpus and have supported full
FP16/8 since 2 years. But people are locked in the Nvidia proprietary jail and
no one seems to care... That is really sad and against the consummer interest,
but also, deep learning will never become mainstream if it can only run on 5%
of the world hardware (Nvidia).

~~~
kbumsik
> Actually AMD gpus are the most cost effective gpus.

No, AMD gpu is zero cost effective because Tensorflow does not support AMD
gpus.

> But people are locked in the Nvidia proprietary jail and no one seems to
> care...

Sounds like you want to blame the users, but this is because Nvidia highly
invested on GPGPU and Cuda since more than 10 years ago, while AMD did focus
on something else like HSA. It is AMD’s fault.

~~~
sabalaba
I agree with kbumsik here. AMD only has themselves to blame. They have great
hardware and fantastic theoretical benchmarks. Heck, even their SGEMMs are
really fast and in line with the 15TFlops of FP32 on the VEGA 64s that we've
benchmarked. However, it comes down to software ecosystem and optimizations
for common deep learning subroutines.

MIOpen[1] is a step in this direction but still causes the VEGA 64 + MIOpen to
be 60% of the performance of a 1080 Ti + CuDNN based on benchmarks we've
conducted internally at Lambda. Let that soak in for a second: the VEGA 64
(15TFLOPS theoretical peak) is 0.6x of a 1080 Ti (11.3TFLOPS theoretical
peak). MIOpen is very far behind CuDNN.

Lisa Su, if you're reading this, please give the ROCm team more budget!

1\.
[https://github.com/ROCmSoftwarePlatform/MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen)

~~~
lostmsu
Or, maybe, a new team.

------
sp332
For one GPU, sure. But the cooling design is going to be awful if you try to
pack a bunch of cards in a 4U or 5U box.

~~~
ydau
As others have stated, blower cards resolve this by pushing air directly
outside of the case. 2080 Ti blowers are in the works:

[https://www.pny.com/GeForce-RTX-2080-Ti-11GB-
Blower](https://www.pny.com/GeForce-RTX-2080-Ti-11GB-Blower)

~~~
_Wintermute
Pretty sure the ASUS TURBO models are already out, which are blower style
cards.

------
londons_explore
It would be good to see it compared to a TPU...

I feel like thats becoming tensorflows 'native platform'...

~~~
zitterbewegung
TPUv1 were made to have good performance per watt for inference. [1]

TPUv2 were benchmarked against NVIDIa's K80 at 30 times the performance and
had a peak of 92 TOPS [2] while the 2080 TI is at 440 TOPS [3] [1]
[https://www.anandtech.com/show/11749/hot-chips-google-tpu-
pe...](https://www.anandtech.com/show/11749/hot-chips-google-tpu-performance-
analysis-live-blog-3pm-pt-10pm-utc) [2]
[https://www.tomshardware.com/news/tpu-v2-google-machine-
lear...](https://www.tomshardware.com/news/tpu-v2-google-machine-
learning,35370.html) [3] [https://www.microway.com/knowledge-center-
articles/compariso...](https://www.microway.com/knowledge-center-
articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/)

~~~
shaklee3
Why would anyone compare a modern tpu against an Nvidia card that is four
generations old? The tpu didn't even exist when the K80 came out.

~~~
deepnotderp
Marketing.

nVidia is destroying the TPUs right now and Google is desperate to keep their
perception in the public eye as the king of AI (which tbf, they probably are,
compute capabilities aside)

~~~
matt4077
You should try the assumption of good faith some time. It generally leads to a
happier place.

In this case the data was from the time Google introduced the TPU _internally_
, when the K80 was very much up-to-date. It also makes sense because the K80
was the only GPU offered in GCP.

~~~
shaklee3
This is the tpu v2, and it was most certainly not released internally when the
K80 was.

------
howscrewedami
What about int8 performance? Would it be somehow improved by using tensor
cores?

------
jaytaylor

      The RTX 2080 Ti, on the other
      hand, is like a Porsche 911.
      It's very fast, handles well,
      expensive but not ostentatious
    

Since when is a Porsche 911 not ostentatious? It's a sports car in $100K+
territory and functionally impractical / limited.

Only compared to a car which costs millions could it be considered reasonable
and not ostentatious.

Guess I'll be sticking to my "poor mans" 970 GTX :p

~~~
Analemma_
Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and can
almost pass for a normal car in a grocery store parking lot, as long as it's
not fire-engine red or something. But I think this analogy breaks down a
little when applied to graphics cards.

~~~
dragonwriter
> Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and
> can almost pass for a normal car in a grocery store parking lot

Kinda depends on which Lamborghini; an Urus probably does a better job at
paying for normal in a grocery store parking lot (and is better for actually
carrying groceries) than a 911.

