
Which GPUs to get for deep learning - dsr12
https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/
======
tbenst
I appreciate the time and care that went into this post, and there’s a nice
discussion of various features.

Unfortunately the performance charts are completely devoid from reality, and
in particular the discussion on tensorcores may be true from an instruction
count perspective but does not reflect any third-party benchmark I’ve seen.
For example: [https://lambdalabs.com/blog/2080-ti-deep-learning-
benchmarks...](https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/).
Nvidia has a history of straight-up-lying about tensorcore and other
benchmarks (for example, see this thread from right after Nvidia announced an
8x improvement in speed on imagenet in tensorflow for V100:
[https://github.com/tensorflow/benchmarks/issues/77](https://github.com/tensorflow/benchmarks/issues/77))

In general, fp16 is only 30-40% faster than fp32, and occasionally 2x in
really optimal conditions.

~~~
nl
> Unfortunately the performance charts are completely devoid from reality, and
> in particular the discussion on tensorcores may be true from an instruction
> count perspective but does not reflect any third-party benchmark I’ve seen.
> For example: [https://lambdalabs.com/blog/2080-ti-deep-learning-
> benchmarks...](https://lambdalabs.com/blog/2080-ti-deep-learning-
> benchmarks/)

The performance numbers posted here appear to almost exactly reflect the
LambdaLabs numbers.

Lambda Labs: the RTX 2080 Ti is 96% as fast as Titan V, 73% as fast as Tesla
V100 (32 GB)

timdettmers: RTX 2080 Ti normalized to 1, Titan V looks about 1.1 to 1.2, V100
is just below 1.5

> for example, see this thread from right after Nvidia announced an 8x
> improvement in speed on imagenet in tensorflow for V100

Well a non-NVidia person managed to get it up to just above 4x improvement
without "using unreleased libraries from NVIDIA". From the same thread:
[https://github.com/tensorflow/benchmarks/issues/77#issuecomm...](https://github.com/tensorflow/benchmarks/issues/77#issuecomment-394541623)

In my experience NVidia benchmark numbers in deep learning are rarely lies -
they are highly optimised, in optimal conditions and rarely achievable in the
real world. About what you'd expect from a vendor benchmark.

~~~
tbenst
Thank you for cross-referencing that, the data does look accurate and my
statement now seems exaggerated. I do think we need skepticism on the A100
charts though until third party benchmarks.

> In my experience NVidia benchmark numbers in deep learning are rarely lies -
> they are highly optimised, in optimal conditions and rarely achievable in
> the real world.

Right, but Nvidia claimed 1360 images/sec for resnet-50 on imagenet. To my
knowledge this still hasn’t been realized by a third party. It also isn’t a 4x
improvement for fp32 vs fp16, that’s comparing to previous generation.
Improvement is more like 1.5x: [https://lambdalabs.com/blog/best-gpu-
tensorflow-2080-ti-vs-v...](https://lambdalabs.com/blog/best-gpu-
tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/)

Even in very simple synthetic benchmarks the speed up is only 2x:
[https://github.com/tensorflow/benchmarks/issues/77#issuecomm...](https://github.com/tensorflow/benchmarks/issues/77#issuecomment-349838985)

I have not seen any benchmarks showing an 8x speedup. Have you? If not ->
Nvidia lied.

~~~
nl
> Right, but Nvidia claimed 1360 images/sec for resnet-50 on imagenet.

On
[https://images.nvidia.com/content/technologies/volta/pdf/vol...](https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-
update-us-1165301-r5.pdf) they claim 1,525 images/sec (!)

Dell hit 5,243 images/sec with one of their 4 V100s servers, which comes to
1,310 images/sec per V100. I find it very believable that NVidia would get
~200 images/sec more, since Dell jumped 50% with a change in their CPU/GPU
connection topology.

See [https://www.dell.com/support/article/en-au/sln317397/deep-
le...](https://www.dell.com/support/article/en-au/sln317397/deep-learning-
performance-on-v100-gpus-with-resnet-50-model?lang=en)

~~~
nl
Also I just noticed that Google's XLA gets 1278 images/second on a single V100
in FP16 mode.

[https://www.tensorflow.org/xla](https://www.tensorflow.org/xla)

~~~
tbenst
Thank you for finding that! I’m glad to see the situation has improved since I
last looked. 3x improvement over fp32 is impressive for sure. Their marketing
claims of 8x still bother me though.

~~~
nl
What exactly is the 8x claim?

The thing I've seen is very limited ( _NVIDIA GPUs offer up to 8x more half
precision arithmetic throughput when compared to single-precision, thus
speeding up math-limited layers._ [1]) which is probably true.

The problem with performance improvements is the diminishing returns part of
Amdahl Law: an 8x improvement in math performance will just mean the math part
becomes less important in terms of absolute performance.

In any case, I've found NVidia's claims in the machine learning area to be
pretty good. Like most claims you have to read carefully to see exactly what
the claim is, but that's not uncommon with performance claims.

[1] [https://docs.nvidia.com/deeplearning/performance/mixed-
preci...](https://docs.nvidia.com/deeplearning/performance/mixed-precision-
training/index.html)

[2]
[https://en.wikipedia.org/wiki/Amdahl%27s_law#Relation_to_the...](https://en.wikipedia.org/wiki/Amdahl%27s_law#Relation_to_the_law_of_diminishing_returns)

~~~
tbenst
> NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when
> compared to single-precision, thus speeding up math-limited layers

Right but I’ve benchmarked the best case scenario, ie a large GEMM call in
C++, and still not seen anywhere close to 8x. I’ve never seen a code example,
no matter how limited, showing a 8x speed up.

~~~
llukas
[https://developer.nvidia.com/blog/programming-tensor-
cores-c...](https://developer.nvidia.com/blog/programming-tensor-cores-
cuda-9/)

See cuBLAS mixed-precision GEMM.

------
esquire_900
Bought a second hand GTX 1060 a while ago to play around a bit more serious
with neural networks. It's a good balance between being cheap, 6GB of memory
but still being serious enough to get some work done. If you are a
professional researcher or do multiple kaggle's per month, then yes, get the
best card. But I suspect a lot of people are one category "below", where a
1060 is sufficient most of the time, and you go to cloud for the actual big
workloads.

This strategy can keep you going for a couple of years. With models becoming
as big as they are, I doubt how much SOTA an RTX 3070 is going to do in 2-3
years. (actually none of these cards come close to GPT-3). By that time you
can pick up a second hand RTX 30x and still get the latest offerings in the
cloud.

Buying second hand GPU comes with a bit of a risk by the way, someone
suggested only buying if the price is less than half of the original price
(can't remember the link).

~~~
mastazi
Interestingly, the 1060 was mentioned among the suggested cards in one of the
old versions of this same article (it gets updated every time Nvidia releases
a new series).

------
coredog64
I’m confused:

> Do not buy GTX 16s series cards. These cards do not have tensor cores and,
> as such, provide relatively poor deep learning performance. I would choose a
> used RTX 2070 / RTX 2060 / RTX 2060 Super any day over a GTX 16s series card

...a few paragraphs later...

> If that is too expensive, a used GTX 980 Ti (6GB $150) or a used GTX 1650
> Super ($190).

~~~
timdettmers
This is good feedback. Will note this down and incorporate it in a small
update.

~~~
timdettmers
I just did a small fix to address this. Thanks again.

------
ablekh
Kudos to the author on producing an excellent, comprehensive yet readable,
post (based on my initial very brief review)! Much appreciated. One thing that
jumped at me, though, is the recommendation for the _" I want to try deep
learning, but I am not serious about it"_ scenario. Advising to use a physical
GPU (unless it's already a part of the system at hand), especially RTX 2060
Super, IMO does not make any sense in this case. Using cheap cloud GPU
instances is the optimal way to _try_ deep learning.

~~~
timdettmers
Thank you! You have a good point. I think I would agree with you if somebody
already has cloud computing skills, then the cloud is much more powerful to
learn deep learning than your own GPU.

I figured that most people that start with deep learning might also lack cloud
computing skills. Learning one thing at a time is easier and as such, just
sticking a GPU into your desktop and focus on deep learning software /
programing might yield a better experience.

I might update my blog post in the future with this detail.

~~~
ablekh
You're welcome! While I understand your rationale now, I'm afraid I still
disagree with it. :-) Simply because I find it very unlikely that people
interested in and having enough skills to embark on any reasonable deep
learning journey (even if just to try) would lack enough cloud computing
skills to use cloud GPU instances. After all, using GPUs in the cloud is not
much different from (and, thus, not more complex than) using physical GPUs in
your local machine.

Fun fact: I've saved your post as PDF for offline reading and it clocked at
649(!) pages at the time of saving (32 pages for the post per se and the rest
for the blog post comments). Combining that with feedback here at HN, it is
clear that there is quite a lot of interest in the topic ...

~~~
timdettmers
Haha, 649 pages! Thanks for the discussion. I can understand your perspective.
Maybe it would be best to add something to the blog post that discusses it
from both perspectives and readers can then choose which perspective suits
them better.

I should also a bit more in general about cloud computing, it seems some
people agree that the post ran a bit short on that. At some point I just
wanted to be done with it though — editing 10k word blog posts is not so much
fun anymore!

~~~
ablekh
It's my pleasure. I agree - it's a good idea to present both perspectives and
allow readers to decide what works best for them.

I can certainly understand you being hesitant to add more content to an
already sizeable post. Perhaps, several small paragraphs on important relevant
aspects might still be worth considering (take it with a grain of salt, since
I haven't actually read your post in detail, including cloud-related parts).
Anyway, thank you very much, again, for your time and effort. Keep it up!

~~~
timdettmers
Thank you so much!

~~~
ablekh
You're very welcome! BTW, do you have any interest in and time for potential
consulting or advisory for a frontier/deep tech startup (ambitious goals,
challenging tasks, great impact)? Not an immediate need, but, hopefully, it
will be a more real opportunity in not so distant future.

------
nl
Just noting that you can (still) do very well on non top-of-the-line cards.

I've won multiple silver Kaggle medals on a 1070. It's true that more power
would be helpful, but I feel it's lack of technique (and time!) rather than
compute that has held me back from gold medals.

~~~
edude03
Not disagreeing with you, but from what I've heard, the people who win kaggles
are people who can try more things, having a faster card presumably would
allow you to try more things because each attempt would take less time.

~~~
nl
There is some truth in this, but it's not the entire picture.

I've thought about it a lot, and talked to lots of really good Kagglers about
it. Most of them use multiple machines, rather than having maximum performance
in a single machine.

This lets them run multiple completely different experiements at once, and
then put extra compute onto the ones that seem good.

That is a big difference to me, when I have to experiment sequentially. The
parallelism is more important than absolute speed a lot of the time.

------
liuliu
Good article! But the comment on NVLink / PCIe4 (and PCIe3 x4 is good enough)
doesn’t fit my experience. PCIe3 x4 can really impact your all-reduce
performance for models such as Transformers. For ResNet, it can impact a bit
too, but to the 5% to 10% range like the author mentioned.

I am also interested in whether PCIe4 can help unified memory for larger
models. Guess have to wait for RTX 3090 actual release.

~~~
p1esk
_unified memory for larger models_

What are you talking about? Are you writing custom CUDA code to run those
large models? Because there's zero support for unified memory in any of the
existing DL frameworks.

~~~
liuliu
Yes. I have an alternative DL framework to play whatever I want with :)

~~~
p1esk
Link?

~~~
lunixbochs
Probably this:

[https://libnnc.org](https://libnnc.org)

[https://github.com/liuliu/ccv/tree/unstable/lib/nnc](https://github.com/liuliu/ccv/tree/unstable/lib/nnc)

------
fxtentacle
Sadly, this is mostly purchase advise.

In short: You need lots of RAM.

And stay away from overclocked (founders edition) and from datacenter models
due to heat or price problems.

~~~
timdettmers
What else would you like to see?

~~~
fxtentacle
The HN submission was initially called "advice on using GPU for deep learning"
so I was hoping for optimization tips, too.

------
dade_
I want an external 3090 chassis on Thunderbolt, similar to this box. No muss,
no fuss. AC, liquid cooling, all in an engineered box. Gaming or ML with my
laptop or tablet pc.

[https://www.gigabyte.com/Graphics-
Card/GV-N208TIXEB-11GC#kf](https://www.gigabyte.com/Graphics-
Card/GV-N208TIXEB-11GC#kf)

~~~
numpad0
Just buy a good ITX PC. eGPU boxes often costs more than a PC, larger than one
as well, while performing less than one due to TB3 link and overall
clunkiness.

~~~
ChuckNorris89
Thunderbolt 4 is on it's way with Intel Tigerlake which doubles the bandwidth
but that still leaves limitations such as being limited to x8 PCI lanes which
means you're leaving performance on the table if you use high eng GPUs and
being locked in to Intel ecosystem so no Ryzen CPUs for you :(

And also cost $$$ which makes it a luxury solution only suitable for the users
who absolutely want a thin portable notebook for on the go and a powerful GPU
at home for AI/Gaming.

------
jtflynnz
Out of curiosity, there are now enterprise sellers offloading old Tesla cards
for relatively cheap (ie K40 ~$100 - $150, k80/M40 ~$150 - $200); are these
worth looking at on a budget versus the 900 or 1600 series cards? Especially
given memory options.

------
djaque
Is anyone else trying to decide if you should upgrade from 1080 ti's?

I have a box with four of them which have served me well for a while now, and
based on just raw performance it looks like upgrading to two RTX 3080s would
exceed the performance of my current system.

I'm wondering if I should rush to sell off the cards on the used market before
the prices crash and then use that money to swap over to Ampere.

Then, there's also the question of if there will be an RTX 3080 TI which will
blow away the RTX 3080 and be a viable card for the next five years like the
1080 TI.

I'm really uncertain about what to do and wonder the calculus other people
have done on this decision.

~~~
p1esk
4x 1080Ti are probably about as fast as 2x 3080 if you're able to use FP16.
They are most likely faster for FP32. And in any case they provide almost 2x
memory.

------
spicyramen
Mmm I don't see any mention of Cloud offerings or T4 model, specifically
Google Cloud team revealed a more detailed and comprehensive blog post which
include more important aspects: inference, training, cost and time.
[https://cloud.google.com/blog/products/ai-machine-
learning/y...](https://cloud.google.com/blog/products/ai-machine-
learning/your-ml-workloads-cheaper-and-faster-with-the-latest-gpus/)

------
lostdog
Really great to publish these builds and GPU suggestions. Putting together a
system that works can be really frustrating, and knowing where to start is
really helpful.

I've gone with GPU spot instances for my personal experiments. The key is to
be able to bring up a machine in a couple minutes so you're ok with always
tearing one down. A combination of ansible and some scripts that push code
around helped a lot to create a useful environment for experimenting.

------
SloopJon
One thing I did after the RTX 30-series announcement was a back-of-the-
envelope comparison of performance per dollar and performance per watt, taking
NVIDIA's numbers at face value. The 3070 and 3080 are surprisingly close on
both metrics. You pay a substantial premium for the 3090, but it does have the
best performance per watt.

~~~
oxygenz
It looks like most are thinking 3080 might be the sweetspot for value?

~~~
acidbaseextract
Definitely, especially from a memory bandwidth perspective:
[https://youtu.be/KpnIx1kLq9w](https://youtu.be/KpnIx1kLq9w)

------
tanilama
For NLP applications, the deciding factor is actually GPU memory, so the
choice is limited (V100 32GB the best, and nothing below 16GB is worth
considering)

~~~
p1esk
Quadro 8000 has 48GB and costs ~$5,300, so it's a much better value than V100.

------
aborsy
Thanks for the post, it was very good!

It would be great if you could add something about the hardware requirements
of the reinforcement learning and video prediction.

------
RedComet
Are any of the bigger frameworks supporting AMD yet?

~~~
figomore
Pytorch has support to ROCM.

~~~
mkl
Do you know how performance/$ compares?

------
leoh
Why not just use a cloud offering? The AWS analysis is interesting, but there
are other great offerings like Google's colab, which offers a free GPU
[https://colab.research.google.com/notebooks/intro.ipynb](https://colab.research.google.com/notebooks/intro.ipynb).

~~~
pjmlp
Not everyone gladly puts private stuff on other people's computers.

~~~
jesterson
You can remove the word private - not everyone gladly puts stuff in general on
other people's computers

~~~
leoh
You just did when you posted this comment

------
Apofis
Answer: Whatever your current generation mid-level Nvidia Geforce is. It's
been this way for a while.

Though, you should probably use AWS ML Compute, since they even have Nvidia
Ampere 100's, which cost $10,000 each, and it'll probably be more cost
effective for heavier workloads.

------
zmmmmm
Would be interested to know how the Tesla T4's fit in, if at all. They seem to
be by far the most "affordable" if your goal is to get in at the low end in
the data center space. But not at all sure if they represent value in terms of
$/compute?

~~~
ptheywood
T4s are essentially a 2070 super with twice the device memory (plus minor
changes to clock speeds to account for power/cooling). ~ 5x the price, but
suitable for the data centre and larger networks.

------
aqohn123
Anyone here knows an article comparing GPU architectures for non deep learning
that has a similar depth as this article addressing stuff like memory
latencies, cycles, caches etc. too? Kudos to the author of above article, I
really enjoyed reading it!

------
ngcc_hk
Before I push that button have to say that I bought two 1080ti based on the
site 3 years ago. Still remember the concern on memory.

Hope it will not ... 3090 got 24 and price is not as unreachable as Titan
then.

~~~
oxygenz
agreed!

------
shmerl
Looks very Nvidia centric. What about using other GPUs for deep learning?

~~~
nanagojo
CUDA owns the industry. AMD has to step up with a good alternative to disrupt
it. Any serious work is being done in CUDA and there are _lots_ of resources
for CUDA.

~~~
shmerl
You don't need to use CUDA for GPU programming, no? It's simply a lock-in. But
I'm sure Nvidia was pushing it quite lot to make sure it's hard not to use it.
But it clearly has to go. It's not healthy for the industry.

~~~
fluffything
> You don't need to use CUDA for GPU programming, no?

No, you don't _need_ to use CUDA for GPU programming, you can use OpenCL or
Vulkan or probably even PHP instead.

I do, however, _want_ to use the best programming language for the task at
hand. If that task is GPU programming, CUDA is the best language I know for
that, much better than SyCL, OpenCL, Vulkan / OpenGL + shaders, etc.

If these other technologies would be better, I would use them instead.

~~~
shmerl
CUDA can't be best if it's tied to a single GPU. It's DOA by definition. This
idea of "a language that only works on this hardware" is out of some dinosaur
lock-in handbook from the last century.

~~~
fluffything
CUDA compiles to CPUs, and AMD has support for CUDA via HIP.

Not that this matters because your argument is flawed.

The claim that CUDA is not worth using because it lacks portability, only
holds, if there is hardware worth using that's not supported by CUDA.

The only GPUs worth buying for compute are from nvidia and support CUDA, so
your claim isn't true.

The only thing you achieve today by not using CUDA is paying a huge price in
development quality for portability that you can't use.

The startup cemetery is filled with companies that made this trade-off and
picked OpenCL just in case they wanted to use non-nvidia hardware. They were
all killed by the velocity of their competitors that were using CUDA to
deliver better products that payed the bills.

The only people for which it might make sense to avoid CUDA are "non-
professionals" (hobbyist, etc.). If you only want to use OpenCL to "learn
OpenCL", then OpenCL is the right choice. But if you want to make money, then
CUDA was the right choice 15 years ago and still is the right choice today.

If that makes you angry, direct your anger properly. It isn't NVIDIA's fault
that CUDA is really good. It is however, AMD's, Intel's, Apple's, Qualcomm,
ARM's... fault that everything else _sucks hard_. Being angry at nvidia for
delivering good products is just stupid. Its the other companies fault that
they can't seem to be able to get their sh __* together when it comes to GPU
computing.

~~~
shmerl
_> The only GPUs worth buying for compute are from nvidia_

That sounds like koolaid marketing to me. AMD GCN was more compute oriented
than Nvidia for years and only lately AMD increased focus on gaming with RDNA.

~~~
fluffything
> That sounds like koolaid marketing to me.

That's a fact: check HPL, MLPerf, Spec, etc. results. MLPerf is the perfect
example, were your results are only accepted if they can be verified by
others. Where is AMD in there? (nowhere, their products suck for compute).

> AMD GCN was more compute oriented than Nvidia for years

No, the only thing AMD GCN was good for is as a very expensive stove.

AMD GCN had a lot of compute, on paper, and higher numbers than nvidia GPUs of
the time. Unfortunately, AMD GCN's memory subsystem sucked, and it was
impossible to deliver data fast enough to actually be able to use the compute.

So nvidia's hardware essentially destroyed GCN for any useful practical
application.

IIRC, the only application for which GCN's got some use was bitcoin mining,
which avoided hitting GCN's issues because it just requires doing a ton of
useless work on a tiny amount of memory. Perfect for GCN right? Nope, nvidia's
hardware was still better, but sold out, and GCN wasn't horrible at this, so
it got some use.

AMD actually fired the architect of GCN over this. Yet this still perfectly
summarizes AMD's GPGPU strategy of the last 15 years: higher numbers on paper,
that cannot be achieved in practice, and lower that the numbers that nvidia's
hardware achieves in practice.

------
k12sosse
Wish they would make dedicate ML cards so the gamers don't have to fight with
the deepfakers and miners

------
ngcc_hk
No comment on the direct transfer from Ssd to Gpu bypassing cpu ... Amy
relevance.

------
nsriv
Site down apparently

~~~
PNWChris
Same situation for me, luckily it looks like archive.org snagged a backup
while it was up!

[https://web.archive.org/web/20200907164516/https://timdettme...](https://web.archive.org/web/20200907164516/https://timdettmers.com/2020/09/07/which-
gpu-for-deep-learning/)

This post looks very thorough, and came just in time for me. I'm looking to
snag an upgrade from my GTX 970 for a mix of flight sim 2020 and digging into
Fast.ai's course part 2.

The 970 has been my big hold-up, right now even simple models take a really
long time to work with.

~~~
dgellow
Have you tried using Google Colab and other online platforms? I started the
course a few days ago and so far Colab works well (I don’t really like Jupiter
but that’s a detail...). It’s free and you have the choice between a CPU, a
GPU, and a TPU.

~~~
PNWChris
Disclosure (since Colab is a Google product): I work at Google, but everything
I say is my personal opinion and experience.

I really dig the overall idea of cloud notebooks. Back when I did fast.ai part
1, I used Paperspace Gradient. It was a pretty good experience, but moving
files around was a bit of a hassle. For example, getting the images for the
Planet Labs exercise took a round trip of downloading from Kaggle to my
computer and re-uploading into Jupyter to do analysis.

Because of all those moving parts, I decided to give a try running things
locally. To my surprise, setup was super easy and I was quickly productive! I
really dig how customizable a local Jupyter server is, too.

I do use Colab, it's particularly great for collaboration/sharing notebooks,
but my past experience has me hooked on the idea of a capable ML machine at
home.

Plus: I can pitch it to myself and my spouse as an investment in personal
development that happens to be able to game :D

~~~
fxtentacle
Go for it! I have my own Jupyter docker image that I run on a server in the
basement. PyCharm can even do code completion for TensorFlow inside a remote
docker container. So it's instant, reproducible, and I don't hear a thing in
my office :)

