
Which GPU(s) to Get for Deep Learning - laktak
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
======
boulos
Disclaimer: I work on Google Cloud.

I saw a lot of "should we use Cloud, no its crazy a GPU only costs $X". The
key is that if you believe GPUs are going to get updated every year, and/or
the best thing for ML may change (see TPU and plenty of startups with custom
hardware) then suddenly buying hardware for 24? 36? months isn't as obvious.

We (and AWS and Microsoft) have K80s because Maxwell wasn't a sufficiently
friendly all around part. We're all going to offer Pascal P100s and in the
future V100s. The challenge for buying your own is that P100s are available
now-ish and V100s may be available in less than 12 months.

Buying a P100 or similar part today doesn't mean it won't still be working in
a year, but it will suddenly mean you've bought a part that now has much worse
!/$ in just N months. If you have an accounting team that is spreading your
$Xk over those 36 months, the reality is that you have two options: tell
everyone they have to make use the old parts ("we're not buying new GPUs until
it's been 36 months!") or realize you're going to get a lot less use out of
them.

To be clear, the progress in this space is really impressive. _And_ the same
problem above applies to us (the cloud providers). Despite my obvious bias, if
I were fired today, I'd be renting to do deep learning just based on the
roadmap alone (not to mention the ability to suddenly spin up and down).

Again, Disclosure: I work on Google Cloud and want to sell you things that
train ML models :).

~~~
tensor
The problem is that a month of GPU time on a cluster can buy you the hardware
itself. If you are doing serious deep learning work it's just not cost
effective at this point. If the costs come down by half or more, it may start
looking viable for people who need a lot of resources.

~~~
boulos
Can you explain your math? A K80 on GCE is $.7/hr x 730 => $511/month if you
were really 24x7. A K80 (and really we sell them by the die not the board) is
more than $1000.

I don't disagree that a consumer board is about that price, but they're not
apples to apples. (Either in memory size, reliability or both). I'm fine with
that being the real complaint: (major) cloud providers only sell the Tesla
class boards, and they're really expensive ;).

~~~
marmaduke
> but they're not apples to apples

It would be nice to see Nvidia or someone expand on this, so that users who
have to make this choice can do so without guessing. If Google or AWS or M$
could publish reliability information, that'd be cool too.

Illustrative case: I run Monte Carlo work on GPUs and administer a local
compute cluster. I tested a workload on a 16 GB P100 and a GTX 1080. A 12 GB
P100 costs (academic, EU) 5000 euros while the latter costs 700 euros, but the
performance difference is about 2x. Still, when we ask Nvidia reps, they say
not to bother installing GTX cards in our cluster, because they aren't
designed for 24/7 work, not commercializable etc. Even so, the GTX would have
to burn out 3 times before the choice of P100 breaks even. Burning out three
times means GTX 1080, then 1180, 1280 etc.

~~~
technics256
Thank you for the answer. When you say performance difference is 2x, I presume
the P100 is 2x faster than the 1080 in training epoch time?

~~~
marmaduke
yep

------
zeptomu
There is an interesting offer by Hetzner that's available for some month now.

They provide a dedicated server with a GTX 1080 for ~99e/month (111$/month)
with adequate CPU (i7-6700), 64G Memory, 500G disk space and 50TB Traffic -
there are also on-demand offerings by GCP and AWS, but I do not think they can
match the offer by Hetzner:
[https://www.hetzner.de/us/hosting/produkte_rootserver/ex51ss...](https://www.hetzner.de/us/hosting/produkte_rootserver/ex51ssd-
gpu). Keep in mind that I am talking about R&D, in particular training of
networks - which has lower expectations on availability than e.g. later
inference.

Disclosure: I am _not_ affiliated with Hetzner and did not test it yet, but I
have a ~40e dedicated machine there for occasional number crunching and
everything worked so far (no availability or hardware issues).

Furthermore I am not sure about the exact specifications (Memory-Size of GPU)
or if there are different types of 1080s that differ significantly in Deep
Learning Performance (see also: [https://github.com/jcjohnson/cnn-
benchmarks](https://github.com/jcjohnson/cnn-benchmarks)).

~~~
throwaway32131
The electricity will likely cost a significant chunk of that, if one were
running that full time in Germany.

0.5 * 24 * 30 * (30 c/kWH) = $108

However, it'd be a third of that in the states (may be cheaper in WA).

~~~
dahauns
Industrial consumers in Germany pay nowhere near 30c/kwh, more like 10-15c.
(depending on size and exemptions)

------
AlphaSite
Interestingly, AMD is claiming that their upcoming Vega Frontier Editions
chips outperform Nvidia's current P100 Deep Learning chip.

[http://hexus.net/media/uploaded/2017/5/30f5633b-1bbf-49b7-9f...](http://hexus.net/media/uploaded/2017/5/30f5633b-1bbf-49b7-9f31-7890e20db28f.jpg)

~~~
sp332
I think signs are pointing to Vega being a HPC beast. But what they really
need is the software ecosystem and support, and so far that hasn't been there.
So while the new hardware looks cool, I'm really waiting for an announcement
that OpenCL tooling got a lot better, or that CUDA is getting first-party
support from AMD, anything to tempt those customers away from nvidia.

~~~
dragandj
CUDA is getting support from AMD in the form of HIP. In my opinion, the
problem is different: even if AMD supported CUDA proper, that still won't lead
them anywhere because the reason people use Nvidia is not so much CUDA itself
(although it has good tools) but Nvidia's proprietary libraries: cuBLAS,
cuDNN, cuFFT etc.

If AMD provided good implementations of similar libraries with OpenCL and
shipped them with their drivers (similarily to Nvidia's CUDA toolkit) that
would be much better than supporting CUDA, but leaving the ecosystem bare
bones again...

~~~
gbrown
[https://github.com/clMathLibraries](https://github.com/clMathLibraries)

~~~
dragandj
There are even better independent open-source equivalents for some of those,
but that is not the point. Good luck compiling some of AMD's clMath libraries
without issues. Then, you're "just" left with packaging them properly.
Performance is also not stellar compared to Nvidia. Meanwhile, when you
install Nvidia's toolkit, you're set with everything.

------
ktta
If anyone here doesn't want to spend money on $500+ GPU (or the $1k+ ones!),
then I'd suggest getting the lowest tier Nvidia GPU for $100~[1]

If that GPU is a real bottleneck for you, then you're much better off spending
money on GCP/AWS's GPU offerings. That's because consumer GPUs get superseeded
every year and online offering's price will only go down.

So you can spend 10% of $1k every year and keep getting better return on
compute / dollar every year.

[1]:[https://www.newegg.com/Product/Product.aspx?Item=N82E1681448...](https://www.newegg.com/Product/Product.aspx?Item=N82E16814487296)

~~~
solomatov
>If that GPU is a real bottleneck for you, then you're much better off
spending money on GCP/AWS's GPU offerings. That's because consumer GPUs get
superseeded every year and online offering's price will only go down.

GCP and AWS have old GPUs and they are really really expensive. If you expect
to run workloads for a long time, it would be more cost efficient to buy your
own hardware.

~~~
ktta
They are very, very few cases where getting an expensive GPU - anything more
than a Nvidia 1060 would make financial sense.

The 1050 is a beginner card is it is perfectly fine to learn and run small
nets. More importantly, you can decide if machine learning is for you. Then,
comes to second investment, which is actually running real-world models.

Although the online GPU offering is expensive (you can also look around for
cloud GPUs with lower SLA requirements for a lot cheaper) you'd be using them
for a lot lesser time.

Even if you go with the big three, you can get good pricing if you look
around. AWS has a single GPU with 1,536 CUDA cores and 4GB RAM. Looking at the
Spot Pricing[1] which is $0.25 an hour, you can get 3000 hours of compute for
$750, which is the price of 1080Ti (which is the most cost effective card on
the market today).

Now, I would say, 3000 hours is more than enough time one needs to run
whatever they want to. You can get a lot more if you go with some other
service other than the big three.

If your usage exceeds that, then you are probably using it for commercial
purposes, in which case you need to account for a lot more variables
(downtime, maintenance, etc). Then you should also consider the deprecating
perf/$ your GPU gives you every year over the new ones in the market -
especially since the perf jump is a lot more than what we're seeing with CPUs.

If you are doing it for learning, I'd say you're doing something wrong,
because you shouldn't need that much power. Also, consider the electricity
costs and possible over usage of your computer.

[1]:[https://aws.amazon.com/ec2/spot/pricing/](https://aws.amazon.com/ec2/spot/pricing/)
(check US East N. Virginia)

~~~
solomatov
There are two problems with your approach:

* Setting up machine on AWS is more complicated than locally, and requires some admin skills.

* If you use spot instances, you need to handle checkpointing, which requires persistent storage, and all of this stuff requires even more admin skills

The goal of a person who starts working with deep learning is to learn deep
learning, not how to setup machines, manage them, work with checkpoints, etc.

Also, don't forget that there's a large market for used GPUs and you can get
real bargains.

~~~
ktta
>Setting up machine on AWS is more complicated than locally, and requires some
admin skills.

There are tons of guides online where you can learn how to do so in <5 mins.
Setting up the computer by yourself is more complicated I would say, and would
take a first timer days rather than hours.

>If you use spot instances, you need to handle checkpointing

Again, not a big deal to learn.

>The goal of a person who starts working with deep learning is to learn deep
learning, not how to setup machines, manage them, work with checkpoints, etc.

I mean, if you're buying a GPU and setting it up, you're more than likely
assembling the computer by yourself. You'll also have to _maintain_ it
properly. Then you'll have to look for correct drivers and other software
which can get frustrating (it did for me).

On the other hand, I just could use a step by step process for the AWS
instances since they had a few specific types of GPUs and I didn't have to
even think, just copy/paste the commands from the webpage to the terminal.
There are even AMIs which setup everything for you, which would require _even_
less effort, but I don't trust them so I go with a clean disk.

Moreover, learning how to use AWS is a much more valuable skill than putting
together computers so time well invested I would say.

------
atarian
Google cached version:

[http://webcache.googleusercontent.com/search?q=cache:04az_uB...](http://webcache.googleusercontent.com/search?q=cache:04az_uBF27EJ:timdettmers.com/2017/04/09/which-
gpu-for-deep-learning/)

~~~
godelski
Thank you, I was getting Error 500 - Hug Of Death

------
tanilama
My recommendation: 1080 Ti or 1080 GTX.

There is no reason, not to get a Pascal GPU at this point, the performance is
simply superior. For a lot of models, we are talking about days of training
time, so 20%-30% time saving is significant, and we are not even touching the
part that large VRAM enables bigger batch size which you won't get in
middle/low end GPUs.

------
jacquesm
tldr: GTX1080ti

They're fairly easy to get now too, in the beginning it was rather hard to get
them.

~~~
astrodust
You can get an 8GB GTX 1080 for 2/3rds the price and it offers 80% of the
performance. If you don't need the 11GB of memory, it's a steal. If you're
building a rig with multiple cards, 1080 might be an economical way to get
great performance. If you're looking for maximal compute density, 1080 Ti is
the way to go.

~~~
ex3ndr
It will also require much more power and cooling this system will rise a cost
and reduce stability of the system.

~~~
astrodust
The power consumption of a 1080 is actually lower than a 780 or 980, plus a
1080 Ti uses about 50W more power when pinned, so I'm not sure where you're
coming from here.

------
dkobran
GTX drivers become crippled when the card detects the presence of a virtual
environment. This means you can't run GTX in the cloud, otherwise, cloud GPU
prices would be much lower. Without the availability of GTX, we've been trying
just about everything at Paperspace to bring prices down and make the cloud a
viable option for GPU compute. The argument being, there are real benefits to
running in the cloud like on-demand scalability, lack of upkeep, minimal
upfront costs, and of course, running a production application :) There are
other indirect cost saving eg power consumption which can be quite significant
when training models for long periods of time. Would love to see a total cost
of ownership figure added to this post.

~~~
colejohnson66
Citation please for the crippled drivers?

~~~
dkobran
The citation is we're building a GPU cloud and have tested almost every GPU in
existence :) Just kidding, here are a few examples:

[http://vfio.blogspot.com.au/2014/08/vfiovga-
faq.html](http://vfio.blogspot.com.au/2014/08/vfiovga-faq.html)
[https://www.reddit.com/r/linux/comments/2twq7q/nvidia_appare...](https://www.reddit.com/r/linux/comments/2twq7q/nvidia_apparently_pulls_some_shady_shit_to_do/)
[https://www.redhat.com/archives/libvirt-
users/2014-October/m...](https://www.redhat.com/archives/libvirt-
users/2014-October/msg00029.html)

I just quickly googled this so there are probably better sources. Some of
these are old but I can tell you firsthand that this is still the case.

There are workarounds for certain hypervisors (KVM mainly) but it's very
unlikely that this would be deployed in a production environment.

------
Sephr
Here's a much more cost effective answer: Use
[https://cloud.google.com/tpu/](https://cloud.google.com/tpu/)

Unless you have an unlimited supply of free electricity and don't care about
the increased hardware management overhead, it's a waste of money to buy
Pascal GPUs for large-scale deep learning.

The following cards have much more optimized deep learning silicon and are
publicly available /right now/:

\- Nvidia Tesla V100 (Tensor cores only: 120 TFLOPS FP16)

\- Google TPU2 (180 TFLOPS FP16)

Additionally, Intel Xeon chips with the Nervana deep learning accelerator
built-in will probably be available early next year.

If you must control the physical hardware yourself and can't use cloud
services, go buy Tesla V100s or wait for the Nervana Xeons.

~~~
KaoruAoiShiho
That makes no sense as the TPU2 is not really out yet for actual consumers.
AFAIK it's only in closed alpha right now, so if you're actually doing stuff
right now it's not a real option. There's also no pricing so the "cost
effective" remark can't really be seen.

~~~
netheril96
If and when it is for sale, I wonder how many problems with the hardware and
drivers customers will face over the years.

------
jonathanpoulter
What impact does Google's Tensor Processing Unit have on the answer to this
question?

~~~
throwaway91111
Well, AFAIK, it isn't for sale; you have to go through their cloud offering.
So it's not impactful at all on what GPU to get.

~~~
ktta
I mean it can have an effect on where the answer is to not get a GPU at all
and use the cloud service.

------
peeb
Mirror:
[https://web.archive.org/web/20170522190506/http://timdettmer...](https://web.archive.org/web/20170522190506/http://timdettmers.com/2017/04/09/which-
gpu-for-deep-learning/)

------
KerrickStaley
Question: if I'm learning about neural networks and want to e.g. train a
network to recognize MNIST digits, do I need a discrete graphics card
(probably attached to a VPS that I would rent)? Or can I use the i5 Kaby Lake
(which has an integrated GPU) in my laptop to train my network?

~~~
marvy
As other people have said, for MNIST, CPU is fine. One core. You can get over
90% accuracy with a network with just 1 hidden layer with only a few seconds
of training. Or something like that. If you've never played with MNIST before,
it's kind of amazing how easy it is. For instance, the following idea "works",
in that you get results that are pretty bad, but much better than chance.
(Even more than 50% right, I think.) Suppose you have 10 arrays:

zeros: each element of this array is a picture of a "0" ones: each element is
a picture of a "1" ... nines: each element is a "9"

Compute the average of each array: the "average" 0, the average 1, and so on.
Then classify new digits based on which average element they are closest to,
using Euclidean distance. This sounds way too dumb to do any good, but the
MNIST digits are normalized so well that this actually does something.

Even better: you can have a neural "network" that has zero hidden layers. This
actually achieves almost respectable performance, believe it or not.

~~~
huac
A 0 hidden layer NN is just a linear (or logistic) regression.

On MNIST I think you get something like 60% accuracy.

~~~
marvy
Yes, it is just logistic regression. I just dug up my old code from a few
weeks (months?) ago, and it seems that somewhere around 90% is not too hard.
60% is way too low, unless I'm measuring wrong somehow.

------
sabalaba
Short answer is that you can save weeks of work putting together a machine by
just buying from Lambda (We of course use the 1080Tis):

[https://lambdal.com/deep-learning-devbox](https://lambdal.com/deep-learning-
devbox)

On average it takes a SWE or DL engineer a few days to set up a unit from
scratch. Your company probably burns over $2,000/day so every day your DL
engineer or SWE isn't up and running costs you money.

------
ericfrederich
I'm guessing you don't... you use the cloud, right?

~~~
antognini
For a hobbyist, it probably makes sense to just use some GPU instances on AWS.
But it doesn't really take that long for it to become cost-effective to buy a
GPU. It only takes a couple of months of full-time use before it becomes cost-
effective to purchase the GPU. Even if you're building the entire machine from
scratch it takes under a year. (At least, that was my estimate when I was
building a rig earlier this year.) If you're doing any serious deep learning
projects it's pretty easy to have a model training nearly 100% of the time.

~~~
nightski
Not to mention, I can imagine it's frustrating uploading many gigs of data to
amazon.

~~~
zeptomu
It is not that bad. Either you've your training data local (unlikely) or it is
already available in the "cloud" (i.e. a public-facing service). Let's say a
typical training set (raw data) is 100G.

Assuming 2.5MB/s Upload capacity with local data, you've uploaded it to your
deep learning machine in half a day (100000 / 2.5 / 3600 ~ 11 hours) - which
is not that much, as most of your time will be used for development and fine-
tuning of your deep learning tool chain anyway. In most cases the data is
accessible via a public-facing service, and assuming 1GBit/s bandwidth you've
downloaded 100G in 13 minutes (100000 / 125 / 60 ~ 13 minutes).

~~~
chronic940
You forgot the AWS Internet egress charge per GB..

~~~
zeptomu
What do you mean?

I know egress is one of the more expensive cloud services (e.g. compared to
compute and storgage) at AWS, GCP, etc., but if I upload data to my learning
system that's ingress AFAIK which is mostly free or less expensive. Btw.
current Egress is like 0.1$/GB, so 100G ~ 10$.

Don't get me wrong, I am not saying you should always train in the cloud, but
I do not think slower Upload or Ingress are the limiting factor.

------
fest
I'm wondering about two things:

1\. Can laptops with, say NVidia 1070 or 1080 GPUs keep them cool at their
stock frequencies for a few hours? My work laptop with non-U i7 (Thinkpad
T440p) starts thermal throttling in just a few minutes when I compile
something large.

2\. Wouldn't two NVidia 1060 or 1070 outperform a single 1080 for training,
assuming batch size is kept low enough so that each batch fits in a single
card's memory?

~~~
mmusc
1\. Depends on the laptop really. Some have better cooling than others. I'm
training models for a few hours every day with an asus rog that has a 1060,
and I'm not experiencing any throttling.

------
ScottBurson
Site is slammed, but I think this is from a couple years ago. -- Ah, it's been
updated.

Is there anything better now than the GTX 1080 Ti?

~~~
Analemma_
The Titan Xp is better, but it's not so much better that it's worth almost
twice the price.

~~~
paulsutter
Two 1080TIs are way better than one Titan Xp and cost about the same (2x$700
vs $1200). Each TI has 11GB of RAM vs the TitanX's 12GB, and each TI is nearly
as fast.

At the NVIDIA conference all the second tier hosting companies were promoting
the P100 (at NVIDIAs insistence) but when pressured admitted that their big
customers now deploy 1080TIs. Paying for P100s is sort of a clown move even if
you're spending someone else's money. The P100 starts at like 10x the price of
the 1080TI and isn't much faster and again has just 12GB of RAM (16GB if you
pay $4000 more, ludicrous)

~~~
mattnewton
Usually, but the caveat is they are only better for workloads that are easily
paralizable (like hyperparameter search). Multi-gpu models are still very
complicated in my experience. And the Xp is a bit faster and has 1gb more
memory, which is useful in some edge cases (like large or 3D covnets)

Also if there do build with multiple gpus, watch out that you can give both
cards a full 16 pci lanes, not a given in a lot of motherboards.

------
DrNuke
GTX 1070 8GB has the best performance/cost at May 2017 and a good laptop costs
as much as an equivalent desktop at below $2k.

------
ThePhysicist
I recently bought an Alienware 13 R3 for deep learning (okay, and gaming...)
and I'm quite satisfied so far. In addition to the built-in Titan 1060 with 6
GB of DDR5 it's possible to get an external graphics accelerator for
additional performance as well (I haven't tried this yet though).

------
throwaway32131
The trouble with getting Pascal now is that the card may end up depreciating
very very sharply in a year, when Volta with its Tensor unit comes out.

------
lazylizard
[http://ambermd.org/gpus/benchmarks.htm](http://ambermd.org/gpus/benchmarks.htm)

------
locusm
How hard does the GPU work on these tasks? Is it Bitcoin mining level power
consumption?

~~~
jacquesm
All out. This is typical:

    
    
      nvidia-smi
      Tue May 23 01:41:58 2017       
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |===============================+======================+======================|
      |   0  Graphics Device     Off  | 0000:01:00.0      On |                  N/A |
      | 57%   84C    P2   257W / 250W |  11010MiB / 11171MiB |     95%      Default |
      +-------------------------------+----------------------+----------------------+
    

That's during training. I'm running a minimal desktop to have maximum ram for
the minibatch (and it still isn't enough but it will have to do).

------
sgt101
No mention of k80's k40's...

~~~
chronic940
K80s and K40s have a K. That's two generations old. We are currently on P and
about to be V. If you think K80s are good, you are far _far_ behind the times
in machine learning. A single 1080Ti even in a 4U server outperforms 2 K80s.

~~~
arnon
If you're running a rackmount server, you need the Tesla series.

We found that the GeForces tend to burn out when under heavy load, whereas
we've not had a single Tesla series ever burn out.

~~~
maksimum
Ooh ooh tell us more. Which GeForces and which server chassis? Adequate power
supply?

~~~
sgt101
Yes - more details please.

We've been using Titan-X and 1080 Maxwells in some Broadberry 4u chassis for
the last year/18mths and we've had no burnouts so far.

I'm buying replacement pascals, and I can't justify teslas when I can get 8
geforces for the price of 1...

~~~
arnon
Dell R720 / R730s, dual GPU (typically with K40m or K80) in there with 1100W
dual redundant PSUs and the GPU enablement kit. We also set our fans to
constant 70% min, to keep the airflow good.

On some servers we introduced the GTX 1080, either along-side a K40/K80 or two
per chassis (see [http://arnon.dk/how-does-the-nvidia-gtx-1080-stack-up-
agains...](http://arnon.dk/how-does-the-nvidia-gtx-1080-stack-up-against-the-
nvidia-tesla-k40/)).

They actually work 15% faster on average compared to the Tesla K series
(Remember it's a 5 year old card), but they just stop working after a few
months, or return inconsistent results for some operations.

Now, we're not doing graphics with them. We have a GPU database called SQream
DB - and we depend on the results to be correct. In the end, they didn't make
a lot of sense for us to deploy in a production environment, so back to the
Tesla series we went.

------
strin
The link is not working for me now.

