
Which GPUs to Get for Deep Learning - etiam
https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deep-learning/
======
etiam
"TL;DR advice

 _I work with data sets > 250GB:_ GTX Titan eBay

 _I have no money:_ 3GB GTX 580

 _I do Kaggle:_ GTX 980 or GTX 580

 _I am a researcher:_ 1-4x GTX 980

 _I am a researcher with data sets > 250GB:_ 1-4x GTX Titan eBay

 _I never used deep learning before:_ 3GB GTX 580 eBay

 _I started deep learning and I am serious about it:_ Start with one GTX 580
and buy more GTX 580s as you feel the need for them; buy new Volta GPUs in
2016 Q2/Q3

 _I want to build a GPU cluster:_ This is really complicated, I will write
some advice about this soon, but you can get some ideas here"

~~~
ajtulloch
I don't understand the claim that you should get a large RAM GPU if you have a
large dataset. It's the model and your training procedure that dictates the
amount of GPU ram consumed (e.g. fft convolution kernels like fbcunn use a
decent amount of RAM compared to CuDNN's fused gemm, larger batch sizes mean
more ram for storing activations, etc).

~~~
nshepperd
With a larger dataset a larger model is more useful though, since you are less
likely to suffer from overfitting.

------
NhanH
I've been curious about the Xeon Phi. It seems like they would fit the same
purpose as the GPU (especially with the Christmas sale last year that you can
get a 31s1p for ~$120-170), it has more memory and similar FLOPS/price. Why
are they not as competitive as GPU?

~~~
varelse
Bluntly, it's a great processor on paper, but for the most part, its
performance falls severely short of its paper specs.

While I was really looking forward to Intel shipping a competitive processor
to NVIDIA's GPUs, reality disappointed. Intel put 2 engineers on porting a
piece of code I wrote for a year. That code is roughly 20x faster on a Maxwell
class GPU than on a Haswell class CPU (all cores firing). 2 man years later,
instead of the Xeon Phi performing on par with a C1060/GTX 285 as it initially
did, it managed to accelerate the CPU code by ~10%. In that time, NVIDIA
shipped GK104, GK110 and then GM204 accelerating my code by a factor of 2
relative to GK104. When I tried to help one of the Intel engineers catch up to
at least Fermi class GPUs, he got mad at my algorithmic choices and stormed
out of the room.

Intel IMO will resort to dumping these things at rock bottom prices into data
centers in order to spoof just enough low information government wage HPC
administrators into believing these things are good for anything more than
playing Jenga with them. And that will keep NVIDIA from achieving total HPC
dominance in the short term.

No idea what the long-term will bring. Intel ought to be able to deliver a
compelling part one day, but I suspect that will be a CPU with wider SIMD and
more cores.

------
amelius
I wonder if there will be a time when we can stop calling these things "GPUs",
because right now, with all these new applications, it really doesn't make
much sense anymore. Also, I hope that there will be a possibility to chain
more of these cards together, because currently they hardly fit inside a
regular machine (these cards are quite bulky with all their fans and heatsinks
and stuff), and the PCI connectors are commonly quite closely spaced (and
there are few of them). Adding or removing a single card is no joy. A server-
rack unit has even less space. I'd rather see these cards made into their own
(stackable) boxes, to be placed next to or on top of the main computer.

~~~
foxhill
using cables for high speed links (~6GB/s) is hard, and whilst there are a few
problems that do not require much host/device communication, there are a
significant number that do - any increase in latency/decrease in bandwidth
will be very noticeable in codes that otherwise perform well on accelerators
(GPUs, etc.).

------
ceduic
Weird that there's not a single mention of AMD cards.

~~~
timdettmers
Why I did not mention AMD cards: NVIDIA's standard libraries made it very easy
to establish the first deep learning libraries in CUDA, while there are no
such powerful standard libraries for AMD's OpenCL. Right now, there are just
no deep learning libraries for AMD cards -- so you just cannot use AMD cards
for deep learning.

~~~
foxhill
i feel that you should have at least mentioned it, otherwise it feels like the
article implicitly asserts that only NVIDIA GPUs exist, or perhaps even that
you are some way not impartial (partial..? that sounds strange).

~~~
timdettmers
Thanks for your feedback. I updated the blog post with a small NVIDIA vs. AMD
section.

------
touristtam
> Another important factor to consider however, is that the Maxwell and Fermi
> architecture (Maxwell 900 series; Fermi 400 and 500 series) are quite a bit
> faster than the Kepler architecture (600 and 700 series);

Where are the citation for this? I have heard about the disappointing
benchmarks when the 600 serie launched, but this is uncommon as far as I know
in the GPU marked to have sub part drivers on launch. Plus if you are
following a tic-toc strategy like Intel does, you should rarely go for the tic
iteration in my opinion. So is this just a claim from the author or is there
something to back this statement up?

~~~
timdettmers
There are several cryptocurrency mining benchmarks which are bandwidth bound;
those benchmarks reflect almost one to one the deep learning performance of a
GPU. You can see this for example for litecoin mining:
[https://litecoin.info/Mining_hardware_comparison](https://litecoin.info/Mining_hardware_comparison)

------
shpx
Bought a 970 when they first came out, because this guy said they were
probably going to be the card to get in a comment.

Completely my fault, but still disappointing

~~~
jensnockert
It's probably still one of the most cost-effective cards, even if the last
0.5GB of memory is less useful.

~~~
wtallis
Less useful is an understatement: your working set simply cannot touch that
memory region at all without crippling performance.

~~~
dragontamer
That's true of any system. A "real" 4GB that touches the 4.1GB region will be
forced to use CPU main memory over the PCIe bus.

So the 0.5GB region is still faster than PCIe IIRC. Not ideal of course, but
still better than a bunch of PCIe transfers.

------
yzh
to some extent, NVIDIA GPU and CUDA's success comes from the better tool chain
and recently released cuDNN. Why AMD still hasn't release any alternative deep
neural network library yet?

------
zitterbewegung
What about a NVIDIA Tesla which seems to be made for this kind of thing?

~~~
wtallis
Tesla and GeForce products are made from the same chips. The GeForce products
have several features turned off, and are then sold for a fraction of the
price and often with more aggressive clock speeds. For any given workload, you
have to assess whether you need the extra reliability of the Tesla products
and whether the features disabled in the GeForce matter to the problem at
hand. For many uses, the high-end consumer parts come out _way_ ahead in this
comparison; the Tesla parts are never an automatic win.

