
A Full Hardware Guide to Deep Learning - etiam
https://timdettmers.wordpress.com/2015/03/09/deep-learning-hardware-guide/
======
benanne
The article recommends getting a 580 as the cheapest, most cost-effective
option. One thing the 580 has going against it is that the cuDNN library does
not support it. Only Kepler and Maxwell cards (600, 700 and 900 series) are
supported. Since many of the popular libraries for deep learning (Theano,
Caffe, Torch7) support using cuDNN as a backend now, I think this is worth
mentioning. For many configurations cuDNN provides some of the fastest
convolution implementations available right now. Even if the 580 is a great
card for CUDA, a more recent model may actually be a better choice in light of
this.

I agree 100% with the 980 recommendation, it's a great card in terms of
performance, power usage and price point.

~~~
timdettmers
Thanks four your feedback, this is an important point and I will update my
blog posts accordingly.

------
kyzyl
Overall a good article with some insightful points. One thing strikes me as a
bit off, though. The recommendation for one or two cores per GPU seems not
quite right. Examining only the CPU<->GPU performance, this might be
reasonable. Like the author mentions, you can use the other core to prep the
next mini-batch and all those sorts of tasks. However, training the model is
only one part of the system, and I tend to value the overall performance of
the system more than any one facet.

For example, despite training on GPUs being very computationally intensive, I
find one of the most onerous tasks to be the custom data
prep/transformation/augmentation pipeline. Because these types of things are
usually pretty application specific, there often isn't a ready-made toolkit
that does all the heavy lifting for you (unlike the GPU training, which has
Torch, Caffe, pylearn, cuda-convnet, lasagne, cxxnet...), so you end up having
to roll it yourself. Often you have to run this code often, and with large
data that isn't trivial. Usually you won't invest--at least, I don't--in doing
custom CUDA code for this type of thing, if it's even possible, so having lots
of fast CPU cores is a win. I usually write multi-threaded routines for my
processing steps and run them on 8-32 cores for huge gains. So my point is
that "one or two cores per GPU" is a bit of a narrow recommendation.

The same applies if you want to do 'real time' data augmentation (this is
hinted at later in the article) and/or if you want to deploy with CPU only.
Sure you need the GPU to do the training in a reasonable amount of time, but
once you've fit your model, it might not be worth it to deploy to GPU-enabled
computers if all you're doing is forward passes.

PS: This is also a place where running on EC2 can be a win. Maybe it's more
economical to build a workstation, but once you're in the cloud you can spin
up a few 32 core boxes to run your preprocessing really quickly, shut them
down and spin up some GPU instances for training, then shut those down and
spin up some mid-tier boxes to run the models through a bunch of data without
breaking the bank. All in 'one place'.

~~~
timdettmers
Thanks for sharing your experience – this is a fair point. Often it is
possible to pre-process your data and save it to disk so that you can skip
this decompression/conversion/transformation step once you start training your
net, but I can image applications where this is impractical or just does not
work. I will add a small note to my blog about this.

------
dchichkov
I was using a few different cards a few years back, to work on deep learning.
Including 2 x GTX 580 and a few others. A checklist:

    
    
      1. motherboard form factor.
      2. cooling. 
      3. power supply.
      0. memory.
    

It is important to choose a motherboard with the right form factor which would
actually fit your cards physically. A fact that the motherboard has 3xPCIe 16
doesn't necessarily mean that it would fit your two(!) cards. Nothing is more
frustrating than not being able to fit the cards. Cooling. Can not be
overstated. Also note that the box makes a lot of noise during its operation.
Ideally you'd want to put it far away from your workplace. Power supply. Note
that the spec that you read on the power supply is usually overrated. If you
have 4x200W cards an advise is a 2kW PSU. And GPU memory, if you can fit your
dataset in, rather than loading/unloading it in batches that will save you a
lot of time and efforts. Well worth the money.

~~~
timdettmers
These are some good points. I heard from another person that he had problems
with the form factor and I will add that to the post tomorrow. I think a 2kW
PSU is overkill, but you are right that more is better for PSUs.

If you want memory a good option will be to wait for the GTX Titan X which
will be released in the next weeks: 12 GB RAM and it will be the fastest card
by far. Overall however, I think the GTX 980 will be better for many cases
still – it is just very cost effective.

------
choppaface
This is cool, but I'm wondering about how the prices compare to EC2 (including
spot prices and depreciation of personal hardware). How much training do you
need to do before EC2 becomes too expensive?

~~~
timdettmers
If you have no desktop PC or no money for a GPU, it might be a better choice
to use a EC2 instance instead of buying the hardware. You pay about $11 a week
for a EC2, which is quite good once you compare it against the electricity
costs that come on top of running a personal computer.

The downside is that you have a slow EC2 GPU with 4 GB RAM. Conv nets that
take 3 weeks on a EC2, will take less than 2 weeks on a GTX980. If you run
large conv nets, the 4 GB can be limiting (for example on ImageNet or
similarly sized data sets).

Another point is that is more convenient to work on your own desktop and you
can run multi-GPU nets, which is not possible on EC2 because the
virtualization kills the memory bandwidth between GPUs.

If you think about it, over the long term a personal system will just be more
cost efficient (you can keep a good system for years). So for deep learning
researcher and those that apply deep learning this is just the most cost
effective option.

A example calculation: You can buy a faster system than a EC2 for roughly $400
(GTX 580 + other parts from eBay). Together with electricity costs thats about
1 year worth of EC2, or 2 years worth if you use deep learning sporadically. A
high end deep learning system will be about $1000-1400, which is about 3 years
worth of EC2. So a EC2 makes good sense, if you use deep learning only
sporadically and work with small data sets. If you use deep learning heavily,
want a faster system or want to use multiple GPUs, a personal system will be
better.

~~~
kyzyl
Putting aside the question of hardware+electricity vs. g2.2xlarge service
charges, I think it's worth mentioning that there's a lot more to putting
these models together than just getting the hardware and paying to operate it.
I tend to spend quite a while mucking with configurations, writing data
preprocessing/formatting code, and doing component-wise checking of each piece
of the giant ball of software it inevitably becomes. For these tasks, it can
be a LOT more convenient to be running locally.

As soon as you're dealing with EC2, you have to take on the mental overhead of
making sure that all your configuration persists between restarts (especially
if you're using spot instances!), running start up tasks, mounting EBS and
paying for volumes, etc. and in my experience this all really adds up. That
said, I still do use EC2 for some things. If I wave five similar models I want
to run in parallel, it's as simple as spinning up five identical instances.
Also, once I start training new models I can continue to use my workstation
without any slowdowns.

~~~
timdettmers
A very valuable comment – this is an important perspective, thanks! I think I
will add a EC2 section to my blog post.

------
fuchsvomwalde
Awesome article! Totally agree with the author.

------
deeviant
The GTX 980 outperforms the K80 ?

~~~
akosednar
No, but a K80 isn't really meant for workstations (from what I understand).
Plus, it's $7k while a 980 is sub $1k typically.

~~~
deeviant
Well, first of all, just to clarify, I was asking a question, not a statement.
I recently ordered a HPC server for my company, we're using caffe to
train/detect very large sets.

I went with the K80, the company we ordered it from charged us $4400 for the
card, so there must be a good amount of markup that can be negotiated out of
it.

I have since read some material comparing the K40 to the 980, giving a slight
edge to the 980, which is surprising considering the price points, but I have
not yet found any good benchmarks/posts about the K80 vs the 980. The K80 is
_not_ just two K40's glued together, as it uses the GK210 tesla chips rather
than the GK110. The GK210 is a more advanced chip with more cache and better
energy efficiency but I'm really not too sure how it translate into real world
performance.

If anybody has any data or perspective on this, I would appreciate it.

~~~
bombita
Just my 2 cents. I've been trying my scientific computing code (QM/MM, not ML)
in a cluster using various configs (6xK40, 4xK80, 6xK20, etc) and the
performance I noticed of the K80 is quite strange. I've been using the
CUDA_DEVICE 0,1,2,3 of that config and if I try to use more than one logical
GPU, the performance is not 1:1, but more like 1:0.6

The only conclusion I've been able to find is that the K80 presents itself as
2 different devices (0,1 or 2,3 in that config) but the performance is not 2x,
at all. There is quite a lot of PCI bus contention, hurting badly the
performance of my code (as it is just running many <10ms kernels at a time).
So far, having 2xK40 seems to be a better value and performance proposition
than 1xK80 on the same bus, but the flops/watt aspect of that equation favors
greatly the K80.

------
fuchsvomwalde
Awesome article!

