
How to Build and Use a Multi GPU System for Deep Learning - rbanffy
http://timdettmers.wordpress.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/
======
dave_sullivan
The direct-to-network stuff is very cool and useful in this scenario. I should
point out that for those of you experimenting with deep learning, you'll
probably not be writing your own code from scratch. There are various open
source libraries (pylearn2, torch, caffe, others) that make things _a lot_
easier when you're getting started. They still have something of a learning
curve though.

I should caution also that not all of the libraries work as well with the
latest or earliest GPUs, so the model of GPU you buy still makes a big
difference. And it should be NVIDIA--the deep learning community has largely
standardized around their hardware. This is a state of affairs that is
constantly changing.

Pertinent self promotion: my company
([http://www.ersatzlabs.com](http://www.ersatzlabs.com)) provides a cloud GPU
deep learning solution, which I'd argue is an even easier way to get started
with deep learning, particularly in visualization and prototyping phases.

But anyway, if anyone's curious about deep learning and just getting their
feet wet, I'm always happy to talk about it, my email is in my profile.

------
nemonemo
This article contains good tips for building a GPU cluster with RDMA. One
thing I would like to add is that there are two types of GPUDirect depending
on CUDA versions. Previous CUDA supported GPUDirect through CPU memory, and
now CUDA supports "true" GPUDirect between the RDMA device and the GPU memory.
However, some chipsets may not support the "true" GPUDirect very well, and two
of our old machines had up to 20x times of throughput asymmetry with GPUDirect
(which is, send was much slower than recv.) There are several papers that
discuss this limitation. Our work, GPUnet[1], overcame this performance issue
with GPUDirect by using fairly recent chipsets, but you can probably imagine
our pain when we saw around 150MB/s throughput with GPUDirect, when 3GB/s is
the expected one.

[1] GPUnet: Networking Abstractions for GPU Programs, OSDI 2014
[https://sites.google.com/site/silbersteinmark/GPUnet](https://sites.google.com/site/silbersteinmark/GPUnet)

------
jeffreyrogers
I'm curious about the author's experience with ML. He mentioned one of the
Kaggle competitions, and from my understanding most of the people doing Kaggle
are using R, Python, or some other language which provides a large degree of
support for ML type tasks.

I wonder if the author also uses those and CUDA/GPU makes up a relatively
small part of his solutions, or whether it's largely done at such a low level.
It'd also be interesting to see how some of the other people who place highly
in Kaggle competitions do their coding.

~~~
timdettmers
I mainly use python and sklearn for Kaggle competitions for my initial models.
If I understand the problem better I use some of my own deep learning
solutions in python (built on gnumpy and cudamat). However, sometimes my own
C++/CUDA implementations come in handy, especially if the data set is large.

Other Kaggle competitors that use deep learning mostly use python libraries
like pylearn2 and torch7 for their deep learning models (which are also built
on CUDA/C++).

In general it is not so easy to use deep learning on problems other than
object recognition. So yes, I do not use deep learning in all of my Kaggle
competition simply because it is hard to get them to work well. Using
different simple models and to then ensemble them yields often better results
for the time invested.

~~~
jeffreyrogers
Thanks for posting that. It gave me some interesting things to look into.

------
ansible
I was recently surprised to learn that they make server systems with up to
eight PCIe x16 slots.

We were looking at this particular beastie [1] to host some Nvidia Tesla K40s
for some simulation software. It would be a very expensive box, but the sim
software costs a lot more.

[1]
[http://www.supermicro.com/products/system/4U/4027/SYS-4027GR...](http://www.supermicro.com/products/system/4U/4027/SYS-4027GR-
TRT.cfm)

