
DeepLearning10: The 8x Nvidia  GTX 1080 Ti GPU Monster - EvgeniyZh
https://www.servethehome.com/deeplearning10-the-8x-nvidia-gtx-1080-ti-gpu-monster-part-1/
======
highd
I've been evaluating this space a fair bit recently. If you want to optimize
FLOPS/$, especially for research-workstation sorts of setups, there's
unfortunately not a lot of options for getting more than 6 GPUs in a
motherboard without going to server grade where you're spending basically
$3-4K more for unclear benefit - maybe a factor of two in GPU-to-GPU
bandwidth.

The bitcoin miners have figured out one way to handle this, which is by using
a variety of PCIe splitting systems. I've seen examples of people putting in
8GPUs in 4 slots with these splitters. The problem is that the majority of
these splitters take your x16 connection and turn into 2 to 4 x1 PCIe lanes,
which is a lot of wasted bandwidth. This is fine for the miners, since the
cards run mostly independently. If I could find compatible PCIe splitters that
could split x16 into 2 x8 channels then that would be a really sweet spot in
performance/$, but unfortunately I've yet to find them. So right now I'm going
to stick to 6 GPUs, which you can get in a $500 consumer motherboard with just
a few riser cables.

See for example: [http://amfeltec.com/products/flexible-x4-pci-
express-4-way-s...](http://amfeltec.com/products/flexible-x4-pci-
express-4-way-splitter-gpu-oriented/)

~~~
brigade
Not sure if you already knew about them, but there are consumer X99
motherboards that include PLX8747 switches to mux 32 PCIe lanes from the CPU
into 7 PCIe x8 slots, at premium of maybe $200 for the motherboard and $300
for a compatible CPU. (ASUS X99-E-10G WS)

The catch being that the 7 slots are next to each other, so you either have to
make a custom water loop with single slot GPUs or use simple risers on half of
them. But that's probably the best bandwidth you can currently get between >4
GPUs on consumer parts.

~~~
highd
That sounds amazing - do you know of any in particular? I can't seem to find
any motherboards with 8 PCIe slots. Or do they need some sort of additional
splitter?

~~~
brigade
Sorry, I was wrong, it's only 7 slots - ASUS x99-e-10g ws is the board I was
thinking of.

~~~
highd
Still pretty good - thanks!

------
mippie_moe
The author wasn't joking about the noise levels. This machine sounds like an
F1 race car.

If you don't require a rack mounted server, a cluster of workstations like
NVIDIA's DIGITS DevBox is far more cost efficient (and less noisy). I run a
compute intensive business (Dreamscopeapp.com) and we opted to build a cluster
of desktop-like machines instead of using a rack mounted solution. Another
benefit is you don't run into the power issues mentioned in the post.

My start-up actually sells the machine described in this post:
[https://lambdal.com/nvidia-gpu-server](https://lambdal.com/nvidia-gpu-server)

And a machine inspired by the NVIDIA DIGITS DevBox:
[https://lambdal.com/nvidia-gpu-workstation-
devbox](https://lambdal.com/nvidia-gpu-workstation-devbox)

~~~
dgacmu
So - tried your quote form. The options are 4x 1080ti, 4x titan Xp, or 8x P100
--- but no 8x 1080ti? Or is the quote form wrong?

16.5K seems pretty reasonable for 8x 1080ti with a bit of profit for building
it, but unreasonable for only 4x 1080ti. My home-built 4x1080ti box (without
quite enough PCIe bandwidth, admittedly) is under $6k. I'm assuming/hoping
there's an error there. :)

Screenshot of the order form:
[https://www.dropbox.com/s/2nm00w1rd6du6ey/Screenshot%202017-...](https://www.dropbox.com/s/2nm00w1rd6du6ey/Screenshot%202017-06-07%2022.17.15.png?raw=1)

Oh, also - if I want a quote on both the big server and the little workstation
I have to enter my contact info twice? Not particularly customer-friendly.

~~~
p1esk
For quad GPU config you should look at dev box type option. It's $8,750 for
machine with 4 1080Ti, 64GB of RAM, and 1TB SATA SSD. Quite a steep margin if
you ask me, considering a 128GB RAM machine that you build yourself would cost
you at most $5,700 (taxes included) if you get everything from Amazon, and
probably under $5k if you're willing to shop around a little.

------
astrodust
It seems odd that they insist on putting the power connectors on the top of
the card instead of the back which would avoid a lot of space constraint
issues.

It also indicates there might be a market for a specialized 90° connector that
can squeeze into tight spaces like that.

~~~
zeta0134
I'm not sure I agree here; a lot of consumer cases that I've worked with are
just barely deep enough to house a card of this size. Putting the power
connector on the end of the card would add some extra room to make the case
slimmer perhaps, but most consumer cases tend to be generously wide, while not
having a lot of extra depth. Wires plugging into the end of the card would
compete with the hard drive / CD ROM / Media Card Bay, etc etc.

I agree with nVidia's choice here, but you also raise a valid point; certain
cases and configurations would benefit from the added flexibility of that
adapter, so there may well be a market.

~~~
astrodust
It can be a squeeze in some of the Mini-ATX cases, but I've never had a
problem with anything bigger. There's always at least four inches to spare.

~~~
seanp2k2
I had an Antec case (solo or sonata or something like that) where I had to cut
into the drive bays with tin snips to fit a 1080ti. Modern cases have more
room for GPUs.

------
Drdrdrq
Curious: can one train a single NN over multiple GPUs? Or is this useful
mainly for parallel training of multiple NNs?

~~~
randyrand
Most definitely.

Simplified: Training works by taking an input sample (say an image), running
it through the network, seeing if your answer is right, then updating the
weights.

If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after
they are done, they would all pool their updates and update a global view of
network. Repeat.

~~~
Smerity
In practice, images are not particularly large and a batch of them would
easily fit on a single GPU. What's more common is either (a) performing the
forward and backward passes on 4 GPUs where each GPU has its own batch, then
collecting the gradient from all 4 backward passes or (b) splitting the
computation for individual layers across multiple GPUs.

Both (a) and (b) have various trade-offs. Some models perform worse with large
batch sizes, so (a) is not preferred, and others are hard or impossible to
parallelize at the layer level, ruling out (b). Google NMT did (b), though it
required many trade-offs and restrictions (see my blog post[1]), while many
image based tasks are happy with large batch sizes so go with (a).

[1]:
[http://smerity.com/articles/2016/google_nmt_arch.html](http://smerity.com/articles/2016/google_nmt_arch.html)

------
skizm
Random question (which I am sure has been asked before): is there any way to
harness the power of the block-chain to, say, fold proteins? Or something
"useful" like that.

I'm not saying that securing the block-chain isn't useful in and of itself,
I'm just wondering if we could sort of set up the block-chain to swap in/out
problems that are "hard to solve easy to verify _and also provide other
benefits to humanity_ ". Example: say we swap the current proof of work with a
protein folding problem instead, and then when we've "folded all the proteins"
(or just decide it isn't a useful problem or whatever) in the future, we just
revert it back to the current proof of work. Then maybe we find other similar
problems and we could swap them in and out as needed.

I'm guessing the current miners are hyper optimized for whatever the current
proof of work is, which would be the main road block (outrage at a "wasted"
investment into sha-256 specific machines).

I'm not really up to date on all the tech / politics that would go into a
change like that, but curious if it were technically possible.

~~~
AlexCoventry
It's not clear how to build a proof of work function which is both flexible
enough to be of practical value and rigid enough to be secure against tasks
designed with hostile intent. Primecoin is the closest I've seen. Gridcoin and
foldcoin aren't serious from a security perspective.

------
1024core
Slightly OT: why are we still limited to 256GB of memory? Why isn't memory
capacity increasing like Moore's Law?

~~~
jpalomaki
Quick glance to the article "There are 24 DIMM slots and you can use LRDIMMs".
64GB dimms at least seem to be available, some news from 2015 also mentioned
128GB dimm from Samsung.

------
fizixer
I seriously doubt you need to spend more than 100% the cost of the 8 GPUs on
the rest of the system.

If your 8 GPUs cost ~6k USD, you should be able to build a system for under
~10k USD (even ~8k). Any extra money you spend is more out of desire to "max
out" your specs and less of a performance boost.

------
sabman
Nice write up thanks for sharing! we have been building and selling similarly
spec'd boxes in EU - if anyone is interested checkout
[http://deeplearningbox.com/](http://deeplearningbox.com/) They come
preconfigured with all the major Deep Learning libs.

------
Cacti
Thanks for posting this. Put together similar build recently for home and
ended up running into many of the same issues. Ended up deciding to go with a
regular MB and 3x Ti cards, which is enough for what I'm doing now and avoids
many of the problems with bumping out to 4+ cards.

------
bwasti
Do 1080tis have fp16 support? Seems like a waste if the model can be fp16
trained and you're using full 32bit.

Similarly you should probably try a bunch of other frameworks (caffe2, cntk,
mxnet) as they might be better at handling this non standard configuration.

~~~
dharma1
no double speed fp16 on 1080ti

~~~
shaklee3
It does, however, have int8 support.

~~~
dharma1
yep 4x int8 (44 TOPS)on 1080ti. Is the framework support for there for
inference at 4x speed int8 on 1080ti? How about training - I thought you need
fp16 minimum for training. I've seen some research into lower precision
training (XNOR) but unsure how mature it is.

Being able to use 44 TOPS for training on a single 1080ti would be pretty
awesome.

~~~
dgacmu
Yes - here's a doc about doing quantized inference in TensorFlow, for example:
[https://www.tensorflow.org/performance/quantization](https://www.tensorflow.org/performance/quantization)

AFAIK, there's still a bit of a performance gap between just using TF and
using the specialized gemmlowp library on Android, but that part's getting
cleaned up.

Haven't seen much in generalized results on training using lower precision.

~~~
dharma1
Does that work with Pascal CUDA8 INT8 out of the box?

~~~
dgacmu
I'm not sure - I believe it depends on getting cuDNN6 working, and from this
bug, I can't quite tell if it works or not (but it's probably not officially
supported yet):
[https://github.com/tensorflow/tensorflow/issues/8828](https://github.com/tensorflow/tensorflow/issues/8828)

