The bitcoin miners have figured out one way to handle this, which is by using a variety of PCIe splitting systems. I've seen examples of people putting in 8GPUs in 4 slots with these splitters. The problem is that the majority of these splitters take your x16 connection and turn into 2 to 4 x1 PCIe lanes, which is a lot of wasted bandwidth. This is fine for the miners, since the cards run mostly independently. If I could find compatible PCIe splitters that could split x16 into 2 x8 channels then that would be a really sweet spot in performance/$, but unfortunately I've yet to find them. So right now I'm going to stick to 6 GPUs, which you can get in a $500 consumer motherboard with just a few riser cables.
See for example: http://amfeltec.com/products/flexible-x4-pci-express-4-way-s...
The catch being that the 7 slots are next to each other, so you either have to make a custom water loop with single slot GPUs or use simple risers on half of them. But that's probably the best bandwidth you can currently get between >4 GPUs on consumer parts.
If you don't require a rack mounted server, a cluster of workstations like NVIDIA's DIGITS DevBox is far more cost efficient (and less noisy). I run a compute intensive business (Dreamscopeapp.com) and we opted to build a cluster of desktop-like machines instead of using a rack mounted solution. Another benefit is you don't run into the power issues mentioned in the post.
My start-up actually sells the machine described in this post:
And a machine inspired by the NVIDIA DIGITS DevBox:
One of the reasons 3D artists who are using GPU rendering generally go for liquid cooling (that and it makes the cards single slot)
16.5K seems pretty reasonable for 8x 1080ti with a bit of profit for building it, but unreasonable for only 4x 1080ti. My home-built 4x1080ti box (without quite enough PCIe bandwidth, admittedly) is under $6k. I'm assuming/hoping there's an error there. :)
Screenshot of the order form: https://www.dropbox.com/s/2nm00w1rd6du6ey/Screenshot%202017-...
Oh, also - if I want a quote on both the big server and the little workstation I have to enter my contact info twice? Not particularly customer-friendly.
The server we sell is packaged with software we wrote that makes administering significantly easier. We also provide technical support and even a limited amount of free machine learning consulting. The customers who purchase this server want a headache free solution and aren't as price sensitive as a lone researcher.
notice the custom part box accounts for two more GPUs, I'm not sure why the site doesn't let you add 4 to the GPU section.
This setup ranges from $5250 with 4 GPU to $3240 with 1 GPU. You might want to bump up the PSU for 4 GPU its currently 1500 watts, which may or may not be enough at max load. The article shows a max of ~2800 watts with 8 GPUs
Mobo does not support 4x16 PCIe lanes (that's why they didn't want you to add 4 GPU cards).
Mobo is limited to 64GB RAM.
$520 for two 256GB 840 Pro SSDs? Seriously?
Here's a better mobo: https://www.amazon.com/Motherboards-X99-E-WS-USB-3-1/dp/B00X...
Also, you can literally double the RAM for the same money: http://www.ebay.com/itm/128gb-DDR4-8-Crucial-16gb-DDR4-2400m... (keep in mind that speed of RAM is irrelevant for DL tasks).
Dreamscope doesn't actually make money :) It's a little under break even. It brings in revenue through a $9.99/mo premium subscription, which gives customers higher resolution images.
Bottleneck depends on the workload. If you're training a small/fast network, data bandwidth is a real problem.
That being said, for most cases, a workstation build that provides every GPU with 16 lanes is far less cost effective.
It also indicates there might be a market for a specialized 90° connector that can squeeze into tight spaces like that.
I agree with nVidia's choice here, but you also raise a valid point; certain cases and configurations would benefit from the added flexibility of that adapter, so there may well be a market.
"At Google, one has relatively unbounded access to GPUs and CPUs. So part of this project was figuring out how to scale the training—because even with these restricted datasets training would take weeks on a single GPU.
The most ideal way to distribute training is Asynchronous SGD. In this setup you start N machines each independently training the same model, sharing weights at each step. The weights are hosted on a separate "parameter servers", which are RPC'd at each step to get the latest values and to send gradient updates. Assuming your data pipeline is good enough, you can increase the number of training steps taken per second linearly, by adding workers; since they don't depend on each other. However as you increase the number of workers, the weights that they use become increasingly out-of-date or "stale", due to peer updates. In classification networks, this doesn't seem to be a huge problem; people are able to scale training to dozens of machines. However PixelCNN seems particularly sensitive to stale gradients—more workers with ASGD provided little benefit.
The other method is Synchronous SGD, in which the workers synchronize at each step, and gradients from each are averaged. This is mathematically the same as SGD. More workers increase the batch size. But Sync SGD allows individual workers to use smaller and faster batch sizes, and thus increase the steps/sec. Sync SGD has its own problems. First, it requires many machines to synchronize often, which inevitably leads to increased idle time. Second, beyond having each machine do batch size 1, you can't increase the steps taken per second by adding machines. Ultimately I found the easiest setup was to provision 8 GPUs on one machine and use Sync SGD—but this still took days to train.
The other way you can take advantage of lots of compute is by doing larger hyperparameter searches. Not sure what batch size to use? Try all of them! I tried hundreds of configurations before arriving at the one we published."
Simplified: Training works by taking an input sample (say an image), running it through the network, seeing if your answer is right, then updating the weights.
If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after they are done, they would all pool their updates and update a global view of network. Repeat.
Both (a) and (b) have various trade-offs. Some models perform worse with large batch sizes, so (a) is not preferred, and others are hard or impossible to parallelize at the layer level, ruling out (b). Google NMT did (b), though it required many trade-offs and restrictions (see my blog post), while many image based tasks are happy with large batch sizes so go with (a).
I'm not saying that securing the block-chain isn't useful in and of itself, I'm just wondering if we could sort of set up the block-chain to swap in/out problems that are "hard to solve easy to verify and also provide other benefits to humanity". Example: say we swap the current proof of work with a protein folding problem instead, and then when we've "folded all the proteins" (or just decide it isn't a useful problem or whatever) in the future, we just revert it back to the current proof of work. Then maybe we find other similar problems and we could swap them in and out as needed.
I'm guessing the current miners are hyper optimized for whatever the current proof of work is, which would be the main road block (outrage at a "wasted" investment into sha-256 specific machines).
I'm not really up to date on all the tech / politics that would go into a change like that, but curious if it were technically possible.
If your 8 GPUs cost ~6k USD, you should be able to build a system for under ~10k USD (even ~8k). Any extra money you spend is more out of desire to "max out" your specs and less of a performance boost.
Similarly you should probably try a bunch of other frameworks (caffe2, cntk, mxnet) as they might be better at handling this non standard configuration.
Being able to use 44 TOPS for training on a single 1080ti would be pretty awesome.
AFAIK, there's still a bit of a performance gap between just using TF and using the specialized gemmlowp library on Android, but that part's getting cleaned up.
Haven't seen much in generalized results on training using lower precision.