
GPU-Oriented PCIe Expansion Cluster - minxomat
http://amfeltec.com/products/gpu-oriented-cluster/
======
krautsourced
For those doubting the usefulness of these rigs, I can say that for GPU
rendering numerous of these are in use by small sized studios and freelancers.
The bottleneck of the slow PCIe connection is nothing compared to the render
times per frame in most cases. It's only when a frame takes less than a second
to render that you notice. They are very popular with e.g. Octane users. The
only thing to be aware of is that some motherboards do not play well with this
and will not recognize all cards or will not run stable with it attached.

------
varelse
The right way to build this sort of thing is with a hub of 8796 PCIE switches
for each group of 4 GPUs such that they could all form a continuous 16x PCIE
bidirectional ring suitable for O(n) collectives like
gather/reduce/allGather/allReduce.

This is more or less useless, or if one wishes to be kind, an amazing
emulation of building distributed GPU code over the craptastic bandwidths
brought to you by AWS, Microsoft, and Google datacenters for now.

Also PCIE Gen 2? WTF? This might not even post with current GPUs.

~~~
kobeya
This is for mining or password cracking.

~~~
varelse
Sure, great, but if they could build this to support Deep Learning and
renegade HPC, I suspect they could sell 10,000+ units/year on the down-low.
That would be $10M in revenue, maybe twice that if it also worked with Macs.

NVIDIA _will_ do their _worst_ to shut this down because it's a direct threat
to DGX-1 running neural networks that aren't entirely communication-limited
(long story), but if they could throw this together, I think they could make a
great quick buck before the axe falls.

~~~
minxomat
If you take a look at their clients, it's mostly massive companies that
probably order large lots of OEM solutions. I don't think amfeltec needs the
scale or rather isn't already at the scale you are talking about (margin, not
turnover), by enterprise tax and support for OEM builds alone.

The 4x splitters go for about $200. But if you want guaranteed compatibility
(i.e. a full build), the price (and margin for them) will skyrocket.

------
angry_octet
Unfortunately this setup would be very slow for most GPGPU applications,
because the CPU-GPU and GPU-GPU bandwidth is very slow. To effectively use
this the data transfer requirements have to be very low.

Even with a dedicated 16x PCIe 3 connection there is a latency overhead
compared to inter-CPU buses, like HyperTransport or QPI, which is why nVidia
and IBM have scaled up the NVlink inter-GPU bus to become a memory speed
interconnect.

[https://www.ibm.com/blogs/systems/ibm-power8-cpu-and-
nvidia-...](https://www.ibm.com/blogs/systems/ibm-power8-cpu-and-nvidia-
pascal-gpu-speed-ahead-with-nvlink/)

------
EthanV2
I'm not as clued up as I used to be about this stuff, but wouldn't this have a
pretty serious impact on the performance of the individual cards? Seems like
splitting 4 16x cards off one 4x bus would limit the available bandwidth
somewhat.

~~~
mschuster91
That depends if the workload is data-transfer-bound or computing-bound.

If the former, yes you will suffer a massive performance blow even with just
one GPU - but if the latter, it's an easy way to upgrade your system.

~~~
EthanV2
I suppose if you're just working on a data set that's already stored in memory
on the GPU(s) the initial work involved in getting that data to the card would
be impacted but everything after that benefits from having an absurd amount of
computing power

------
mrbill
I saw a lot of this when GPUs were being used to mine bitcoin/altcoin.

~~~
DennisP
They're still being used for some blockchains with ASIC-resistant mining,
including zcash and ethereum.

------
kierank
If there are any reasonably-priced ways of doing this especially with more
PCIe lanes please let me know. In my experience it's often easier to just buy
more motherboards and CPUs than invest in PCIe expansion.

~~~
angry_octet
Depends on your bisection bandwidth requirement and software cost. If you want
to do MPI it is going to be cheaper to buy some InfiniBand cards. But if you
have a small cluster and tightly coupled code, or large I/O requirements, then
a PCIe switch might be the go. These get used on some video on demand systems
and signal processing architectures:

[https://www.microsemi.com/products/drivers-interfaces-and-
pc...](https://www.microsemi.com/products/drivers-interfaces-and-pcie-
switches/pcie-switches/pcie-fanout-switches/pm8536-pfx-96xg3)

The Dolphin systems are tuned better for computation:
[http://www.dolphinics.com/products/IXS600.html](http://www.dolphinics.com/products/IXS600.html)

I expect small PCIe NVMe external storage systems to become quite common in
the near future, because enterprise systems need multipath storage for
reliability; 8GB/s FC is too slow for SSDs, let alone NVMe, same with SAS bus
expanders.

[https://events.linuxfoundation.org/sites/events/files/slides...](https://events.linuxfoundation.org/sites/events/files/slides/LinuxVault2015_KeithBusch_PCIeMPath.pdf)

~~~
kierank
The Dolphin systems seem more expensive than just buying 4U servers.

~~~
angry_octet
You only buy them when you can't fit your problem into one server (i.e. not
enough memory or I/O) or you need redundancy (multipath I/O, synchronised
memory). In those cases they can be a lot quicker/cheaper/more reliable than
trying to solve it with more computers, protocols and ethernet.

------
chx
This needs a (2014) in the title.

------
westmeal
This is interesting but would different architectures of GPUs play nicely?
E.g. 3 AMD gpus and one NVIDIA gpu?

~~~
throwawayish
If you think regular GPU drivers are shitty, then wait until you try to
install both AMD and nVidia drivers at the same time. It's not very stable, to
say the least (Windows 7). Never tried it since. Not worth the hassle.

~~~
dragandj
Easy to install both nvidia and AMD and works like a charm for GPGPU on Arch
Linux, though.

------
StavrosK
At those speeds, doesn't the cable/bus length play a huge role? What's the
maximum length that can support them? I'd imagine it'd be a few centimeters at
most...

~~~
detaro
shorter is obviously better, but from what I've read 30 cm PCIe riser cables
work, and there are 50 cm examples.

~~~
kierank
We have problems with signal integrity on cheap risers and ribbon cables.

