
The System Bottleneck Shifts to PCI-Express - rbanffy
https://www.nextplatform.com/2017/07/14/system-bottleneck-shifts-pci-express/
======
faragon
The bottleneck is not PCI-Express, not even the 3.0 version. The bottleneck is
shipping processors with just 16 PCI-E 3.0 lanes (!). Things are changing with
the new high end desktop processors, with 64 PCI-E 3.0 lanes (e.g. AMD
Threadripper [1]), which is massive: 128 GB/s, and hardly a bottleneck. The
4.0/5.0 versions will allow reducing costs by requiring less lanes, thus using
less pins, being cheaper to produce.

[1] [http://www.tomshardware.com/news/amd-threadripper-vega-
pcie-...](http://www.tomshardware.com/news/amd-threadripper-vega-pcie-
lanes,34581.html)

~~~
shaklee3
The article is directed at the hpc/server market. Xeons have had 40 pcie lanes
each for long time. Pcie bandwidth on desktops is only a problem for a very
small number of people.

~~~
faragon
They could make 32-lane PCI-E slots, then.

------
neilmovva
Taking a more skeptical look at the issue, we might have Intel to blame for
the stagnant system interconnect.

As it gets harder each year to improve CPU performance, the HPC community has
shifted to more workload-specific accelerators, most notably GPUs for high-
throughput parallel data processing. These accelerators still rely on the host
CPU to dispatch commands, but in recent years we've seen workloads really
focus on and target the accelerator device (eg neural networks on the GPU).

If one wants to build a multi-GPU cluster, then the CPU quite plainly "gets in
the way" \- performance can very quickly be bottlenecked by weak inter-device
bandwidth (PCIe 3.0 x16 = 16GB/s, vs. the GTX 1080 Ti onboard DRAM's ~500
GB/s). Not to mention the fact that the PCIe controller is on the CPU die,
meaning inter-node bandwidth and latency also strictly favor the CPU.

For large-scale systems, the value proposition of multi-GPU is greatly
neutered by reliance on the PCIe bus, and so Intel stays relevant for many
applications. And for the last 5 years, Intel's utter dominance of the
HPC/server market meant that they could limit PCIe lanes without much pressure
from their customers. With Ryzen/EPYC (128 PCIe 3.0 lanes!!) that old order
looks set to change.

~~~
jchw
I don't think that's the whole story, though. With some advances in storage,
PCI-e is not far from being a bandwidth bottleneck in even consumer PCs
(thinking about NVMe as an example.)

~~~
noinsight
> NVMe

Even there the CPU can be the bottleneck. I have a Samsung 960 Pro that's
theoretically capable of 3GB/s reads but when you use disk encryption even
with AES-NI the processor can only do ~2GB/s.

~~~
zacmps
What about when SIMD?

~~~
noinsight
That would probably help. Not sure what Linux LUKS supports specifically, but
it's testable via "cryptsetup benchmark".

------
bhouston
NVIDIA has their own bus to interconnect cards call NVLink:
[https://en.wikipedia.org/wiki/NVLink](https://en.wikipedia.org/wiki/NVLink)
Seems like an attempt to get around this.

Memory is also really slow too, but AMD's 6+ memory controllers will help.

~~~
loeg
NVLink is called out specifically in the article as a workaround for the
delayed creation of PCIe 4.0.

------
davidmr
I certainly don't have much to add on any of the technical discussion here
about the host side of things, but speaking as someone who in a previous life
had to run miles and miles of IB cable under the floors, rip out mysteriously
broken cables and run their replacements back in, these high-capacity ports
are a godsend for switch interconnects. I can use more ports on my edge
switches and run way fewer shitty $1500 Mellanox ISL cables if I just want my
hosts running FDR IB.

------
wmf
Specifically, a PCIe 3.0 x16 slot can't even support a 200 Gbps NIC.
Apparently the Mellanox ConnectX-6 NIC takes up two slots.

~~~
faragon
PCI-E 3.0 x16 has 16GB/s bandwidth, being 200 Gbps 25 GB/s, so yes, it is not
enough.

Even having enough bandwidth, handling 25 GB/s data, except for hardware-
accelerated network processing, is very hard when current processors have 60
GB/s of RAM bandwidth (DDR4, dual channel), and even with zero-copy modern OS
is going to be hard for the CPU to process the packets on continuous full
load. That's mind blowing, and amazing :-)

~~~
shaklee3
DDR has a much higher bandwidth than this. The old V4 Xeons could get close to
100 GBps, and the new skylakes out now have a 50% increase by going to six
channels. On top of that, Intel has a feature where the packets coming off the
wire are stored in cache first, so you could see much higher packet processing
performance, depending on the type of application.

~~~
snnn
Have you tested the numbers?

~~~
shaklee3
[http://www.anandtech.com/show/8423/intel-
xeon-e5-version-3-u...](http://www.anandtech.com/show/8423/intel-
xeon-e5-version-3-up-to-18-haswell)

[https://www.pugetsystems.com/labs/hpc/Memory-Performance-
for...](https://www.pugetsystems.com/labs/hpc/Memory-Performance-for-Intel-
Xeon-Haswell-EP-DDR4-596/)

------
hirsin
No mention of power requirements/allowances? I recall some gpus a while back
being a big deal because they just squeezed into the power envelope allowed by
pcie. I don't know if this governing body also controls the power specs or if
they're even related.

~~~
SAI_Peregrinus
High-end GPUs have a second or even third power connector, direct to the PSU.
The power supplies to support them have dedicated 12V rails for each. The case
you're remembering was due to trying to build cards down to a lower price by
not including this connector and running very near the max allowed power that
can be taken from the PCI slot connector. IIRC When combined with cheap
motherboards and power supplies it resulted in failures.

The 6-pin PCIE power connectors can handle 75W, the 8-pins 150W, according to
the specs. Some video cards have multiples of these connectors, to allow for
very high power consumption. If more power is needed they'll just add more and
require a bigger power supply.

~~~
blattimwind
GPU servers nowadays use proprietary form factors for GPUs anyway (like SXM2).

~~~
cameldrv
I'd be curious to hear who is using servers with the proprietary form factor
GPUs. The prices for datacenter vs. gamer versions of the same GPUs have
gotten really out of whack, where a 1080ti is $700, and a P100 is $7000, and
at least on FP32, they have the same performance.

From what I hear, most all of the major DL research labs are using 4U servers
with 8 of either the 1080ti or the Titan Xp rather than the SXM2 Teslas. On
the other hand, all of the cloud providers seem to be sticking with the Tesla
products, and pricing GPU instances commensurately. I really wish there were a
cloud provider that offered instances with the gamer cards in them so that
companies in the DL field weren't effectively forced to buy and manage their
own hardware.

~~~
verall
A P100 has a gp100 with nvlink, hbm2, and a 610mm die, while a 1080ti or titan
Xp has a gp102 with gddr5x and 471mm die. It is not like quadro vs geforce
where they both run on a gp102 with different drivers, the gp100 is
considerably physically larger and has a much smaller userbase. I would be
very surprised if they have the same fp32 performance in real world
situations.

~~~
wmf
OK, so let's compare eight GP102s against one GP100. Is there any metric where
GP100 wins?

~~~
ori_b
Power consumption and space consumption.

------
sliken
Seems kinda of silly, especially since PCI-Express is not cache coherent.

Why not just connect high performance devices directly to hypertransport,
Infinity fabric, QPI, or whatever the fast cache coherent serial interface of
the day is?

~~~
greglindahl
The first generation of Infinipath did exactly that. We had to create a slot
standard for Hypertransport (HTX), convince motherboard makers to build boards
with that slot, and then at the end of the day we just ended up convincing
everyone that they never wanted to go that route ever again.

These days Intel's putting Omnipath on-package for Xeon Phi and Skylake.
Likely it's still a PCI-Express connection, but it doesn't count against the
total available lanes for external cards.

~~~
jabl
> and then at the end of the day we just ended up convincing everyone that
> they never wanted to go that route ever again.

Hmm, why? Intel executed well with Nehalem while AMD stumbled, leaving HTX
even more niche than it already was, or do you mean there was some fundamental
(technical) problem with it?

> These days Intel's putting Omnipath on-package for Xeon Phi and Skylake.
> Likely it's still a PCI-Express connection, but it doesn't count against the
> total available lanes for external cards.

Assuming the parts with integrated omni-path use the same socket, some pins
will be needed for the OPA connection, no? Presumably pins that were reserved
for PCIe in the normal chips, I'd guess...

~~~
greglindahl
Look at photos -- Omnipath Skylakes have an extra connector.

------
jnordwick
My problem with PCIE has always been latency issues. While more lanes will
help with throughput it still doesn't help me get data on and off my NIC any
quicker.

------
Razengan
Do busses matter in devices with a singular SoC, like modern phones and
tablets? If not, then perhaps laptops/desktops could follow suit?

~~~
joenathanone
Yes, storage for example isn't on the SoC, so bus speed will limit storage
speed.

~~~
frahs
Is PCIe really the bottleneck for storage devices? I haven't measured
experimentally to see, but just off the top of my head, it seems like an x4
PCIe 3.0 slot should be able to keep up with everything but the latest SSD's
(and for those, I'm not really sure.... depends on the workload as well
probably).

~~~
greglindahl
It's a bit of a self-fulfilling prophecy -- the more flash chips you've got
the more bandwidth is possible, but there's not much incentive to go faster
than x4 can if you know the device is going to plug into x4.

------
mcraiha
Topic is too overgeneralized. For general computing cases PCI Express is not
the bottleneck, nor it will be.

~~~
greglindahl
This blog "offers in-depth coverage of high-end computing at large
enterprises, supercomputing centers, hyperscale data centers, and public
clouds."

It's not about general purpose computing.

~~~
kbaker
Yeah they even link directly to PDF datasheets of PCB laminates and prepreg
material (the core layers that the copper adheres to on a printed circuit
board) that were used in an effort to show why they are having issues.

Pretty specific. Good article.

------
omgtehlion
This article confronts PCIe with eth/IB multiple times.

I wonder, does the author know how network cards are attached to CPU? There is
no magic, apparently. And on most desktops you can not even have a full size
video card and a 10GE adapter working simultaneously at full speed. And this
bottleneck is in CPU (its interfaces).

