
Servers for an Accelerated Future - jgrahamc
https://blog.cloudflare.com/cloudflares-gen-x-servers-for-an-accelerated-future/
======
chadmeister
_" Notably, for the first time, Intel is not inside. We are not using their
hardware for any major server components such as the CPU, board, memory,
storage, network interface card (or any type of accelerator)."_

~~~
Proven
Who cares? What matters is price performance. They need to deliver value.
What's inside, I don't care.

~~~
oh_sigh
In the next paragraph from the article

> This time, AMD is inside. We were particularly impressed by the 2nd Gen AMD
> EPYC processors because they proved to be far more efficient for our
> customers’ workloads. Since the pendulum of technology leadership swings
> back and forth between providers, we wouldn’t be surprised if that changes
> over time. However, we were happy to adapt quickly to the components that
> made the most sense for us.

------
jagger27
> We selected the AMD EPYC 7642 processor in a single-socket configuration for
> Gen X. This CPU has 48-cores (96 threads), a base clock speed of 2.4 GHz,
> and an L3 cache of 256 MB. While the rated power (225W) may seem high, it is
> lower than the combined TDP in our Gen 9 servers and we preferred the
> performance of this CPU over lower power variants. Despite AMD offering a
> higher core count option with 64-cores, the performance gains for our
> software stack and usage weren’t compelling enough.

I find this a bit puzzling for density reasons. I can definitely appreciate
the clock speed benefits. One 64-core part (AMD EPYC 7742) has the same TDP of
225W, so power should be in the same ballpark. There's also lower clocked
64-core SKUs with 200W TDP. I can't imagine price would be major factor for a
company of Cloudflare's size, but it's definitely true that the 48-core part
is much cheaper. There's also the 7H12 with a higher base clock than the
48-core part, but its TDP is 280W.

All of these EPYC chips have the same monstrous 256MB of L3, so maybe part of
Cloudflare's workloads maxes out the cache before being able to feed all 64
cores, but that's a bit wishywashy. Maybe since they also all have the same
PCIe lane capacity 48 cores is the sweet spot.

The 64-core parts still seem like a nobrainer.

~~~
spamizbad
If you look at how the chiplets are organized, you technically have 4 cores
sharing a bank of L3 (2 of these 4 core groups per chiplet). In the 48 core
model, 1 core from each 4 core group are disabled, so you have 3 cores sharing
the same quantity of L3. So you now have 25% more L3 cache per core. You also
have 25% more per-core memory and PCIE bandwidth.

If your workload is cache or memory bandwidth sensitive you might recover some
performance despite having 25% fewer cores. You can probably run fewer cores
at a higher sustained clockspeed. This may reduce a 25% deficit to something
more modest like 5-10%, at which point the 64 core parts are harder to
justify.

~~~
JoeAltmaier
33% more per core?

~~~
spamizbad
Ah right, I'm bad at math.

------
rasengan
This major shift is well received by the tech community and appears to be an
industry wide movement [1]. Looks like meltdown [2] and spectre [3] have run
their course.

Bravo to Cloudflare!

[1] [https://www.amd.com/system/files/documents/linode-case-
study...](https://www.amd.com/system/files/documents/linode-case-study.pdf)

[2]
[https://en.m.wikipedia.org/wiki/Meltdown_(security_vulnerabi...](https://en.m.wikipedia.org/wiki/Meltdown_\(security_vulnerability\))

[3]
[https://en.m.wikipedia.org/wiki/Spectre_(security_vulnerabil...](https://en.m.wikipedia.org/wiki/Spectre_\(security_vulnerability\))

~~~
wbl
(Disclosure: I work at Cloudflare but wasn't involved in this decision)
Spectre variant 1 still affects patched chips with any sort of speculation and
the price of fixing via compiler is substantial. It's a multi vendor
structural issue. The hardware industry is stuck with some very unpalatable
decisions to fix this. In some cases there are tricks to protect particular
arrays but it's very finicky especially if arrays aren't a power of two in
length.

~~~
britmob
I wonder when these issues will be far behind us... I can’t imagine it’s very
near.

------
tyingq
Cloudflare's workers and kv store are pretty popular. I wonder if they ever
plan to expand that to include more cloud compute. The article seems to
indicate they have a lot of the needed unpinnings already.

~~~
dx034
I'd guess they'd need to provision massively more to offer compute. They could
offer it for some major data centres but I don't believe offering compute at
all data centres would be feasible.

------
a-ve
[https://blog-cloudflare-com-
assets.storage.googleapis.com/20...](https://blog-cloudflare-com-
assets.storage.googleapis.com/2020/02/AMD_EPYC-35.jpg)

Looks like they're using Open Compute Project inspired designs, looking at
those high-visibility green screws.

~~~
Hardops
hey, this is Rami .. I lead hw team in Cloudflare.

Great observation! these chassis are designed for 19" racks, not OCP spec.
But, who does not love green thumb screws!

~~~
jcdmacleod
Can I ask the chassis and motherboard combination? I know previous generations
have been based on QCT chassis and motherboard components. I am curious what
the non-hotswap, shallow depth, 1S is as it probably saves a fair amount of
money dropping the HPC/density and hot swap premiums.

~~~
Hardops
We're still using ODM built servers that are available for the public. Yes,
moving off the shared infra chassis helps us improve reliability & cost thru
simplicity, and enables us to scale our infrastructure using single nodes vs.
X nodes as the smallest scaling unit. Our machines are stateless as our
globally interconnected infra is widely distributed and replicated. Hence we
moved to no-hotplug.

------
rajnathani
> For memory, we continued to use 256GB of RAM, as in our prior generation,
> but rated higher at 2933MHz.

Is ECC memory being used?

Also, great write up! The recent benchmark numbers from Netflix [0] comparing
dual socket Intel Xeons with AMD EPYC Rome (64 cores) also showed really
impressive results!

[0]
[https://news.ycombinator.com/item?id=22229431](https://news.ycombinator.com/item?id=22229431)

------
nodesocket
Are they running 2x 25Gbps NIC’s per server over standard cat 6? I thought
10Gbps over cat 6 was generally accepted as the fastest it can reliably push?

~~~
Hardops
This is Rami from Cloudflare ... we actually use DAC Cables w/ SFP+ for all of
our 25G nics.

------
bluedino
>> For storage, we continue to have ~3TB, but moved to 3x1TB form factor using
NVME flash

Are they not using additional storage for caching? I also kind of expected
some GPU's for some tasks.

~~~
sudhirj
What cache tasks are possible on GPUs? Maybe computing a hash, but that
doesn’t seem like a big problem.

~~~
wmf
I could imagine CloudFlare using GPUs for image optimization or video
encoding. They do a lot more than caching these days.

~~~
elithrar
Neither is “great” on a GPU:

• Image optimization is still mostly software: JPEG, PNG, WebP. GPUs do not
bring much to the table here at all, and consume a lot of power, which is
likely to upset the hosting ISP of that cache node.

• Video encoding is better, but still average. NVENC (Nvidia’s encode Engine)
has some quality challenges and limitations.

Further, doing this at the edge makes less sense: if you have to fetch the
source material from the origin, do the heavy lifting centrally and store the
results at the edge. Storing 4-5 renditions is cheap compared to GPU running
costs here.

------
DenseComet
In the last picture, the server seems to be 1U, but also squarish. Is there a
benefit or a reason for not using the entire depth of the rack, or was there
just nothing more to include?

------
porker
With the LUKS full disk encryption, how are you managing the encryption keys?

I assume the keys are not password protected so the server automatically
decrypts on boot?

------
markroseman
Immense kudos for the Douglas Coupland reference.

