> This time, AMD is inside. We were particularly impressed by the 2nd Gen AMD EPYC processors because they proved to be far more efficient for our customers’ workloads. Since the pendulum of technology leadership swings back and forth between providers, we wouldn’t be surprised if that changes over time. However, we were happy to adapt quickly to the components that made the most sense for us.
I find this a bit puzzling for density reasons. I can definitely appreciate the clock speed benefits. One 64-core part (AMD EPYC 7742) has the same TDP of 225W, so power should be in the same ballpark. There's also lower clocked 64-core SKUs with 200W TDP. I can't imagine price would be major factor for a company of Cloudflare's size, but it's definitely true that the 48-core part is much cheaper. There's also the 7H12 with a higher base clock than the 48-core part, but its TDP is 280W.
All of these EPYC chips have the same monstrous 256MB of L3, so maybe part of Cloudflare's workloads maxes out the cache before being able to feed all 64 cores, but that's a bit wishywashy. Maybe since they also all have the same PCIe lane capacity 48 cores is the sweet spot.
The 64-core parts still seem like a nobrainer.
My experience buying CPUs for data intensive servers has typically been that maximizing the performance-density-cost curve often recommends a mid-range number of cores at a lower-middle clock rate. These CPUs are inexpensive relative to their product families while still having enough horses to drive your memory, PCIe, etc to saturation with highly optimized code. Just enough resources, but no more.
We have two more blogs that will come out probably today, which will shed more light on why AMD worked better for us.
Stay tuned for 3 more blogs this week ... they will give deeper analysis on the perf % gains we saw and why.
If your workload is cache or memory bandwidth sensitive you might recover some performance despite having 25% fewer cores. You can probably run fewer cores at a higher sustained clockspeed. This may reduce a 25% deficit to something more modest like 5-10%, at which point the 64 core parts are harder to justify.
Given that, having more memory bandwidth per-core seems like it could easily improve CF's performance a lot.
If so, maybe AMD doesn’t have high-enough yield on their 64-core part (i.e. 8-core chiplet sub-part) to satisfy huge bulk orders for them, without also generating huge numbers of the 48-core-binned SKU (i.e. 6-core chiplets, really 6-out-of-8-enabled-core chiplets) in the process.
And I would suspect that their production process is such that they do have a real, explicit 6-core chiplet part as well, which can be mixed-and-matched within a single CPU with the flawed, re-binned 6-of-8-core chiplets, giving them a powerful hedge on their own logistics (in about the same way that SPAM has flexibility in their ratio of chicken to ham that lets them ride out turbulence in either market, making the end-product cheaper than either input), but requiring even further that people consume the SKUs containing “6”-core chiplets.
I would bet that AMD very much wants to sell large buyers the lower-core-count CPUs, since their yield guarantees that—at least for now—they have so very much more of them, and attempting to make more of the highest-end part ensures that they end up with even more of the less-than-highest-end chiplets laying around.
AMD probably ideally wants order-flow of CPUs in a ratio, e.g. “1x 7742 : 8x 7642”, and offers both better deals monetarily, and far faster delivery (/less contention on orders with other clients) when you take them up on it; or when you buy huge numbers of 7642s alone, such that you’re consuming the cast-off from bullheaded clients who wanted pure 7742s.
Curiously, TSMC seemingly published their N7 defect densities and they're low enough that most chiplets would not have outright dead cores. Specifically, they said 0.09 defects per square cm in a slide you can see at https://fuse.wikichip.org/news/2879/tsmc-5-nanometer-update/ . If that's saying what it appears to, lower SKUs must use a lot of chiplets where all the cores turn on, but (say) might not hit the top chip's performance spec within its TDP.
The 7742 needs all cores to run at 2.5GHz averaging ~3.2W apiece if you leave 25W of the 225W for the I/O die. The 7642 is looser: 2.3GHz averaging ~4.1W apiece, and that after dropping the "worst" core from any CCX where they all work. (For non-obsessives, a CCX is a four-core group connected to a 16MB chunk of L3 cache.)
Note lower SKUs like the 3700X/3800X and 7232P use 8-core chiplets. You can figure the chiplet count for an SKU by dividing its L3 capacity by 32MB, and from there you can figure how many cores each chiplet has enabled.
There's also plain market segmentation, i.e. enabling/disabling stuff on identical chips to sell at different prices. In this gen I doubt it's good strategy for AMD to hold back much performance like that, though, since they really want to get some market share right now.
(If turned-off cores generally work but below spec, that suggests there could be some way to make them useful for extremely-threaded workloads. Split hardware threads onto two lower-clocked physical cores when it looks like a net win, say. Can see enough potential thorns not to bother trying, but thinking of possibly-useful silicon sitting there turned off makes it just so tempting, heh.)
I have worked at companies with 5 employees struggling day to day to keep afloat, highly profitable companies with employee counts in the 6 figures and multiple billions in revenue and just about everything in between. I have never worked at one where cost was not a major factor.
In Cloudflare's case revenues for 2019 were $287 million and they had a net loss of $105.8 million. They are competing with a market leader, Akamai, whose revenues for 2019 were 10 times theirs and had a profit of a few hundred million dollars, so I don't suspect Cloudflare is the exception to the rule of cost being a major factor.
Bravo to Cloudflare!
Intel is still selling as many Xeon as they can make. And AMD's EPYC 2 / Datacenter revenue hasn't move up much relatively speaking.
Not only is it not major by any means, I wouldn't even agree on the word shift.
And that is speaking as someone who really want AMD to do better.
Looks like they're using Open Compute Project inspired designs, looking at those high-visibility green screws.
Great observation! these chassis are designed for 19" racks, not OCP spec. But, who does not love green thumb screws!
If a server can take OCP mezz cards and the BMC can be managed in the same manner and with the same tools as a fully OCP-spec system, does it really matter that it's not in an OCP-spec chassis and rack?
Yes, we're using OCP Nic ... and many other components that are used by other servers/clouds. Nothing proprietary except our SW stack.
Is ECC memory being used?
Also, great write up! The recent benchmark numbers from Netflix  comparing dual socket Intel Xeons with AMD EPYC Rome (64 cores) also showed really impressive results!
Are they not using additional storage for caching? I also kind of expected some GPU's for some tasks.
Yes, the 3TB is 2/3 used for Caching, and 1/3 as a local storage for our SSL/DNS/Firewall/etc.
• Image optimization is still mostly software: JPEG, PNG, WebP. GPUs do not bring much to the table here at all, and consume a lot of power, which is likely to upset the hosting ISP of that cache node.
• Video encoding is better, but still average. NVENC (Nvidia’s encode Engine) has some quality challenges and limitations.
Further, doing this at the edge makes less sense: if you have to fetch the source material from the origin, do the heavy lifting centrally and store the results at the edge. Storing 4-5 renditions is cheap compared to GPU running costs here.
I assume the keys are not password protected so the server automatically decrypts on boot?