I'm not on the hardware team, so I don't have the cost breakdown. But my understanding is that flash storage is the most expensive line item, and it matters little if its consolidated into one box, or spread over a rack, you still have to pay for it. By serving more from fewer boxes, you can reduce component duplication (cases, mobos, ram, PSUs) and more importantly, power & cooling required.
The real risk is that we introduce a huge blast radius if one of these machines goes down.
I almost fell for the hype of pcie gen4 after reading https://news.ycombinator.com/item?id=25956670, and it is quite interesting that pcie gen3 nvme drives can still do the job here. What would be the worst case disk I/O throughput while serving 400Gb/s?
If you look at just the pci-e lanes and ignore everything else, the NICs are x16 (gen4) and there's two of them. The NVMes are x4 (gen3) and there are 18 of them. Since gen4 is about twice the bandwidth of gen3, it's 32 lanes of gen4 NIC vs about 36 lanes of gen4 equivalent NVMe.
If we're only worried about throughput, and everything works out with queueing, there's no need for gen4 NVMes because the storage has more bandwidth than the network. That doesn't mean gen4 is only hype; if my math is right, you need gen4x16 to have enough bandwidth to run a dual 100G ethernet at line rate, and you could use fewer gen4 storage devices if reducing device count were useful. I think for Netflix, they'd like more storage, so given the storage that fits in their systems, there's no need for gen4 storage; gen4 would probably make sense for their 800Gbps prototype though.
In terms of disk I/O, either in the thread or the slides, drewg123 mentioned only about 10% of requests were served from page cache, leaving 90% served from disk, so that would make worst case look something like 45GB/sec (switching to bytes cause that's how storage throughput is usually measured). From previous discussions and presentations, Netflix doesn't do bulk cache updates during peak times, so they won't have a lot of reads at the same time as a lot of writes.
Thanks for the numbers. Perhaps hype is not the right word. It is just interesting to see that some older hardware can still be used to achieve the state of the art performance, as the bottleneck may lie elsewhere.
It's always balancing bottlenecks. Here, the bottleneck is memory bandwidth, limiting to (more or less) 32 lanes of network; the platform has 128 lanes, so using more lanes than needed at a slower rate works and saves a bit of cost (probably). On their Intel Ice Lake test machine, that only had 64 lanes which is also a bottleneck, so they used Gen4 NVMe to get the needed storage bandwidth into the lanes available.
The real risk is that we introduce a huge blast radius if one of these machines goes down.