I want to build an epic rig that will last a long time with professional grade hardware (with ECC memory for instance) and would love to get a lot of the bleeding-edge stuff without compromising on durability. Where do these people hang out online?
Previously you could only get this CPU when buying the Lenovo ThinkStation P620 machine. I'm pretty happy with Lenovo Thinkstations though (I bought a P920 with dual Xeons 2.5 years ago)
I guess I should submit this on HN as well.
Edit: I was getting too ahead of myself I thought these are for TR Pro with Zen 3. Turns out they are not out yet.
GamersNexus (despite the name, they include a good amount of non-gaming benchmarks, and they have great content on cases and cooling): https://www.youtube.com/user/GamersNexus https://www.gamersnexus.net/
Level1Techs (mentioned in another reply): https://www.youtube.com/c/Level1Techs https://www.level1techs.com/
r/homelab (and all the subreddits listed in its sidebar): https://www.reddit.com/r/homelab/
Even LinusTechTips has some decent content for server hardware, though they stay fairly superficial. And the forum definitely has people who can help out: https://linustechtips.com/
And the thing is, depending on what metric you judge performance by, the enthusiast hardware may very well outperform the server hardware. For something that is sensitive to memory, e.g., you can get much faster RAM in enthusiast SKUs (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than you'll find in server hardware. Similarly, the HEDT SKUs out-clock the server SKUs for both Intel and AMD.
I have a Threadripper system that outperforms most servers I work with on a daily basis, because most of my workloads, despite being multi-threaded, are sensitive to clockspeed.
No one's using "gamer NICs" for high speed networking. Top of the line "gaming" networking is 802.11ax or 10GbE. 2x200Gb/s NICs are available now.
Gaming parts are strictly single socket - software that can take advantage of >64 cores will need server hardware - either one of the giant Ampere ARM CPUs or a 2+ socket system.
If something must run in RAM and needs TB of RAM, well then it's not even a question of faster or slower. The capability only exists on server platforms.
Some workloads will benefit from the performance characteristics of consumer hardware.
There's some folks in /r/homelab who are into this kind of thing, and I used their advice a fair bit in my build. While it is kind of mixed (there's a lot of people who build pi clusters as their homelab), there's still plenty of people who buy decommissioned "enterprise" hardware and make monstrous-for-home-use things.
It's not quite as trivial of a barrier to entry as consumer desktops, but I suppose that's the point. Still, it would be nice if there was a guide that could help me make good decisions to start.
Loud though - most of them run pretty quiet if not doing anything.
Compute is so cheap second hand.
Of course that is nothing compared to what you’d pay at Google/Azure/AWS for the AMD machine of this news item :-)
12V only PSUs like OEMs use or ATX12VO in combination with a motherboard without IPMI, similar to the German Fujitsu motherboards, have significant lower power consumption at rest. Somewhere around 8-10Watt without HDD. Much better for home use IHMO.
Regardless of electricity cost, all that electricity usage winds up with a lot of heat in a dwelling. To help offset the energy consumption in the future I plan to use a hybrid water heater that can act as a heat pump and dehumidifier and capture the excess heat as a way to reduce energy consumption for hot water.
I’ve got a 3.5” x16 bay gooxi chassis that I’ve put a supermicro mb + xeon in.
Something like this:
I got this specific nas chassis because it got a fan wall with 3x120mm fans, not because I need bays.
With a few rather cool SSD’s for storage and quiet noctua fans it is barely a whisper.
Also - vertical rack mounting behind a closet door!
I can have a massive chassis that basically takes no place at all. Can’t belive I didn’t figure that one out earlier...
It’s not likely that a silent 2W fan will move a similar amount of air as the stock 14W fans. The enterprise gear from HPE is pretty well engineered; I’m skeptical that they over-designed the fans by a 7x factor.
Operating voltage tells you “this fan won’t burn up when you plug it in”. It doesn’t tell you “will keep the components cool”.
Hard forum is cool too
haha yeah I bought a whole computer from someone and was wondering why the RAM looked like rupies from Zelda
apparently that is common now
but at least I'm not cosplaying as a karate day trader for my Wall Street Journal expose'
I’m not trying to be snarky here but you can always just turn off the lights or set it to be a solid color of your preference.
I remember WhatsApp used to operate its 500M user with only a dozen of large FreeBSD boxes. ( Only to be taken apart by Facebook )
So Thank you for raising awareness. Hopefully the pendulum is swinging back to conceptually simple design.
>I also have a 380 GB Intel Optane 905P SSD for low latency writes
I would love to see that. Although I am waiting for someone to do a review on the Optane SSD P5800X . Random 4K IOPS up to 1.5M with lower than 6us Latency.
With 1TB of RAM you can have 256 bytes for every person on earth live in memory. With SSD either as virtual memory or keeping an index in RAM, you can do meaningful work in real time, probably as fast as the network will allow.
depending on how you define a TB(memory tends to favour the latter definition, but YMMV):
1,000,000,000,000 / 7.8billion = 128.21 bytes per human.
1,099,511,627,776 / 7.8billion = 140.96 bytes per human.
population source via Wikipedia.
The new P5800X should be sick.
I rolled with it, but really wondered if they knew I could get 2x the hardware and have a computer at home and at work for less money than the MBP ... Most of the people didnt seem to understand that laptop CPUs are not the same as desktop/workstation ones, especially when they hit thermal down throttling.
Which is how I ended up with an absolute monster of a work machine, these days I WFH and while work issued me a Macbook Pro it sits on the shelf behind me.
Fedora on a (still fast) Ryzen/2080 and 2x4K 27" screens vs a Macbook Pro is a hilarious no brainer for me.
Upgrading soon but can't decide whether I need the 5950X or merely want it - realistically except for gaming I'm nowhere near tapping out this machine (and it's still awesome for that an VR which is why the step-son is about to get a in his words "sick" PC).
I used to work for a VFX company in 2008. At that point we used lustre to get high throughput file storage.
From memory we had something like 20 racks of server/disks to get a 3-6 gigabyte/s (sustained) throughput on a 300tb filesystem.
It is hilarious to think that a 2u box can now theoretically saturate 2x100gig nics.
I was thinking of doing something like that. Weirdly I got sustained throughput differences when I killed & restarted fio. So, if I got 11M IOPS, it stayed at that level until I killed fio & restarted. If I got 10.8M next, it stayed like it until I killed & restarted it.
This makes me think that I'm hitting some PCIe/memory bottleneck, dependent on process placement (which process happens to need to move data across infinity fabric due to accessing data through a "remote" PCIe root complex or something like that). But then I realized that Zen 2 has a central IO hub again, so there shouldn't be a "far edge of I/O" like on current gen Intel CPUs (?)
But there's definitely some workload placement and I/O-memory-interrupt affinity that I've wanted to look into. I could even enable the NUMA-like-mode from BIOS, but again with Zen 2, the memory access goes through the central infinity-fabric chip too, I understand, so not sure if there's any value in trying to achieve memory locality for individual chiplets on this platform (?)
https://access.redhat.com/documentation/en-us/red_hat_enterp... tells you how to tweak irq handlers.
You usually want to change both. pinning each fio process + each interrupt handler to specific cpus will reach highest perf.
You can even use isolcpus param to linux kernel to reduce jitter from things you don't care about to minimize latency.(wont do much for bandwidth)
Make sure you get as many numa domains as possible in your BIOS settings.
I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.
One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU.
An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.
What is your mental model like? How much experimentation do you do verses reading kernel code? How do you know what questions to start asking?
*edit, btw I understand that a response to these questions could be an entire book, you get the question-space.
As far as mindset goes - I try to apply the developer mindset to system performance. In other words, I don't use much of what I call the "old school sysadmin mindset", from a time where better tooling was not available. I don't use systemwide utilization or various get/hit ratios for doing "metric voodoo" of Unix wizards.
The developer mindset dictates that everything you run is an application. JVM is an application. Kernel is an application. Postgres, Oracle are applications. All applications execute one or more threads that run on CPU or do not run on CPU. There are only two categories of reasons why a thread does not run on CPU (is sleeping): The OS put the thread to sleep (involuntary blocking) or the thread voluntarily wanted to go to go to sleep (for example, it realized it can't get some application level lock).
And you drill down from there. Your OS/system is just a bunch of threads running on CPU, sleeping and sometimes communicating with each other. You can directly measure all of these things easily nowadays with profilers, no need for metric voodoo.
I have written my own tools to complement things like perf, ftrace and BPF stuff - as a consultant I regularly see 10+ year old Linux versions, etc - and I find sampling thread states from /proc file system is a really good (and flexible) starting point for system performance analysis and even some drilldown - all this without having to install new software or upgrading to latest kernels. Some of the tools I showed in my article too:
https://tanelpoder.com/psnapper & https://0x.tools
In the end of my post I mentioned that I'll do a webinar "hacking session" next Thursday, I'll show more how I work there :-)
I have one question / comment: did you use multiple jobs for the BW (large IO) experiments? If yes, then did you set randrepeat to 0? I'm asking this because fio by default uses the same sequence of offsets for each job, in which case there might be data re-used across jobs. I had verified that with blktrace a few years back, but it might have changed recently.
edit: fixed typo
I mean, currently OLTP RDBMS engines tend to use 4k, 8k (and some) 16k block size and when doing completely random I/O (or, say traversing an index on customer_id that now needs to read random occasional customer orders across years of history). So you may end up reading 1000 x 8 kB blocks just to read 1000 x 100B order records "randomly" scattered across the table from inserts done over the years.
Optane persistent memory can do small, cache line sized I/O I understand, but that's a different topic. When being able to do random 512B I/O on "commodity" NVMe SSDs efficiently, this would open some interesting opportunities for retrieving records that are scattered "randomly" across the disks.
edit: to answer your question, I used 10 separate fio commands with numjobs=3 or 4 for each and randrepeat was set to default.
And ethernet (unless LAN jumbo frames) is about 1.5kByte per frame (not 4kB).
One such PC should be able to do 100k simultaneous 5 Mbps HD streams.
Testing this would be fun :)
This is all FreeBSD, and is the evolution of the work described in my talk at the last EuroBSDCon in 2019: https://papers.freebsd.org/2019/eurobsdcon/gallatin-numa_opt...
I still remember the post about breaking 100Gbps barrier, that was may be in 2016 or 17 ? And wasn't that long ago it was 200Gbps and if I remember correct it was hitting memory bandwidth barrier as well.
And now 350Gbps?!
So what's next? Wait for DDR5? Or moving to some memory controller black magic like POWER10?
The current bottleneck is IO related, and its unclear what the issue is. We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?
I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
> We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
Something I explained to a colleague recently is that a modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air than my first four computers had combined.
You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
It may depend on what you're sending. Netflix's use case is generally sending files. If you're doing software encryption you would load the plain text file into memory (via the filesystem/unified buffer cache), then write the (session specific) encrypted text into separate memory, then tell give that memory to the NIC to send out.
If the NIC can do the encryption, you would load the plain text into memory, then tell the NIC to read from that memory to encrypt and send out. That saves at least a write pass, and probably a read pass. (256 MB of L3 cache on latest EPYC is a lot, but it's not enough to expect cached reads from the filesystem to hit L3 that often, IMHO)
If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.
Not that this is a totally different case from encrypting dynamic data that's necessarily touched by the CPU.
> You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
I had no problem serving 10 Gbps of files on a dual Xeon E5-2690 (v1; a 2012 CPU), although that CPU isn't great at AES, so I think it only did 8 Gbps or so with TLS; the next round of servers for that role had 2x 10G and 2690 v3 or v4 (2014 or 2016; but I can't remember when we got them) and thanks to better AES instructions, they were able to do 20 G (and a lot more handshakes/sec too). If your 2020 servers aren't as good as my circa 2012 servers were, you might need to work on your stack. OTOH, bulk file serving for many clients can be different than a single connection iperf.
You're spot on. I have a slide that I like to show NIC vendors when they question why TLS offload is important.
See pages 21 and 22 of: https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf
I assume NF's software pipeline is zero copy, so if TLS is done in the NIC data only gets read from memory once when it is DMA'd to the NIC. With software TLS you need to read the data from memory (assuming it's not already in cache, which given the size of data NF deals with is unlikely), encrypt it, then write it back out to main memory so it can be DMA'd to the NIC. I know Intel has some fancy tech that can DMA directly to/from the CPU's cache, but I don't think AMD has that capability (yet).
Easy line rate if you crank the MTU all the way to 9000 :D
> modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air
If you're using the boost algorithm rather than a static overclock, and when that boost is thermally limited rather than current limited. With a good cooler it's not too hard to always have thermal headroom.
In my experience jumbo frames provide at best an improvement of about 20% in rare cases, such as ping-pong UDP protocols such as TFTP or Citrix PVS streaming.
And - do such cards even allow direct "cross" connection without a switch in between?
For a cheap solution, I'd get a pair of used Mellanox ConnectX4 or Chelsio T6, and a QSFP28 direct attach copper cable.
They all seem to offer/suggest daisy-chain connectivity at least for those with two ports per card as one potential topology.
As for directly connecting them absolutely, works great. Id recommend a cheap DAC off fs.com to connect them in that case.
$ lsblk -t /dev/nvme0n1
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme0n1 0 512 0 512 512 0 none 1023 128 0B
$ sudo nvme id-ns -H /dev/nvme0n1 | grep Size
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)
On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are supported:
# nvme id-ns /dev/nvme0n1 -n 1 -H|grep "^LBA Format"
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)
Some consumer SSD vendors do enable 4kB LBA support. I've seen it supported on consumer drives from WD, SK hynix and a variety of brands using Phison or SMI SSD controllers (including Kingston, Seagate, Corsair, Sabrent). But I haven't systematically checked to see which brands consistently support it.
But this thread gets into details that are more esoteric than what I cover in most reviews, which are written with a more Windows-oriented audience in mind. Since I do most of my testing on Linux and have an excess of SSDs littering my office, I'm well-equipped to participate in a thread like this.
I highly recommend reddit.com/r/NewMaxx as the clearinghouse for consumer SSD news and Q&A. I'm not aware of a similarly comprehensive forum for enterprise storage, where this thread would probably be a better fit.
As in, what ashift value do you use with zfs?
matching the page size?
> the underlying media page size is usually on the order of 16kB
I'd say that's a good reason to set ashift=14 as 2^14=16kb
1) Learning & researching capabilities of modern HW
2) Running RDBMS stress tests (until breaking point), Oracle, Postgres+TimescaleDB, MySQL, probably ScyllaDB soon too
3) Why? As a performance troubleshooter consultant+trainer, I regularly have to reproduce complex problems that show up only under high concurrency & load - stuff that you can't just reproduce in a VM in a laptop.
4) Fun - seeing if the "next gen" hardware's promised performance is actually possible!
FYI I have some videos from my past complex problem troubleshooting adventures, mostly Oracle stuff so far and some Linux performance troubleshooting:
Any chance you could post somewhere the output of:
lstopo --of ascii
RocksDB, and LSM algorithms in general, seem to be designed with the assumption that random block I/O is slow. It appears that, for modern hardware, that assumption no longer holds, and the software only slows things down .
 - https://github.com/BLepers/KVell/blob/master/sosp19-final40....
Saturating an NVMe drive with a single x86 thread is trivial if you change how you play the game. Using async/await and yielding to the OS is not going to cut it anymore. Latency with these drives is measured in microseconds. You are better off doing microbatches of writes (10-1000 uS wide) and pushing these to disk with a single thread that monitors a queue in a busy wait loop (sort of like LMAX Disruptor but even more aggressive).
Thinking about high core count parts, sacrificing an entire thread to busy waiting so you can write your transactions to disk very quickly is not a terrible prospect anymore. This same ideology is also really useful for ultra-precise execution of future timed actions. Approaches in managed lanaguages like Task.Delay or even Thread.Sleep are insanely inaccurate by comparison. The humble while(true) loop is certainly not energy efficient, but it is very responsive and predictable as long as you dont ever yield. What's one core when you have 63 more to go around?
I'm not an expert in this area, but wouldn't it be just as lightweight to have your async workers pushing onto a queue, and then have your async writer only wake up when the queue is at a certain level to create the batched write? Either way, you won't be paying the OS context switching costs associated with blocking a write thread, which I think is most of what you're trying to get out of here.
For context and to put numbers around this, the average read latency of the fastest, latest generation PCI 4.0 x4 U.2 enterprise drives is 82-86µs, and the average write latency is 11-16µs.
I can't find it now. I think they were trying to say that cassandra can't keep up because of the JVM overhead and you need to be close to metal for extreme performance.
This is similar. Huge amounts of flooding I/O from modern PCIx SSDs really closes the traditional gap between CPU and "disk".
The biggest limiter in cloud right now is the EBS/SAN. Sure you can use local storage in AWS if you don't mind it disappearing, but while gp3 is an improvement, it pales to stuff like this.
Also, this is fascinating:
"Take the write speeds with a grain of salt, as TLC & QLC cards have slower multi-bit writes into the main NAND area, but may have some DIMM memory for buffering writes and/or a “TurboWrite buffer” (as Samsung calls it) that uses part of the SSDs NAND as faster SLC storage. It’s done by issuing single-bit “SLC-like” writes into TLC area. So, once you’ve filled up the “SLC” TurboWrite buffer at 5000 MB/s, you’ll be bottlenecked by the TLC “main area” at 2000 MB/s (on the 1 TB disks)."
I didn't know controllers could swap between TLC/QLC and SLC.
1. Async everywhere - We use AIO and io_uring to make sure that your inter-core communications are non-blocking.
2. Shard-per-core - It also helps if specific data is pinned to a specific CPU, so we partition on a per-core basis. Avoids cross-CPU traffic and, again, less blocking.
3. Schedulers - Yes, we have our own IO scheduler and CPU scheduler. We try to get every cycle out of a CPU. Java is very "slushy" and though you can tune a JVM it is never going to be as "tight" performance-wise.
4. Direct-attached NVMe > networked-attached block storage. I mean... yeah.
We're making Scylla even faster now, so you might want to check out our blogs on Project Circe:
• Introducing Project Circe: https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...
• Project Circe January Update: https://www.scylladb.com/2021/01/28/project-circe-january-up...
The latter has more on our new scheduler 2.0 design.
I wish I could control the % of SLC. Even dividing a QLC space by 16 makes it cheaper than buying a similarly sized SLC
To me, what the original article shows is an opportunity to remove - not add.
This is using SPDK to eliminate all of the overhead the author identified. The hardware is far more capable than most people expect, if the software would just get out of the way.
When I have more time again, I'll run fio with the SPDK plugin on my kit too. And would be interested in seeing what happens when doing 512B random I/Os?
But while SPDK does have an fio plug-in, unfortunately you won't see numbers like that with fio. There's way too much overhead in the tool itself. We can't get beyond 3 to 4 million with that. We rolled our own benchmarking tool in SPDK so we can actually measure the software we produce.
Since the core is CPU bound, 512B IO are going to net the same IO per second as 4k. The software overhead in SPDK is fixed per IO, regardless of size. You can also run more threads with SPDK than just one - it has no locks or cross thread communication so it scales linearly with additional threads. You can push systems to 80-100M IO per second if you have disks and bandwidth that can handle it.
For at least reads, if you don't hit a CPU limit you'll get 8x more IOPS with 512B than you will with 4KiB with SPDK. It's more or less perfect scaling. There's some additional hardware overheads in the MMU and PCIe subsystems with 512B because you're sending more messages for the same bandwidth, but my experience has been that it is mostly negligible.
The benchmark builds to build/examples/perf and you can just run it with -h to get the help output. Random 4KiB reads at 32 QD to all available NVMe devices (all devices unbound from the kernel and rebound to vfio-pci) for 60 seconds would be something like:
perf -q 32 -o 4096 -w randread -t 60
You can specify only test specific devices with the -r parameter (by BUS:DEVICE:FUNCTION essentially). The tool can also benchmark kernel devices. Using -R will turn on io_uring (otherwise it uses libaio), and you simply list the block devices on the command line after the base options like this:
perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1
You can get ahold of help from the SPDK community at https://spdk.io/community. There will be lots of people willing to help.
Excellent post by the way. I really enjoyed it.
But SPDK has a problem you don't have with bypasses and uio_ring, in that it needs the IOMMU enabled, and that can itself become a bottleneck. There are also issues for some applications that want to use interrupts rather than poll everything.
Whats really nice about uio_ring is that it sort of standardizes a large part of what people were doing with bypasses.
I may have missed using the right unit in some other sections too. At least I hope that I've conveyed that there's a difference!
I was knocking up some profiling code and measured the performance of gettimeofday as a proof-of-concept test.
The performance difference between running the test on my personal desktop Linux VM versus running it on a cloud instance Linux VM was quite interesting (cloud was worse)
I think I read somewhere that cloud instances cannot use the VDSO code path because your app may be moved to a different machine. My recollection of the reason is somewhat cloudy.
Anyone has advice on optimizing a windows 10 system? I have a haswell workstation (E5-1680 v3) that I find reasonably fast and works very well under Linux. In windows, I get lost. I tried to run the userbenchark suite which told me I'm below median for most of my components. Is there any good advice how to improve that? Which tools give good insight into what the machine is doing under windows?
I'd like first to try to optimize what I have, before upgrading to the new shiny :).
Particularly submitting multiple up requests can amortize the cost of setting the nvme doorbell (the expensive part as far as I understand it) across multiple requests.
edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...
root@awork3:~# echo 4 > /sys/module/nvme/parameters/poll_queues
root@awork3:~# echo 1 > /sys/block/nvme1n1/device/reset_controller
root@awork3:~# dmesg -c
[749717.253101] nvme nvme1: 12/0/4 default/read/poll queues
root@awork3:~# echo 8 > /sys/module/nvme/parameters/poll_queues
root@awork3:~# dmesg -c
root@awork3:~# echo 1 > /sys/block/nvme1n1/device/reset_controller
root@awork3:~# dmesg -c
[749736.513102] nvme nvme1: 8/0/8 default/read/poll queues
I think that should be independent.
> edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...
It probably won't get you drastically higher speeds in an isolated test - but it should help reduce CPU overhead. E.g. on one of my SSDs
fio --ioengine io_uring --rw randread --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 48
uses about 25% more CPU than when I add --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But the resulting iops are nearly the same as long as there are enough cycles available.
This is via filesystem, so ymmv, but the mechanism should be mostly independent.
I was quite surprised to hear in that thread that AMD's infiniband was so oversubscribed. There's 256GBps of pcie on a 1P butit seems like this 66GBps is all the fabric can do. A little under a 4:1 oversubscription!
I'm on the same page with your thesis that "hardware is fast and clusters are usually overkill," and disk I/O was a piece that I hadn't really figured out yet despite making great strides in the software engineering side of things. I'm trying to make a startup this year and disk I/O will actually be a huge factor in how far I can scale without bursting costs for my application. Good stuff!
Anyone have a story to share about their company doing just this? "Scale out" has basically been the only acceptable answer across most of my career. Not to mention High Availability.
Another way of achieving HA together with satisfying disaster recovery requirements is replication (either app level or database log replication, etc). So, no distributed system is necessary unless, you have legit scaling requirements.
If you work on ERP-like databases for traditional Fortune 500-like companies, few people run such "sacred monolith" applications on modern distributed NoSQL databases, it's all Oracle, MSSQL or some Postgres nowadays. Data warehouses used to be all Oracle, Teradata too - although these DBs support some cluster scale-out, they're still "sacred monoliths" from a different era (they are still doing - what they were designed for - very well). Now of course Snowflake, BigQuery, etc are taking over the DW/analytics world for new greenfield projects, existing systems usually stay as they are due to lock-in & extremely high cost of rewriting decades of existing reports and apps.
I would call this a distributed system. To me HA means 0 downtime deploys, are there SQL/RDBMS that offer that even for schema changes?
U.2 means more NAND to parallelize over, more spare area (and higher overall durability), potentially larger DRAM caches, and a far larger area to dissipate heat. Plus it has all the fancy bleeding-edge features you aren't going to see on consumer-grade drives.
The big issue with U.2 for "end user" applications like workstations is you can't get drivers from Samsung for things like the PM1733 or PM9A3 (which blow the doors off the 980 Pro, especially for writes and $/GB, plus other neat features like Fail-In-Place) unless you're an SI, in which you also co-developed the firmware. The same goes for SanDisk, KIOXIA and other makers of enterprise SSDs.
The kicker is enterprise U.2 drives are about the same $/GB as SATA drives, but being NVMe PCIe 4.0 x4. blow the doors off about everything. There's also the EDSFF, NF1 and now E.1L form factors, but U.2 is very prevalent. Enterprise SSDs are attractive as that's where the huge volume is (hence the low $/GB), but end-user support is really limited. You can use "generic drivers", but you won't see anywhere near the peak performance of the drives.
The good news is both Micron and Intel have great support for end-users, where you can get optimized drivers and updated firmware. Intel has the D7-P5510 probably hitting VARs and some retail sellers (maybe NewEgg) within about 60 days. Similar throughput to the Samsung drives, far more write IOPS (especially sustained), lower latencies, FAR more durability (with a big warranty), far more capacity, and not too bad a price (looking like ~$800USD for 3.84TB with ~7.2PB of warrantied writes over 5 years).
My plan once Genesis Peak (Threadripper 5XXX) hits is four 3.84TB Intel D7-P5510s in RAID10, connected to a HighPoint SSD7580 PCIe 4.0 x16 controller. Figure ~$4,000 for a storage setup of ~7.3TB usable space after formatting, 26GB/sec peak writes, ~8GB/sec peak reads, with 2.8M 4K read iops, 700K 4K write iops, and ~14.3PB of warrantied write durability.
Here is the standalone AVX-512 ResNet50 code (C99 .h and .c files):
Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible
$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
If that is the case, then maybe it could be emitted again while masking the instruction sets Ryzen doesn't support yet.
It's been less than a quarter century ago, 1997, when Microsoft and Compaq launched the TerraServer which was a wordplay on terabyte -- it stored a terabyte of data and it was a Big Deal. Today's that not storage, that's main RAM, unencumbered by NUMA.
When I think about Optane, I think about optimizing for low latency where it's needed and not that much about bandwidth of large ops.
Anyway, thanks for the inspirative post!
Btw, even the DIMMs have dedicated fans and enclosure (one per 4 DIMMs) on the P620.
PCIe switch chips were affordable in the PCIe 2.0 era when multi-GPU gaming setups were popular, but Broadcom decided to price them out of the consumer market for PCIe 3 and later.
This stuff is all fascinating to me. I have a zfs NAS but I feel like I've barely scratched the surface of SSDs
One popular example is HFT.
And from my experience on a desktop PC it is better to disable swap and have the OOM killer do his work, instead of swapping to disk, which makes my system noticeable laggy, even with a fast NVMe.
Two big players in this space are Aerospike and ScyllaDB.
I'm not really in the market anymore, but Epyc looks like 1P is going to solve a lot of needs, and 2P will be available at a reasonable premium, but 4P will probably be out of reach.
But users of 16 socket machines, will just step down to 4 socket epyc machines with 512 cores (or whatever). And someone else will realize that moving their "web scale" cluster from 5k machines, down to a single machine with 16 sockets results in lower latency and less cost. (or whatever).
Turn those sleds into blades though, put em on their side, & go even denser. It should be a way to save costs, but density alas is a huge upsell, even though it should be a way to scale costs down.
If you purchase a server and stick it in a co-lo somewhere, and your business plans to exist for 10+ years — well, is that server still going to be powering your business 10 years from now? Or will you have moved its workloads to something newer? If so, you'll probably want to decommission and sell the server at some point. The time required to deal with that might not be worth the labor costs of your highly-paid engineers. Which means you might not actually end up re-capturing the depreciated value of the server, but instead will just let it rot on the shelf, or dispose of it as e-waste.
Hardware leasing is a lot simpler. Whether you lease servers from an OEM like Dell, there's a quick, well-known path to getting the EOLed hardware shipped back to Dell and the depreciated value paid back out to you.
And, of course, hardware renting is simpler still. Renting the hardware of the co-lo (i.e. "bare-metal unmanaged server" hosting plans) means never having to worry about the CapEx of the hardware in the first place. You just walk away at the end of your term. But, of course, that's when you start paying premiums on top of the hardware.
Renting VMs, then, is like renting hardware on a micro-scale; you never have to think about what you're running on, as — presuming your workload isn't welded to particular machine features like GPUs or local SSDs — you'll tend to automatically get migrated to newer hypervisor hardware generations as they become available.
When you work it out in terms of "ten years of ops-staff labor costs of dealing with generational migrations and sell-offs" vs. "ten years of premiums charged by hosting rentiers", the pricing is surprisingly comparable. (In fact, this is basically the math hosting providers use to figure out what they can charge without scaring away their large enterprise customers, who are fully capable of taking a better deal if there is one.)
Which, if you have even the remotest fiscal competence, you'll have funded by using the depreciation of the book value of the asset after 3 years.
(it's in my TODO list too)