Hmmh, there might be room for improvement, but I appreciate the article and the link here to it. It's been a while that I tried to get the most out of a NFS server and was unaware that it is still used for high performance applications. 25GB/s surely impressed me.
Yeah this whole process took 2 years in-between usual work. This article was testing the waters to see if people are interested and for direction where to focus followup articles.
Future areas of focus could be: diving into modern numa architecture, how to set expectations of hw perf limits, network architecture, pros/cons of S3 vs object in terms of perf.
What he said was not wrong. Why do better? I also wasted my time and learned nothing about how they achieved 25 GB/s with NFS. What were the bottlenecks? What did they fine tune? Was a plug-and-play type of benchmark?
I did something similar (~2015) but using the kernel NFS client and having multiple mounts to the same volume using different IP addresses.
Using vectored IO and spreading across multiple connections greatly improved throughout. However, metadata operations cannot be parallelized easily without application side changes.
In more modern kernels, NFS supports ‘nconnect’ mount option to open multiple network connections for a single mount. I wonder if the approach of using libnfs for multiple connections is even required.
I've tried to drive NFS to reasonable levels of performance in the past, and the bottleneck has never seemed like storage or network or the NFS server; it always seems like the combination of the built-in NFS client in the Linux kernel, the implementation of filesystem semantics, and the behavior of common workloads ends up making NFS much slower than network throughput. I'm impressed with these benchmarks and the approach to collecting them, but I'm wondering if the Linux kernel NFS client can get anywhere close to the theoretical limits.
I've tried this with a read-only NFS server, across an AWS multi-gigabit connection, and I still found that I couldn't get anywhere near this level of performance for the workload of "make -j$(nproc)" in a Linux kernel tree. As a quick baseline, some numbers from the last time I tested this, with a c5.12xlarge (48 CPU) client and server: a defconfig local build was 40s, and a defconfig build with a Linux kernel tree in read-only NFS (with a tmpfs overlay on top for writability) was 6m55s. That's a 10x slowdown. System stats during the build showed 4-5MBps net recv and net send, and 1-4MBps disk write.
Is there some well-known method to getting reasonable performance out of off-the-shelf NFS servers and clients?
(PM-T for Amazon EFS, AWS's native NFSv4.1 file system)
Performance turning NFS is difficult mostly because the information on how to do it isn't readily available. The two things that most people run into:
- 'Close to open' cache consistency. In practical terms this means that open() is a round trip to the server to validate any data that might be cached already (unless you use delegations), write() goes into the page cache (as a writeback cache), and close() flushes all dirty data. Building a kernel tree reading and creating tons of small files, each of which requires two serial round trips over the network. Compare that to a local fs where neither open(O_CREAT), write() or close() actually go to disk and therefore run at memory speed (unless you use things like O_DIRECT or fsync/fdatasync()).
- Per-TCP flow throughput limitations. On the AWS network the per-flow limit is 5 Gbit in general and 10 Gbit within a placement groups. To work around this, people use the 'nconnect' mount option. (which does not work currently with EFS). Local networks might have different limitations, but single TCP streams will typically always have some bw limit lower than the physical network bandwidth. I believe that this (very cool!) fio plugin works around this by using multiple connections.
The actual data write latency of NFS servers isn't terribly different from local file systems.
Today, the best way to get the most performance out of NFS is to either use large files and/or keep files open, or use high concurrency. By default, the 4.1 client will issue up to 64 concurrent requests, which can be increased by increasing the 'max slots' NFS kernel module parameter. In your example of a kernel build, you could -j much higher than the number of CPUs because the compile jobs will be IO bound on reading input and writing output. This will amortize the round trips over more threads, and in theory (barring any other bottlenecks) reduce your build times.
> - 'Close to open' cache consistency. In practical terms this means that open() is a round trip to the server to validate any data that might be cached already (unless you use delegations), write() goes into the page cache (as a writeback cache), and close() flushes all dirty data. Building a kernel tree reading and creating tons of small files, each of which requires two serial round trips over the network. Compare that to a local fs where neither open(O_CREAT), write() or close() actually go to disk and therefore run at memory speed (unless you use things like O_DIRECT or fsync/fdatasync()).
That definitely sounds like a concern for writable NFS filesystems, but I was benchmarking reads to a read-only NFS mount.
Related: Is there some option I can pass to make it clear that the data on the server will never change and thus no possible write-to-read or close-to-open consistency issues can arise?
> - Per-TCP flow throughput limitations. On the AWS network the per-flow limit is 5 Gbit in general and 10 Gbit within a placement groups. To work around this, people use the 'nconnect' mount option. (which does not work currently with EFS). Local networks might have different limitations, but single TCP streams will typically always have some bw limit lower than the physical network bandwidth. I believe that this (very cool!) fio plugin works around this by using multiple connections.
Interesting! I've never seen the per-flow limit mentioned before. Is that documented somewhere?
I'd be concerned about that if I were getting anywhere close to that limit, but I was experiencing 4-5MBps network throughput. It seemed like individual file operations (like stat) were taking an excessive amount of time.
> In your example of a kernel build, you could -j much higher than the number of CPUs because the compile jobs will be IO bound on reading input and writing output.
I'm writing output to a local tmpfs (via overlayfs), not to NFS. And I'd love to tune the NFS setup to the point that reads (and stats) from NFS aren't causing a 10x slowdown.
> Related: Is there some option I can pass to make it clear that the data on the server will never change and thus no possible write-to-read or close-to-open consistency issues can arise?
As far as I know the NFS client does not support such a mount option today. I should have mentioned this, but there /is/ a way to eliminate the 'close to open' cache check for repeated open() operations, which is to use NFS delegations. NFS read delegations are supported by both nfsd and the NFS client. They are not perfect, as they are best effort, but can typically keep the core data set of your workload fully local. This would not work for your first build but would work for the second.
> Interesting! I've never seen the per-flow limit mentioned before. Is that documented somewhere?
Any word on the iops numbers vs the in-kernel NFS client? The throughput is impressive but IME it ends up being the stat/fd activity of NFS clients that's the the limiting factor (try running `ls -l` in an NFS directory with lots of files in it, even worse if there are lots of symlinks involved.)
Presumably, with async you could queue all the `stat`-ops after the dir walk at once, leaving the total latency in the ball park of "dir walk"+"ping for file stat"+"ping remaining symlink stats if present". But I don't think `ls` does this.
Otherwise, yeah, you incur network latency on every file, plus, as you say, symlink "pings" if those are present. So its "dir walk"+"number of files"*"ping"+"symlinks"*"ping", which adds up.
Batching high-latency ops is one of the only cases where I like async.
NFS version 3 already has the READDIRPLUS operation which returns the directory contents and all of the stats together in one call to improve performance for this case. Sometimes the kernel client doesn't use READDIRPLUS and falls back to issuing a bunch of requests - usually because it's being super conservative about security (as if there is such a thing in NFS). It's also not that straightforward for a utility like ls to tell the kernel that it's reading both the directory and the stats for each file - they're separate system calls and the kernel has to figure out what the program is trying to do and optimise.
I wrote a tool that issues raw READDIRPLUS requests to list a directory:
Wow, thanks! I’m definitely going to try this tool. I have use cases at work where I need to look at files in an NFS server on the other side of the country from my home desktop (don’t ask… it’s much bigger PITA to get SSH access to a VM near the NFS server than it should be) and it takes ages to run ls in a directory.
Usually going via kernel nfs client will use up more memory bandwidth. I would expect lower per-client numbers. From what I've read you go from 3 memcopies on userspace to 4 with kernel nfs.
I haven't yet instrumented memory bandwidth on my amd machines, but it feels like I'm at the limit.
The nice thing about NFS (v3) is that it's stateless, so you can just keep doing the same thing over and over. Messing around with metadata sucks but once you get the filehandle of the file that you want to read or write then you don't need it anymore.
I wrote a suite of tools that does all this and dumps the NFS transactions as JSON, maybe it can be useful:
Emily does fantastic work. The task I described above is more of a plumbing test, whereas her post is more for 'real world'. You need both to test your storage, but they don't overlap.
"To make use of multiple NICs I needed multiple NFS connections per NIC."
What? I presume "multiple NFS connections per host." was meant. Not sure what those NFS connections are supposed to be though. NFSv3 is (stateless) request/response protocol on top of TCP/IP connections. NFSv4 introduced sessions, but here NFSv3 was used, wasn't it?
I learned nothing about FIO, libnfs, NFS, or your patch.
No feature comparison with other efforts. No benchmark comparison with other efforts.
I wasted my time here.