Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.
Finally, it's a little surprising the author didn't try using O_DIRECT writes.
There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.
Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.
THe number of people who are sustaining 4GB/sec (on a single machine/device array) is pretty small and they have a reason to go beyond the straightforward approaches the kernel makes available through a simple API (everything you described, like bypass, puts you in a rare category).
Anyway, when I was swapping to SSD, the kswapd process was using 100% of one core while swapping at 500MB/sec. I suspect many kernel threads haven't been CPU-optimized for high throughput.
Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler.
If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.
In either case it should save disk space for core files which are highly sparse.
This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.
I've frequently observed sustained 500MB/sec writes and reads on my cheap ($250) 250GB SSDs. One of my favorite instances was running out of RAM while assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it ran over night with nearly 500MB/sec reads and writes more or less continuously, but the job finished fine.
Nope, it's MB not Mb.
I would never do XFS benchmarks because in my experience if XFS is writing during powerdown, it trashses the FS (maybe this was fixed in the past 6 years, but after it happened 3 times I haven't touched the OS again).
of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)
just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.
At least for Linux, I think that's dangerously untrue. On my machine, 'man write' even includes an explicit warning:
A successful return from write() does not make any
guarantee that data has been committed to disk.
In fact, on some buggy implementations, it does not
even guarantee that space has successfully been reserved
for the data. The only way to be sure is to call
fsync(2) after you are done writing all your data.
He does say:
> in a real program you’d have to do real error handling instead of assertions, of course
But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program. Especially when contrasting it with a "wrong way" I think it wouldn't hurt to include real error handling. And that means something that doesn't fall into an infinite loop when the disk fills up.
The point is to retry on EINTR and to abort completely in case of other IO failures.
assert(errno == EINTR);
if (errno == EINTR)
Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.
Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.
> I don’t care about a “disk full” that I could catch and act on
I dare say that would be their fault for blindly copying and pasting without taking the time to understand the context. (He even gives an explicit disclaimer!) Robust error handling would just be more noise to filter through for people actually reading the article, and I don't think it's the author's responsibility to childproof things for people who aren't.
The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.
I'm just a bystander, but I think you may be jumping to unfounded conclusions here. Based on previous comment history, I presume that 'masklinn' understands perfectly well how assert() works. Yes, if you define NDEBUG your error handling will go away. So don't define NDEBUG unless you want your error handling to go away!
By contrast, your assertion that Some compilers will set that for you in an optimized build strikes me as unlikely. Some program specific build systems do this, and if you use one of them you should be aware that your assert() functions may drop out. But I don't think I've ever used a compiler that drops the assert() based on optimization level.
I don't particularly disagree with your conclusion, just your argument. I think 'awda' gets closer to the truth: the default assert() from <assert.h> with its negative reliance on NDEBUG is tricky and probably best avoided -- not just for error handling but altogether. Personally, I use two distinct macros: ERROR_ASSERT() and DEBUG_ASSERT(). ERROR_ASSERT() cannot be disabled, and DEBUG_ASSERT() only runs if DEBUG was defined at compile time.
Uhh, I didn't make it up. I remember now what I was thinking of: the defaults for Visual Studio (not the compiler, the IDE) are to have -DNDEBUG in release mode. So lots of Windows projects end up having it without the authors explicitly asking.
(I thought I also once used a machine, maybe some obscure Unix, where cc would add it if you specified -O. I don't remember the details of that, or if I might be confusing it with what VS did.)
FWIW I don't think it's weird that assert has this quirk, I think some people in this discussion just disagree about what an assert is. If you think of it as an extra debug check that might not be evaluated and should not have side effects, and are fine with that conceptually, no problems.
Async I/O avoids this. You can tell the I/O subsystem what you want to read next even while doing a write. The I/O is posted to the disk in modern systems, and the disk will begin seeking to the read site in parallel with informing the OS that the write has completed. Posting I/O even helps for SSDs to avoid the idle time on the SSD media between write done and read start.
write(out, buf, (r - w)) should be write(out, buf + w, r - w)