Linux file write patterns: So you want to write to a file fast

tytso · on May 1, 2014

It's 2014; why was the author using ext3 instead of ext4? Ext4 does have fallocate support.

Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.

Finally, it's a little surprising the author didn't try using O_DIRECT writes.

dekhn · on May 1, 2014

Most people who use O_DIRECT writes stop quickly, thinking it's "slow". What's actually happening is you're seeing what the system is actually capable of in terms of write bandwidth, without any of the 'clever' optimizations like write caching.

StillBored · on May 1, 2014

I don't think this is accurate. We have a kernel bypass for disk operations. We use our own memory buffers, and bypass the filesystem, block, and SCSI midlayers. Our stuff is basically what O_DIRECT should be.

There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.

Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.

dekhn · on May 2, 2014

Sure, if you're dealing with extremely high bandwidth apps (4GB/sec is pretty high bandwidth!) what I said doesn't apply.

THe number of people who are sustaining 4GB/sec (on a single machine/device array) is pretty small and they have a reason to go beyond the straightforward approaches the kernel makes available through a simple API (everything you described, like bypass, puts you in a rare category).

Anyway, when I was swapping to SSD, the kswapd process was using 100% of one core while swapping at 500MB/sec. I suspect many kernel threads haven't been CPU-optimized for high throughput.

zobzu · on May 1, 2014

but thats also why his tests are unreliable in this case

Nican · on May 1, 2014

From a Boston Linux Usergroup discussion: https://www.mail-archive.com/discuss@blu.org/msg08490.html

tlb · on May 1, 2014

The code in the second example is wrong. If a write partially succeeds, instead of writing the remaining part it writes again from the beginning of the buffer. The resulting file will be incorrect. That doesn't normally happen on disk writes, but it does when writing to a pipe.

mtdewcmu · on May 1, 2014

It's a little bit reassuring that there weren't any clear winners and losers. In a perfect world, the OS and hardware would figure out what your intent is and carry it out the fastest way possible, right? Ideally, you'd write the code the most convenient way and it would run the most performant way. Maybe the future is now.

rwmj · on May 1, 2014

A bit surprising (considering he started off talking about coredumps) that he doesn't mention sparse files. Core dumps can be very sparse, and you might save time and definitely will save space by not writing out the all-zeroes parts.

lukesandberg · on May 1, 2014

to do that, wouldn't you have to look at every byte just to detect the runs of 0s, that would mean that you have to pull the whole file through the memory hierarchy of your system (rather than just passing chunks from syscall to syscall) wouldn't that alone slow you down significantly?

rwmj · on May 1, 2014

It depends. If the data is coming from a pipe (like core_pattern) then yes you have to check for runs of zeroes. If it's coming from a filesystem, then there are various system calls that let you skip them (specifically SEEK_HOLE and SEEK_DATA flags of lseek(2)).

Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler[1].

If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly sparse.

[1] https://stackoverflow.com/a/1494021

mtdewcmu · on May 2, 2014

The easiest way to handle things like sparse files correctly is to invoke a program like GNU dd that already has this feature built in. GNU cp handles it, too, but it doesn't accept input from stdin.

dekhn · on May 1, 2014

Right. Sparse files are normally written by applications or kernel threads that specifically know the defined byte ranges, and define new allocated parts of the file. Further, file allocations are probably block-sized, so you would need to ensure the byte regions of blocks were all zero.

This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.

flogic · on May 1, 2014

The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b). I didn't bother clicking the links that just the blurb under it. He's dumping 128Mb in ~200 ms. I'm not sure there is much room for improvement.

dekhn · on May 1, 2014

MB. Bytes. nobody quotes disk speeds in bits (or if they do I typically ignore them).

I've frequently observed sustained 500MB/sec writes and reads on my cheap ($250) 250GB SSDs. One of my favorite instances was running out of RAM while assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it ran over night with nearly 500MB/sec reads and writes more or less continuously, but the job finished fine.

Phlarp · on May 1, 2014

I weep for the memory sectors that got re-written continuously for an entire night.

pushedx · on May 1, 2014

If the controller on the SSD was working properly those writes were distributed evenly over the flash.

dekhn · on May 1, 2014

I used to, but given not a single one of my SSD drives (I have 4 deployed in my house) has so much as balked once in a year of continuous deployment, I am cautiously optimistic.

ef47d35620c1 · on May 1, 2014

I've had very good reliability from my SSD drives as well. Some have been running almost continuously since 2009.

vacri · on May 2, 2014

I ran a 60GB SSD as my Windows machine's system drive (with pagefile) for four years before it started showing problems, and that machine saw a few hours use almost every day. It was >90-95% full for most of that time.

masklinn · on May 1, 2014

SSD controllers do write-leveling, the blocks a filesystem writes to is virtual and remapped (think VMEM)

Phlarp · on May 1, 2014

wouldn't an ~8GB page/swap file being continuously rewritten on a 250GB drive still consume a non-negligible number of write cycles over several days / weeks at most?

awda · on May 1, 2014

It depends on how much free space you have on the SSD. But yes, especially because the swap file isn't ssd-aware, you get a high degree of write amplification which wears the disk more than necessary. That being said, newish SSDs can take a beating, even under these kinds of workloads.

_quasimodo · on May 1, 2014

I think swap files are written page-wise, so as long as the start of the page file is aligned, all the writes should be aligned. (assuming memory page size and SSD page size are the same)

awda · on May 1, 2014

Does writes being aligned on page boundaries have anything to do with them being 13 base 2 orders of magnitude smaller than flash erase blocks? An unaligned page means you erase (mostly) one or (very rarely) two flash blocks. An aligned page means you erase one flash block. The biggest amplification is the 1<<13 4k -> 32MB amplification...

dekhn · on May 1, 2014

BTW I already have 16GB RAM on the machine. The swap file was 32GB.

rsynnott · on May 1, 2014

Depends on the SSD. The PCIe SSD in a 2013 Retina MBP can approach 1GB/sec, and of course high-end PCIe server stuff can do better again; you may also have a striped RAID setup.

masklinn · on May 1, 2014

> The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b).

Nope, it's MB not Mb.

dekhn · on May 1, 2014

Am I correct in noting that none of his benchmarks actually timed how long it took for the data to be durably committed to disk, to the limit that the OS can report that?

I would never do XFS benchmarks because in my experience if XFS is writing during powerdown, it trashses the FS (maybe this was fixed in the past 6 years, but after it happened 3 times I haven't touched the OS again).

Wilya · on May 1, 2014

One of the most catastrophical failure modes of XFS was fixed in 2007 [0]. Or at least that's what is said. I never dared touch XFS again after losing a fs to it, so I can't really confirm what it looks like today.

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....

maxhou · on May 1, 2014

actually since none of its tests ask to sync the data onto the disk, he might just be measuring each method efficiency in creating dirty pages.

of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)

just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.

on May 1, 2014

[deleted]

nkurz · on May 1, 2014

write() guarantees that the bits have been sent to the disk, and the disk reports that they have been written (or, if a nonvolatile cache is available, it is in the cache).

At least for Linux, I think that's dangerously untrue. On my machine, 'man write' even includes an explicit warning:

   A successful return from write() does not make any 
   guarantee that data has been  committed to  disk.
   In  fact, on some buggy implementations, it does not
   even guarantee that space has successfully been reserved
   for the data.  The only way to be sure is to call
   fsync(2) after you are done writing all your data.

There are ways to configure a file system so this is not the case, but they are rare. Is there a reference you could point to that would clarify what you are saying?

nhaehnle · on May 1, 2014

I am >99% confident that this is false for default setups. write() itself may return even before a command has been sent to the physical disk, and definitely before the disk reports that it is done. Waiting for the disk on every write() would kill the performance for most programs.

asveikau · on May 1, 2014

Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

He does say:

> in a real program you’d have to do real error handling instead of assertions, of course

But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program. Especially when contrasting it with a "wrong way" I think it wouldn't hurt to include real error handling. And that means something that doesn't fall into an infinite loop when the disk fills up.

masklinn · on May 1, 2014

> Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

The point is to retry on EINTR and to abort completely in case of other IO failures.

    assert(errno == EINTR);
    continue;

is equivalent to

    if (errno == EINTR)
        continue;
    abort();

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.

asveikau · on May 1, 2014

You are wrong. assert is a no-op when NDEBUG is defined. Some compilers will set that for you in an optimized build.

Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.

awda · on May 1, 2014

More like, assert() from assert.h is a huge wtf in C, because turning asserts off in optimized builds produces exactly these kinds of scenarios.

maxhou · on May 1, 2014

ENOSPC ?

masklinn · on May 1, 2014

Is something the author very specifically noted he does not care about in his examples a bit later:

> I don’t care about a “disk full” that I could catch and act on

Niten · on May 1, 2014

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

I dare say that would be their fault for blindly copying and pasting without taking the time to understand the context. (He even gives an explicit disclaimer!) Robust error handling would just be more noise to filter through for people actually reading the article, and I don't think it's the author's responsibility to childproof things for people who aren't.

asveikau · on May 1, 2014

I'd agree, except that I've seen too many examples where people, particularly those who still have things to learn, cite blog snippets as authoritative. IMO we have something of a duty to those people to get it right. In this case it's not a lot of effort to get it right. Relying on the side-effects of an assert() is not getting it right.

The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.

nkurz · on May 1, 2014

The fact that I got a reply based on a misunderstanding of how asserts work tells me it's a point that needs to be made.

I'm just a bystander, but I think you may be jumping to unfounded conclusions here. Based on previous comment history, I presume that 'masklinn' understands perfectly well how assert() works. Yes, if you define NDEBUG your error handling will go away. So don't define NDEBUG unless you want your error handling to go away!

By contrast, your assertion that Some compilers will set that for you in an optimized build strikes me as unlikely. Some program specific build systems do this, and if you use one of them you should be aware that your assert() functions may drop out. But I don't think I've ever used a compiler that drops the assert() based on optimization level.

I don't particularly disagree with your conclusion, just your argument. I think 'awda' gets closer to the truth: the default assert() from <assert.h> with its negative reliance on NDEBUG is tricky and probably best avoided -- not just for error handling but altogether. Personally, I use two distinct macros: ERROR_ASSERT() and DEBUG_ASSERT(). ERROR_ASSERT() cannot be disabled, and DEBUG_ASSERT() only runs if DEBUG was defined at compile time.

asveikau · on May 1, 2014

> your assertion that Some compilers will set that for you in an optimized build strikes me as unlikely

Uhh, I didn't make it up. I remember now what I was thinking of: the defaults for Visual Studio (not the compiler, the IDE) are to have -DNDEBUG in release mode. So lots of Windows projects end up having it without the authors explicitly asking.

(I thought I also once used a machine, maybe some obscure Unix, where cc would add it if you specified -O. I don't remember the details of that, or if I might be confusing it with what VS did.)

FWIW I don't think it's weird that assert has this quirk, I think some people in this discussion just disagree about what an assert is. If you think of it as an extra debug check that might not be evaluated and should not have side effects, and are fine with that conceptually, no problems.

cjensen · on May 1, 2014

The author has failed to account for command latency. If you write some bytes, there are a bunch of hardware buffering delays in getting bytes to disk including seek and rotational latency.

Async I/O avoids this. You can tell the I/O subsystem what you want to read next even while doing a write. The I/O is posted to the disk in modern systems, and the disk will begin seeking to the read site in parallel with informing the OS that the write has completed. Posting I/O even helps for SSDs to avoid the idle time on the SSD media between write done and read start.

mtdewcmu · on May 1, 2014

I think there is some amount of write back caching in the kernel so that the application doesn't have to wait for each individual chunk to go to disk before it can submit the next chunk. I believe there's a sync on either file close or process termination, or some combination.

angry_octet · on May 2, 2014

"For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each" The whole thing is completely invalid for measuring actual I/O hierarchy efficiencies because of (a) write sizes too small, would be in buffer cache of unknown hotness, (b) dmcrypt introduces a whole layer of indirection and timing variability and (c) on an SSD, almost anything could be happening regarding cache and syncs. Also, mount options, % disk used, small sample sizes, unknown contention effects, etc. This is a good example of how to convince yourself of something and yet be less accurate than a divining rod.

callesgg · on May 1, 2014

I would assume the encryption is such a big overhead that most optimizations in the upper levels will be useless.

petermonsson · on May 1, 2014

The Intel AES instructions help a lot. According to Wikipedia we use 3.5 cycles/byte. That gives us 128*3.5/3 = 149 ms. If the write speed is around 500mb/a as another potter stated, then the disk encryption is probably not a bottleneck. Still, it is better to actually measure the performance without encryption to see if there is any effect.

dar8919 · on May 1, 2014

The second code snippet looks wrong,

write(out, buf, (r - w)) should be write(out, buf + w, r - w)

zobzu · on May 1, 2014

actually despite authors claims the fs stuff is ram cached a lot, hence the differences in the tests. (specially for a single file write)