
Linux file write patterns: So you want to write to a file fast - noqqe
http://blog.plenz.com/2014-04/so-you-want-to-write-to-a-file-real-fast.html
======
tytso
It's 2014; why was the author using ext3 instead of ext4? Ext4 does have
fallocate support.

Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to
worry about glibc trying to emulate fallocate() for those file systems which
don't use it.

Finally, it's a little surprising the author didn't try using O_DIRECT writes.

~~~
dekhn
Most people who use O_DIRECT writes stop quickly, thinking it's "slow". What's
actually happening is you're seeing what the system is _actually_ capable of
in terms of write bandwidth, without any of the 'clever' optimizations like
write caching.

~~~
StillBored
I don't think this is accurate. We have a kernel bypass for disk operations.
We use our own memory buffers, and bypass the filesystem, block, and SCSI
midlayers. Our stuff is basically what O_DIRECT should be.

There are cases where we are 50% faster than O_DIRECT without any "caching".
Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its
easy to become CPU limited in the blk/midlayer so again we win.

Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which
are tuned for higher IOP rates. These patches were driven by people plugging
in high performance flash arrays and discovering huge performance issues in
the kernel. Still, I expect if you plug in a couple high end flash arrays the
kernel is going to be the limit rather than the IO subsystem on a modern xeon.

~~~
dekhn
Sure, if you're dealing with extremely high bandwidth apps (4GB/sec is pretty
high bandwidth!) what I said doesn't apply.

THe number of people who are sustaining 4GB/sec (on a single machine/device
array) is pretty small and they have a reason to go beyond the straightforward
approaches the kernel makes available through a simple API (everything you
described, like bypass, puts you in a rare category).

Anyway, when I was swapping to SSD, the kswapd process was using 100% of one
core while swapping at 500MB/sec. I suspect many kernel threads haven't been
CPU-optimized for high throughput.

------
tlb
The code in the second example is wrong. If a write partially succeeds,
instead of writing the remaining part it writes again from the beginning of
the buffer. The resulting file will be incorrect. That doesn't normally happen
on disk writes, but it does when writing to a pipe.

------
mtdewcmu
It's a little bit reassuring that there weren't any clear winners and losers.
In a perfect world, the OS and hardware would figure out what your intent is
and carry it out the fastest way possible, right? Ideally, you'd write the
code the most convenient way and it would run the most performant way. Maybe
the future is now.

------
rwmj
A bit surprising (considering he started off talking about coredumps) that he
doesn't mention sparse files. Core dumps can be very sparse, and you might
save time and definitely will save space by not writing out the all-zeroes
parts.

~~~
lukesandberg
to do that, wouldn't you have to look at every byte just to detect the runs of
0s, that would mean that you have to pull the whole file through the memory
hierarchy of your system (rather than just passing chunks from syscall to
syscall) wouldn't that alone slow you down significantly?

~~~
rwmj
It depends. If the data is coming from a pipe (like core_pattern) then yes you
have to check for runs of zeroes. If it's coming from a filesystem, then there
are various system calls that let you skip them (specifically SEEK_HOLE and
SEEK_DATA flags of lseek(2)).

Also if the data is being copied into userspace anyway, then it's quite fast
to check that memory is zero. There's no C "primitive" for this, but all C
compilers can turn a simple loop into relatively efficient assembler[1].

If you're using an API that never copies the data into userspace and you have
to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly
sparse.

[1] [https://stackoverflow.com/a/1494021](https://stackoverflow.com/a/1494021)

~~~
mtdewcmu
The easiest way to handle things like sparse files correctly is to invoke a
program like GNU dd that already has this feature built in. GNU cp handles it,
too, but it doesn't accept input from stdin.

------
flogic
The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b).
I didn't bother clicking the links that just the blurb under it. He's dumping
128Mb in ~200 ms. I'm not sure there is much room for improvement.

~~~
dekhn
M _B_. Bytes. nobody quotes disk speeds in bits (or if they do I typically
ignore them).

I've frequently observed sustained 500MB/sec writes and reads on my cheap
($250) 250GB SSDs. One of my favorite instances was running out of RAM while
assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it
ran over night with nearly 500MB/sec reads and writes more or less
continuously, but the job finished fine.

~~~
Phlarp
I weep for the memory sectors that got re-written continuously for an entire
night.

~~~
dekhn
I used to, but given not a single one of my SSD drives (I have 4 deployed in
my house) has so much as balked once in a year of continuous deployment, I am
cautiously optimistic.

~~~
ef47d35620c1
I've had very good reliability from my SSD drives as well. Some have been
running almost continuously since 2009.

~~~
vacri
I ran a 60GB SSD as my Windows machine's system drive (with pagefile) for four
years before it started showing problems, and that machine saw a few hours use
almost every day. It was >90-95% full for most of that time.

------
dekhn
Am I correct in noting that none of his benchmarks actually timed how long it
took for the data to be durably committed to disk, to the limit that the OS
can report that?

I would never do XFS benchmarks because in my experience if XFS is writing
during powerdown, it trashses the FS (maybe this was fixed in the past 6
years, but after it happened 3 times I haven't touched the OS again).

~~~
Wilya
One of the most catastrophical failure modes of XFS was fixed in 2007 [0]. Or
at least that's what is said. I never dared touch XFS again after losing a fs
to it, so I can't really confirm what it looks like today.

[0]
[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ba87ea699ebd9dd577bf055ebc4a98200e337542)

------
maxhou
actually since none of its tests ask to sync the data onto the disk, he might
just be measuring each method efficiency in creating dirty pages.

of course that depends on the amount of RAM the system has, and how the kernel
VM parameters are tuned (sysctl vm.dirty_*)

just add a fdatasync() call and you will take into account the time it takes
to flush all dirty pages into the disk.

------
asveikau
Assuming all I/O failures are EINTR is really, really odd, as if to say disks
never fill up or fail and sockets never disconnect.

He does say:

> in a real program you’d have to do real error handling instead of
> assertions, of course

But somebody somewhere is reading this and thinking this is a "semantically
correct pattern" (as it is introduced) and may just copy-paste it into their
program. Especially when contrasting it with a "wrong way" I think it wouldn't
hurt to include real error handling. And that means something that doesn't
fall into an infinite loop when the disk fills up.

~~~
masklinn
> Assuming all I/O failures are EINTR is really, really odd, as if to say
> disks never fill up or fail and sockets never disconnect.

The point is to retry on EINTR and to abort completely in case of other IO
failures.

    
    
        assert(errno == EINTR);
        continue;
    

is equivalent to

    
    
        if (errno == EINTR)
            continue;
        abort();
    

> But somebody somewhere is reading this and thinking this is a "semantically
> correct pattern" (as it is introduced) and may just copy-paste it into their
> program.

Even if they do, it likely will not actually do any harm, it'll just kill the
program instead of gracefully handle error.

~~~
asveikau
You are wrong. assert is a no-op when NDEBUG is defined. Some compilers will
set that for you in an optimized build.

Using an assert in place of real error checking or otherwise relying on its
side effects is consequently a huge wtf in C.

~~~
awda
More like, assert() from assert.h is a huge wtf in C, because turning asserts
off in optimized builds produces exactly these kinds of scenarios.

------
cjensen
The author has failed to account for command latency. If you write some bytes,
there are a bunch of hardware buffering delays in getting bytes to disk
including seek and rotational latency.

Async I/O avoids this. You can tell the I/O subsystem what you want to read
next even while doing a write. The I/O is posted to the disk in modern
systems, and the disk will begin seeking to the read site in parallel with
informing the OS that the write has completed. Posting I/O even helps for SSDs
to avoid the idle time on the SSD media between write done and read start.

~~~
mtdewcmu
I think there is some amount of write back caching in the kernel so that the
application doesn't have to wait for each individual chunk to go to disk
before it can submit the next chunk. I believe there's a sync on either file
close or process termination, or some combination.

------
angry_octet
"For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an
SSD. This is “read a 128MB file and write it out”, repeated for different
block sizes, timing each" The whole thing is completely invalid for measuring
actual I/O hierarchy efficiencies because of (a) write sizes too small, would
be in buffer cache of unknown hotness, (b) dmcrypt introduces a whole layer of
indirection and timing variability and (c) on an SSD, almost anything could be
happening regarding cache and syncs. Also, mount options, % disk used, small
sample sizes, unknown contention effects, etc. This is a good example of how
to convince yourself of something and yet be less accurate than a divining
rod.

------
callesgg
I would assume the encryption is such a big overhead that most optimizations
in the upper levels will be useless.

~~~
petermonsson
The Intel AES instructions help a lot. According to Wikipedia we use 3.5
cycles/byte. That gives us 128*3.5/3 = 149 ms. If the write speed is around
500mb/a as another potter stated, then the disk encryption is probably not a
bottleneck. Still, it is better to actually measure the performance without
encryption to see if there is any effect.

------
dar8919
The second code snippet looks wrong,

write(out, buf, (r - w)) should be write(out, buf + w, r - w)

------
zobzu
actually despite authors claims the fs stuff is ram cached a lot, hence the
differences in the tests. (specially for a single file write)

