Hacker News new | comments | ask | show | jobs | submit login
Files Are Hard (danluu.com)
451 points by pyb on Dec 13, 2015 | hide | past | web | favorite | 151 comments

Pretty much spot on. Local in-kernel file systems are hard, partly because of their essential nature and partly because of their history. A lot of the codebases involved still show their origins on single-core systems and pre-NCQ SATA disks, and the development/testing methods are from the same era. The developers always have time to improve the numbers on some ancient micro-benchmark, but new features often get pushed to the LVM layer (snapshots), languish for ages (unions/overlays), or are simply ignored (anything alternative to the fsync sledgehammer).

The only way a distributed file system such as I work on can provide sane behavior and decent performance to our users is to use local file systems only for course-grain space allocation and caching. Sometimes those magic incantations from ten-year-old LKML posts don't really work, because they were never really tested for more than a couple of simple cases. Other times they have unexpected impacts on performance or space consumption. Usually it's easier and/or safer just to do as much as possible ourselves. Databases - both local and distributed - are in pretty much the same boat.

Some day, I hope, all of this scattered and repeated effort will be combined into a common library that Does All The Right Things (which change over time) and adds features with a common API. It's not quite as good as if the stuff in the kernel had been done right, but I think it's the best we can hope for at this point.

What distributed filesystem do you work on?

I'm on the Gluster team.

Could you provide any recommendations on reading materials for

1. File systems (single machine)

2. Distributed file systems

AIUI, ZFS was explicitly designed to deal with this sort of data corruption - one of the descriptions of the design I've heard is "read() will return either the contents of a previous successful write() or an error". That would (in principle) prevent the file containing "a boo" or "a far" at any point.

It looks like one of the authors cited in this article has written a paper analysing ZFS - though they admittedly don't test its behaviour on crashes. Citation here, in PDF form:


(edited to add: This only deals with the second part of this article. The first part would still be important even on ZFS)

Right, Copy-On-Write filesystems (ZFS, Bttr) are explicitly designed to prevent that kind of corruption by never editing blocks in place, but rather copying the contents to a new block and using a journaled metadata update to point the file at it's new block.

ZFS also includes features around checksumming of the metadata. "Silent" write errors become loud the next time data is accessed and the checksums don't match. This can't prevent all errors, but has some very nice data integrity properties - Combined with it's RAID format, you can likely recover from most any failures, and with RAIDZ2, you can recover from a scattered failures on all drives even if one drive has completely died. This is actually fairly common - Modern drives are very large, and rust is more susceptible to 'cosmic rays' than one might think.

There is an easy way to write data without corruption. First copy your file-to-be-changed as a temporary file or create a temporary file. Then modify the temporary file and write whatever you want in it. Finally, use rename() to atomically replace the old file by the temporary one.

The same logic also apply to directories, although you will have to use links or symlinks to have something really atomic.

It may not work on strangely configured systems, like if your files are spread over different devices over the network (or maybe with NFS). But in those cases you will be able to detect it if you catch errors of rename() and co (and you should catch them of course). So no silver bullet here, but still a good shot.

I'm surprised rename() wasn't mentioned in the article, it's a well known technique to atomically update a file, which is very practical for small-ish files.

Note that in the general case, you should fsync() the temporary file before you rename() it over the original - but ext3 and ext4 in writeback mode added a heuristic to do that automatically, because ext3 in the default ordered mode would effectively do that and many applications came to assume it.

rename is atomic, but it is not guaranteed to be durable. In order for rename to be durable, I've learned that you have to fsync the parent directory.

I was saddened when I learned this. I used to use this old trick for a lot of my systems. I learned it from reading the code of old well-written unix systems programs.

I guess that's not viable if your files are huge monolithic databases.

It also doesn't work if you want to support a non-trivial update rate, or if there's any possibility of contention with another process trying to do the same thing. It's the sort of thing that app writers get addicted to, because it does work in the super-easy case, but it doesn't help at all when you need to solve harder storage problems than updating a single config file once in a while.

"How is it that desktop mail clients are less reliable than gmail...?"

Made me chuckle. I've been told off by a former Googler colleague enough times now to have learned that Gmail is more complex than anyone imagines on a first guess, in order to be "reliable".

It is certainly the google service that I use the most. In a decade of quite heavy usage I remember one outage of a 1-2 hours (with no data loss). To me this is the gold standard that the rest of us should aspire to. :)

Lately (last year or so) I've started to notice substantial data loss. Either old mails completely missing or large mails being truncated (destroying inline images f.ex.)

So to anyone relying on gmail for safe keeping: Don't.

> Lately (last year or so) I've started to notice substantial data loss. Either old mails completely missing or large mails being truncated (destroying inline images f.ex.)

Please write to support and ask them to investigate. I used to work on the Gmail backend team, so I know bug reports regularly make it through support to an engineer. I also know they take data integrity quite seriously and have a variety of tools to investigate (potential) problems, both proactively and in response to user complaints such as yours. They also keep redundant copies of everything.

No offense (I have no idea who you are and how much you know), but...

Do you feel you are you competent enough (in e.g. SMTP and MIME) to distinguish between MIME-encoded base64 inline images (super rare) and references to images on external web sites (a lot more common)? Which in the case of old web sites are quite likely to stop working if you revisit old mails.

Did you make the effort to verify that it was the first type when you witnessed this?

You really could have just worded that as, 'are you sure those images weren't linked to external websites' without resorting to questioning his competence and trying to cover it with a 'no offense' (laff).

That's a good point - that would have been less loaded. (The question remains the same though, doesn't it?)

Also - your suggestion would have been a lot less douchey without that "laff".

Actually, yes. However, to be fair, it didn't extend farther than noticing that I couldn't rescue it by copying over the data, because when viewing the "original" the base64 text simply stopped. Wether the data was lost on the server, or simply never transmitted for other reasons (size limit in the UI?) I don't know.

It was inline do by means of a data-url in an img-tag in case you want the details ;)

And no offense taken. It's a fair question.

I get the opposite problem: stuff I delete periodically comes back. Google has less incentive to fix that due to their business model. ;)

Well, they are pretty open that they don't delete it anyway.

Keeping your "deleted" stuff hidden wouldn't hurt them.

> they don't delete it anyway


Interesting that none of the cited software uses maildir.

Breaking a mbox is an extremely simple thing, as the format leaves no possibility of error checking, parallel writing, rewriting lost data, or anything else.

Outlook's mail folders are marginally better, allowing for error detection, but really, that's a lame first paragraph for introducing a great article.

I've recently been playing with nbdkit, which is basically FUSE but for block devices rather than file systems.

I was shocked to discover that mke2fs doesn't check the return value of its final fsync call. This is compounded by the fact that pwrite calls don't fail across NBD (the writes are cached, so the caller's stack is long gone by the time the get flushed across the network and fails...)

As a test, I created an nbdkit plugin which simply throws away every write. Guess what? mke2fs will happily create a file system on such a block device and not report failure. You only discover a problem when you try to mount the file system.

The article's table of filesystem semantics is missing at least one X: Appends on ext4-ordered are not atomic. When you append to a file, the file size (metadata) and content (data) must both be updated. Metadata is flushed every 5 seconds or so, data can sit in cache for more like 30. So the file size update may hit disk 25s before the data does, and if you crash during that time, then on recovery you'll find the data has a bunch of zero bytes appended instead of what you expected.

(I like to test my storage code by running it on top of a network block device and then simulating power failure by cutting the connection randomly. This is one problem I found while doing that.)

Wow, 5 and 30 seconds before metadata and data flush? It sounds unbelievably long. If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?

I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.

> If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?

No, because correct programs use sync() and/or fsync() to force timely flushes.

A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.

The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.

Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)

For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.

> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.

Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.

When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.

Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.

> this would only mask bugs, not fix them

If I get a corruption once in 100 outages instead of on every one, I'm satisfied. That it "masks" anything is not an argument at all.

The writes happen in bursts. The behavior of bursts won't change if one more write is done after the burst is over (an example) only a second later instead of waiting 30.

The "delayed allocation" is a red herring: in the optimal case, the software can instruct the filesystem to preallocate the whole file size without having to actually fill the content. If it's not common by some specific applications on Linux, that's the place to fix it.

I worked at a storage company and the scariest thing I learned is that your data can be corrupt even though the drive itself says that the data was written correctly. The only way to really be sure is to check your files after writing them that they match. Now whenever I do a backup, I always go through them one more time and do a byte-by-byte comparison before being assured that it's okay.

This is true. Which is why we really, really need checksummed filesystems. I am very worried that this hasn't made its way into mainstream computing yet, especially given the growing drive sizes and massive CPU speed increases.

I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.

You know, I'm not really sure I buy this. I worked for a storage company in the past, and I put a simple checksumming algorithm in our code sort of like the zfs one. Turns out that two or three obscure software bugs later that thing stopped firing randomly, and started picking out kernel bugs. Once we nailed a few of those the errors became "harder". By that I mean that, we stopped getting data that the drives claimed was good but we thought was bad.

Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.

Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.

Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.

How does checksumming help if the data is in cache and waiting to be written ? For ex: I have 1MB of data, i write it but it stays in buffer cache after written and when you do checksum you are computing the checksum on buffer cache .

On Linux you have to drop_caches and then read get the checksum to be sure. Now per buffer or file drop_cache isnt available as per my knowledge . If you are doing a systemwide drop_caches you are invalidating the good and bad ones.

What if now if device is maintaing cache as well in addition to buffer cache?

Can someone clarify ?

How do you know you put good data into the cache in the first place?

There's always going to be a place where errors can creep in. There are no absolute guarantees; it's a numbers game. We've got more and more data, so the chance of corruption increases even if the per-bit probability stays the same. Checksumming reduces the per-bit probability across a whole layer of the stack - the layer where the data lives longest. That's the win.

Agree whole heartedly.

I was asking this thinking of open(<file>,O_DIRECT|O_RDONLY); that bypasses buffer cache and read directly from the disk that atleast solves buffer cache i guess. The disk cache is another thing ie if we disable it we are good at the cost of performance.

I was pointing that tests can do these kind of things.

I advocate taking it further with clustered filesystems on inexpensive nodes. Good design of those can solve problems on this side plus system-level. Probably also need inexpensive, commodity knockoffs of things like Stratus and NonStop. Where reliability tops speed, could use high-end embedded stuff like Freescale's to keep the nodes cheap. Some of them support lock-step, too.

GlusterFS now supports its own (out of band) checksumming. So you could have a Btrfs brick, and an XFS brick to hedge your fs bets. And also setup glusterfs volumes to favor XFS for things like VM images and databases, and Btrfs to favor everything else.

Neat idea. Appreciate the tip on GlusterFS.

> you should be outsourcing your storage to someone else

Well, I'd need to be sure that "someone else" does things properly. My experience with various "someone elses" so far hasn't been stellar — most services I've tried were just a fancy sticker placed onto the same thing that I'm doing myself.

As a counterpoint I have 6x3TB zfs raidz2 on freebsd at home. I resilver every month and only had one checksum error that turned out to be a cable going bad given it hasn't repeated.

Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.

had one checksum error that turned out to be a cable going bad given it hasn't repeated

I wouldn't assume that it was a cable error. The SATA interface has a CRC check. So the odds are quite high that a single error would simply result in a retransmission.

Of course, a plethora of detected SATA CRC errors and resulting retransmissions means that an undetected error could readily slip thru. There should be error logs reporting on the occurrence of retransmission, but I'm not enough of a software person to know how possible / easy it is to get that information from a drive or operating system.

OTOH, as you mention later in your post, a single bit error in non-ECC RAM could easily result in a single block having a checksum error. Exactly what you saw!

Hard drives also have spare sectors, so if a defect is detected at one spot in the disk, it will probably never touch that spot again.

Simply observing that an error only occurred once does almost nothing to narrow down the possible causes. You have to also be keeping track of all the error reporting facilities (SMART, PCIe AER, ECC RAM, etc.).

I simply don't have the upstream bandwidth necessary to back up 1TB (my estimate of essential data) offsite - it'd take months and take my ADSL line out of use for anything else.

I also have expensive power costs so running something like a Synology DS415 would cost $50 in power a year while barely using it - although that's better than older models.

Did you get any details on these 18 errors? Were they single bit flips?

No, unfortunately. I can't rule out the possibility of physical bus errors (like cable going bad or poor physical connection - in my case, there is one fairly expensive SAS cable per 4 drives, as I'm using a bunch of SAS/SATA backplanes with hotswap caddies); I do think that's probably more likely (or non-ECC RAM bitflip) than on-disk corruption.

But the exact nature of the problem is a distinction without a huge amount of difference to me. If I was copying those files, the copies would be silently corrupt. If I was transcoding or playing videos, the output would have glitches. Etc.

With this many HDDs, there are necessarily more components in the setup, and more things that can go wrong. Meanwhile, I'm not a business customer with profitable clients I can sell extra reliability to, so it's not the most expensive kit I could buy. I went as far as getting WD Red drives, and even then they were misconfigured by default, with an overly aggressive idle timer (8 seconds!) that needed tweaking.

The main thing is: more and bigger drives means increased probability of corruption.

Fortunately, zfs on Linux is excellent, and is a two-liner on modern Ububtu LTS. (add PPA, install zfs.)

Is it? I've heard several complains about bugs in FUSE.

ZFS on Linux[0] doesn't use FUSE.

[0] http://zfsonlinux.org/

Thanks, I was unaware. Apparently there is both native ZFS, and FUSE-backed ZFS for Linux:


I've a friend that uses it. Can't say it is not buggy, he hit the bug where unlinked files weren't removed. Got to 95% use before finding out he had to reboot/unmount to clean things up.

But that means his pool performance is now shit.

    > The only way to really be sure is to check your files after
    > writing them that they match.
This is assuming that the underlying block device would forcibly flush those queued writes to disk and then re-read them again rather than just serve them up directly from the pending write queue directly without flushing them first.

You generally can't make that assumption about a black box, so reading back your writes guarantees nothing.

Unless you're intimately familiar with your underlying block device you really can't guarantee anything about writes going to physical hardware. All you can do is read its documentation and hope for the best.

If you need a general hack to that's pretty much guaranteed to flush your writes to a physical disk it would be something like:

    After your write, append X random bytes to a file where X is
    greater than your block device's advertised internal memory, then
    call fsync().
Even then you have no guarantees that those writes wouldn't be flushed to the medium while leaving the writes you care about in the block device's internal memory.

This is why end-to-end data integrity with something like T10-PI is a necessity. The kernel block-layer already generates and validates the integrity for us, if the underlying drive supports it, but all major filesystems really need to start supporting it as well.

I don't think that's a necessity for all workflows. Just think about it, that would require all of us buying enterprise 520 or 528 byte sector drives to store the extra checksum information, and a whole new API up to the application level to confirm, point to point, that the data in the app is the data on the drive on writes, and the data on the drive is the data in the app on reads. It's not like T10/PI comes for free just by doing any one thing, it implies changes throughout the chain.

Great write-up and probably explains some issues in my apps a while back. Like that my long-time favorite, XFS, totally kicks ass in the comparisons. I agree on using static analysis and other techniques as I regularly push that here. What threw me is this line:

"that when they came up with threads, locks, and conditional variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning at these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems."

There were quite a few ways of organizing systems, including objects and functions, that the old guard came up with. UNIX's popularity and organization style of some others pushed the file-oriented approach from mainstream into outright dominance. However, many of us long argued it was a bad idea and alternatives exist that just need work on the implementation side. We actually saw some of those old ideas put into practice in data warehousing, NoSQL, "the cloud," and so on. Just gotta do more as we don't have a technical reason for dealing with the non-sense in the write-up: just avoiding getting our hands dirty with the replacements.

    > Like that my long-time favorite, XFS, totally kicks ass in the
    > comparisons.
I think you'll find this an interesting read then: http://teh.entar.net/~nick/mail/why-reiserfs-is-teh-sukc

It's written in 2004 so I don't know how current it is, but it makes the point that XFS makes certain performance & safety guarantees essentially assuming that you're running on hardware that has a UPS with the ability to interrupt the OS saying "oops, we're out of power".

It was designed by SGI for high-end workstations and supercomputers with long-running jobs (esp render farms). So, that doesn't surprise me. However, it's nice to have all the assumptions in the open and preferrably in user/admin guides. Another issue was it zeroing out stuff on occasion but they fixed that.

2004 is not current for XFS, that is a decade ago! However, disks finishing writes and not lying about having done it is a critical need for all FS. For some like ext3 you would notice it less as it was flush happy.

XFS is becoming the sane default filesystem for servers as it allocates nodes more consistently than the other current mainstream linux options on multidisk systems. Basically small servers now have more disk space and performance than the large systems of 2004. So XFS stayed put in where it starts to make sense, but systems grew to meet its sweetspot much often.

Pretty good analysis.

In plan9 mailbox files, like many others, are append only.

All files are periodically (default daily) written to a block coalescing worm drive and you can rewind the state of the file system to any date on a per process basis, handy for diffing your codebase etc.

For a while the removal of the "rm" command was considered to underline the permanence of files but the removal of temporary data during the daytime hours turned out to be more pragmatic.

How does Plan 9 deal the equivalent of this append-only pattern potentially causing corruption on Unix if you have multiple writers and the writes are larger than PIPE_BUF (4k by default on Linux)?

Most users of this pattern (concurrent updates to log files) get away with it because their updates are smaller than 4k, but if you're trying to write something as big as an E-Mail with this pattern you can trivially get interleaved writes resulting in corruption.

Exclusive locking

Surely filesystems are going to go through a massive change when SSDs push standard spinning disks into the history books? They must carry a lot of baggage for dealing with actual spinning disks, much of which is just overhead for super-fast solid state drives. Hopefully this will allow interesting features not possible on spinning disks, like better atomic operations.

"IotaFS: Exploring File System Optimizations for SSDs"

Our hypothesis in beginning this research was simply that the complex optimizations applied in current file system technology doesn’t carry over to SSDs given such dramatic changes in performance characteristics. To explore this hypothesis, we created a very simple file system research vehicle, IotaFS, based on the incredibly simple and small Minix file system, and found that with a few modifications we were able to achieve comparable performance to modern file systems including Ext3 and ReiserFS, without being nearly as complex.


Yeah btrfs 'ssd' mount option does less for the same reasoning, but still does include checksums for metadata and data because SSDs have at least as much likelihood of non-deterministically returning your data as spinning rust. So even if it doesn't fix the corruptions (which requires additional copies or parity), at least there's an independent way of being informed of problems.

I wonder how this approach (single file + log) compares to the other usual approach (write second file, move over first):

1. Write changed the data into a temporary file in the same directory (don't touch the original file)

2. Move new file over old file

Does this lead to a simpler strategy that is easier to reason about, where it is less likely for programmer to get it wrong? At least I see this strategy being applied more often than the "single file + log" approach.

The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.

That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data.

This is the fundamental problem: When you allow the OS (or the compiler, or the CPU) to re-order operations in the name of efficiency you lose control over intermediate states, and so you cannot guarantee that these intermediates states are consistent with respect to the semantics that you care about. And this is the even more fundamental problem: our entire programming methodology has revolved around describing what we want the computer to do rather than what we want to to achieve. Optimizations then have to reverse-engineer our instructions and make their best guesses as to what we really meant (e.g. "This is dead code. It cannot possibly affect the end result. Therefore it can be safely eliminated.) Sometimes (often?) those guesses are wrong. When they are, we typically only find out about it after the mismatch between our expectations and the computer's have manifested themselves in some undesirable (and often unrecoverable) way.

"That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data."

No, it can work, provided that the temporary file is fsynced before being renamed, the parent directory is fsynced after renaming the file, and that the application only considers the rename to have taken place after the parent directory is fsynced (not after the rename call itself).

FSYNC is not enough. You also have to make sure that NCQ is disabled:


OS can use Force Unit Access flag with NCQ to control disk buffering: https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_...

Good summary of the situation. It's why I fought out-of-order execution at hardware and OS levels as much as I could. Even went out of way to use processors that didn't do it. Market pushed opposite stuff into dominance. Then came the inevitable software re-writes to get predictability and integrity out of what shouldn't have problems in the first place. It's all ridiculous.

It bugged me that Sublime Text used to do these so-called atomic saves by default since it screwed with basic unix expectations like fstat and fseek meaningfully working (like a tail -f implementation could boil down to[0]). A concurrently running process using those calls would be off in lala-land as soon as the text file was modified and saved: it would never pick up any modifications, because it and the editor weren't even dealing with the same file any more.

[0] Here's follow(1) in my homemade PL:

    #!/usr/bin/env imp
    (when (not-eq? (length script-args) 1)
          (write (sprintln "usage: " script-name " <path>") stderr)
          (exit 1))

    (let path (car script-args)
         f (open path)
         ((proc (stat-then)
                (let stat-now (stat f)
                     (seq (when (and? stat-then
                                      (lt? (find 'size stat-now)
                                           (find 'size stat-then)))
                                # The file was truncated. Read from the start.
                                (seek s-set 0 f))
                           (let bytes (read default-chunk f)
                                (seq (when bytes (print bytes))
                                     (sleep 1000)
                                     (recur stat-now))))))

GNU tail lets you say "--follow=descriptor" to follow the content of the file no matter how it gets renamed, or "--follow=name" to re-open the same filename when a different file gets renamed onto it.

That's the difference between 'tail -f' and 'tail -F' and is implemented in every tail I know of.

indeed. `tail -qF` has been muscle memory for me for as long as I can remember.

if you have long lived sessions tail'ing logs that get rotated (or truncated), forgetting -F is a good way to eventually end up confused, ctrl-c'ing, and finally cursing.

Functional versus imperative concurrent shared data approaches provide a good analogy:

* Single file + log: fine grained locking in a shared C data structure. Yuck!

* Write new then move: transactional reference to a shared storage location, something like Clojure's refs. Easy enough.

The latter clearly provides the properties we'd like, the former may but it's a lot more complicated to verify and there are tons of corner cases. So I think move new file over old file is the simpler strategy and way easier to reason about.

The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.

Clojure's approach again provides an interesting solution to saving space. Taking the idea of splitting data into multiple files to the logical conclusion, you end up with the structure sharing used in efficient immutable data structures.

Your solution is slower; also, you need to fsync()/fdatasync() the new file before moving, at least on some systems (http://lwn.net/Articles/322823/ is the best reference I can find right now), and you need to fsync() the directory if you wish the rename to be durable (as opposed to just all-or-nothing.)

In general this approach should work fine, but devil is in detail: 1. You have to flush change to temporary file before move because otherwise you may get empty file: OS may reorder move and write operations 2. After move you have to flush parent directory of destination file on Posix. Windows have special flag for MoveFileEx() to ensure that operation is done or you have to call FlushFileBuffers() for destination file.

Linked paper mention the many popular programs forgets about (1).

This is an interesting article but the examples given are seriously dated:

Mail servers and clients (MTAs and MUAs) have been using Maildir format to escape this problem since 1996.

Filesystems have evolved. ZFS has been available for ten years.

If you take your data seriously, you do need to make informed choices. But this article isn't targeted at people who won't know about Maildir and ZFS.

The section about relative data loss failures between various applications is great. Again, careful choices: SQLite and Postgres.

Wasn't part of the need for Maildir indexability rather than reliability?

It provides both. mbox was a known evil 20 years ago, so it's a bad example for the article.

The qmail spec describes queuing and delivery in a filesystem-safe manner.

> Filesystems have evolved. ZFS has been available for ten years.

Perhaps you should examine the section discussing the various problems with file systems. ZFS is hardly immune from these problems.

Perhaps you should take a closer look. ZFS wasn't part of that discussion, and none of the file systems which were discussed have ZFS's feature set.

I wonder how close btrfs is to ZFS. And how exactly it differs.

The similarity is specious. That is, if you look at a feature listing, they look similar. But dig into details and they clearly differ.

Origins: one starts out with proprietary developed functionality and with tons of money in a skunkworks project, using a consolidated team of developers mainly within one company through the fs development and stabilization; the other starts with an idea in a research paper (not called btrfs then because it wasn't an implementation), slow recruitment that'd eventually involve dozens of major companies, and is on a split stabilize and feature development track.

And then if you go back to look at features (commands in particular) in more detail there are some big differences that may or may not affect specific workflows. e.g. btrfs snapshots are unique and independent file trees, in effect there's no distinction between the original and snapshot, both are read-writable, modifying one doesn't affect the other, either can be deleted without affecting the other (ZFS has a distinct parent-child relationship between filesystems, clones, snapshots). Growing btrfs volumes by adding then (at some point balancing, which may be optional depending on the utilization and layout) is a lot easier and faster than ZFS. ZFS offers raid7 (3 parity drives), Brfs does not. Btrfs offers online defragmentation, ZFS does not. ZFS is more stable for $reasons. Really even ext4 and XFS are considered more stable still, although I hesitate to say Btrfs is not stable or unstable. I'd say Btrfs is earning trust, where the others have earned it (with very edge case exceptions, and there simply are more such edge cases for Btrfs still being found and fixed)

So they're actually quite different.

The article is a troll with some good information.

I've been a heavy mail user for years... Never encountered data loss due to file system problem, and honestly I can't think of a time in the last decade where anyone I'm acquainted with has. (And I ran a very large mail system for a long time)

Hell, I've been using OSX predominately for years now, and that garbage file system hasn't eaten any data yet!

There are problems, even fundamental problems, but if someone is literally unable to use any mail client, you need to look at the user before the file system.

Where I think you're seeing "that garbage file system" not eating your data has a lot to do with no crashes or power losses. It has evolved a good deal since HFS and even HFS+ days, no one uses either of those anymore. It's all HFSJ, with a scant number using HFSX.

20 years ago Mac OS crashed often, and had a file system not designed to account for that. OS X even shipped with non-journaled HFS+. It was only into the 3rd major release of OS X that journaling appeared. Data corruptions, I feel, dropped massively, because the OS didn't crash nearly as often, but did still crash. In the last 4-5 years I'd say I get maybe one or two kernel panics per year on OS X, which is a lot less than I get on Linux desktops. But even still on Linux desktops, I can't say when I've seen file system corruption not attributable to pre-existing hardware issues.

Cough, cough. MH.

What, you don't think MH was doing better file handling than mbox, including most of maildir's techniques, a decade before maildir?

The article only mentions Linux/Posix systems, are the same problems also present in Windows/NTFS? I was under the impression that, for example, renames on ntfs were crash safe and atomic, which would make the "write temp file then rename to target" work even if the power is cut?

NTFS is not very safe. No data integrity checksums. I think it's about same level as ext4 mostly. Meaning not very good. One shouldn't trust any critical data on NTFS without checksums and duplication.

NTFS consistency checks and recovery are pretty good. But they won't bring your data back.

Microsoft's ReFS (Reliable File System) might give storage reliability to Windows one day. On Linux you can use ZFS or btrfs (with some reservations) today.


> ...work even if the power is cut?

If power is cut during NTFS logfile update, all bets are off. Hard disks can and will do weird things when losing power unexpectedly. That includes writing incorrect data, corrupting nearby blocks, corrupting any blocks, etc. That includes earlier logfile data, including checkpoints.

The article makes me wonder whether there's enough abstraction being done via the VFS layer, because all this fsync business that application developers seem to have to do can be so workload and file system specific. And I think that's asking too much from application developers. You might have to fsync the parent dir? That's annoying.

I wonder if the article and papers its based on account for how the VFS actually behaves, and then if someone wanting to do more research in this area could investigate this accounting for the recent VFS changes. On Linux in particular I think this is a bigger issue because there are so many file systems the user or distro could have picked, totally unbeknownst to and outside the control of the app developer.

That's definitely asking too much of app developers. Every time someone complains about any of this, the filesystem developers come back with a bit of lore about a non-obvious combination of renameat (yes that's a real thing) and fsync on the parent directory, or some particular flavor of fallocate, or just use AIO and manage queues yourself, or whatever depending on exactly which bogus behavior you're trying to work around. At best it's an unnecessary PITA. At worst it doesn't even do what they claimed, so now you've wasted even more time. Most often it's just non-portable (I'm so sick of hearing about XFS-specific ioctls as the solution to everything) or performs abominably because of fsync entanglement or some other nonsense.

We have libraries to implement "best practices" for network I/O, portable across systems that use poll or epoll or kqueues with best performance on each etc. What we need is the same thing for file I/O. However imperfect it might be, it would be better than where we are now.

Very rudimentary, but a way for an application developer to specify categories of perform/safety ratio operations. An app developer might have a simple app that only cares about performance, only cares about safety, there'd be a default in between both. Another app developer might have mixed needs depending on the type of data the app is generating. But in this way, if they write the app with a category of A (let's say that means highest safety at expense of performance) and their benchmarking determines this is crap, and they have to go to category B for writes, that's a simpler change that going back through their code and refactoring a pile of fsyncs or FUA writes.

I mean, I thought this was a major reason for VFS abstraction between the application and kernel anyway. It's also an example of the distinction between open source and free (libre). If as an application developer you have to know such esoterics to sanely optimize, you in fact aren't really free to do what you want. You have to go down a particular rabbit hole and optimize for that fs, at the expense of others. That's not fair choice to have to make.

The inherent issue is that there's a huge performance benefit to be gained by batching updates. FS safety will always come at the cost of performance.

The article doesn't say but I suspect most of the issues it mentions can be mitigated by mounting with the "sync" and "dirsync" options, but that absolutely kills performance.

The APIs involved could definitely be friendlier, but the app developer is using an API that's explicitly performance oriented by default at the cost of safety, and needs to opt-in to get safer writes. Whether the default should be the other way around is one matter, but ultimately someone has to pick which one they want and live with the consequences.

One of the naive assumptions that most of us make is that if there's a power failure none of the data that was fsynced successfully before will be corrupted.

Unfortunately, this is not the case for SSDs.

This issue is completely fixed by Maildir and was many years ago. Many clients, including Mutt, for example, support Maildir boxes

One thing I've always wanted to try and never had time is to build SQlite as a kernel module, talking directly to the block device layer, and then implement a POSIX file system on top of it.

It wouldn't solve problems with the block device layer itself, but it'd be interesting to see how robust and/or fast it was in real life.

SQLite's still too big of a pig, and still unreliable. But we can do LMDB in-kernel, and an LMDBfs is in the works. Based on BDBfs.


SQLite itself relies on a filesystem for the rollback journal or write-ahead log. So you'd need some kind of abstraction between the block device and SQLite. Might as well just use the existing filesystem and keep SQLite in user space, since that works.

It all sounded so good until the end of the header where it's described as a demo vfs. :-(

This article sheds some light on a problem I've had for years:

The software product I develop stores its data in a local database using H2, stored in a single file. Users tend to have databases gigabytes in size.

After reading this article, I start to understand why my users occasionally get corrupted databases. A hard problem to solve.

An easy problem to solve. LMDB never corrupts.

While I experienced this pain first hand I'm not sure that the FS deserved a 100% of the blame. There is enough to blame to go around for the userspace, filesytems, block device layer and disk controllers, hardware & firmware.

If you start at the bottom of the stack you have all sorts of issues with drives, their firmware and controllers doing their own levels of caching & re-ordering. Over the years the upper layer developers (block layer / fs) had to deal with hardware that simply lies about what happened or is just plain buggy.

It appears that sqlite could be good basis for decrapified filesystem.

then you didn't read the article closely. SQLite has plenty of crash vulnerabilities.

I just saw zeros in the tables.

I don't program much in C, or use direct system calls for files. Mostly I use Java.

Does anyone know if any of this applies to Java's IO operations. I'm sure you can force some of this behaviour, but for instance: The flush method on OutputStream, will it ensure proper sync, or is that again dependent on the OS and file system as described in the article for syscalls?

This answers your question:


It's only logical if you think about the bigger picture. Does Java have access to the underlying disk device or does it work with the filesystem? Which component is responsible for the filesystem?

You can force writes to disk with NIO, but I don't think that really solves any of the problems detailed in this article.

I do agree with you, this is my story about the usual stuff: http://www.sami-lehtinen.net/blog/chaos-monkey-bit-me-a-shor... As well as this classic from SQLite3 guys: https://www.sqlite.org/howtocorrupt.html - As you can see, there are multiple ways to ruin your day.

Ah, fond memories of async ext2 corrupting root filesystems beyond recognition... I think we 'invented' 'disposable infrastructure' back in 2002 because the filesystem forced us to... The MegaRAID cards that would eat their configuration along will all your data didn't help either..

Can't remember if we switched to ext3 in ordered or data-journaled mode but it made an immense difference...

IIRC I have seen this discussion before. And the answer was do a fsync. But for sakes of performance we want to be able to issue write-barriers to a group of file handles. So we know the commands will be ran in order, and no other order.

Correct, we want to be able to write in deterministic order. SCSI has supported this natively for decades. Unfortunately SATA doesn't, and the Linux kernel pretty much doesn't, because it can't rely on the storage devices to support it.

Which is just lame, its mostly because the block layer doesn't support file based fencing. A mistake made a decade an a half ago, and no one has the will/political power to fix it.

If the block layer supported it, solving the problem of fencing an ATA device would be as simple as issuing a whole device flush instead of a SYNC_CACHE with range. Which for soft raid devices would make a huge impact because only the devices with dirty data need to be flushed.

Of course the excuse today, is that most scsi devices can't do range based sync either, and just fall back to whole device because no one uses the command. Chicken/egg.



I think it's mostly a "why are you telling us? The author of the article might not ever even visit HN. Email them." reaction.

FWIW, the author is on Hacker News at https://news.ycombinator.com/user?id=luu.

Moreover, it is non-obvious how to contact the author via private email. The "about" section of the blog just links to a twitter account:


This is especially strange as the author then mentions email there, without providing an actual email address:

| Please ping me on twitter if it looks like your email got eaten by my spam filter

Maybe the author expects a reader to brute-force lots of addresses "something [at] danluu.com"?

I'd prefer if he just give his address, but in Dan's defence he does offer some pretty clear hints in his profile page: "Email is my full name @ gmail." https://news.ycombinator.com/user?id=luu

His email address is also given prominently on his linked resume: https://github.com/danluu/tex-resume (periods are optional in Gmail addresses)

Isn't block level duplication/checksumming like RAID supposed to solve this hardware unreliability? I understand that by default RAID is not used on end user desktops.

AFAIK, most RAID systems do not have checksums. I may be wrong, but I think RAID5/RAID6 even amplifies error frequency.

It gets more "fun" when you consider many (most? all?) hard disks can get corrupted without checksum failures.

Layer "violators" like ZFS and btrfs do have checksums.

Maybe conventional block / filesystem layering itself is faulty.

The layers that ZFS violates were created years before the failure modes were well understood that filesystem-based checksums address. I'm not sure how you _can_ solve these issues without violating the old layers.

In particular: checksumming blocks alongside the blocks themselves (as some controllers and logical volume managers do) handles corruption within blocks, but it cannot catch dropped or misdirected writes. You need the checksum stored elsewhere, where you have some idea what the data _should_ be. Once we (as an industry) learned that those failure modes do happen and are important to address, the old layers no longer made sense. (The claim about ZFS is misleading: ZFS _is_ thoughtfully layered -- it's just that the layers are different, and more appropriate given the better understanding of filesystem failure modes that people had when it was developed.)

Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great).

Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great)

Good article. You have a typo: looks like you're using Pandoc or something similar, and left out the closing parenthesis in the link after [undo log].

Yeah, that's a common error among Markdown writers. It's too easy to forget the last bracket, especially if you are putting a link inside a parenthetical comment in the first place.

Fortunately, it's easy to detect programmatically. I have a little shell script which flags problems in my Markdown files: http://gwern.net/markdown-lint.sh

In this case, you can use Pandoc | elinks -dump to get a rendered text version, and then simply grep in the plain text for various things like "-e '(http' -e ')http' -e '[http' -e ']http'"

For me, Vim's highlighting and concealing prevents this class of typos. It also makes it more pleasant to read the source, as it hides the links unless I'm editing the line.

This problem can be fixed. We need to rethink file system semantics. Here's an approach:

Files are of one of the following types:

Unit files

For a unit file, the unit of consistency is the entire file. Unit files can be created or replaced, but not modified. Opening a unit file for writing means creating a new file. When the new file is closed successfully, the new version replaces the old version atomically. If anything goes wrong, including a system crash, between create and successful close, including program abort, the old version remains and the new version is deleted. File systems are required to maintain that guarantee.

Opens for read while updating is in progress reference the old version. Thus, all readers always see a consistent version.

They're never modified in place once written. It's easy for a file system to implement unit file semantics. The file system can cache or defer writes. There's no need to journal. The main requirement is that the close operation must block until all writes have committed to disk, then return a success status only if nothing went wrong.

In practice, most files are unit files. Much effort goes into trying to get unit file semantics - ".part" files, elaborate file renaming rituals to try to get an atomic rename (different for each OS and file system), and such. It would be easier to just provide unit file semantics. That's usually what you want.

Log files

Log files are append-only. The unit of consistency is one write. The file system is required to guarantee that, after a program abort or crash, the file will end cleanly at the end of a write. A "fsync" type operation adds the guarantee that the file is consistent to the last write. A log file can be read while being written if opened read-only. Readers can seek, but writers cannot. Append is always at the end of the file, even if multiple processes are writing the same file.

This, of course, is what you want for log files.

Temporary files

Temporary files disappear in a crash. There's no journaling or recovery. Random read/write access is allowed. You're guaranteed that after a crash, they're gone.

Managed files

Managed files are for databases and programs that care about exactly when data is committed. A "write" API is provided which returns a status when the write is accepted, and then makes an asynchronous callback when the write is committed and safely on disk. This allows the database program to know which operations the file system has completed, but doesn't impose an ordering restriction on the file system.

This is what a database implementor needs - solid info about if and when a write has committed. If writes have to be ordered, the database program can wait for the first write to be committed before starting the second one. If something goes wrong after a write request was submitted, the caller gets status info in the callback.

This would be much better than the present situation of trying to decide when a call to "fsync" is necessary. It's less restrictive in terms of synchronization - "fsync" waits for everything to commit, which is often more than is needed just to finish one operation.

This could be retrofitted to POSIX-type systems. If you start with "creat" and "O_CREAT" you get a unit file by default, unless you specify "O_APPEND", in which case you get a log file. Files in /tmp are temporary files. Managed files have to be created with some new flag. Only a small number of programs use managed files, and they usually know who they are.

This would solve most of the problems mentioned in the original post.

Does anyone know if there's a recording of the usenix talk given? (Referenced slides)

PG was quite proud of having just used the file system as a database with Viaweb, claiming that "The Unix file system is pretty good at not losing your data, especially if you put the files on a Netapp." His entire response is worth reading in light of the above article, if only to get a taste of how simultaneously arrogant and clueless PG can be: http://www.paulgraham.com/vwfaq.html

PG and company still think this is a great idea, because the software this forum runs on apparently also does the file system-as-database thing.

No wonder Viaweb was rewritten in C++.

EDIT: to those downvoting, my characterization of PG's response is correct. Its arrogance is undeniable, with its air of "look how smart I am; I see through the marketing hype everyone else falls for," as is its cluelessness, with PG advocating practices that will cause data loss. Viaweb was probably a buggy, unstable mess that needed a rewrite.

You're spending your Sunday ragging on someone with a throwaway and being hypervigilent about downvotes on your throwaway. Get a grip mate.

being hypervigilent about downvotes on your throwaway

Well, FWIW, it's working. At this instant his total karma is +10, he only has 2 posts, and the 2nd post is slightly greyed. So that means that his original comment is now in the range of +10.

Sadly, the fact that I'm replying here probably means that I need to get a life!

It's clear you're ignorant of the capabilities netapp filers had, even back then.

WAFL uses copy on write trees uniformly for all data and metadata. No currently live block is ever modified in place. By default a checkpoint/snapshot is generated every 10 seconds. NFS operations since the last checkpoint are written into NVRAM as a logical recovery log. Checkpoints are advanced and reclaimed in a double buffering pattern, ensuring there's always a complete snapshot available for recovery. The filer hardware also has features dedicated to operating as a high availability pair with nearly instant failover.

The netapp appliances weren't/aren't perfect, but they are far better than you're assuming. They were designed to run 24/7/365 on workloads close to the hardware bandwidth limits. For most of the 2000's, buying a pair of netapps was a simple way to just not have issues hosting files reliably.

Perhaps you should take your own advice and dial back the arrogance a bit.

> No wonder Viaweb was rewritten in C++

Or maybe this massive non-sequitur is reason enough for downvotes. Also note this from the link you provided:

> But when we were getting bought by Yahoo, we found that they also just stored everything in files

Trusting Netapp is way better than 99% of status quo. It has had ZFS style integrity for a pretty long time.

I think PG was pretty much right in his judgement. Any file system is going to be pretty good on a reliable block store, such as Netapp.

Unless the file IO is done correctly, even the best file system won't save you from data loss, such as the kind that can result from sudden power failure, which is what the article talks about.

PG obviously thinks RDBMSs are just unnecessary middlemen between you and the same file system. He doesn't realize that even if they ultimately use the same file system you do, they likely don't use it the way you do. Maybe Viaweb used something like the last snippet of code in the article, but I doubt it.

here's some summary slides of what you're missing: http://community.netapp.com/fukiw75442/attachments/fukiw7544...

Or you can read the patent itself: https://www.google.com/patents/US5819292


I was curious why pg did it that way. Here's a brief comment from him:

>What did Viaweb use?

>pg 3160 days ago >Keep everything in memory in the usual sort of data structures (e.g. hash tables). Save changes to disk, but never read from disk except at startup.

So similar to what Redis does today but a decade before Redis and likely faster than the databases of the day. Could have been important with loads of users editing their stores on slow hardware. Anyway it worked, they beat their 20 odd competitors and got rich. I'm sceptical that it was a poor design choice.

I downvoted you for lowering the tone of the conversation with personal insults. If you had just said something like 'note that PG's advice about using the Unix file system as a database is not now considered best practice,' that would have been fine.

The article implies very little about the aptness of 'using the file system as a database' for a specific application.

Who's PG?

Paul Graham, the boss of Ycombinator.

Your comment begs the question: Do you fall for the marketing hype? And if not, then do you think you should keep quiet about stuff that works?

At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.

But he has admitted they struggled with setting up RAID. They were probably not too experienced with FreeBSD. I am sure they had their fair share of troubles.

PG's essays and his taste in software are great and the software he writes may be elegant, but that does not necessarily mean it is robust.

Best filesystem I have experienced on UNIX is tmpfs. Backing up to permanent storage is still error-prone, even in 2015.

> At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.

Why was it a better OS choice at the time than, say, Solaris or IRIX or BSDI?

I would have if it served my needs, because those others required special hardware and/or hefty licensing costs.


All of this is great, except the first two sentences:

> I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox.

If I were to get so many emails that it was corrupt my mailbox, I'd first ask myself why, and how to stop that.

I wouldn't. If your daily workflow includes a high volume of email, then it does.

What I would ask is: is there a way I can solve this problem without having to totally rearrange how I use email? I'm curious whether the author looked at, say, running a personal MDA that served mail over IMAP, so that it could be interacted with via a desktop email client, without requiring that client to serve as the point of truth. Not to say that corruption couldn't still happen that way, but Thunderbird (for example) can be configured to store only a subset of messages locally, or none at all. With a reasonably fast connection to the MDA, this seems like a possibly workable solution.

> If your daily workflow includes a high volume of email, then it does.

Not necessarily. For example, if you have monitoring software sending you all email notifications, you could change that to just write records to a database instead.

I suppose I'm giving the person in question credit for being able to see the blindingly obvious.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact