Hacker News new | past | comments | ask | show | jobs | submit login
Unix’s file durability problem (utoronto.ca)
287 points by robinhouston on Apr 16, 2016 | hide | past | web | favorite | 159 comments



I've suggested an approach to this before. There should be several types of files.

- "Unit" files commit when closed properly (this does not include a program exit without close), and then replace the old version of the file. The file system should guarantee that, after a crash, you have either the old version or the new complete version. This should be the default when a file is opened via "creat()"

- "Temp" files always go away completely after a crash. This should be the default for designated temp directories.

- "Log" files can only be appended. Writers cannot seek. The file system guarantees that after a crash, the end of the file is at the end of some write; the file may not tail off into junk. This should be the default for files opened for append.

- "Managed" files are for structured databases. They have an additional API. "writemanaged()" has a callback parameter. It returns when the data has been queued, but in addition, the writer gets a callback when the write has committed to disk. The file system must guarantee that a write for which the callback has been made will survive a crash. This provides fine-grained information of when data has been committed to disk, which is what a database needs. It improves performance by not blocking waiting for disk commit to take place. The database can have several I/O operations going at once in different parts of the file without blocking.


> - "Unit" files commit when closed properly (this does not include a program exit without close), and then replace the old version of the file. The file system should guarantee that, after a crash, you have either the old version or the new complete version.

Incidentally, this is the way it currently is. From rename(2): If "newpath already exists, it will be atomically replaced".

Just don't forget that "replacing" means to create a _new_ file, writing it, syncing it, and then replacing the old version with rename(2) (and fsyncing the directory).

> - "Temp" files always go away completely after a crash. This should be the default for designated temp directories.

You can have that: Just don't link the inode of the file. On Linux, you can use O_TMPFILE. Alternatively, just create the file, and keep it open but immediately unlink it (it's a little hack but nothing dramatic).

You can also write to /tmp. Normally that gets cleaned at reboot, which might be enough.

> - "Log" files can only be appended.

In open(2), look for O_APPEND.

> Writers cannot seek.

That's not how you write a kernel API. If you don't want to seek, then don't.

> The file system guarantees that after a crash, the end of the file is at the end of some write

How do you intend to implement that? Disks don't support arbitrarily large atomic commits. Making such a guarantee in software is necessarily inefficient. The kernel will not choose your tradeoffs for you. You can do it from userland. Decide yourself what is the least bad way to do it.

> "Managed" files are for structured databases. [..] The writer gets a callback when the write has committed to disk. The file system must guarantee that a write for which the callback has been made will survive a crash.

Fsync does just that, except it isn't asynchronous. Maybe there's something behind aio(7)?


Rename isn't quite an atomic replacement. If you crash before the rename, the new file hangs around. (Hence unwanted .part files.)

O_APPEND isn't airtight on all systems. On some older UNIX systems, multiple writers created with "open()" (not "dup()") do not share a file position. NTFS doesn't do append correctly.

How do you guarantee that, after a crash, the end of the file is at the end of some write? By updating the file size after the write. The file size update can be deferred during heavy write traffic, but you should always get a file size that ends at a write boundary.

"fsync" synchs the whole file, not just one I/O, which can take a while. Databases such as MySQL's InnoDB, which puts multiple tables in one file, can have independent I/O going on in different parts of a file.

aio(7) has the right mechanism, a callback/signal on completion. But it's not clear if the file system guarantees the data is safely on disk when the completion signal comes in.

The original article complains that UNIX/Linux file system semantics aren't well enough defined for database safety. He's right. They're close, but not quite there, because behavior after a crash is unspecified.


> How do you guarantee that, after a crash, the end of the file is at the end of some write? By updating the file size after the write.

Note that Linux ext4 does not do this. On a power outage, you can get bogus trailing zeros on a file which you were appending, because the file size was updated before the data was written. I asked Ted T'so about this and he said it was working as intended.

https://plus.google.com/+KentonVarda/posts/JDwHfAiLGNQ


ZFS does do this. Changes are atomic, so they either happen or do not. There is no in-between.


I don't know about ZFS, but the usual write(2) API does not support this: It might for example return early with a short write because some interrupt occurred. Can happen for all "slow devices" (see signal(7)). And I think that's a good thing and am sure many programs expect this.


The signal(7) man page also states clearly that a local disk is not a "slow" device, so this seems moot.

    read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices.  A "slow" device is one where the I/O call  may  block  for  an
    indefinite  time, for example, a terminal, pipe, or socket.  If an I/O call on a slow device has already transferred some data by the time it
    is interrupted by a signal handler, then the call will return a success status (normally, the number of  bytes  transferred).   Note  that  a
    (local) disk is not a slow device according to this definition; I/O operations on disk devices are not interrupted by signals.


oops, had that wrong. Thanks for noticing!


Oh! So that's where all the random nulls in my shell history came from. Zsh would sometimes complain about a corrupted history after a bad crash, and I always wondered what caused maybe 20 nulls to be added to the end of the file. Thanks!


It does happen with the default setting, but if you journal everything, then it doesn't happen. Performance on spinning rust, however, drops 50%.


> On some older UNIX systems, multiple writers created with "open()" (not "dup()") do not share a file position.

I would never expect that. It's not how it works. But what has to work is the dup() case, or the inherited descriptors case like

$ ( do_this & do_that & } > /some/file

because here they share the same file descriptor (as opposed to having only the same file opened on stdout). Note this still doesn't guarantee that writes won't be interleaved. I think there is only a guarantee that writes up to 512 bytes won't be interleaved. The answer here is: Don't do simultaneous writes. Or implement your own synchronization mechanism. It's just not a requirement with a universally satisfying solution that should be implemented in the kernel.

> NTFS doesn't do append correctly.

If you meant NFS: It deliberately violates POSIX file system semantics. I guess it's just too hard to implement. For example, an NFS server can't know whether the machine which owned a lock crashed or is just temporarily disconnected from the network (and has still a process running thinking it owns a lock and not knowing the server is wondering).

In any case, not the API to blame.

> How do you guarantee that, after a crash, the end of the file is at the end of some write? By updating the file size after the write. The file size update can be deferred during heavy write traffic, but you should always get a file size that ends at a write boundary.

How could we possibly implement these "transactions" if the write is _not_ at the end of some file?

> It's not clear if the file system guarantees the data is safely on disk when the completion signal comes in.

What the disk does is in the end controlled only by the disk itself (and of course environmental forces). So it's always best effort.


sync_file_range(2) allows you to specify ranges to sync, instead of the whole file, but it's not that useful anyway.

In most DBs, you have separate data and log. The log is always only appended to, and the nature of disks is such that grouping concurrent transactions together into a single commit is a big win regardless of how syncing works. So syncing all outstanding I/O to the log file is generally what you want to do anyway.

While the data store is always being written to, its writes do not need to be synced to disk until a log checkpoint is made (and the old log subsequently discarded). And since that is an intermittent asynchronous process, there is not much to gain from finer-grained syncing here either.

See e.g. http://www.postgresql.org/message-id/4A51CB76.5020407@enterp...


Actually, we made postgres use sync_file_range() for checkpointing if available in 9.6. Not for durability - there's a few to many caveats in the manpage - but to control how much work a later fsync() has to do. In many workloads checkpointing can generate a lot of writes, and the OS's writeback caching of those can generate a lot of dirty buffers in the kernel's page cache. If the kernel decides to flush those (on it's own or due to an fsync), latency for every other FS operation can skyrocket. We've seen stalls in the 10s of minutes. So we now regularly sync_file_range(SYNC_FILE_RANGE_WRITE), unless the feature is disabled of course, to control how much dirty data the kernel has.

See http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;... and http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;...


Thanks for the info, as a heavy user of Postgres it's great to know that option will be available!


> NTFS doesn't do append correctly.

Care to elaborate on that? I haven't seen issues with the two mechanisms I know of in NT:

1. CreateFile with FILE_APPEND_DATA

2. WriteFile with the OVERLAPPED offset set to ~0

Both of these get you writes that are as far as I know safe to have multiple processes writing to a common log file. Not sure about flushes to disk but you could probably get away with FlushFileBuffers for this.


NTFS handles append just fine. As another comment speculated, the parent may have meant NFS, which does notoriously have lots of problems with append, and many other POSIX semantics.


Sorry, meant NFS.


> Rename isn't quite an atomic replacement. If you crash before the rename, the new file hangs around. (Hence unwanted .part files.)

It seems O_TMPFILE cannot be used to atomically replace existing files either. If only one could use renameat2() on a raw fd or force linkat() to an atomic overwrite.


Your response pretty aptly demonstrates the problem in the original article.

If you write to a temp file and then use rename, you need to pick a suitable temporary name (without a race condition) that other programs will ignore, and now you've probably created a garbage file if the program crashes before you get to the rename. You also need to make sure you've preserved the permissions in the original file, and hope that nobody else is using the same trick to update the same file.

There are API's and standard tricks for these things too. But it's difficult to be sure that you've covered all the corner cases and impossible to be sure that everyone else has covered all the corner places.

This all adds complexity compared to having a simple, standard, guaranteed way to atomically update a file and makes it less likely that most programs will do it right.


As I said, if it's too unbearable a hack for you to unlink() the file directly after creat()ing it (to get an anonymous file handle that nobody else can see), for the unlikely case that the system crashes just in between, then just go ahead and use O_TMPFILE. It was made for you!

If you don't want to use O_TMPFILE (not super-portable) then just open(path, O_CREAT|O_EXCL). That fails should there be another file lying around, and you can try again with another random name.

Btw. you can also just mmap(2) some anonymous non-disk-backed memory, and possibly open_memstream(3) on it to get a FILE * if that suits you better.


Even easier than open(2) with a random file name is mkstemp(3).


Exactly. It takes a whole collection of workarounds to get well behaved file semantics across crashes, and the workarounds are different for Linux, BSD, and various other UNIX variants, and may differ depending on the file system type and system version.


Rename doesn't always do what we want. The one serious investigation I've seen [1] unhelpfully states that "[W]hen a file is appended, and then renamed ... many file systems recognize it and allocate blocks immediately." Suggesting that some file systems don't, but failing to name them. In general, appends followed by directory operations are not order-preserved.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14...


There's no order preserving guarantee. Code should do

    fd <- open(tmpfile)
    write(fd, data)
    fsync(fd)
    rename(tmpfile, finalpath)
    fsync(parent(finalpath))
(Strictly speaking there is a problem in the rename because we can't be sure "tmpfile" still refers to the same inode. It would be nice to have a way to link fds, but I'm not aware of any).

According to the paper, existing file systems recognize the (open,) write, rename pattern and don't reorder write and rename. That would break too much existing code. But these are filesystem-specific design decisions.

Btw. it's just the same with CPUs writing to memory. Generally writes can be reordered if the behaviour is not visible to the executing thread. However if there are multiple threads you might need to insert (CPU-specific) memory barriers to prevent reorderings that are visible to other threads. x86 is like those file systems here in that most incorrect code still works, because x86 makes some ordering guarantees.


Don't know why you're downvoted. Everything you said is correct.

Additionally, in Linux, you can get asynchronous fdatasync(2) with sync_file_range(2).

aio(7) is weird. It really only works with direct I/O (which is necessary much less often than people think it is), and IIRC aio_fsync(2) isn't implemented on Linux, but that doesn't matter so much because generally direct I/O implies synchronous I/O (but not always). See http://lse.sourceforge.net/io/aio.html


AIO does not require O_DIRECT. Most Linux file systems do not implement aio_fsync because they put the writes into the page cache and block until it is flushed to disk before signaling completion. That means aio_fsync can be implemented by no-op.


Unfortunately that is not documented in any man page I've read, because that is very useful (crucial) to know and not at all obvious. I reiterate that aio is weird :)

Also, I'm pretty sure that issuing IOCB_CMD_FSYNC from io_submit(2) has given me an error in the past. Not sure about aio_fsync(2), because last time I checked, POSIX aio only had a pthread-based backend, rather than the native support that libaio (i.e. io_submit(2)) provides.


rename(2) atomic replacement only means atomic with respect to processes running on the system at the same time: anyone looking up the newpath will get either the old file or the new file (and won't, for example, see a spurious ENOENT).

It doesn't mean atomic with respect to a system crash.

The easiest way to get asynchronous fsync(2) is to do the fsync(2) in a different thread.


> The easiest way to get asynchronous fsync(2) is to do the fsync(2) in a different thread.

However you would need to create a new thread for each simultaneous asynchronous fsync to get the benefits of asynchronicity.


> "Log" files can only be appended. Writers cannot seek.

I think the semantic you want for log files is: write atomically at offset X, but only if the size of the file is currently X.

If the write fails due to a file size mismatch, you are racing with someone else. You can either read the other writer's data before you try writing again or just fail immediately if races shouldn't happen.


I don't mean to sound dismissive, but do you have any idea of how O_APPEND interacts with PIPE_BUF? The problem you're describing is basically solved on modern POSIX OS's.

You're just proposing moving the semantics of what could be better done in kernel-space by resizing PIPE_BUF to userspace.

The way this works now is that you open a logfile with O_APPEND, any write(2) you make to the file will be atomic up to PIPE_BUF, so it doesn't matter that you have a "race" with someone else if everyone's write is <PIPE_BUF, which is the common case, they won't interleave with each other.

Now of course if your write is bigger than PIPE_BUF you might have interleaving writes, but at the point where you're making >4K writes you're usually better off having some custom log system anyway and not rely on the kernel serializing things for you.

If you want to emulate this proposed mechanism of yours now without any syscall changes you can simply flock() the file for the duration of your write. The solution you're proposing already exists, but locking sucks more than O_APPEND + <PIPE_BUF sized writes.


> I don't mean to sound dismissive, but do you have any idea of how O_APPEND interacts with PIPE_BUF?

Nope. I was replying to one proposed semantic and arguing that the ideal semantic is slightly different. I haven't looked deeply into how such a semantic could or could not be mapped onto current APIs.

> at the point where you're making >4K writes you're usually better off having some custom log system anyway and not rely on the kernel serializing things for you

I disagree. The kernel has access to the I/O queue, and I don't. The kernel can fundamentally do this more efficiently and robustly than user space can.

So it sounds like my options are to limit my write size, serialize writers in user space (impossible generally unless you can physically prohibit processes from running that don't follow the serialization protocol) or flock(), which "sucks." Sounds like there is room for an api that does what is desired but doesn't suck.

EDIT: I just read about O_APPEND and it doesn't sound like what I'm talking about at all. It always appends to the end. It sounds like if two writers race, both chunks of new data are appended. That's not as useful as what I described. I'm basically talking about compare and swap, except instead of compare and swap its compare and append. If something changed in the meantime you don't want to just blindly append your new block that is unaware of the concurrently appended log data.


    > I just read about O_APPEND and it doesn't sound like what I'm
    > talking about at all. It always appends to the end. It sounds like
    > if two writers race, both chunks of new data are appended. That's
    > not as useful as what I described. I'm basically talking about
    > compare and swap, except instead of compare and swap its compare and
    > append
Correct. If you want to only write to the end of a file if that file is size X you'd either need your proposed syscall of "only write if size is X" or have every writer cooperate via flock() to achive that via current semantics.

That's not at all what the grandparent is talking about though. They just want "log" files where you can only append and the write() is atomic. That doesn't mean "my write shouldn't work if there is a concurrent write" which is a semantic you're making up, and which I don't really see making sense for log files.

I'm really finding it difficult to imagine a plausable use case for your proposed semantics. Imagine this: You have 100 concurrent web serving processes all writing to an access.log, currently they can just write() O_APPEND to an access.log with a string below PIPE_BUF and their writes will end up on disk, but may not be interleaved in "real time", but who cares?

What you're proposing means that only 1 out of those 100 processes will succeed in their write() call. The rest will start a retry loop that's trying to write out some data whose order didn't even matter in the first place to try to flush it to disk because they all want to only write the data if the size of the file is N, and every single write updates N.

If you have a use-case for strictly serializing writes to disk like this that's fine, but it's not the common case with log files, and the way it's done is not to try to add a new API to the kernel, you just send messages to some userspace thread that queues them up and does the I/O for you.


> You have 100 concurrent web serving processes all writing to an access.log

That makes no sense. Have each process write to its own log. This is making your life hard for no reason.

> If you have a use-case for strictly serializing writes to disk like this that's fine, but it's not the common case with log files

This isn't about access.log, it's about transaction logs for database systems. Like the commit log for Postgres or SQLite. You can think of their commit logs as like a block chain. Imagine a block chain that didn't have a pointer to the previous block, but just a timestamp. It wouldn't work. If you have two conflicting blocks you need some central arbiter to decide which wins.


    > That makes no sense. Have each process write to its own log. This is
    > making your life hard for no reason.
This is literally how pretty much every daemon process works on Unix systems. You don't log in and tail /var/log/apache/access.log.{1..N} where N is the number of threads/processes. You just tail /var/log/apache/access.log which all of them write to by virtue of PIPE_BUF semantics.

Having to tail N logs where N is some unknown number of processes is what's making your life harder for no reason. That's what O_APPEND and PIPE_BUF are for.

    > This isn't about access.log, it's about transaction logs for
    > database systems. [...] think of their commit logs as like a
    > block chain.
I really don't see how any of this is relevant to the original point of having "log" files. Now you're talking about DB transaction logs and blockchains, which really don't need to have their nuances implemented in terms of POSIX file semantics. They can just have a single process that manages those expectations.

Everyone's expectations don't have to be implemented on the filesystem level, particularly when those expectations inherently involve locking writes and retry loops.


> You just tail /var/log/apache/access.log which all of them write to by virtue of PIPE_BUF semantics.

I can see how this would be convenient. That's fine and I have nothing against it. It's just not a problem I consider interesting, because there is no inherent need for all 100 writers to write to the same file. You could just as easily have each write to a separate file and have a separate process that merges them together if you want a single file for convenience.

> I really don't see how any of this is relevant to the original point of having "log" files.

Maybe you're not aware: every major database keeps a commit log, and writes it to a file called a log file (sometimes "write-ahead log file"). That is what I think of when I hear about "log" files in the context of file durability, because this is the case where file consistency and durability actually affect the consistency of the system. Here are some examples:

https://www.sqlite.org/tempfiles.html#walfile

http://www.postgresql.org/docs/9.1/static/wal-intro.html

https://leveldb.googlecode.com/svn/trunk/doc/impl.html

This is a much more interesting problem (to me) because it is harder and relevant to the consistency of any database system.

> Now you're talking about DB transaction logs and blockchains, which really don't need to have their nuances implemented in terms of POSIX file semantics. They can just have a single process that manages those expectations.

You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?

SQLite has chosen to support concurrent SQLite processes all writing to the same database. Maybe you think they shouldn't do that, but they find it useful. And even in your world where they shouldn't do that, it's still not ok to just corrupt the database because the user didn't follow the rules.

> Everyone's expectations don't have to be implemented on the filesystem level

One of the kernel's key responsibilities is to arbitrate concurrent access to shared resources. It provides primitives that makes it possible to build higher-level abstractions. What I am describing is a primitive that allows for lock-free appends to consistent transaction logs. It could be useful for a lot of things, and makes at least as much sense as plenty of other things that are already in syscall interfaces.


>> I really don't see how any of this is relevant to the original point of having "log" files. > Maybe you're not aware: every major database keeps a commit log, and writes it to a file called a log file (sometimes "write-ahead log file").

> You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?

Sure, but, so? I don't know sqlite's WAL logging code in detail, but I do know PostgreSQL's fairly intimately. I don't see how such an interface[1] would be relevant for WAL logging. Such logs usually have checksums and pointers to previous records in their format. For those to be correct each writing process needs to know about previous records (or at least their starting point). Thus you need locking and coordination in userspace anyway - kernel level append mechanics aren't that interesting.

In addition to that, if you care about performance, you'll want to pre-allocate the WAL files and possibly re-use them after a checkpoint. For many filesystems overwriting files is a lot more efficient than allocating new blocks. It avoids the need for fs-internal metadata journaling, avoids fragmentation etc.. With pre-allocated files you then can use fdatasync() instead of fsync(), which can be considerable performance benefit in our experience.

There are things that'd make it easier and more efficient to write correct and efficient journaling, but imo not what you were talking about. Querying and actually getting guarantees about which size of writes are atomic, for example; otherwise you need to use rather expensive workarounds like WAL logging full page contents after checkpoints, or double-write buffers.

Proper asynchronous fsync(), fdatasync() would also be rather useful.

[1] > I'm basically talking about compare and swap, except instead of compare and swap its compare and append


A good solution is to use SQLite. It addresses the issues (pretty much by doing all the fsync etc mentioned including on directories) and has a very comprehensive test suite. It is also used very widely on desktops, mobile devices, applications etc. https://www.sqlite.org/whentouse.html

A notable quote: SQLite does not compete with client/server databases. SQLite competes with fopen().


SQLite just (two month ago) got a new synchronous level EXTRA which fsyncs the directory after deleting the journal. The default is FULL (https://www.sqlite.org/pragma.html#pragma_synchronous ). So you still have to think about those problems.


> So you still have to think about those problems.

That is the case, but the advantage of the SQLite approach is that they have already done so, and hence their code is far more likely to be "correct" than any new code you write. The many billions of instances of SQLite already deployed also help with maturity and finding corner case issues.

And as a bonus it also lets you do rollbacks, transactions etc which aren't easy if trying to implement them yourself.


The concept was proven before w/ RMS in OpenVMS:

https://en.wikipedia.org/wiki/Files-11

It could work. Just best to have hybrids with different types of files, including those bypassing RDBMS function, so one can select proper reliability vs performance tradeoffs. SQLite might have cross-platform, FS-type API's too that I don't know about. Not sure if it's already there or be an extra development.


I really miss OpenVMS. It's a damn shame it doesn't run on x86-64. I might try to set up one of the Alpha AXP emulators that are floating around out there, if only to feed my nostalgia.


This company has licensed it from HP to update it and port it to Xeon:

http://www.vmssoftware.com/index.html

They're working on it as we speak. :)


Do you know if they address I/O reordering within the scheduler? For example transaction implementations often require that writes (distinct file system calls) hit the disk in a particular order to guarantee a sane state for the database. Prime example is the GNU bug for gzip:

http://bugs.gnu.org/22768

The the writes to the `foo.gz` file have to hit the disk before the unlink but the I/O scheduler can reorder these potentially and a badly timed crash could result in data loss. Note that journaling doesn't fix this issue because the transactions are distinct too.


SQLite doesn't care about I/O re-ordering, but does care that the various fsync style calls work. SQLite uses a separate journal so it can rollback/forward changes.


I wonder what the implications of making an SQLite filesystem would be.


I'm not actually sure whether this actually helps with the problem described in the post, but such a thing does exist:

https://github.com/guardianproject/libsqlfs


If you then expose that file system through a POSIX file system API, you have all of the same issues of underspecified or unclear behaviour that the article mentions.


If I were designing a userland from the ground up, I'd probably give processes a transactional MVCC object store, and make guarantees about that; and then implement a "POSIX compatibility layer" file system API in terms of that, but explicitly say that none of the same guarantees from the object-store layer apply.

Some days I really do wish we weren't so inured to the particular 50-year-old systems-programming abstractions.


How would it be different than the filesystem API?

There was a great lightning talk a few years ago that I can't seem to find where the author described an API for storing blobs in a hierarchical namespace. Of course, halfway through, it became clear that it's just the POSIX API: you can "open" handles to objects, "rename" them, remove them, and so on. You'd end a transaction with "fsync()". (Okay, that one's a little more complicated, but I don't think it's as hard as the OP claims, at least for single files. Multiple files are more complicated, but that problem is intrinsically more complicated.)


The problem is that fsync() runs outside of the control of the program. There's no way for an application to start a transaction, perform steps x, y, and z, and then end a transaction, rolling back to before step x if there are any failures. For example, suppose you're rotating a an audit log file at the same moment your backup program is running. Your backup program reads the directory, and at that same moment, your rotate script had renamed the file, but had created the new file, but data hadn't been written to the file yet. What does the backup program see? does it see the old file you just renamed? Does it see the new zero-length file? Your backup now has an indeterminable state, and potentially lost data, because the backup received a consistent view of the overall data. Were there a way of creating a transaction, the second program looking at the same data would either see the old file, or would see the new file and the rotated file. This is where a transactional file system would be of great benefit, because it limits the amount of indeterminate state to a very minimum, even while multiple programs are operating on the same file.


Realistically, backups need to be based on atomic snapshots, like `zfs`.


Yep. I'll admit my example was a bit convoluted, but I was trying to show a way in which common tasks can race and create undesirable situations. ZFS snapshots would definitely make things quite a bit more predictable.


Two key differences:

1. "MVCC": the ability to effectively get a handle on a static copy of the entire filesystem, perform mutations to many objects, and then submit the transaction, at which point it will either commit or rollback depending on whether any of those objects have been modified.

Windows actually has this: https://en.wikipedia.org/wiki/Transactional_NTFS allows for exactly the sort of MVCC I'm talking about—and presents a very different API than the POSIX filesystem one.

2. "Object store": as in, you don't interact with the filesystem by getting writable handles to the file's underlying backing store, where processes can effectively treat a file as persistent shared memory. Instead, you ask for a floating unnamed "buffer" object, fill it up, and then submit it to the filesystem to be atomically stored; or, vice-versa, you ask to retrieve a file, and are passed a buffer handle of the representation of the file at the point in time you asked for it (presumably, for optimization's sake, backed by copy-on-write mmap'ed pages from the latest MVCC-linearized copy of the file.)

We actually have this today as well, but in an implicit and half-assed way. Files below some size threshold, in modern filesystems, don't use any FS extents for backing, but are instead stored as part of their filesystem directory-entry node. This basically makes the filesystem into an object store for those files—but without actually exposing any guarantee to userland that those files will be read/written as atomic operations.

---

To be clearer, there's a problem with this fs-atop-object-store design, as I've stated it so far:

An object store doesn't support every use-case a filesystem does, and a filesystem implemented only in terms of an object-store wouldn't be good for some things because of that. It would be terribly expensive to emulate a block device on top of an object store, for example—you'd have to make each emulated block an object, and mutate blocks by transactionally adding an updated block and removing the old block. This means that it wouldn't make sense to keep mountable read-write disk images on such a filesystem; nor would it make sense to keep database backing stores there.

But both of those things are effectively things that take no advantage of existing on top of a filesystem in the first place—they do their own index-building, their own sparse-allocation, their own journaling, etc. Making the filesystem an object store just makes it obvious that for those use-cases, the filesystem itself is pure overhead.

So, along with the MVCC object-store, the other part of my hypothetical system is a "buffer store": basically like a logical-volume manager, an API that consumes physical block devices and exposes handles to (cheap) logical resizable block devices. The object store could be implemented in terms of one of those logical block devices, but otherwise would completely ignore the existence of it. The filesystem compatibility layer, on the other hand, would allow you to request that a given filesystem object be backed by a persistent buffer (newly-created logical block device) instead of an object; and then whenever you requested that object from the filesystem, you'd get an IO handle to the logical block device, instead of an IO handle to a transactionally-resubmittable copy-on-write mmap of some object.


It surprised me to read that Microsoft is considering to remove transactional NTFS because "there has been extremely limited developer interest in this API platform since Windows Vista primarily due to its complexity and various nuances which developers need to consider as part of application development" (https://msdn.microsoft.com/en-us/library/windows/desktop/hh8... linked-to from that Wikipedia page)


Microsoft actually tried something similar with WinFS [0] in the 00's, but the project failed.

[0] - https://en.wikipedia.org/wiki/WinFS


I was a developer on WinFS. It turns out to be hard to have a performant system that also has transactional commits between something changing a regular file, and a change in a database. It was also stupid to write OS components in .net, when it would just crash if it ran out of memory, among many other problems. Many people working on it wrote reports about how this was infeasible or impractical, yet they just said keep going, it will be all right.

I think we could have built something that worked well enough, but they grew cowardly after the general failure of the entire longhorn project. Don't try to change the compiler, the language, the os, the office application to use same "in motion" components, plus the database.

It's probably little known outside of Microsoft, but they tried to build many different higher end object filesystems, including JAWS (Jim Alchin Windows Filesystem), and others I can't even remember.


Oddly, Apple's Core Data framework achieves pretty much all the goals WinFS set out to achieve—but does so sitting (for no good reason, really) on top of a regular filesystem. It'd actually be pretty easy to flip things around: give Core Data a raw block device persistence backend, and then write a filesystem driver on top of it.

One of my recently-planned hobby projects was to write a FUSE filesystem for OSX that would take all the shell-level glop (LaunchServices UTI-associated verbs, NSFileManager attributes, Spotlight metadata, etc.) and stick it onto the files themselves as plain xattrs, so POSIX tools could effectively be made to work with shell-namespace "Documents" (including treating directory-packages as files, and traversing archive files and container-file-formats as if they were directory-packages.) It'd be quite interesting to combine the two experiments, actually; it'd result in something very much like a POSIX compatibility layer on top of a WinFS-like filesystem.


> but does so sitting (for no good reason, really) on top of a regular filesystem

You don't like files? Files are great.

Running Core Data over SQLite instead of a "real database" gets you backups for free, the iOS file-based security model, lets you delete an app's data when it goes away, network mounted home directories… really, the filesystem abstraction is pretty good still.


You might look at probably the most popular system to ship with an object store, the Apple Newton. I would probably start there and work my up API wise.


> A notable quote: SQLite does not compete with client/server databases. SQLite competes with fopen().

So if I need to create a file with a timestamp, you are suggesting creating a SQLite database, a table, and inserting my timestamp there?

It's nonsensical to "fix" a Unix low-level syscall issue by using SQLite.


I'll bite. Please post the C code required to portably and safely create a file and write a timestamp to it in a way that beats the safety provided by SQLite.

Most C programmers can't even write that piece of code. Let's see if you can.


I'm pretty sure any OS or API developed continuously over 25-40 years (depending how you count) would suffer from the same problems. Technology changes. Early assumptions prove to be wrong. Any OS developed from scratch now would suffer the same problem 25 years from now.

It's worthwhile to use libraries which embody the knowledge, trials and errors of years, such as SQLite.


SQLite is an SQL database engine. Using it for storing a single value is like attacking a fly with a nuclear weapon.

Unless you are talking about a function in the SQLite library that implements the durability calls properly and can be called from other programs.


If you do not want to use SQLITE to solve this problem then download the SQLITE source code and copy the way that they achieve these guarantees.

Stand on the shoulders of the giants who have come before us.


You should use SQLite whenever you are tempted to write some data and you have any expectations to be able to read back that data. That applies even for a single value although most programs will store more than just a single value.


SQLite stores data on disk in a file.


Er, yes, that's the point.


I think his point was that SQLite still has to work around the durability problem described in the article. As such, SQLite isn't a solution, it's a middleware API. I'm not trying to knock it, but solving a problem by adding a new layer isn't good design.

Also, middleware shouldn't be needed in open source because you can simply patch the underlying API. That the middleware is the best solution in this case says a lot about the problem.


SQLite does solve the problem because it knows exactly what fsync()/fdatasync()/whatever calls are necessary to persist a file, and it also knows that for the many operating systems that SQLite is implemented on. It's also tested regularly.

I can't believe the level of ignorance in this thread.


So the reliability and storage semantics bottom out at the filesystem after all. It's not as simple as 'I'll use SQLite: problem solved'. Not for the authors of SQLite, for sure.


The point is: the authors of SQLite solved that problem. Properly. For many more OSes than you or I will ever see. And they test it.


"competes with" does not mean that it's better in all use cases.

But even in an extreme example like storing a single timestamp, there are situations where you want various guarantees that SQLite makes easy.


One of the engineers at Google took the time to figure this out, and updated the page out code in the Linux kernel they were using so that the "correct" steps were known if you had to know your data was on disk. It was discussed on LKML as I recall but considered "not generally useful" and I doubt it made it into the main sources.

One of the interesting things about writes is that they disrupt reads more significantly than you might expect. Greg Lindahl characterized the impact at Blekko when we were crawling so that we could optimize writes from the crawl to not disrupt latency on the search engine side. Later we completely separated those functions for similar reasons. I believe every disk I've evaluated over the years is slowest on the random read/write 50% test.


Even DRAM will be slowest on a random read/write 50% test (it has read-to-write and write-to-read penalties; admittedly these will be dwarfed by precharge/activate penalties, but still, mixing reads and writes will make things worse.)


I think the man page http://linux.die.net/man/2/fsync answers the questions the article asks about fsync().

fsync() = YES metadata, YES data, NO dir entry

fdatasync() = NO metadata, YES data, NO dir entry

fsync() of dir = NO metadata, NO data, YES dir entry


For Linux, but it's not portable across all Unixes. fsync is allowed to do nothing (1) and does not need to work on directories (2). On Mac OSX, fsync will not flush the disk cache, which can lead to data loss (3).

(1) http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs...

(2) http://austingroupbugs.net/view.php?id=672

(3) https://developer.apple.com/library/mac/documentation/Darwin...


fsync(2) on OS X and other Darwins also does not sync metadata. You need to use fcntl(2) with F_FULLFSYNC for that.


It looks like doing `sysctl -w kern.always_do_fullfsync=1` Will produce a sane behavior:

https://opensource.apple.com/source/xnu/xnu-2422.1.72/bsd/hf...

Alternatively, you could just use the ZFS driver for Mac OS X. It will do fsync properly.



I ran across a nice paper a few years ago: Rethink the Sync.

https://www.usenix.org/legacy/event/osdi06/tech/nightingale/...

The basic idea is that the file system provides two guarantees, one boring, and one interesting. I'll illustrate using code instead of words. The boring guarantee is:

    FILE* f = fopen("autosave.bak", "w");
    fwrite(f, buffer, length); // save current document
    fclose(f);
The system will try to make the write durable within 5 seconds of executing the second line. The interesting guarantee is this:

    FILE* f = fopen("autosave.bak", "w");
    fwrite(f, buffer, length); // save current document
    puts("Saving complete!")
    fclose(f);
The system guarantees that the user will not see the message until the data is safely on disk! The way they do this is to implement a fancy dependency tracking mechanism that makes sure that the computer never generates output until the writes that the output depends on have completed.

They do a bunch of benchmarks that show that their system is almost as fast as mounting ext3 _asynchronously_. (In fact, not much worse than a RAM disk even.) Of course, they also show that in the case of power loss, their system behaves well, whereas ext3 does not, unless you turn up all the paranoia knobs to 11, and then the performance is WAY worse than their system.

I'm oversimplifying quite a bit, since this comment is already pretty long. Read the paper for details. Or ask questions here, but I'll probably forget to check the comments, because HN doesn't remind me :(


I misread this comment and it could be dangerous if someone who doesn't know misreads it the same way. So I want to put a big warning here.

Warning! This comment is _not_ saying that your operating system provides these guarantees. In fact, it almost certainly does not. This is a novel suggestion (and implementation?) presented in this particular paper.


Yes, definitely! If the edit timer hadn't expired, I would rewrite this. I wrote this in a horribly unclear way.


Looks like I remembered, but there aren't any comments.


Disks can buffer, disks can have firmware bugs, disks can fail both catastrophically and subtly. Even the fanciest battery-backed RAID controllers have firmware bugs and dead batteries. Those are just ways your storage hardware can fail you even if the kernel and libraries are bug free and you follow the right mystic incantations to sync data.

While there's no excuse for bad documentation or poor APIs, you can never consider data written to a single local disk "safe". It never is.

It's a shame making a best effort at safety is nontrivial, but it does force developers to write more defensive and crash-safe code which is all that can save you in the end.


Yeah, from a SRE perspective, the last N writes are always purely probabilistic. The real quest is to have enough redundancy that that curve is fairly close to 1, and enough failure warning that your system will fix itself before it droops. That means Battery Backup, SMART detection, ECC memory, etc.


It's even worse than that. All writes have some probability of being incorrect, even after having been written to disk properly (the programmer did the right thing, the OS did the right thing, the filesystem did the right thing, and the disk did the right thing). After writing, your data can be modified by actions on other cells or by simply leaving it alone completely (see read disturbance and charge leakage). Some of this can be fixed by error correcting codes, but there is still a chance of loosing data simply by doing nothing wrong.

So yes, agreed++. You need to have a level of redundancy appropriate to the criticality of the data.


EBS, Riak, etc. Lots of easy solutions to this problem of putting the data in several diverse places with one write call.

Use what works and don't complain that broken stuff is broken.


Stupid question: why don't all computers (irrespective of the OS) have a built in power loss mechanism? It seems to be such a common and obvious problem.

1. The PSU would have a big enough capacitor to keep the computer running for a few seconds at its stated output power

2. The PSU would notify the OS of a power loss

3. The OS would immediately flush all caches and adopt a "brace position".

4. Events are spread system wide so that apps can also flush and brace.

It should work even if it is the PSU that fails (as long as the capacitor is there).

Surely the problem cannot be the cost. Why don't modern desktops have that feature?


You'll find this old mailing list post a good answer to your question: http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc

The answer is cost. We used to have the computers you're describing. They were made by SGI and the hardware & OS made all sorts of guarantees like the ones you're describing, to the point where XFS was ported to Linux x86 initially it had all sorts of bugs that simply couldn't happen on SGI machines.

The x86 boxes were cheaper, nobody really cared enough about hardware reliability enough to pay the price for the likes of SGI, and the rest is history. Today we have a "worse is better" hardware architecture and software has to be able to handle it.


Very interesting link. Thanks


You have just described a UPS. Or a laptop. A battery just makes more sense than a capacitor in this application.


Why? Wouldn't a large enough capacitor last longer than batteries? UPS batteries I've seen are only rated for 3 years, and half of my recent laptop batteries have physically swelled after 3 years of use.


It would, but the size would be impractical and it's too costly. Quick estimate, let's assume we need 10W for 10 seconds, that's 100 Joule of energy. The energy stored in a capacitor is 0.5CV^2. Say we use a 10V capacitor then C = 2 Farad. They exist, but they are very large (look them up on amazon for instance). You'll probably need more like twice the capacity though because it's impossible to extract all energy from the capacitor, and it's lossy to convert it to a constant +5V / +12V.


Instead of 10 V and 2 F, how about going for 2.5 V and 50 F? A capacitor with those specs is only 40 mm long and 18 mm diameter [1]. That shouldn't be too hard to fit in a typical server or desktop. That's under $4 in quantity.

[1] http://www.mouser.com/ds/2/257/Maxwell_HCSeries_DS_1013793-9...


Theoretically it's possible, but in practice the lower input voltage makes it harder to convert it to +5/+12V. It becomes increasingly lossy and expensive - to convert 0.5V to 12V at 10W is not trivial to begin with. Voltage across a capacitor drops continuously while discharging (unlike a battery). So with a 2.5V capacitor, being able to use it between 1.25V and 2.5V is already pushing it. On average, discharge current at 10W is around 8A. The internal resistance of the capacitor (ESR) better be very low (it probably isn't) at this low voltage - even if it's 0.1 Ohm, at 8A we already lost 0.8V from out meager 2.5V, and now the useful energy is just 1.7-1.25 = 0.5C0.45V^2 = 5J, just enough for 500ms at 10W.


Interesting. 500 ms would would probably not be enough time to save everything (although maybe on an SSD based system it would be...), but it would probably be enough time to save select information that would allow ensuring that the disk is in a consistent state.

The 10 F capacitor has an ESR of 0.075 ohm, but that's at 1 A. They have a 100 F model that is 0.015 ohm at 10 A, and a 150 F that is 0.015 at 15 A. Based on your calculations, these look like they would have a good chance of giving enough time to save everything (especially on an SSD system).

Those are physically bigger but should still fit in a normal desktop or server.

(There is another manufacturer that has up to 630 F!)


The idea is rather to have it built in. A retail consumer shouldn't have to worry about these things. And the idea would not be to sustain a long power interruption like a UPS, just to give time to flush caches and avoid corruption. Which should be much cheaper than a UPS.


I would guess because no promise is much better than a broken promise. No one wants to take blame.


We have to accept that it wouldn't save the day in all scenarios. If there is a 60MB excel file to save on a slow network drive, the file will be lost. But this mechanism would still reduce file/disk/database corruptions by an order of magnitude.


1 is not possible on anything with more than one >100MHz chip


You mean there is no capacitor powerful enough?


Its one thing to add $0.5 supercap and a diode to SSD (and still the likes of OCZ didnt bother for cost reasons). Its entirely another to slap bank of boost/buck converters, 100F cap and spider web of control circuity all over the place.

Adding $20-50 in bom the size of two D batteries just in case something _very rare_ happens is not economically viable. There are better alternatives, like banks of batteries ala google boxes. Turning server into oversized laptop gives you more than couple of seconds buffer.


There's worse problems at a lower level - hardware caches on the disk do not really guarantee flushes are honored either. There's a discussion of this here - honoring writes is an enterprise feature... http://serverfault.com/questions/460864/safety-of-write-cach...


I use something akin to djb's Maildir delivery procedure when I want to be as close to sure as I can be that a file 'has been saved'.

1. Create a temp file on the same mount

2. Write data to file, checking return of each write(), then a final fsync(), and close()

3. link() to filename we actually want

I still haven't figured out how to be this safe on Windows, because AFAICT there's no atomic way to link() or 'move' a file - and they canned the transactional API for the filesystem. Any pointers to how to write files 'safely' on Windows would be much appreciated!


`ReplaceFile` is supposed to be atomic, at least according to the documentation: https://msdn.microsoft.com/en-us/library/windows/desktop/hh8...


UNIX has "a standard way to deal with the file durability problem." If you fsync the file, then fsync the containing directory, that should be durable on all properly behaved setups.

Of course this comes with a lot of footnotes. There was a bug on some older Linux kernels where they didn't tell the hard drive to flush the data after an fsync. This is why a lot of people still incorrectly believe that enabling hard disk write cache is unsafe. But this was a bug, and the bug was fixed. There are also reports of hard drives that don't honor flush requests. There's not much UNIX or any other OS can do about this-- if the hardware lies, you are in trouble.

You could also imagine a much richer durability API. Empirically, databases need such a richer API rather than simple fsync. The POSIX async I/O standard was supposed to standardize all this, but Linux's glibc just implements it as a thread pool making blocking system calls. If you want real async I/O on Linux, you need to use an OS-specific API.


This is kinda old now, 2013, but I found it a useful quick read that's related to some of these issues, at least on Linux. Many of the comments are also interesting. Atomic I/O operations in Linux https://lwn.net/Articles/552095/


I don't see why you would have to fsync the "parent of the directory", too. That may not be explicitly specified, but it works just this way:

Everything is a file ("object"), whether it's a standard file, or a directory. If you create a new file, you want two things synced: the contents of the file, and the linking of the file (that is the pointer from the directory in which you created the file, to the file object). If you don't sync the link you may not be able to find the file again, even though it was synced to the disk just fine. (It's just how git works btw.)

The link to the file is part of the directory object's contents. That's why the directory needs to be synced.

There is no need to sync the "parent of the directory" because that was never modified.


Most programmers, and especially the hipsterish HN crowd, simply cannot be trusted to write correct file manipulation code. In general, anything they produce will be riddled with race conditions and erroneous assumptions (e.g., that rename works cross-device, that close cannot fail) and that breaks in rare but possibly catastrophic circumstances.

The solution is copy-on-write file systems such as ZFS and Btrfs that ensure neither data nor meta-data are ever altered in-place and the reuse of correct file manipulation code (written by adults) rather than rolling your own, either from a library or something higher-level like SQLite.


Related link with a very detailed writeup: http://danluu.com/file-consistency/ (surprised no one posted it yet).

TL;DR: files are hard; filesystems differ a lot; sqlite is very robust, most other software (git and mercurial including) not that much.


> One issue is that unlike many other Unix API issues, it's impossible to test to see if you got it all correct and complete. If your steps are incomplete, you don't get any errors; your data is just silently sometimes at risk.

There is no problem with the API. The issue is in an underlying assumption, that data sometimes is not at risk. Sadly or maybe luckily, there is nothing you can do to guaranty 100% durability, no point trying to do the impossible. Your data is always silently at risk. Accept it, deal with it, minimize the risk, if you need to. Like replicate your data synchronously cross continent.


Rule of thumb for OS disk I/O: write as soon as possible, i.e. write as soon as possible without hurting performance using buffers on memory for slow component amortization, spreading operations during longer periods.


The article wasn't about the decision of when to write to disk, but how to actually do it. Say you have a point in your program where you have made the decision that it is necessary to write to disk. How do you actually do that for sure? It's not write() or even necessarily fsync().

The author is finding this frustrating because there are several things to do that can seem arbitrary, random, and counterintuitive. His trust in the OS was damaged and now he or she is understandably grumpy about it.


> How do you actually do that for sure? It's not write() or even necessarily fsync().

Right. For decision of "when" you can configure a few kernel parameters, namely dirty writes thresholds. That assumes you just do write()s from your process and kernel decides to flush that data out. That can be configured to be a function of time and or amount of unwritten data.

I had to do it a few times. Once it was a realtime-ish system which was recording data to disk, and noticed recording thread would time-out. (Timeouts were in the 10s of seconds and there were noticed by a watchdog system). Thought 10 seconds should be enough for that process. But it turns out because of how priorities were setup and how fast writes went, dirty page flushing was not keeping up with writes. Periodically it would hit the top threshold and at that point any process doing disk writes would be blocked.

Was able to get it under control by essentially doing what gp suggested, writing a little bit at a time, but more often. The total throughput probably went went down but performance was smoothed out quite a bit.


That's still missing the point. Say you're writing a program that stores a log of transactions. People are trading resources and you log those transactions so you can find out who owns which resources. User A just gave 100 units to user B; you need to store this information and tell both users that the transaction is complete. How do you do that? Your program will do something like "write(t_log_fd, t_entry, t_bytes);". Can you now tell the users that the transaction is complete?


I guess I have misunderstood, your question about "The article wasn't about the decision of when to write to disk".

I thought you are asking how to control in general when your system is writing to disk.

Well if you don't use fsync then it decides at some point like I mentioned above. If want to talk about transactions then do a write and fsync, does that not work someetimes fro you? If you are worried about the new file appearing in the directory then fsync the directory as well.

If you want more guarantees, you'll have to dig deeper and find out about your specific device, does it have battery power and does it tell lies about it writing data and so on.


Not sure if I understand the problem. You don't want in most cases to flush out file changes or dirty memory to disk, so you an batch those operations or write only final change and not all intermediary ones. You don't want to wait forever since RAM is volatile. This is why databases have setting for checkpoint intervals, you can set the % or time for dirty pages to be written, I thought the default was 30 seconds, doesn't the same idea apply for file buffers?


FWIW this uncertainty in fsync behavior is essentially the cause of the recent PostgreSQL bug on ext4 (fixed in the last round of minor releases).

Basically, on XFS and all other tested filesystems is seems to work just fine, on EXT4 we may loose the last rename() effects.

So this makes it difficult to get the durability right, even when the project is as careful about it as PostgreSQL.


We needed this recently when doing some custom filesystem work, ended up patching our kernel with something we found in a discussion from lkml.


Is it just "Unix"? What if fsync returns when the hardware indicates that it has completed a write, but it's actually in some drive controller cache for a few moments more?


If you care about these types of things, you can tell the drives to not cache writes.


What about O_DIRECT | O_SYNC options in linux's open()?


O_SYNC has abysmal performance. O_DIRECT has very underspecified semantics, but particularly it demands that you only read and write whole "blocks" (where "block" depends on the filesystem type, underlying block device and phase of the moon).


See also Linux Torvalds' opinion on O_DIRECT:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances.

https://lkml.org/lkml/2002/5/11/58


fdatasync() followed by fsync() "should" do it although on very recently created/mounted directories you'll want to use sync() as well. Luckily sync() is synchronous on Linux (sanity!) even though traditional Unix doesn't require it to be.

So perhaps the article can be shortened to "sync() is not guaranteed to be synchronous on non-Linux Unix."


Yet another example about how the filesystem, and its POSIX realization, is in fact a horrible abstraction.


I don't think there is anything wrong with the POSIX file system API. Two things:

- I think it's mainly the modern file systems like Btrfs and I think partly also ext4, which introduced a shift in paradigm, which broke old applications (or at least broke their performance, for example dpkg).

- We're talking about a hierarchical file system, meaning it's easy for humans to find data, but terrible for machines because they must hunt pointers to get to files. Similar problem for syncing a bunch of logically connected files if they are not stored in a single directory. (How do you atomically commit multiple changes across directories?)

What would be a better abstraction or non-abstraction (that's still a hierarchical file system)?


A friend of mine had the realization that all optimization boils down to making lower layers understand higher layers' concerns, or higher level layers understand on lower levels' concerns. Here are some examples off the top of my head:

POSIX fails as a high level interface:

- No (at all powerful) notion of transactions. Mutation without transactions is extraordinarily primitive.

- No multiple FS roots to indicate boundary across which data will never be synchronized. (This is also good for how to spread data across multiple devices, a low-level concern.)

- Overall pushes people to maintain their own structure within files rather than use FS's trees.

POSIX fails as a low level interface:

- No way to hint or control caching along the memory hierarchy.

- Block size, locality out of control.

- Can't statically disallow various non-free actions which are costly (ability to expand file), can only hope that not using them does incur penalty.

Perhaps ZFS an BTRFS made other approaches more viable/obvious, but these weakness are inherent to the API itself and the high level stuff was noticeably missing from the get-go.


> - No (at all powerful) notion of transactions. Mutation without transactions is extraordinarily primitive.

There is nothing that prevents you from implementing these in userland. Does not belong in the kernel, since the kernel can't know what are your atoms that must be atomically committed. Research how databases do it.

> - No multiple FS roots to indicate boundary across which data will never be synchronized. (This is also good for how to spread data across multiple devices, a low-level concern.)

You can check what device a file belong to with stat(2). It's the st_dev member. You can also check it from the shell with the stat command.

> - Overall pushes people to maintain their own structure within files rather than use FS's trees.

And that's entirely ok. Hierarchies are not for databases. Database-y problems are not the problems that the POSIX fs solves.

> - Block size, locality out of control.

I actually heard that Unix filesystems have traditionally been quite good at preserving locality. That's why I don't know of any defrag tool for e.g. ext3.

Overall, if the FS does not solve your problems, implement your own abstraction. That's perfectly ok.


I think this is a bad option from a concurrency point of view, but why not one of the relational database alternatives for maintaining a hierarchy? http://stackoverflow.com/questions/4048151/what-are-the-opti...


It wasn't ext4 that introduced a paradigm shift but ext3: the fsync() implementation in ext3 by design wrote all buffered writes for the whole filesystem, not just a single file, hence it was so horribly slow that it taught a generation of application programmers never to call fsync() to avoid multi-second pauses.


You can't say something like that without giving at least one link to another of the examples you mention!


Actually make that about how most layers / abstractions for IO in mainstream OSes are garbage.


But what if, behind the scenes, the hardware is performing similar tricks as the OS?


Even if you solve all these problems, if your RAID array completely loses its mind, then you can still lose/corrupt data.

The best solution here it to replicate data across multiple machines, and take good backups that you can restore from, and plan on corruption to happen.

You also need to assess exactly how much you really need those transactions. If they're financial transactions where each one could be millions of dollars of stock, then you probably need to care about this a bit. For most web transactions I think the risk/cost analysis here is that you don't need to worry about being perfect. You might lose the last one or two transactions in the case of a hard crash, but your customer service team should be able to handle that and fix it for the customer out of band. You really should consider if you have a business justifcation for worrying about perfection.

On the other hand, I did live through the ext2 era before ext3 was production ready, and fully async super-duper fast filesystems really are bad. They would regularly corrupt the disk and require rebuilds. That wasn't a data availability problem, but one or two of those a week was problematic when it came to the operational load (out of 400-800 servers that we managed at the time). We later scaled out ext3 to 30,000 servers with some sets of servers having 2,500 hosts of basically the same type of webserver, and while kernel crashes were a daily issue, corruption and rebuilding was relatively low. If you're not at Google/Amazon/etc and dealing with servers counts an order of magnitude higher than this, you don't really need to worry about it. ext3 or ext4 and fdatasync should be fine, and then apply proper levels of engineering principles to ensure that you don't stay offline for too long or lose too much data.

You are dealing with free commodity hardware and software that isn't ever going to be perfect. If you really needed to never lose a transaction you'd probably be buying some kind of awfully expensive mainframe system.

Oh and I do recall one case of filesystem corruption leading to a service being down for over a week and probably the loss of a multi-million dollar business deal. But in that case the software ran on a single box. The dev team that was responsible for it never saw that as a problem even though the ops/sysadmin teams kinda yelled at them about it. Then one day the RAID array lost its mind and the server was unrecoverable. When we attempted to rebuild it, it was discovered that over the years the software devs had tweaked the versions of libraries that their software linked against and by crashing all that information had been lost, so it took ages to debug and find the right incantations to get it all back up again. Huge business risk there, but nothing that could be mitigated by naval gazing analysis of filesystems and fdatasync -- backups, documentation, replication, proper config management practices, etc were what was needed.


Can't someone smart just read the source code and figure out exactly under which conditions files get written to the disk?


Which source code? There's more than one implementation of all of the following: OS kernel, disk driver, and filesystem.


It only takes one combination to start the chain reaction. If someone identifies one combination of kernel, disk driver, file system, hardware, and syscalls that results in reliable durability then people who care about reliable durability will start using it, and that will eventually turn into a de facto standard which will eventually turn into an actual standard.


Until someone who doesn't care changes some code and that pattern is no longer durable.


All of it - these things aren't straightforward, you need to go knee-deep in this area of programming to have a good hnderstanding of it.


The problem isn't just the mainline source code, you also have drivers which of course vary for each device.

Then even completely beyond control of the kernel, devices have their own RAM caches, which means that even if the device reports that something was written, it might not have been. So there's no way to absolutely be sure.


Event assuming that you can look at the source code for your filesystem/kernel the results of a given write still depend on the conditions and orders under which your writes hit disk relative to other processes' writes.

For example, if two processes issue writes to disk sectors that are adjacent but one of the processes' writes also affects a different sector farther away on the physical disk the I/O scheduler may re-order processes' sector level writes. Normally journalling addresses this issue but if the journalling transactions are reordered then it's no help at he application level.


The Linux Documentation Project has this to say about flushing of buffer cache: "In traditional UNIX systems, there is a program called update running in the background which does a sync every 30 seconds, so it is usually not necessary to use sync. Linux has an additional daemon, bdflush, which does a more imperfect sync more frequently to avoid the sudden freeze due to heavy disk I/O that sync sometimes causes."


Wow, I'd love to know what they mean by "imperfect". In Linux, specifically, the sync_dirty_buffers syncs all the dirty write buffers (nope - see edit) and isn't imperfect or incomplete in any particular way.

Edit: Nevermind, I'm remebering now there is a dirty_background_ratio separate from a dirty_ratio. It will sync some of the dirty buffers as long as it's above dirty_background_ratio and under dirty_ratio. And it was actually bdflush that determined all of this; the sync_dirty_buffer function just took a specific buffer object and synced that.

Edit 2: And I'm finding out that information seems to be outdated. Here's the old documentation about bdflush [1]. That was either moved or removed in the latest version, but I did find this [2], which suggests that there was an old bdflush vs something else now. Looks like I missed something important and have some research to do.

[1] http://lxr.free-electrons.com/source/Documentation/sysctl/vm...

[2] http://lxr.free-electrons.com/source/Documentation/vm/active...


I've been told to type sync into terminal whenever I want writes to complete.


The sync(1M) command provided on many UNIX systems is often a thin wrapper around the sync(2) system call. The sync(2) entry point does not, as specified, guarantee that all inflight data _has_ been written; it's more of a guideline than an actual rule. The user must subsequently wait for the data to actually make it to disk, often for an arbitrarily long amount of time.


"I've been told to type sync into terminal whenever I want writes to complete."

If you'd really like a guarantee, you can always unmount the filesystem, or alternatively, mount it read-only:

mount -ur /mnt/blah

That will guarantee, at least at the filesystem/OS level, that the writes are committed.


My understanding that, after that point, the hardware drivers could lie and say they've written it and we'd never know, correct?

And even after that, the hardware could return a "written OK" value but not actually do the job, right?

So the point is that without complete transparency from end-to-end, there's no way to tell.

Is that correct?


Yes, and there's also the problem that magnetization of material isn't permanent so your data may disappear after N days even if it were written correctly (typically, N > 5000). Should your "complete transparency" model also include expected longevity of the written bits?


Can I solve most of this in linux like Ubuntu 14 by doing "# sync" ?


I'll admit that one reason I'm unusually grumpy about this is that I feel rather unhappy not knowing what I need to do to safeguard data that I care about.

...backups?

This issue is not unsolvable at a technical level, but it probably is at a political level. Someone would have to determine and write up what is good enough now (on sane setups), and then Unix kernel people would have to say 'enough, we are not accepting changes that break this de facto standard'. You might even get this into the Single Unix Specification in some form if you tried hard, because I really do think there's a need here.

Or we could just have everyone perform regular backups like they already are/should be doing, and decide that if systems are crashing so frequently as to lose data often enough, trying to "fix" this "problem" by adding what would likely be another mass of design-by-committee complexity to filesystems is not addressing the cause but only its symptoms.

Then again, with over two decades of experience using the FAT filesystem and never a single instance of unrecoverable data loss despite sudden crashes while hearing countless tales of others corrupting their data frequently even when using far more complex and "robust" filesystems, it makes me wonder why I don't seem to suffer quite the same problems...


Backups don't help if writes don't make it to disk in the order and manner expected by the application programmer. There's an emerging consensus that there are crash protocol bugs lurking everywhere due to I/O scheduler reordering. For example this bug in gzip:

http://bugs.gnu.org/22768


This bug report is truly surreal. A filesystem could easily just write on disk directly skipping write buffers, the whole reason they don't is "because performance".

In fact the "file system mathematically guaranteed to not lose data" is too slow to be used in practice and will implement fsync/fdatasync in the future to regain performance (and in the process it will stop being "mathematicall guaranteed not to lose data").

Clearly the solution is to add fsync/fdatasync calls to every single program so as to negate the performance gains of file system write buffers entirely.

Clearly, the next step after that is for filesystem to start ignoring fsync/fdatasync entirely, because otherwise they would be too slow.


Backups do not solve the problem of not knowing whether data you just wrote to disk will still be there in the event of a power outage or system crash. You know, before a backup has a chance to run. Sure, those events should be extremely rare but that doesn't mean that we can or should just ignore it, at a large enough scale even extremely improbable events are guaranteed to happen.


Backups do not solve the problem of not knowing whether data you just wrote to disk will still be there in the event of a power outage or system crash.

It's not a "problem", it's just reality. You can't predict when exactly the crash will occur relative to the disk write or backups.


The OP has nothing to do with backups. fsync(2) and friends are also necessary to ensure correct ordering of writes to disk, which in some ways is more important than ensuring that the data is actually committed. If certain writes aren't written in the order expected, then applications lose crash consistency and now you have corrupt data, not just missing data.


In many applications, crashes are less of a problem than falsely believing a write happened persistently. Consider a mail server: if it can wait until the message is really saved before returning the final success response, then a crash means a retry later (and maybe a duplicated message if the crash happened in the small window between the write committing and the response going out) rather than a lost message.


Not a good answer for application servers. You can lose data in between the application issuing the write and the disk persisting it, and in between persistence and backup.

The absence of data corruption doesn't necessarily mean the presence of all data either. Imagine a fleet of application servers that write several times per second and there are hundreds of them. How would you know if one write had gotten lost in a power failure?

The best available answer among folks who need one is to pass all writes through a distributed system that can commit the data durably: data replicated in real time as part of the process of committing a write, the metadata tracked with a quorum or lock managed by quorum.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: