- "Unit" files commit when closed properly (this does not include a program exit without close), and then replace the old version of the file. The file system should guarantee that, after a crash, you have either the old version or the new complete version. This should be the default when a file is opened via "creat()"
- "Temp" files always go away completely after a crash. This should be the default for designated temp directories.
- "Log" files can only be appended. Writers cannot seek. The file system guarantees that after a crash, the end of the file is at the end of some write; the file may not tail off into junk. This should be the default for files opened for append.
- "Managed" files are for structured databases. They have an additional API. "writemanaged()" has a callback parameter. It returns when the data has been queued, but in addition, the writer gets a callback when the write has committed to disk. The file system must guarantee that a write for which the callback has been made will survive a crash. This provides fine-grained information of when data has been committed to disk, which is what a database needs. It improves performance by not blocking waiting for disk commit to take place. The database can have several I/O operations going at once in different parts of the file without blocking.
Incidentally, this is the way it currently is. From rename(2): If "newpath already exists, it will be atomically replaced".
Just don't forget that "replacing" means to create a _new_ file, writing it, syncing it, and then replacing the old version with rename(2) (and fsyncing the directory).
> - "Temp" files always go away completely after a crash. This should be the default for designated temp directories.
You can have that: Just don't link the inode of the file. On Linux, you can use O_TMPFILE. Alternatively, just create the file, and keep it open but immediately unlink it (it's a little hack but nothing dramatic).
You can also write to /tmp. Normally that gets cleaned at reboot, which might be enough.
> - "Log" files can only be appended.
In open(2), look for O_APPEND.
> Writers cannot seek.
That's not how you write a kernel API. If you don't want to seek, then don't.
> The file system guarantees that after a crash, the end of the file is at the end of some write
How do you intend to implement that? Disks don't support arbitrarily large atomic commits. Making such a guarantee in software is necessarily inefficient. The kernel will not choose your tradeoffs for you. You can do it from userland. Decide yourself what is the least bad way to do it.
> "Managed" files are for structured databases. [..] The writer gets a callback when the write has committed to disk. The file system must guarantee that a write for which the callback has been made will survive a crash.
Fsync does just that, except it isn't asynchronous. Maybe there's something behind aio(7)?
O_APPEND isn't airtight on all systems. On some older UNIX systems, multiple writers created with "open()" (not "dup()") do not share a file position. NTFS doesn't do append correctly.
How do you guarantee that, after a crash, the end of the file is at the end of some write? By updating the file size after the write. The file size update can be deferred during heavy write traffic, but you should always get a file size that ends at a write boundary.
"fsync" synchs the whole file, not just one I/O, which can take a while. Databases such as MySQL's InnoDB, which puts multiple tables in one file, can have independent I/O going on in different parts of a file.
aio(7) has the right mechanism, a callback/signal on completion. But it's not clear if the file system guarantees the data is safely on disk when the completion signal comes in.
The original article complains that UNIX/Linux file system semantics aren't well enough defined for database safety. He's right. They're close, but not quite there, because behavior after a crash is unspecified.
Note that Linux ext4 does not do this. On a power outage, you can get bogus trailing zeros on a file which you were appending, because the file size was updated before the data was written. I asked Ted T'so about this and he said it was working as intended.
read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an
indefinite time, for example, a terminal, pipe, or socket. If an I/O call on a slow device has already transferred some data by the time it
is interrupted by a signal handler, then the call will return a success status (normally, the number of bytes transferred). Note that a
(local) disk is not a slow device according to this definition; I/O operations on disk devices are not interrupted by signals.
I would never expect that. It's not how it works. But what has to work is the dup() case, or the inherited descriptors case like
$ ( do_this & do_that & } > /some/file
because here they share the same file descriptor (as opposed to having only the same file opened on stdout). Note this still doesn't guarantee that writes won't be interleaved. I think there is only a guarantee that writes up to 512 bytes won't be interleaved. The answer here is: Don't do simultaneous writes. Or implement your own synchronization mechanism. It's just not a requirement with a universally satisfying solution that should be implemented in the kernel.
> NTFS doesn't do append correctly.
If you meant NFS: It deliberately violates POSIX file system semantics. I guess it's just too hard to implement. For example, an NFS server can't know whether the machine which owned a lock crashed or is just temporarily disconnected from the network (and has still a process running thinking it owns a lock and not knowing the server is wondering).
In any case, not the API to blame.
> How do you guarantee that, after a crash, the end of the file is at the end of some write? By updating the file size after the write. The file size update can be deferred during heavy write traffic, but you should always get a file size that ends at a write boundary.
How could we possibly implement these "transactions" if the write is _not_ at the end of some file?
> It's not clear if the file system guarantees the data is safely on disk when the completion signal comes in.
What the disk does is in the end controlled only by the disk itself (and of course environmental forces). So it's always best effort.
In most DBs, you have separate data and log. The log is always only appended to, and the nature of disks is such that grouping concurrent transactions together into a single commit is a big win regardless of how syncing works. So syncing all outstanding I/O to the log file is generally what you want to do anyway.
While the data store is always being written to, its writes do not need to be synced to disk until a log checkpoint is made (and the old log subsequently discarded). And since that is an intermittent asynchronous process, there is not much to gain from finer-grained syncing here either.
See e.g. http://www.postgresql.org/message-id/4A51CB76.5020407@enterp...
Care to elaborate on that? I haven't seen issues with the two mechanisms I know of in NT:
1. CreateFile with FILE_APPEND_DATA
2. WriteFile with the OVERLAPPED offset set to ~0
Both of these get you writes that are as far as I know safe to have multiple processes writing to a common log file. Not sure about flushes to disk but you could probably get away with FlushFileBuffers for this.
It seems O_TMPFILE cannot be used to atomically replace existing files either. If only one could use renameat2() on a raw fd or force linkat() to an atomic overwrite.
If you write to a temp file and then use rename, you need to pick a suitable temporary name (without a race condition) that other programs will ignore, and now you've probably created a garbage file if the program crashes before you get to the rename. You also need to make sure you've preserved the permissions in the original file, and hope that nobody else is using the same trick to update the same file.
There are API's and standard tricks for these things too. But it's difficult to be sure that you've covered all the corner cases and impossible to be sure that everyone else has covered all the corner places.
This all adds complexity compared to having a simple, standard, guaranteed way to atomically update a file and makes it less likely that most programs will do it right.
If you don't want to use O_TMPFILE (not super-portable) then just open(path, O_CREAT|O_EXCL). That fails should there be another file lying around, and you can try again with another random name.
Btw. you can also just mmap(2) some anonymous non-disk-backed memory, and possibly open_memstream(3) on it to get a FILE * if that suits you better.
fd <- open(tmpfile)
According to the paper, existing file systems recognize the (open,) write, rename pattern and don't reorder write and rename. That would break too much existing code. But these are filesystem-specific design decisions.
Btw. it's just the same with CPUs writing to memory. Generally writes can be reordered if the behaviour is not visible to the executing thread. However if there are multiple threads you might need to insert (CPU-specific) memory barriers to prevent reorderings that are visible to other threads. x86 is like those file systems here in that most incorrect code still works, because x86 makes some ordering guarantees.
Additionally, in Linux, you can get asynchronous fdatasync(2) with sync_file_range(2).
aio(7) is weird. It really only works with direct I/O (which is necessary much less often than people think it is), and IIRC aio_fsync(2) isn't implemented on Linux, but that doesn't matter so much because generally direct I/O implies synchronous I/O (but not always). See http://lse.sourceforge.net/io/aio.html
Also, I'm pretty sure that issuing IOCB_CMD_FSYNC from io_submit(2) has given me an error in the past. Not sure about aio_fsync(2), because last time I checked, POSIX aio only had a pthread-based backend, rather than the native support that libaio (i.e. io_submit(2)) provides.
It doesn't mean atomic with respect to a system crash.
The easiest way to get asynchronous fsync(2) is to do the fsync(2) in a different thread.
However you would need to create a new thread for each simultaneous asynchronous fsync to get the benefits of asynchronicity.
I think the semantic you want for log files is: write atomically at offset X, but only if the size of the file is currently X.
If the write fails due to a file size mismatch, you are racing with someone else. You can either read the other writer's data before you try writing again or just fail immediately if races shouldn't happen.
You're just proposing moving the semantics of what could be better done in kernel-space by resizing PIPE_BUF to userspace.
The way this works now is that you open a logfile with O_APPEND, any write(2) you make to the file will be atomic up to PIPE_BUF, so it doesn't matter that you have a "race" with someone else if everyone's write is <PIPE_BUF, which is the common case, they won't interleave with each other.
Now of course if your write is bigger than PIPE_BUF you might have interleaving writes, but at the point where you're making >4K writes you're usually better off having some custom log system anyway and not rely on the kernel serializing things for you.
If you want to emulate this proposed mechanism of yours now without any syscall changes you can simply flock() the file for the duration of your write. The solution you're proposing already exists, but locking sucks more than O_APPEND + <PIPE_BUF sized writes.
Nope. I was replying to one proposed semantic and arguing that the ideal semantic is slightly different. I haven't looked deeply into how such a semantic could or could not be mapped onto current APIs.
> at the point where you're making >4K writes you're usually better off having some custom log system anyway and not rely on the kernel serializing things for you
I disagree. The kernel has access to the I/O queue, and I don't. The kernel can fundamentally do this more efficiently and robustly than user space can.
So it sounds like my options are to limit my write size, serialize writers in user space (impossible generally unless you can physically prohibit processes from running that don't follow the serialization protocol) or flock(), which "sucks." Sounds like there is room for an api that does what is desired but doesn't suck.
EDIT: I just read about O_APPEND and it doesn't sound like what I'm talking about at all. It always appends to the end. It sounds like if two writers race, both chunks of new data are appended. That's not as useful as what I described. I'm basically talking about compare and swap, except instead of compare and swap its compare and append. If something changed in the meantime you don't want to just blindly append your new block that is unaware of the concurrently appended log data.
> I just read about O_APPEND and it doesn't sound like what I'm
> talking about at all. It always appends to the end. It sounds like
> if two writers race, both chunks of new data are appended. That's
> not as useful as what I described. I'm basically talking about
> compare and swap, except instead of compare and swap its compare and
That's not at all what the grandparent is talking about though. They
just want "log" files where you can only append and the write() is
atomic. That doesn't mean "my write shouldn't work if there is a
concurrent write" which is a semantic you're making up, and which I
don't really see making sense for log files.
I'm really finding it difficult to imagine a plausable use case for
your proposed semantics. Imagine this: You have 100 concurrent web
serving processes all writing to an access.log, currently they can
just write() O_APPEND to an access.log with a string below PIPE_BUF
and their writes will end up on disk, but may not be interleaved in
"real time", but who cares?
What you're proposing means that only 1 out of those 100 processes
will succeed in their write() call. The rest will start a retry loop
that's trying to write out some data whose order didn't even matter in
the first place to try to flush it to disk because they all want to
only write the data if the size of the file is N, and every single
write updates N.
If you have a use-case for strictly serializing writes to disk like
this that's fine, but it's not the common case with log files, and the
way it's done is not to try to add a new API to the kernel, you just
send messages to some userspace thread that queues them up and does
the I/O for you.
That makes no sense. Have each process write to its own log. This is making your life hard for no reason.
> If you have a use-case for strictly serializing writes to disk like this that's fine, but it's not the common case with log files
This isn't about access.log, it's about transaction logs for database systems. Like the commit log for Postgres or SQLite. You can think of their commit logs as like a block chain. Imagine a block chain that didn't have a pointer to the previous block, but just a timestamp. It wouldn't work. If you have two conflicting blocks you need some central arbiter to decide which wins.
> That makes no sense. Have each process write to its own log. This is
> making your life hard for no reason.
Having to tail N logs where N is some unknown number of processes is what's making your life harder for no reason. That's what O_APPEND and PIPE_BUF are for.
> This isn't about access.log, it's about transaction logs for
> database systems. [...] think of their commit logs as like a
> block chain.
Everyone's expectations don't have to be implemented on the filesystem level, particularly when those expectations inherently involve locking writes and retry loops.
I can see how this would be convenient. That's fine and I have nothing against it. It's just not a problem I consider interesting, because there is no inherent need for all 100 writers to write to the same file. You could just as easily have each write to a separate file and have a separate process that merges them together if you want a single file for convenience.
> I really don't see how any of this is relevant to the original point of having "log" files.
Maybe you're not aware: every major database keeps a commit log, and writes it to a file called a log file (sometimes "write-ahead log file"). That is what I think of when I hear about "log" files in the context of file durability, because this is the case where file consistency and durability actually affect the consistency of the system. Here are some examples:
This is a much more interesting problem (to me) because it is harder and relevant to the consistency of any database system.
> Now you're talking about DB transaction logs and blockchains, which really don't need to have their nuances implemented in terms of POSIX file semantics. They can just have a single process that manages those expectations.
You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?
SQLite has chosen to support concurrent SQLite processes all writing to the same database. Maybe you think they shouldn't do that, but they find it useful. And even in your world where they shouldn't do that, it's still not ok to just corrupt the database because the user didn't follow the rules.
> Everyone's expectations don't have to be implemented on the filesystem level
One of the kernel's key responsibilities is to arbitrate concurrent access to shared resources. It provides primitives that makes it possible to build higher-level abstractions. What I am describing is a primitive that allows for lock-free appends to consistent transaction logs. It could be useful for a lot of things, and makes at least as much sense as plenty of other things that are already in syscall interfaces.
> You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?
Sure, but, so? I don't know sqlite's WAL logging code in detail, but I do know PostgreSQL's fairly intimately. I don't see how such an interface would be relevant for WAL logging. Such logs usually have checksums and pointers to previous records in their format. For those to be correct each writing process needs to know about previous records (or at least their starting point). Thus you need locking and coordination in userspace anyway - kernel level append mechanics aren't that interesting.
In addition to that, if you care about performance, you'll want to pre-allocate the WAL files and possibly re-use them after a checkpoint. For many filesystems overwriting files is a lot more efficient than allocating new blocks. It avoids the need for fs-internal metadata journaling, avoids fragmentation etc.. With pre-allocated files you then can use fdatasync() instead of fsync(), which can be considerable performance benefit in our experience.
There are things that'd make it easier and more efficient to write correct and efficient journaling, but imo not what you were talking about. Querying and actually getting guarantees about which size of writes are atomic, for example; otherwise you need to use rather expensive workarounds like WAL logging full page contents after checkpoints, or double-write buffers.
Proper asynchronous fsync(), fdatasync() would also be rather useful.
> I'm basically talking about compare and swap, except instead of compare and swap its compare and append
A notable quote: SQLite does not compete with client/server databases. SQLite competes with fopen().
That is the case, but the advantage of the SQLite approach is that they have already done so, and hence their code is far more likely to be "correct" than any new code you write. The many billions of instances of SQLite already deployed also help with maturity and finding corner case issues.
And as a bonus it also lets you do rollbacks, transactions etc which aren't easy if trying to implement them yourself.
It could work. Just best to have hybrids with different types of files, including those bypassing RDBMS function, so one can select proper reliability vs performance tradeoffs. SQLite might have cross-platform, FS-type API's too that I don't know about. Not sure if it's already there or be an extra development.
They're working on it as we speak. :)
The the writes to the `foo.gz` file have to hit the disk before the unlink but the I/O scheduler can reorder these potentially and a badly timed crash could result in data loss. Note that journaling doesn't fix this issue because the transactions are distinct too.
Some days I really do wish we weren't so inured to the particular 50-year-old systems-programming abstractions.
There was a great lightning talk a few years ago that I can't seem to find where the author described an API for storing blobs in a hierarchical namespace. Of course, halfway through, it became clear that it's just the POSIX API: you can "open" handles to objects, "rename" them, remove them, and so on. You'd end a transaction with "fsync()". (Okay, that one's a little more complicated, but I don't think it's as hard as the OP claims, at least for single files. Multiple files are more complicated, but that problem is intrinsically more complicated.)
1. "MVCC": the ability to effectively get a handle on a static copy of the entire filesystem, perform mutations to many objects, and then submit the transaction, at which point it will either commit or rollback depending on whether any of those objects have been modified.
Windows actually has this: https://en.wikipedia.org/wiki/Transactional_NTFS allows for exactly the sort of MVCC I'm talking about—and presents a very different API than the POSIX filesystem one.
2. "Object store": as in, you don't interact with the filesystem by getting writable handles to the file's underlying backing store, where processes can effectively treat a file as persistent shared memory. Instead, you ask for a floating unnamed "buffer" object, fill it up, and then submit it to the filesystem to be atomically stored; or, vice-versa, you ask to retrieve a file, and are passed a buffer handle of the representation of the file at the point in time you asked for it (presumably, for optimization's sake, backed by copy-on-write mmap'ed pages from the latest MVCC-linearized copy of the file.)
We actually have this today as well, but in an implicit and half-assed way. Files below some size threshold, in modern filesystems, don't use any FS extents for backing, but are instead stored as part of their filesystem directory-entry node. This basically makes the filesystem into an object store for those files—but without actually exposing any guarantee to userland that those files will be read/written as atomic operations.
To be clearer, there's a problem with this fs-atop-object-store design, as I've stated it so far:
An object store doesn't support every use-case a filesystem does, and a filesystem implemented only in terms of an object-store wouldn't be good for some things because of that. It would be terribly expensive to emulate a block device on top of an object store, for example—you'd have to make each emulated block an object, and mutate blocks by transactionally adding an updated block and removing the old block. This means that it wouldn't make sense to keep mountable read-write disk images on such a filesystem; nor would it make sense to keep database backing stores there.
But both of those things are effectively things that take no advantage of existing on top of a filesystem in the first place—they do their own index-building, their own sparse-allocation, their own journaling, etc. Making the filesystem an object store just makes it obvious that for those use-cases, the filesystem itself is pure overhead.
So, along with the MVCC object-store, the other part of my hypothetical system is a "buffer store": basically like a logical-volume manager, an API that consumes physical block devices and exposes handles to (cheap) logical resizable block devices. The object store could be implemented in terms of one of those logical block devices, but otherwise would completely ignore the existence of it. The filesystem compatibility layer, on the other hand, would allow you to request that a given filesystem object be backed by a persistent buffer (newly-created logical block device) instead of an object; and then whenever you requested that object from the filesystem, you'd get an IO handle to the logical block device, instead of an IO handle to a transactionally-resubmittable copy-on-write mmap of some object.
 - https://en.wikipedia.org/wiki/WinFS
I think we could have built something that worked well enough, but they grew cowardly after the general failure of the entire longhorn project. Don't try to change the compiler, the language, the os, the office application to use same "in motion" components, plus the database.
It's probably little known outside of Microsoft, but they tried to build many different higher end object filesystems, including JAWS (Jim Alchin Windows Filesystem), and others I can't even remember.
One of my recently-planned hobby projects was to write a FUSE filesystem for OSX that would take all the shell-level glop (LaunchServices UTI-associated verbs, NSFileManager attributes, Spotlight metadata, etc.) and stick it onto the files themselves as plain xattrs, so POSIX tools could effectively be made to work with shell-namespace "Documents" (including treating directory-packages as files, and traversing archive files and container-file-formats as if they were directory-packages.) It'd be quite interesting to combine the two experiments, actually; it'd result in something very much like a POSIX compatibility layer on top of a WinFS-like filesystem.
You don't like files? Files are great.
Running Core Data over SQLite instead of a "real database" gets you backups for free, the iOS file-based security model, lets you delete an app's data when it goes away, network mounted home directories… really, the filesystem abstraction is pretty good still.
So if I need to create a file with a timestamp, you are suggesting creating a SQLite database, a table, and inserting my timestamp there?
It's nonsensical to "fix" a Unix low-level syscall issue by using SQLite.
Most C programmers can't even write that piece of code. Let's see if you can.
It's worthwhile to use libraries which embody the knowledge, trials and errors of years, such as SQLite.
Unless you are talking about a function in the SQLite library that implements the durability calls properly and can be called from other programs.
Stand on the shoulders of the giants who have come before us.
Also, middleware shouldn't be needed in open source because you can simply patch the underlying API. That the middleware is the best solution in this case says a lot about the problem.
I can't believe the level of ignorance in this thread.
But even in an extreme example like storing a single timestamp, there are situations where you want various guarantees that SQLite makes easy.
One of the interesting things about writes is that they disrupt reads more significantly than you might expect. Greg Lindahl characterized the impact at Blekko when we were crawling so that we could optimize writes from the crawl to not disrupt latency on the search engine side. Later we completely separated those functions for similar reasons. I believe every disk I've evaluated over the years is slowest on the random read/write 50% test.
fsync() = YES metadata, YES data, NO dir entry
fdatasync() = NO metadata, YES data, NO dir entry
fsync() of dir = NO metadata, NO data, YES dir entry
Alternatively, you could just use the ZFS driver for Mac OS X. It will do fsync properly.
The basic idea is that the file system provides two guarantees, one boring, and one interesting. I'll illustrate using code instead of words. The boring guarantee is:
FILE* f = fopen("autosave.bak", "w");
fwrite(f, buffer, length); // save current document
FILE* f = fopen("autosave.bak", "w");
fwrite(f, buffer, length); // save current document
They do a bunch of benchmarks that show that their system is almost as fast as mounting ext3 _asynchronously_. (In fact, not much worse than a RAM disk even.) Of course, they also show that in the case of power loss, their system behaves well, whereas ext3 does not, unless you turn up all the paranoia knobs to 11, and then the performance is WAY worse than their system.
I'm oversimplifying quite a bit, since this comment is already pretty long. Read the paper for details. Or ask questions here, but I'll probably forget to check the comments, because HN doesn't remind me :(
Warning! This comment is _not_ saying that your operating system provides these guarantees. In fact, it almost certainly does not. This is a novel suggestion (and implementation?) presented in this particular paper.
While there's no excuse for bad documentation or poor APIs, you can never consider data written to a single local disk "safe". It never is.
It's a shame making a best effort at safety is nontrivial, but it does force developers to write more defensive and crash-safe code which is all that can save you in the end.
So yes, agreed++. You need to have a level of redundancy appropriate to the criticality of the data.
Use what works and don't complain that broken stuff is broken.
1. The PSU would have a big enough capacitor to keep the computer running for a few seconds at its stated output power
2. The PSU would notify the OS of a power loss
3. The OS would immediately flush all caches and adopt a "brace position".
4. Events are spread system wide so that apps can also flush and brace.
It should work even if it is the PSU that fails (as long as the capacitor is there).
Surely the problem cannot be the cost. Why don't modern desktops have that feature?
The answer is cost. We used to have the computers you're describing. They were made by SGI and the hardware & OS made all sorts of guarantees like the ones you're describing, to the point where XFS was ported to Linux x86 initially it had all sorts of bugs that simply couldn't happen on SGI machines.
The x86 boxes were cheaper, nobody really cared enough about hardware reliability enough to pay the price for the likes of SGI, and the rest is history. Today we have a "worse is better" hardware architecture and software has to be able to handle it.
The 10 F capacitor has an ESR of 0.075 ohm, but that's at 1 A. They have a 100 F model that is 0.015 ohm at 10 A, and a 150 F that is 0.015 at 15 A. Based on your calculations, these look like they would have a good chance of giving enough time to save everything (especially on an SSD system).
Those are physically bigger but should still fit in a normal desktop or server.
(There is another manufacturer that has up to 630 F!)
Adding $20-50 in bom the size of two D batteries just in case something _very rare_ happens is not economically viable. There are better alternatives, like banks of batteries ala google boxes. Turning server into oversized laptop gives you more than couple of seconds buffer.
1. Create a temp file on the same mount
2. Write data to file, checking return of each write(), then a final fsync(), and close()
3. link() to filename we actually want
I still haven't figured out how to be this safe on Windows, because AFAICT there's no atomic way to link() or 'move' a file - and they canned the transactional API for the filesystem. Any pointers to how to write files 'safely' on Windows would be much appreciated!
Of course this comes with a lot of footnotes. There was a bug on some older Linux kernels where they didn't tell the hard drive to flush the data after an fsync. This is why a lot of people still incorrectly believe that enabling hard disk write cache is unsafe. But this was a bug, and the bug was fixed. There are also reports of hard drives that don't honor flush requests. There's not much UNIX or any other OS can do about this-- if the hardware lies, you are in trouble.
You could also imagine a much richer durability API. Empirically, databases need such a richer API rather than simple fsync. The POSIX async I/O standard was supposed to standardize all this, but Linux's glibc just implements it as a thread pool making blocking system calls. If you want real async I/O on Linux, you need to use an OS-specific API.
Everything is a file ("object"), whether it's a standard file, or a directory. If you create a new file, you want two things synced: the contents of the file, and the linking of the file (that is the pointer from the directory in which you created the file, to the file object). If you don't sync the link you may not be able to find the file again, even though it was synced to the disk just fine. (It's just how git works btw.)
The link to the file is part of the directory object's contents. That's why the directory needs to be synced.
There is no need to sync the "parent of the directory" because that was never modified.
The solution is copy-on-write file systems such as ZFS and Btrfs that ensure neither data nor meta-data are ever altered in-place and the reuse of correct file manipulation code (written by adults) rather than rolling your own, either from a library or something higher-level like SQLite.
TL;DR: files are hard; filesystems differ a lot; sqlite is very robust, most other software (git and mercurial including) not that much.
There is no problem with the API. The issue is in an underlying assumption, that data sometimes is not at risk. Sadly or maybe luckily, there is nothing you can do to guaranty 100% durability, no point trying to do the impossible. Your data is always silently at risk. Accept it, deal with it, minimize the risk, if you need to. Like replicate your data synchronously cross continent.
The author is finding this frustrating because there are several things to do that can seem arbitrary, random, and counterintuitive. His trust in the OS was damaged and now he or she is understandably grumpy about it.
Right. For decision of "when" you can configure a few kernel parameters, namely dirty writes thresholds. That assumes you just do write()s from your process and kernel decides to flush that data out. That can be configured to be a function of time and or amount of unwritten data.
I had to do it a few times. Once it was a realtime-ish system which was recording data to disk, and noticed recording thread would time-out. (Timeouts were in the 10s of seconds and there were noticed by a watchdog system). Thought 10 seconds should be enough for that process. But it turns out because of how priorities were setup and how fast writes went, dirty page flushing was not keeping up with writes. Periodically it would hit the top threshold and at that point any process doing disk writes would be blocked.
Was able to get it under control by essentially doing what gp suggested, writing a little bit at a time, but more often. The total throughput probably went went down but performance was smoothed out quite a bit.
I thought you are asking how to control in general when your system is writing to disk.
Well if you don't use fsync then it decides at some point like I mentioned above. If want to talk about transactions then do a write and fsync, does that not work someetimes fro you? If you are worried about the new file appearing in the directory then fsync the directory as well.
If you want more guarantees, you'll have to dig deeper and find out about your specific device, does it have battery power and does it tell lies about it writing data and so on.
Basically, on XFS and all other tested filesystems is seems to work just fine, on EXT4 we may loose the last rename() effects.
So this makes it difficult to get the durability right, even when the project is as careful about it as PostgreSQL.
The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances.
So perhaps the article can be shortened to "sync() is not guaranteed to be synchronous on non-Linux Unix."
- I think it's mainly the modern file systems like Btrfs and I think partly also ext4, which introduced a shift in paradigm, which broke old applications (or at least broke their performance, for example dpkg).
- We're talking about a hierarchical file system, meaning it's easy for humans to find data, but terrible for machines because they must hunt pointers to get to files. Similar problem for syncing a bunch of logically connected files if they are not stored in a single directory. (How do you atomically commit multiple changes across directories?)
What would be a better abstraction or non-abstraction (that's still a hierarchical file system)?
POSIX fails as a high level interface:
- No (at all powerful) notion of transactions. Mutation without transactions is extraordinarily primitive.
- No multiple FS roots to indicate boundary across which data will never be synchronized. (This is also good for how to spread data across multiple devices, a low-level concern.)
- Overall pushes people to maintain their own structure within files rather than use FS's trees.
POSIX fails as a low level interface:
- No way to hint or control caching along the memory hierarchy.
- Block size, locality out of control.
- Can't statically disallow various non-free actions which are costly (ability to expand file), can only hope that not using them does incur penalty.
Perhaps ZFS an BTRFS made other approaches more viable/obvious, but these weakness are inherent to the API itself and the high level stuff was noticeably missing from the get-go.
There is nothing that prevents you from implementing these in userland. Does not belong in the kernel, since the kernel can't know what are your atoms that must be atomically committed. Research how databases do it.
> - No multiple FS roots to indicate boundary across which data will never be synchronized. (This is also good for how to spread data across multiple devices, a low-level concern.)
You can check what device a file belong to with stat(2). It's the st_dev member. You can also check it from the shell with the stat command.
> - Overall pushes people to maintain their own structure within files rather than use FS's trees.
And that's entirely ok. Hierarchies are not for databases. Database-y problems are not the problems that the POSIX fs solves.
> - Block size, locality out of control.
I actually heard that Unix filesystems have traditionally been quite good at preserving locality. That's why I don't know of any defrag tool for e.g. ext3.
Overall, if the FS does not solve your problems, implement your own abstraction. That's perfectly ok.
The best solution here it to replicate data across multiple machines, and take good backups that you can restore from, and plan on corruption to happen.
You also need to assess exactly how much you really need those transactions. If they're financial transactions where each one could be millions of dollars of stock, then you probably need to care about this a bit. For most web transactions I think the risk/cost analysis here is that you don't need to worry about being perfect. You might lose the last one or two transactions in the case of a hard crash, but your customer service team should be able to handle that and fix it for the customer out of band. You really should consider if you have a business justifcation for worrying about perfection.
On the other hand, I did live through the ext2 era before ext3 was production ready, and fully async super-duper fast filesystems really are bad. They would regularly corrupt the disk and require rebuilds. That wasn't a data availability problem, but one or two of those a week was problematic when it came to the operational load (out of 400-800 servers that we managed at the time). We later scaled out ext3 to 30,000 servers with some sets of servers having 2,500 hosts of basically the same type of webserver, and while kernel crashes were a daily issue, corruption and rebuilding was relatively low. If you're not at Google/Amazon/etc and dealing with servers counts an order of magnitude higher than this, you don't really need to worry about it. ext3 or ext4 and fdatasync should be fine, and then apply proper levels of engineering principles to ensure that you don't stay offline for too long or lose too much data.
You are dealing with free commodity hardware and software that isn't ever going to be perfect. If you really needed to never lose a transaction you'd probably be buying some kind of awfully expensive mainframe system.
Oh and I do recall one case of filesystem corruption leading to a service being down for over a week and probably the loss of a multi-million dollar business deal. But in that case the software ran on a single box. The dev team that was responsible for it never saw that as a problem even though the ops/sysadmin teams kinda yelled at them about it. Then one day the RAID array lost its mind and the server was unrecoverable. When we attempted to rebuild it, it was discovered that over the years the software devs had tweaked the versions of libraries that their software linked against and by crashing all that information had been lost, so it took ages to debug and find the right incantations to get it all back up again. Huge business risk there, but nothing that could be mitigated by naval gazing analysis of filesystems and fdatasync -- backups, documentation, replication, proper config management practices, etc were what was needed.
Then even completely beyond control of the kernel, devices have their own RAM caches, which means that even if the device reports that something was written, it might not have been. So there's no way to absolutely be sure.
For example, if two processes issue writes to disk sectors that are adjacent but one of the processes' writes also affects a different sector farther away on the physical disk the I/O scheduler may re-order processes' sector level writes. Normally journalling addresses this issue but if the journalling transactions are reordered then it's no help at he application level.
Edit: Nevermind, I'm remebering now there is a dirty_background_ratio separate from a dirty_ratio. It will sync some of the dirty buffers as long as it's above dirty_background_ratio and under dirty_ratio. And it was actually bdflush that determined all of this; the sync_dirty_buffer function just took a specific buffer object and synced that.
Edit 2: And I'm finding out that information seems to be outdated. Here's the old documentation about bdflush .
That was either moved or removed in the latest version, but I did find this , which suggests that there was an old bdflush vs something else now. Looks like I missed something important and have some research to do.
If you'd really like a guarantee, you can always unmount the filesystem, or alternatively, mount it read-only:
mount -ur /mnt/blah
That will guarantee, at least at the filesystem/OS level, that the writes are committed.
And even after that, the hardware could return a "written OK" value but not actually do the job, right?
So the point is that without complete transparency from end-to-end, there's no way to tell.
Is that correct?
This issue is not unsolvable at a technical level, but it probably is at a political level. Someone would have to determine and write up what is good enough now (on sane setups), and then Unix kernel people would have to say 'enough, we are not accepting changes that break this de facto standard'. You might even get this into the Single Unix Specification in some form if you tried hard, because I really do think there's a need here.
Or we could just have everyone perform regular backups like they already are/should be doing, and decide that if systems are crashing so frequently as to lose data often enough, trying to "fix" this "problem" by adding what would likely be another mass of design-by-committee complexity to filesystems is not addressing the cause but only its symptoms.
Then again, with over two decades of experience using the FAT filesystem and never a single instance of unrecoverable data loss despite sudden crashes while hearing countless tales of others corrupting their data frequently even when using far more complex and "robust" filesystems, it makes me wonder why I don't seem to suffer quite the same problems...
In fact the "file system mathematically guaranteed to not lose data" is too slow to be used in practice and will implement fsync/fdatasync in the future to regain performance (and in the process it will stop being "mathematicall guaranteed not to lose data").
Clearly the solution is to add fsync/fdatasync calls to every single program so as to negate the performance gains of file system write buffers entirely.
Clearly, the next step after that is for filesystem to start ignoring fsync/fdatasync entirely, because otherwise they would be too slow.
It's not a "problem", it's just reality. You can't predict when exactly the crash will occur relative to the disk write or backups.
The absence of data corruption doesn't necessarily mean the presence of all data either. Imagine a fleet of application servers that write several times per second and there are hundreds of them. How would you know if one write had gotten lost in a power failure?
The best available answer among folks who need one is to pass all writes through a distributed system that can commit the data durably: data replicated in real time as part of the process of committing a write, the metadata tracked with a quorum or lock managed by quorum.