I believe the solution is actually pretty simple, though maybe not easily implementable:
Provide new API calls with precisely defined semantics.
Rather have this rigamarole with fsync and rename, provide an actual syscall with the actual effect the userspace developers are looking for. Eg, an atomic_replace() syscall that ensures that either a file is replaced with a new, fully written to disk version, or nothing happens.
The main problem I see is that this of course would be Linux specific, so of course somebody would build a library to either invoke the syscall or do the fsync/rename mess underneath, and this would of course run into the same exact problem on those systems.
I've proposed this informally before, along these lines:
- Unit files. You create, you write, you close, and then others can read the new version. Until the original writer closes and gets a good close status, nobody else can read it. If the program aborts or the system crashes before closing cleanly, the file reverts. All readers see a fully written file. This is the default. It's what most programs need. Replacing a file by creating a new one on top of it is both permitted and an atomic operation. UCLA-LOCUS and some IBM systems that followed worked that way.
- Temporary files. Disappear on system crash.
- Log files. Append-only. All readers are guaranteed to see an end of file position that corresponds to the end of a previous write. Usually the most recent write, but buffering may make log file reading run a little behind.
- Managed files. Read, write, share. Write operations return two completions, probably via some async mechanism. The first completion means "buffer contents taken". The second completion means "committed to storage that will survive a crash". Database systems would use this, but few other programs would bother.
This tells the database what it really needs to know - when is the data safe? That tells the database when it can commit a transaction. The database can do other things, including more I/O, while waiting for commitment.
"fsync" is really a clunky way of getting that second completion.
I think most of those things are implementable in userspace with the right combination of O_TMPFILE, rename, fsync, fsync(dir), sync_file_range and io_uring.
So all that's needed is encapsulating it in a library that provides those different abstractions.
Well ok, we could also use an additional flag for linkat() that allows atomically replacing the target to close that tiny window where a temporary file might get left behind after a crash.
Actually, a good chunk of that is available via memfd + linkat.
- Unit files: memfd, fdatasync, linkat, fsync(dir)
- Temporary files. memfd, preferably within /tmp to avoid any disk I/O unless you have a particularly large file to buffer
- Log files. This is trickier. Probably requires running on a fs that supports reflinks and creating a new memfd that’s a true clone of the original file, appending, & then replacing it. The one piece that is incompatible with this is that other file descriptors are stale and you have to reopen to see new data but that’s at odds with APIs available today - either readers can see partial writes or you have to have a new file. There have been proposals to allow buffering writes but delaying publishing the metadata acking the data but that hasn’t gone anywhere unfortunately.
- Managed files. I’m not sure that completions are the right contract here because technically the kernel could defer that completion indefinitely whereas databases typical need some kind of latency bounds of “OK - I really want this to commit to disk nowish”.
memfd is for files in RAM. linkat cannot get them onto disk. What you seem to have in mind is what the grandparent post already described: "O_TMPFILE + linkat".
It's not, though, because fsync has no relation whatsoever to what other processes see.
Durability is totally different from concurrency or multiprocess consistency.
There are already guarantees about how we can order reads and writes in a shared file. Very well defined guarantees -- ones we've built entire RDBMS systems on top of.
Guaranteeing durability (persistence across a crash) is a totally separate problem from understanding write visibility between threads.
> It's not, though, because fsync has no relation whatsoever to what other processes see.
Ah, the SQLite people. I was thinking in terms of databases such as MySQL and Postgres where one process owns the files and client programs communicate with it.
It's the same for sqlite, postgres, mysql, or any other file.
Syncing data to durable storage is totally unrelated to visibility guarantees between multiple threads or processes working within the same file (be it using write, mmap, whatever)
SQLite relies on file system visibility guarantees, because there are multiple processes connected only via the file system. But Postgres and MySQL have one process that touches the underlying file, so they can coordinate internally.
Not really. It's often all just shared memory via mmap. The files are never closed, and sometimes never operated on with read/write because they're just mapped. This behavior is common between (eg) both sqlite and mysql. (Postgres defaults otherwise, but architecturally it is not different)
The visibility issues you're raising, basically implementing views and transactions, would not be applicable to any of these systems.
You're confusing durability guarantees (fsync, msync) with visibility guarantees -- the latter of which are just standard memory access issues.
>This is the default. It's what most programs need.
Is it though? I think single writer+many readers is a very common use case. For example I want a compiler to read a file I have open in my IDE. Or I want to be able to open a log file while the application is still running.
It's fine to read an append-only file. But if the file is a unit file, you can't read the version being written until it is done. So you have file integrity.
I don't think there's really any issue with it being Linux specific. There are plenty of Linux specific APIs that people happily use already.
Seems to me the bigger issue is the deification of POSIX and UNIX. There's a stupidly large contingent of people that think that they are flawless and must be followed unthinkingly.
I've noticed that this week. After spending a little time just reading on how to correctly write posix compliant code that was also guaranteed to do what I want, it's hard to come to any other conclusion than "posix is a last ditch attempt to slap a bandaid on Unix fragmentation by attempting to retcon some lowest common denominator behavior from a couple popular unixish operating systems and calling that a spec".
It's not what I would actually call designed. The closest analogy I can think of for non-c programmers is that posix is like if we decided that people should only make websites using javascript that was mutually interpretable by IE6 and Netscape navigator and we occasionally made updates every decade or two.
There's really nothing particularly noble or good or correct or elegant about all the API calls with implementation defined semantics. At best it's a necessary evil for compatibility. You can admire the cleverness required to get a single simple c codebase that works correctly on multiple operating systems and future operating systems that conform to posix in creative ways, but only in the way I admire those old school zines that are simultaneously a PDF and a jpeg and a shell script. Clever and ingenious but not actually good engineering design.
There's that, and there's a lot of UNIX actually sucks these days and was made for different times.
Like signals, for instance. Okay idea when users are writing C from scratch to implement simple things. Horrible pain in modern times, with threads and libraries not playing well with signals.
The problem is every filesystem would implement a different version of fsync_that_works or rename_for_safety and then applications will have to call all of them, and then people will start building a compat framework that tries to assemble the optimal sequence of calls for each filesystem except it won't work in every case and so a new fsync_but_no_cheating function will be proposed.
The trick is not allowing such a thing. Don't have something vague like "fsync_that_works" that could be implemented arbitrarily.
Have a function that is documented to implement a highly specific contract, like "writes all pending blocks of this file to disk, and ensures they'll be there if there's a crash after the function returns".
That's exactly what people expect from fsync(). The problem explained in the article is that, due to the difficulties of tracing every possible related write, it is (was?) often implemented conservatively and waited for more writes to land than were strictly necessary.
I don't know how much this tracing has improved in various kernels since 2009, but I do know that I would want them to benefit fsync(), not to make a new function with a subtly different contract.
Does this suggest that the idea of an overarching fsync/rename is perhaps being proposed at the wrong level of abstraction? If it can’t be reliably implemented at every subsequent level, what’s the point in having it at this level?
Different filesystems have different on disk structures, with varying degrees of complexity and cost in getting those things on disk. Everything can reliably get everything written eventually, but if you want to fully utilize your disk bandwidth, you only want to flush the data that can get there fast, which is going to vary by filesystem. And different applications have different durability requirements.
This is still no argument to not define syscalls for the semantics that applications typically care about. If it is slow on a particular file system, then so be it, it’s still the semantics the application requires.
I don't see how new APIs will help here. The problem is that the semantics that application developers want is expensive (performance-wise), so they use a similar API that doesn't guarantee those semantics but provides them in practice 99% of the time. You could create a new set of API's, but it's really hard to implement any API in a way that doesn't give extra, unspecified semantics in practice, so what prevents application developers from making the same choice?
They aren't that expensive. ZFS provides them by default, in that no operations are ever reordered in a user-visible fashion even without fsync. That's a stronger guarantee than we're asking for here.
I think that was the problem with how Linux implemented pselect() circa 10 years ago. I think pselect fixes the race condition you get when you set a timer then call select(). Linux's implementation was a function that set a timer and then called select(). ... slam head on keyboard.
So I think that fear is valid that they'll implement the API without the guarantees.
Myself I'm annoyed with the reckless push always to remove guarantees in return for 'performance'.
I'm starting to come round to the belief that the only robust remedy is to release the chaos monkey: have the kernel delete all the state of the filesystem driver at randomly-chosen intervals, a few hours apart on average, and force the driver to recover itself.
That is pretty close to what those of us building robust storage systems in the real world have to do. At a previous employer, we had test suites that exercised all kinds of corner cases by triggering failover and system reboots in the middle of heavy persistent messaging workloads.
Filesystems have other horrors that you learn about during testing. I had one test case where it would take ext4 80 seconds to write out an 8MB file after a fresh mount of an 8TB filesystem. Free space was fragmented in just the right way that we hit the single threaded reading of block groups and bitmaps in the kernel and it took forever to get that data off the disk array.
I've done the same with embedded drivers, randomly flip bits in the drivers state and see what happens. Does it recover, throw fault, hang, or explode like a bomb. I got that from a hardware designer talking about robust state machines. Any random illegal state should sequence back to a known good state.
> Any random illegal state should sequence back to a known good state.
What benefit does that property provide? Other than helping to deal with random memory corruption bit-flips, I'm having trouble understanding why state machines (in hardware, or any level of software) should ever be expected to handle arbitrary memory manipulation outside of the rules of the state machine.
You often don't have memory protection in embedded and coding errors, running over stack space could cause such problems. Usual remedy is to detect these with diagnostics and issue a reboot to clean state ASAP.
Okay, so why not create a single syscall to encompass all that? So that the developer can clearly say "this is what I'm trying to achieve", and accomplish it simply, and reliably?
I think this would be a benefit, because first you don't need userspace developers dig into the gory details of dirent metadata durability, and second because by making this explicit it helps put pressure on filesystem developers to optimize what the users actually want.
The phrase "atomic across crashes" is meaningless. Atomicity deals with the running system state at the moment a system call executes. It has nothing to do with crashes (which are not atomic in and of themselves), or what bits actually end up on disk.
Atomic simply means indivisible: you don't observe half of an atomic operation. Of course it strongly depends on where you are observing the operation from, and that does not necessarily just mean 'a running system', hence OP specifying that they wish for the operation to be atomic even when observed from a system before and after a crash.
Yes, but as I said above that doesn't make sense. The atomicity we're discussing here is with respect to time, and sequencing (not say size, like an atom).
We say that rename() is atomic because you will either see file A or file B. There are no other possible states.
With a crash, writes might not be committed. You might get an earlier filesystem state.
> With a crash, writes might not be committed. You might get an earlier filesystem state.
And the whole point of TFA is that this is perfectly fine as long as the metadata writes aren't committed before the data writes. That is:
Precondition: File at path A exits with data X
1. Create file at path B
2. Write data Y to file B.
3. Close file B
4. Rename B to A
5. Crash
---
As long as the contents in the file at path A are either X or Y, then we have achieve atomicity in the sense used in TFA. This is what we mean by rename acting "atomically across crashes"
Note that we only care about the file at path A; all of these are fine:
This is fundamentally not how unix filesystems work. No file data is written in the above scenario with rename(). Rename changes links - not files. Let me rewrite this for you using more correct language:
Given a link to a file at path A exists. The file contains data X. There may be other links at paths C, D and E.
1. Acquire a filesystem-global lock on link ops (link/unlink et al, but not read/write).
2. Create a link to X at path B
3. Remove the link at path A
4. Release the filesystem-global lock.
Note: The data in X is not relevant. The data in X may be undergoing active modification while the above 4 steps are performed. The data in X may be memory mapped and may change many times during the "atomic" rename() syscall.
The atomic properties of rename() are only vis a vis link/unlink semantics within a single filesystem. Other files can't be created or deleted while the rename() is in progress. File data however is under no such restriction -- and in fact it cannot be because the file may be memory-mapped and modified without any syscall interactions. There is no system call sequence point to gate reads and writes.
What you are describing is fundamentally incompatible with how the system actually works. What you describe cannot be made to work, ever. It is architecturally invalid at a fundamental level.
It works in ZFS, XFS, and even ext4 with data=ordered. It worked well enough in ext3 as well, that I didn't see issues with it despite crashing the kennel a lot.
> File data however is under no such restriction -- and in fact it cannot be because the file may be memory-mapped and modified without any syscall interactions. There is no system call sequence point to gate reads and writes.
Note a complete lack of mmap() in the list of operations above; you are trying to trust what I am asking for into something that is impossible, when what I'm asking for is definitely possible.
[Edit]
I'm also well aware of how rename works; the ask is that the link is not committed before the data. This is possible to do at the expense of some performance, but possibly less performance then using existing commit primitives (e.g. fsync or fdatasync)
There are all kinds of cases where filesystems might issue an extraneous sync, but it is not atomic, merely ordered. You've switched from talking about one to talking about the other and they are not the same. This is the core of our disagreement.
Because it is not an issue of atomic behavior but merely ordering there's no reason to modify the kernel or touch syscalls. It would be far more reliable to provide this in a library -- rename(3) -- which would then automatically provide the benefit on every posix compatible filesystem.
It would also allow for a use case of wanting to rename a file without forcing a full data sync -- which could be performance critical behavior in some scenarios.
To recap:
No filesystems provide atomic syncing of data during rename(). The words don't even make sense. Ext doesn't do it, zfs doesn't do it, xfs doesn't do it. Achieving this would require mechanisms that don't exist.
Some filesystems do guarantee that an fsync will occur alongside a rename. There is no specific guarantee as to what file state will be synced. These filesystems do this to mask the impact of folks mis-using the API, paying an efficiency cost as a result.
In ext4, for example, the sync does NOT occur within the same transaction as the rename. The sync is NOT atomic (it would be wild if this were the case, as noted above -- architecturally inconsistent and extremely non-performant)
It would be ideal for this to happen in libc rather than in the kernel.
Yes everyone in the discussion agrees on this fact
> ...but it is not atomic, merely ordered. You've switched from talking about one to talking about the other and they are not the same
If the operations we are talking about are ordered, then this provides the atomicity that TFA is asking for. Properly ordering operations is a very common way of making a series of steps atomic.
> "If the operations we are talking about are ordered, then this provides the atomicity that TFA is asking for."
No, it doesn't. I think we're reaching the heart of the disagreement here - around what "atomic" means.
Atomic means indivisible. It means that other things cannot happen at the same time - that the individual (possibly ordered) steps of an operation cannot be observed while in progress.
This is why it's absolutely nonsense to talk about atomicity across a crash.
The fundamental design of unix files and memory preclude this. It can't be done.
Specifically (to hopefully clarify your point), the operation we wish to be atomic is 'replacing the mapping path A->data X with path A->data Y'. This is hoped to be done by placing data Y in path B (which does not need to be atomic )and then using the atomicity of the rename to move around the path. This then puts an ordering constraint that any observer (whether after a crash and looking at the written filesystem, or during the normal operation of the system) sees data Y going into path B before the rename B to A, and this is what POSIX does not guarantee.
It blatently can work, it's just not guaranteed by the standard (at least without an fsync). I don't think anyone is talking about situations where B is being actively written to while the rename happens.
You can sync to durable storage at any point, yes. But you cannot do it in a semantically useful fashion.
"I don't think anyone is talking about situations where B is being actively written to while the rename happens."
Well, you specifically are ignoring this. I'm not.
Filesystems are free to generate extra syncs whenever they like. You could even have a filesystem sync every single write operation to durable storage - why not?
This article is just plain wrong. The author doesn't seem to understand how filesystems work - specifically the difference between writing data and renaming directory entries. LWN should either issue a correction or take it down.
First: The author is confusing atomicity techniques with durability techniques. rename() isn't at all relevant to durability. Never has been.
Second: Nothing about using rename() is relevant to what happens when pages are lost on crash. Using rename() or link() has no bearing on dirty writes. This whole section should just be deleted.
Third: When pages are lost, they're not back-filled with random unallocated space (and if they were, see the second point above). Filesystems are designed to ensure these regions are zeroed out before the area can be used - either proactively or in fsck.
Finally, this quote: "Which brings us to the present day fsync/rename/O_PONIES controversy, in which many file systems developers argue that applications should explicitly call fsync() before renaming a file if they want the file's data to be on disk before the rename takes effect"
There is no such controversy because rename() is utterly unrelated to the durability of data on a disk. This is a pure hallucination.
I don't think any other unix filesystems do this. It's not posix, and it's not behavior that any apps should be depending on if they care about their data.
I would suggest updating the LWN article to clarify that this is a piece specifically about ext4 -- not POSIX, or even Linux in general.
What you say is correct, and ext4's current default behaviour can be quite confusing:
* ext4 turns rename() into an invisible fsync(), if the target file existed.
* ext4 turns close() into an invisible fsync(), if the opened file existed.
This leads to surprising results:
If you unzip something the first time, it's nice and fast.
If you unzip the same file again, force-overwriting existing files, suddenly it's 10x slower. You now spend hours finding out why, and start cargo-culting vodoo that doing `rm` before makes it inexplicably fast again.
Invisible fsync()s that the application programmer cannot turn off are quite bad.
My speculation is that Ted Ts'o agrees with that in principle, but that he got tired of having to explain people that if they don't fsync(), there are no guarantees; so he added this hack that makes 95% of the cases go away with surprising special-case behaviour.
That’s not quite true. There used to be a performance bug in ext3 that caused it to order the writes before the journal flush that created the files.
Then, gnome(?) and only gnome(?) relied on it, leading to massive filesystem corruption, so the kernel team gave up and made the old braindead behavior the default.
(The current default behavior is braindead because it makes rename extremely slow for correct programs that don’t care about durability across crashes, and therefore don’t call fsync).
Yeah, all filesystems are allowed to synchronize to disk as often as they want, in whatever way they want in /addition/ to what's specified. Filesystems can even make their own guarantees, above and beyond generalized specifications like fsync().
I could write a filesystem that synchronized files after every 42 megabytes, or once every 69 seconds. Apps might even come to depend on this behavior. All filesystems will have predictable, implementation dependent idiosyncrasies.
But these aren't specified behaviors. They're not part of an interface and they shouldn't be depended on. The article implying that they ought to be is, I think, shortsighted. Other filesystems don't do this and your code will break if you assume it.
> However, the ordering effect of rename() turns out to be a file system specific implementation side effect. It only works when changes to the file data in the file system are ordered with respect to changes in the file system metadata. In ext3/4 [...]
Seems pretty clear to me that this is talking about an ext3/4 implementation detail that people have started to rely on...? I also pretty clearly remember that from back in 2009.
> This article is just plain wrong.
Articles like this helped me understand the distinction between theoretical POSIX semantics and what Linux was actually doing at the time.
It seems like you are reading more into this article than at least I got out of it – maybe it's because I still remember that 2009 controversy, but I wouldn't have drawn any of the (incorrect, and I agree on that!) conclusions you're listing above.
> The author doesn't seem to understand how filesystems work
At the time of writing (2009) the author had 10 years of experience under Sun Microsystems, IBM, Intel, and Red Hat -- all working on filesystems. Including ZFS, ext2/3, and ChunkFS. (It's literally on her Wikipedia page)
So I'm more likely to regard her comments in high regard versus a driveby post written by a throwaway account.
It's pretty common for people with some experience to not understand filesystem nuance - there's quite a bit of it right here in the hn comment section. You'll note below I also corrected someone significantly more credentialed than the author of this LWN post.
My suggestion to you: Worry less about measuring resumes and more about who is actually correct as a matter of demonstrable fact.
If you have actionable questions, ask them. Use facts, not appeals to (frankly not very substantial) authorities.
In POSIX, file content data and metadata (directory entries) are separate. Atomicity and durability are separate.
It is likely that the LWN post author understands this very well, but omitted this important info from the article.
--
I can also understand throwaway's criticism, e.g. on sections like this:
> Given this situation, application developers came to rely on what is, on the face of it, a completely reasonable assumption: rename() of one file over another will either result in the contents of the old file, or the contents of the new file as of the time of the rename().
No. There is nothing reasonable about this. This is programming. When you're programming against a spec (POSIX), you don't "make assumptions". You rely only on what it says in the spec.
It would be "reasonable" to wish that somebody creates spec that ensures the mentioned rename semantics. But assuming that a spec says something it doesn't is wishful thinking.
People are quick to lazy it out, assume, copy-paste, etc, instead of critically thinking "wait, does the code I write here really guarantee the desired effect, e.g. to write my file to disk".
Reject assumption-based programming. Go check. Read the docs.
That makes good programs.
Or, as the throwaway says, rely on "demonstrable fact".
> In POSIX, file content data and metadata (directory entries) are separate. Atomicity and durability are separate.
> It is likely that the LWN post author understands this very well, but omitted this important info from the article.
1. If the author understands this, then the throwaway is wrong, since they claimed the author didn't know this.
2. The entire point of TFA was that POSIX is underspecified here, so I think they covered it sufficiently; something that worked on FFS didn't work on ext4, and the suggestion is that it should work going forwards, regardless of what POSIX requires.
> No. There is nothing reasonable about this. This is programming. When you're programming against a spec (POSIX), you don't "make assumptions". You rely only on what it says in the spec.
Are you sure 100% of Linux application developers first and foremost think about POSIX, and never rely on an implementation-defined oddity to get the behavior they need?
Hyrum's Law is powerful: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody."
> People are quick to lazy it out, assume, copy-paste, etc, instead of critically thinking "wait, does the code I write here really guarantee the desired effect, e.g. to write my file to disk".
Exactly. That is a fact of life that Linux kernel developers have to balance with POSIX.
> The author is confusing atomicity techniques with durability techniques.
No, the author isn't confusing anything. The author is describing a controversy in the linux community, and presenting the arguments being made in that controversy for you, the reader, to evaluate. It's called journalism.
I happen to agree with you that it was a sort of silly controversy... but the controversy was very real.
> There is no such controversy. This is a pure hallucination. What a bizarre article.
There factually was controversy on the mailing list. It's history, it happened. That's what the article is about.
Would that more PM's had had ponies as children: then they might understand that some shiny must-have features are the sort where (a) one occasionally has to call in expensive consultants, just to return them to the (b) steady state where they merely require constant feeding and cleaning-up-after.
On modern systems, crashes happen maybe once a year, while outright disk failures or undetected memory errors corrupting disk blocks happen maybe every 100 years. So if the chance of corruption from a crash is 1%, the overall risks are comparable.
Given that, I’d probably choose performance over guaranteed crash recovery for most systems.
fsync doesn't have to "perform", because it doesn't have to do anything. It just has to wait for some certain events to occur that will eventually occur anyway.
Now some people might not like how long it takes; maybe it's waiting for some unnecessary events that are not relevant to this file.
Surely fsync() should at least initiate syncing the data to the disk? Otherwise, everything FS-related could theoretically still fit in the memory cache forever, and so fsync() will never return.
Provide new API calls with precisely defined semantics.
Rather have this rigamarole with fsync and rename, provide an actual syscall with the actual effect the userspace developers are looking for. Eg, an atomic_replace() syscall that ensures that either a file is replaced with a new, fully written to disk version, or nothing happens.
The main problem I see is that this of course would be Linux specific, so of course somebody would build a library to either invoke the syscall or do the fsync/rename mess underneath, and this would of course run into the same exact problem on those systems.