Hacker News new | past | comments | ask | show | jobs | submit login
Files Are Fraught with Peril (danluu.com)
240 points by tambourine_man 88 days ago | hide | past | web | favorite | 78 comments

I did work on backup solutions and supported Dropbox, and I think that the author used Dropbox as a reference to support his own concerns about filesystems that are not related in any case to the Dropbox case.

From the filesystem point of view, I don't think that Dropbox is so much concerned about you losing your data because of corruption. As, anyway they are supposed to be safe and versioned in the cloud. And I think that their app is very simple, ie like no specific 'driver' hack.

Their real costly problem is related to metadata! All different file systems support different naming, encoding or character sets for file names. Also, they all have their specificities and limitations regarding the support of extended metadata or user right info. Also some will support file change notification or not, some will have a valid file last modification value that can have different level of decimal precision, etc...

So, as Dropbox ambition is that their app can fully backup and restore your system data, supporting backuping from one and restoring to another or the combinatory of all possible for each case is a nightmare.

Just a simple example: let suppose that you have a filename with the char "ü". With unicode normalization, this can be stored as a single character or as 2 characters "u" + ".." (letter + particule). Everyone uses the first version, but Mac HFS uses the second form. The crazy thing is that if you try to save a file with the first form in HFS, it will accept but will silently convert the filename. So, let's suppose that it was Dropbox asking to restore such a filename with the letter as a single character, later when listing your files to see if they are in sync, it will see another filename (based on filename bytes) but not the one it expects. So it might want to download it again and again and again if it was not smart.

Yeah, the article used comments on Dropbox as a kicking off point to talk more generally about filesystem issues. The author is aware that in this specific case it was more about metadata:

> I believe their hand was forced by the way they want to store/use data, which they can only do with ext, but even if that wasn't the case...

The Dropbox examples seem to have clouded the issues actually being discussed.

But Dropbox doesn’t even attempt to support the more common cases (pun intended). For example, you put a file in with specific characters in upper case and it might all come out lower case on another (Linux) system.

They claim it’s because not all file systems are case sensitive, but they could at least keep the original case and use it when possible.

Then again, onedrive syncing break on a whole new level. If you upload a docx file, they’ll open it and edit the meta data so the file you added will be different from the one you get back (actually they do it to images and pdfs too).

>From the filesystem point of view, I don't think that Dropbox is so much concerned about you losing your data because of corruption. As, anyway they are supposed to be safe and versioned in the cloud.

They could get to the Cloud corrupted, and then continue replicating in your devices, so no...

That's where the "and versioned" comes in handy

Versioned wont help if it's the first copy that got into dropbox.

If it's the first copy, it will just be read by Dropbox. There will not be 'write' from Dropbox that can corrupt it...

If it's the first copy going to Dropbox, then it is not Dropbox that wrote it and so not Dropbox that corrupted it by writing in a wrong way...

Also to note, Dropbox was likely inspired by the open source Unison file-synchronization tool. So they have to handle all of the shortcomings and caveats that Unison does not handle out of the box:

Caveats: https://www.cis.upenn.edu/~bcpierce/unison/download/releases...

Symbolic links: https://www.cis.upenn.edu/~bcpierce/unison/download/releases...

Permissions: https://www.cis.upenn.edu/~bcpierce/unison/download/releases...

Cross platform issues (unsupported characters across operating systems and file systems): https://www.cis.upenn.edu/~bcpierce/unison/download/releases...

Edit: Inspired by the the Unison Spec (published on 2004) which may be why we have conflict files etc. when using Dropbox http://www.cis.upenn.edu/~bcpierce/papers/unisonspec.pdf

Fairly sure this is not true, unless you have a source. Drew did all the client side work in python/librsync to get going.

Ok, may have been inspired by. Here's some relevant backstory from 2011: https://www.wired.com/2011/12/backdrop-dropbox/

This is fascinating, a bit in the same way that looking at accidents is interesting.

A good synthesis is: filesystem API design is obviously a problem, given that people that specialize in using them can't do it correctly:

"Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope would write to files safely, like datbases and version control systems ... they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug ... programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority of programmers ... they still can't use files safely every time"

I write database engines for Linux and the filesystem situation really is a train wreck. It isn't anyone's fault per se, design choices and standards were accreted over many decades that in isolation made sense in some context but which in aggregate have many poorly defined interactions and create conflicting requirements. And you can't change any of it easily because there are several decades of software built using the existing design. Guaranteeing precise, consistent behavior is nearly impossible with the standard filesystem API infrastructure and the implementation details change invisibly in important ways.

This is evident in database engines if you look at the number of lines of code dedicated to storage. Working with raw block devices requires the fewest lines of code, working through the filesystem requires the most, and direct I/O (partial filesystem bypass) is somewhere in the middle. And even if you design for comparable guarantees across these models for working with storage, the consistency of behavior across Linux environments also significantly improves as you bypass more of the filesystem.

Ironically, I now tend to borrow the low-level storage interface from database kernels, which abstracts the filesystem mess, for all code that needs to work with storage even if it isn't a database. It provides a saner interface with more consistent guarantees and often better performance. In my ideal world, someone properly designs a completely new filesystem API from scratch that sits alongside the legacy APIs that applications could start migrating to. But it would probably require adverse changes in the way the Linux kernel works for the legacy path, which means it will never happen.

Very few applications need random read/write access to files. Most of the time, you need to read an entire file in, or write an entire file out via the streaming access APIs. This core fact of typical usage is why I think so many application developers have naive expectations about filesystem behavior.

Read-only random-access is well served by mmap() and pread().

For random writes access within a file, preadv2() and pwritev2() could be augmented with additional flags RWF_ACQUIRE and RWF_RELEASE. That's Linux-specific, but it could give database developers the ability to separate ordering with barriers from flushing with fsync. But perhaps I'm being naive. My assumption is that database developers are using flushes in order to get the barriers that they really want.

I would only add that mmap should be used with care[0] and pread should be preferred.

[0]: https://www.sublimetext.com/blog/articles/use-mmap-with-care

It should be noted that much of the problems with mmap cited there aren't actually mmap's fault, but the fact that I/O errors become POSIX signals, and the API for POSIX signals really blows. (POSIX signals are probably even more ripe than filesystems for needing a different approach).

I'm surprised that page didn't mention the other issue with mmap, which is concurrent file access.

How would you signal errors accessing virtual memory?

In the context of a VM error, the only deficiency with POSIX signals I can think of off-hand is that POSIX only permits global signal handlers--not per thread--but that's only an issue for the multithreaded case.

I'm partial to something more akin to SEH for the synchronous signals (SIGILL, SIGFPE, SIGSEGV, SIGBUS, SIGTRAP). Basically, define that synchronous signals are handled per-thread in a manner similar to try/catch, although you need an extra catch type that amounts to "retry the operation" in addition to "rethrow" and "swallow the exception".

Can you restart execution at the point of the fault using SEH? A brief skim of the documentation doesn't seem to suggest it's possible as a general matter. If I wanted to implement a dynamically growable stack structure in a way that let me resume at the point of the fault, preserving program state, how would I do that?

I ask because I think sometimes people conflate POSIX signals offering poor semantics (i.e. reentrancy issues) with POSIX signals being too low-level. I can imagine how I might implement SEH using POSIX signals (though a per-thread handler would be really nice), but not vice-versa (though maybe it is possible).

As I understand it, there's a long history behind signals relating to interrupt vs polling software system models. Signals as they exist in Unix were a very early implementation of the interrupt driven model in the context of a kernel<->user space interface. But Unix's process and I/O concepts were perhaps too convenient so the value-add of the signals model was minimal; Unix ended up evolving in directions that didn't require the interrupt abstraction (at least, not until decades later). This history explains why POSIX signals are so low-level and the lack of comprehensive runtime treatment.

Note that they're only low-level by today's standards. At the time they were very high-level--a signal interrupt magically preserved your process state (stack and program counter) and would magically resume the process when returning from the handler, which could be a simple C function. Even better, this could occur recursively! And it's worth mentioning that all kernel interfaces were and remain async-signal safe (e.g. dup2 is atomic even from the perspective of an async signal).[1] The lack of consistent and convenient treatment by the runtime comes from the fact that much of the runtime we're most familiar with came later; threads came way later. When they came about people had already moved away from signals, perhaps because they saw that it was too much work to make the interrupt driven model work well at a high-level.

[1] In classic Unix style it did all this with the most minimal of kernel and user space code, pushing process state onto the user space stack and relying on an in-process trampoline to restore program state (which is how recursion could be supported without any complexity in kernel space).

I'm curious, what low-level storage interfaces do you recommend?

I use the same one I use in the database engines I work on, so not open source. It provides a flexible storage abstraction on top of Linux with a lot of control over performance and behavior. It is a bit like an explicit mmap() implementation. Underneath the hood it uses direct (i.e. no kernel caching) pread/pwrite.

> filesystem API design is obviously a problem

It's an API that's so critical it's almost impossible to change or rewrite. The API itself has barely moved in 30 years, whether it's POSIX or Win32. Possibly the only widely adopted change in filesystem API has been "S3" and compatibles, which provide a completely different set of atomicity semantics as well as being network-native.

> Possibly the only widely adopted change in filesystem API has been "S3" and compatibles, which provide a completely different set of atomicity semantics as well as being network-native.

I would love if operating systems exposed a local object storage syscall API (with object versioning and all that good stuff.)

It could be implemented on top of the filesystem for all I care, as long as it's an abstraction with safe object-storage semantics, enforced by the kernel and exposed to all processes without a need for library support.

I believe that processes that wanted to "base" themselves entirely on object-storage would still need to touch the fileystem abstraction as well, mostly to allocate temporary on-disk "buffers" to gradually write to before submitting them to the object-store ABI as new object bodies. But:

1. Most such buffers would be small enough that you could get away with using an anonymous mmap(2) instead, keeping the file "on disk" in the page file rather than in the filesystem itself.

2. For objects you're writing to by streaming, with an unbounded eventual size, you could do the same trick Google Cloud Storage does: allocate a series of fixed-size "chunk" objects to receive the stream data, closing one and opening the next as each previous chunk gets "filled"; and then expose an API call to concatenate such chunk objects together on the object-storage kernel "backend" into single files (probably in O(1) time, because at a low level it's just concatenating disk extent lists.)

3. For other unbounded-size buffers, you could also have the object-store-kernel-daemon provide an API where it manages "large durable working copy" files for you, sort of "checking out" objects into file descriptors (probably using copy-on-write file clones on the backend), then "checking in" file descriptors to become new versions of those objects (maybe even "helpfully" avoiding doing so if the buffer hasn't been touched.)

What's the difference between an 'object' and a file?

"Objects" in S3 are basically immutable. You can't overwrite the middle of an object or append to it - you can only overwrite the object as a a whole. This makes concurrent access much easier to reason about by making certain operations impossible. You couldn't build a database on an S3 object, for example. So you wouldn't want S3 as your only filesystem. But a combination of S3 semantics and block storage semantics covers the use cases of virtually all applications, with a bit of extra work to cover streaming writes.

The typical solution is to provide a new, safe API while supporting the broken stuff forever.

Except if the API's have different semantics and you can't reconcile them. Then your new API can't be significantly different from your old API or your old API gets broken on the edge cases.

That's the situation we are in with file systems API's now.

Does anyone happen to know if ZFS also suffers from renames being non-atomic? Since ZFS is the only candidate for a reasonable file system we have anyway[+], I'd be a lot less sad if it turns out that way out of this dumpster fire is just tell people to use ZFS.

[+] The absolutely minimal requirement for a reasonable file system is that it works on multiple popular platforms and performs checksumming and other data integrity measures. And that's before we even get to still highly desirable stuff like snaphotting, encryption, etc.

ZFS never reorders metadata operations in an observable manner, so it's really well behaved. If only every filesystem was like that...

You still need fsync for the data, but you can set sync=disabled on the filesystem, which turns it into a barrier. Alas, you can't do anything more granular than per-filesystem.

Does this idiom (which turns out to be broken on rename op failure for most FSs) work reliably with ZFS (on the same filesystem)?

    def atomically_write(filename, data):
         with open(filename + '.tmp', 'w') as fh:
         os.rename(filename + '.tmp', filename)
In addition to ordering guarantees, you also need to guarantee that even in case of failure the rename operation leaves the to-be-renamed file as it is and the to-be-renamed-to file non-existent – is that the case?

> If only every filesystem was like that..

I wonder if the solution for databases and other server software that needs to persist data reliably is just pretend they are and insist on end-users sticking to non-broken filesystems (most probably ZFS, assuming the set is currently non-empty). The alternative of wasting endless time and resources on sysiphean quests to placate crap filesystems seems to have been an utter failure so far.

I'm actually not sure if that code would work, but I don't think it would. You need at least one fsync in there, before the rename.

I love ZFS but the licence situation means it'll always be a mess to use.

Yes, it's sad. Even more so since there is not even a potential competitor in sight.

> [ext4 data= mount option] writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

Not just old data from the same file, either. In the case of file appending, "old" can actually mean "whatever junk happened to be in that allocated disk block earlier".

Where can I learn more about the rename() trick not being safe?

I was also (incorrectly) assuming that renaming a file after having fsynced the contents is a valid mechanism to perform "atomic" file writes where the write either appears completely at the destination path, creates a zero byte file or does not happen at all. But it should never produce an "incorrect" non-zero byte file - even in the face of crashes.

It is definitely something that is used in the wild so it was surprising to me that it is not correct. I even found a recent LWN article suggesting otherwise, so there seems to be a lot of confusion on this topic.

Would love to learn more if anybody has detailed infos or a link to the relevant mailing list thread or similar.

EDIT: Glibc documentation also implies that the "rename trick" is indeed safe:

From https://www.gnu.org/software/libc/manual/html_node/Renaming-...

> One useful feature of rename is that the meaning of newname changes “atomically” from any previously existing file by that name to its new meaning (i.e., the file that was called oldname). There is no instant at which newname is non-existent “in between” the old meaning and the new meaning. If there is a system crash during the operation, it is possible for both names to still exist; but newname will always be intact if it exists at all.

Another relevant source that suggests the trick is safe is https://lwn.net/Articles/327601/. Keep in mind that Ted Tso is the ext4 maintainer.

> For the longer term, Ted asked: should the above-described fixes become a part of the filesystem policy for Linux? In other words, should application developers be assured that they'll be able to write a file, rename it on top of another file, omit fsync(), and not encounter zero-length files after a crash? The answer turns out to be "yes," but first Ted presented his other long-term ideas.

We wrote a program where atomicity was mission critical, and tested the hell out of power loss recovery. I can confirm experimentally that the rename trick is not good enough.

It was nearly good enough on Linux. Corruption was very rare, but we still caught issues. On Windows corruption happened something like 1 in 10 poweroffs.

For low performance ACID we write the whole file out twice and fsync between each write. First write is to the backup file, second is to the primary file. This method passed our tests.

For high performance, we use another strategy entirely that's a lot more involved.

> For high performance, we use another strategy entirely that's a lot more involved.

It sounds interesting. Is there a write-up with more details to read?

I appreciate the desire for stripped-down websites, but "body { max-width: 800px; }" makes a world of difference for readability.

This is where em rather than px (or rem) units is preferred.

   body {
      margin: 2rem auto;
      padding: 2rem;
      width: auto;
      max-width:  45em;
Works in most instances, even without @media queries.


Firefox reader mode is pretty useful in these cases, and can also save you from websites with the opposite problem.

True, but sadly Chrome still lacks a reader mode for whatever reason


I've been using it for a while and it seems to be solid.

I don't really trust any browser extensions at all these days

Oh god, I didn't realize how broken filesystems are. Shit.

If it helps, just about everything is like that if you look closely. Processors have side channel attacks, RAM has rowhammer which recently turned out to be a real thing, digital electronics in general turn out to have analog side effects, time and space are both basically impossible for computers to represent precisely (see: falsehoods programmers believe about *). We should do what we can, but life goes on:)

I'm amazed every time a machine boots.

Given the thousands of things that could cause it not to it's incredible it makes it down the happy path each time.

Yep - as soon as they moved away from ROMs I am too.

Amiga, Atari ST and early Macs were sort of in a twilight - they ran the core OS from ROM but loaded utilities and sometimes patches from disk (-ette).

> we should expect to see data corrupiton all the time.

I wonder if the typo was deliberate?

> If we look at a worn out drive, one very close to end-of-life, it's specced to retain data for one year to three months, depending on the class of drive.

What happens once the data expires? Does the SSD return an error when the data is read, or does it read the bogus data without knowing?

I would kind of prefer the drive bricking itself rather than risking silently backing up bogus data. The earlier comment about ECC and bit error rates suggests bad reads are identified as such, but I'm not sure how far to trust that given that, as mentioned, I/O is hard.

Error detection is inherently probabilistic, so there's always the chance that bogus data is read without detection. SSDs use multiple levels of error correction where each level is slower but more reliable than the previous. Such a scheme could only work if the error detection ability of each level were much greater than its error correction ability. I wouldn't rely on SSD firmware to do anything in particular, though. Your best bet is to monitor SMART stats about error rates. If there's a high error rate some of your sectors may be reading bogus data without you being able to detect it.

Memory gets corrupted. Harddisks at least would slowly die over time because sectors of the disk would stop working and have to be marked bad. I’m pretty sure with most things the data you read back from memory is a random variable (one with a very very low variance but it’s still random at the end of the day.)

You can checksum backup data. Might be more difficult with high speed data like databases, but if it's just backups a cryptographic checksum is really reasonable.

The article is a great survey of gotchas across the storage stack.

However, I think what motivated Dropbox to drop so many Linux filesystems was the need for a certain flavor of xattrs, but specifically to detect renames, i.e. to merge create/delete events into renames.

Perhaps Dropbox adds internal UIDs via xattrs to every file you add to a Dropbox folder. If you move the file around, no problem, Dropbox can detect the rename via the xattr UID. Relying on heuristics alone to do rename detection can be brittle if you don't do it right, and missing rename events means file version histories get lost, which is a terrible shock for users. Imagine you suddenly can't find last month's version of a file...

So and just guessing here but probably those Linux filesystems lacked performant xattr access and were too slow for Dropbox to do hundreds of thousands of xattr lookups when scanning a folder at startup or every N minutes, or whenever inotify tells them something changed.

They didn't want the complexity of heuristics, or they found a way to exploit EXT on-disk data structures for faster xattr lookups, something like that. Who knows? Perhaps the real reason is less complicated than that.

iOS 13 changes a lot about how files work, also internally, so implementing a correct File Provider is going to be quite hard (even the iCloud one doesn't work properly yet in the beta ...). I hope Dropbox and the others follow suit and implement this correctly. Just another example of how files are difficult to get right.

The suggestion to "use SQLite" to write files should come with a caveat. Such files can still become corrupted, and cannot be inspected nor amended with a text editor.

Any truly robust file-on-local-disk storage scheme for a non-cloud system should allow for manual diagnostics and repair.

Text-based formats are the best for repairability but not so efficient for structuring, querying and storing information. It’s a tradeoff, as always!

A search index can be maintained in binary format; it can be rebuilt in the event of corruption.

That’s not a good counterexample if that is what you attempted, for 2 reasons:

- An index is redundant information but dropping it and recreating it is not “repairability” as in “I read and amend a text”. You compare apples to oranges.

- What I wrote still applies to an index, so that’s orthogonal: a binary index will be more efficient but harder to repair than a text-based binary index.

Also, schema updates are quite hopeless with sqlite. One cannot even rename a column last time I checked.

I'm not sure I've ever tried to rename one, but the docs assert support for ALTER TABLE RENAME COLUMN (https://www.sqlite.org/lang_altertable.html).

There's also this from 2015: https://news.ycombinator.com/item?id=10725859

From reading the first paragraph it sounds like the author will explain to great lengths why Dropbox had it hard to support filesystems other than ext4.

But then you read the article and it goes to explain how dealing with files is hard and sometimes data corruption and loss occurs (fair enough), but nothing filesystem specific that could explain why ext4 is superior (or special) and had to be chosen.

So it reads kind of like an excuse for Dropbox, only that it isn't.

ext4 is by far the most popular filesystem so supporting it is a necessity for that reason alone. Nothing to do with it being superior or special.

The argument this article is making is that supporting additional filesystems is hard. The is meant to refute the allegation that it's trivial to add whatever other filesystems the OS supports.

No, that's the argument this article is supposedly making. But it's not actually making it. That's the problem. Filesystems are an abstraction and I expected to see some problems regarding a leaky abstraction or something, but the article doesn't mention anything like that.

It mentions that dealing with hard disks is hard (which I believe, since they are flaky hardware). But dropbox didn't say "we won't be supporting this kind of hard disk/hard disk controller", but "we won't be supporting these filesystems". Where's the proof that those filesystems have problems that ext4 doesn't have?

From the article: "Large parts of the file API look like this, where behavior varies across filesystems or across different modes of the same filesystem. For example, if we look at mainstream filesystems, appends are atomic, except when using ext3 or ext4 with data=writeback, or ext2 in any mode and directory operations can't be re-ordered w.r.t. any other operations, except on btrfs."

That doesn't mean ext4 is problem-free, but it does mean that other filesystems have different problems that are not fixed by mitigations for ext4's quirks.

Okay and? Why does userspace care about this? Should Emacs not run on btrfs?

This is rather like the bluescreen problem: if your application tries to open a file from the normal filesystem and it's corrupt, the user blames the application. If the user opens a file in the Dropbox folder, they blame Dropbox. So they end up engaging in heroics to not be blamed for it.

(Windows has gone to increasing lengths to accomodate and contain badly written drivers, since most bluescreens are caused by drivers. There is now a subsystem to allow the video drivers to crash and entirely restart without bluescreening.)

Which prevents any CUDA kernels from running longer than like 5 seconds or so. Which means you can't use the GPU to spawn it's own kernels with no PCIe/driver latency in between, because this master has to finish before windows kills it.

Last I checked, providing callback function pointers to binary vendor libraries (read/write adapters for FFT come to mind, allowing on-the-fly metric computation or skipping an intermediate storage for FFT convolution) was only possible on Linux, and with statically linking said vendor library into the software (incidentally breaking binary distributability for GPL).

I’d have much more sympathy if they didn’t already support more filesystems. Now I’m no expert, but I’m guessing dropping support mostly constituted only e2e testing on ext4 and refusing to operate or warning on everything else.

This all feels a little silly. Yes sure, files are hard. If they weren’t I probably wouldn’t need to pay someone to solve the problem. But I think solving the file syncing problem for just ext4 is worse than bad for your customers because now you’re taking the whole ecosystem with you. Imagine users ditching btrfs or zfs because of Dropbox.

Now nothing and nobody is perfect and I wish Dropbox the best, but I miss the days when it felt like they cared about Linux and truly focused on solving the core problem effectively. They differentiated on sheer quality. I’ve moved on from file syncing to a more elaborate local NAS configuration and I could never go back, but I did get many good years out of Dropbox.

I’m glad Dropbox did at least walk back the filesystem compatibility issue a bit. Hopefully it’s a sign of better times to come for the Linux client.

This is hard only on GNU/Linux because it's inherently broken because solving this once and for all proved too hard of a problem for volunteers working on the kernel and filesystem code. That's a hard to swallow fact if one's favorite operating system is GNU/Linux. There is one group of people who did solve it and they solved it correctly: the ZFS team under Jeff Bonwick.

If you care about data integrity, if you need an operating system where filesystem operations aren't leaky abstractions and where fsync(2) works as POSIX specifies, if correctness of operation is important to you, use an illumos-based operating system like SmartOS or any other based on the illumos codebase. Put some effort in and learn real UNIX and leave these long ago solved problems where they belong, back in the past century.

With the risk of sounding like those /r/programming replies: why would an application be worried about journaling/logs on a file system level? As an application developer all I’m usually told is that the only atomic operations are creates, deletes, and renames. So to update a file you always write a second file and then rename it to the destination.


Copy file.txt to file.txt.new Update file.txt.new Rename file.txt to file.txt.old Rename file.txt.new to file.txt.old

This is ”safe” in the sense that in case of a terminated process, the point where it failed can be determined and the application itself can resume or roll back the update on its next run.

What it doesn’t guarantee us that the OS/filesystem provides this rollback independently of the application.

My question is: does Dropbox have any reason to want to work on a lower level than the “normal” high level where you only manually rollback or resume transactions? Do other applications also do this? I always felt that trying to pierce the abstraction of high level FS APIs was unnecessary unless you are writing drivers or file systems.

Some low level programs (antvirus, backups) I can see why they would need to peek under the hood, but to me Dropbox is a pretty dumb file sync program that shouldn’t need complex fs operations like say a backup program. Is it more complex than I give it credit for?

This is specifically addressed at one point:

> This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that rename is atomic, but that only means rename is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't gauranteed to execute in program order, as people sometimes expect.

> The most mainstream exception where rename is atomic on crash is probably btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, rename is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive teesting, even if you're writing btrfs specific code.

(also, this trick is almost completely useless for databases because the amount of data to be rewritten is too large)

Unfortunately, piercing the abstraction is completely essential if you want to achieve high reliability from userspace.

My own experience of this was in a Windows embedded environment on writing to either internal Flash or SD cards, and trying to construct a database (we should have used sqlite, but didn't have room) that was resilient against sudden power off. I discovered all sorts of odd failure modes - "delete file A, write to file B, crash" could result in the write to B persisting but A not being deleted, for example.

Wow I can’t believe I missed the section on this. My bad. Rename not being atomic on crash sounds absolutely terrifying. Luckily for NTFS there is an atomic rename (deprecated but still working).

That said, what kind of non-atomicness occurs is also important. Leaving both the source and destination files around is at least something that can be recovered from, but if there are other failure modes that would be scary!

This is how nearly all “document-based” applications prevent corruption also under posix, because nearly no applications would use per-filesystem logic. I suppose ”prevent corruption with a crash at any point” is simply too high a bar for most apps - they would resort to backups to prevent data loss with untimely crashes?

With networks- or removable media I know all sorts of madness will happen.

The author seems to have a bit of an unreasonable animosity to that technique, from this article and others of theirs.

It seems to me they could have written much the same article the other way round: starting with a rename-based method, observing that the naive implementation isn't good enough, and going through the steps you need to make it robust in practice. Then they could have put a naive undo log in their "one weird trick" section and claimed it doesn't work.

I think the best rule of thumb is that if your outputs depend only on your inputs, you should aim for some kind of rewrite and atomic replace. If your outputs depend on both new inputs and the previous state, you need the database-like techniques.

A number of applications could be made more reliable if rename was atomic, an acquire barrier, and a release barrier all together.

Incidentally the copy/rename process can run into issues on Windows due to Windows Defender holding a lock on the file which can prevent it being renamed. This is an issue for the Rust updater utility:


File locks and virus checkers. Two things that I'm not missing since moving from Windows.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact