
Files Are Fraught with Peril - tambourine_man
https://danluu.com/deconstruct-files/
======
greatgib
I did work on backup solutions and supported Dropbox, and I think that the
author used Dropbox as a reference to support his own concerns about
filesystems that are not related in any case to the Dropbox case.

From the filesystem point of view, I don't think that Dropbox is so much
concerned about you losing your data because of corruption. As, anyway they
are supposed to be safe and versioned in the cloud. And I think that their app
is very simple, ie like no specific 'driver' hack.

Their real costly problem is related to metadata! All different file systems
support different naming, encoding or character sets for file names. Also,
they all have their specificities and limitations regarding the support of
extended metadata or user right info. Also some will support file change
notification or not, some will have a valid file last modification value that
can have different level of decimal precision, etc...

So, as Dropbox ambition is that their app can fully backup and restore your
system data, supporting backuping from one and restoring to another or the
combinatory of all possible for each case is a nightmare.

Just a simple example: let suppose that you have a filename with the char "ü".
With unicode normalization, this can be stored as a single character or as 2
characters "u" \+ ".." (letter + particule). Everyone uses the first version,
but Mac HFS uses the second form. The crazy thing is that if you try to save a
file with the first form in HFS, it will accept but will silently convert the
filename. So, let's suppose that it was Dropbox asking to restore such a
filename with the letter as a single character, later when listing your files
to see if they are in sync, it will see another filename (based on filename
bytes) but not the one it expects. So it might want to download it again and
again and again if it was not smart.

~~~
coldtea
> _From the filesystem point of view, I don 't think that Dropbox is so much
> concerned about you losing your data because of corruption. As, anyway they
> are supposed to be safe and versioned in the cloud._

They could get to the Cloud corrupted, and then continue replicating in your
devices, so no...

~~~
moolcool
That's where the "and versioned" comes in handy

~~~
coldtea
Versioned wont help if it's the first copy that got into dropbox.

~~~
greatgib
If it's the first copy, it will just be read by Dropbox. There will not be
'write' from Dropbox that can corrupt it...

------
ovi256
This is fascinating, a bit in the same way that looking at accidents is
interesting.

A good synthesis is: filesystem API design is obviously a problem, given that
people that specialize in using them can't do it correctly:

"Pillai et al., OSDI’14 looked at a bunch of software that writes to files,
including things we'd hope would write to files safely, like datbases and
version control systems ... they found that every single piece of software
they tested except for SQLite in one particular mode had at least one bug ...
programmers who work on things like Leveldb, LBDM, etc., know more about
filesystems than the vast majority of programmers ... they still can't use
files safely every time"

~~~
jandrewrogers
I write database engines for Linux and the filesystem situation really is a
train wreck. It isn't anyone's fault per se, design choices and standards were
accreted over many decades that in isolation made sense in some context but
which in aggregate have many poorly defined interactions and create
conflicting requirements. And you can't change any of it easily because there
are several decades of software built using the existing design. Guaranteeing
precise, consistent behavior is nearly impossible with the standard filesystem
API infrastructure and the implementation details change invisibly in
important ways.

This is evident in database engines if you look at the number of lines of code
dedicated to storage. Working with raw block devices requires the fewest lines
of code, working through the filesystem requires the most, and direct I/O
(partial filesystem bypass) is somewhere in the middle. And even if you design
for comparable guarantees across these models for working with storage, the
consistency of behavior across Linux environments also significantly improves
as you bypass more of the filesystem.

Ironically, I now tend to borrow the low-level storage interface from database
kernels, which abstracts the filesystem mess, for all code that needs to work
with storage even if it isn't a database. It provides a saner interface with
more consistent guarantees and often better performance. In my ideal world,
someone properly designs a completely new filesystem API from scratch that
sits alongside the legacy APIs that applications could start migrating to. But
it would probably require adverse changes in the way the Linux kernel works
for the legacy path, which means it will never happen.

~~~
brandmeyer
Very few applications need random read/write access to files. Most of the
time, you need to read an entire file in, or write an entire file out via the
streaming access APIs. This core fact of typical usage is why I think so many
application developers have naive expectations about filesystem behavior.

Read-only random-access is well served by mmap() and pread().

For random writes access within a file, preadv2() and pwritev2() could be
augmented with additional flags RWF_ACQUIRE and RWF_RELEASE. That's Linux-
specific, but it could give database developers the ability to separate
ordering with barriers from flushing with fsync. But perhaps I'm being naive.
My assumption is that database developers are using flushes in order to get
the barriers that they really want.

~~~
ChrisSD
I would only add that mmap should be used with care[0] and pread should be
preferred.

[0]: [https://www.sublimetext.com/blog/articles/use-mmap-with-
care](https://www.sublimetext.com/blog/articles/use-mmap-with-care)

~~~
jcranmer
It should be noted that much of the problems with mmap cited there aren't
actually mmap's fault, but the fact that I/O errors become POSIX signals, and
the API for POSIX signals really blows. (POSIX signals are probably even more
ripe than filesystems for needing a different approach).

I'm surprised that page didn't mention the other issue with mmap, which is
concurrent file access.

~~~
wahern
How would you signal errors accessing virtual memory?

In the context of a VM error, the only deficiency with POSIX signals I can
think of off-hand is that POSIX only permits global signal handlers--not per
thread--but that's only an issue for the multithreaded case.

~~~
jcranmer
I'm partial to something more akin to SEH for the synchronous signals (SIGILL,
SIGFPE, SIGSEGV, SIGBUS, SIGTRAP). Basically, define that synchronous signals
are handled per-thread in a manner similar to try/catch, although you need an
extra catch type that amounts to "retry the operation" in addition to
"rethrow" and "swallow the exception".

~~~
wahern
Can you restart execution at the point of the fault using SEH? A brief skim of
the documentation doesn't seem to suggest it's possible as a general matter.
If I wanted to implement a dynamically growable stack structure in a way that
let me resume at the point of the fault, preserving program state, how would I
do that?

I ask because I think sometimes people conflate POSIX signals offering poor
semantics (i.e. reentrancy issues) with POSIX signals being too low-level. I
can imagine how I might implement SEH using POSIX signals (though a per-thread
handler would be really nice), but not vice-versa (though maybe it is
possible).

As I understand it, there's a long history behind signals relating to
interrupt vs polling software system models. Signals as they exist in Unix
were a _very_ _early_ implementation of the interrupt driven model in the
context of a kernel<->user space interface. But Unix's process and I/O
concepts were perhaps too convenient so the value-add of the signals model was
minimal; Unix ended up evolving in directions that didn't require the
interrupt abstraction (at least, not until decades later). This history
explains why POSIX signals are so low-level and the lack of comprehensive
runtime treatment.

Note that they're only low-level by today's standards. At the time they were
very high-level--a signal interrupt magically preserved your process state
(stack and program counter) and would magically resume the process when
returning from the handler, which could be a simple C function. Even better,
this could occur recursively! And it's worth mentioning that all kernel
interfaces were and remain async-signal safe (e.g. dup2 is atomic even from
the perspective of an async signal).[1] The lack of consistent and convenient
treatment by the runtime comes from the fact that much of the runtime we're
most familiar with came later; threads came _way_ later. When they came about
people had already moved away from signals, perhaps because they saw that it
was too much work to make the interrupt driven model work well at a high-
level.

[1] In classic Unix style it did all this with the most minimal of kernel and
user space code, pushing process state onto the user space stack and relying
on an in-process trampoline to restore program state (which is how recursion
could be supported without any complexity in kernel space).

------
patrec
Does anyone happen to know if ZFS also suffers from renames being non-atomic?
Since ZFS is the only candidate for a reasonable file system we have
anyway[+], I'd be a lot less sad if it turns out that way out of this dumpster
fire is just tell people to use ZFS.

[+] The absolutely minimal requirement for a reasonable file system is that it
works on multiple popular platforms and performs checksumming and other data
integrity measures. And that's before we even get to still highly desirable
stuff like snaphotting, encryption, etc.

~~~
Filligree
ZFS never reorders metadata operations in an observable manner, so it's
_really_ well behaved. If only every filesystem was like that...

You still need fsync for the data, but you can set sync=disabled on the
filesystem, which turns it into a barrier. Alas, you can't do anything more
granular than per-filesystem.

~~~
patrec
Does this idiom (which turns out to be broken on rename op failure for most
FSs) work reliably with ZFS (on the same filesystem)?

    
    
        def atomically_write(filename, data):
             with open(filename + '.tmp', 'w') as fh:
                  fh.write(data)
             os.rename(filename + '.tmp', filename)
                  
    

In addition to ordering guarantees, you also need to guarantee that even in
case of failure the rename operation leaves the to-be-renamed file as it is
and the to-be-renamed-to file non-existent – is that the case?

> If only every filesystem was like that..

I wonder if the solution for databases and other _server_ software that needs
to persist data reliably is just pretend they are and insist on end-users
sticking to non-broken filesystems (most probably ZFS, assuming the set is
currently non-empty). The alternative of wasting endless time and resources on
sysiphean quests to placate crap filesystems seems to have been an utter
failure so far.

~~~
Filligree
I'm actually not sure if that code would work, but I don't think it would. You
need at least one fsync in there, before the rename.

------
brandmeyer
> [ext4 data= mount option] writeback: Data ordering is not preserved – data
> may be written into the main filesystem after its metadata has been
> committed to the journal. This is rumoured to be the highest-throughput
> option. It guarantees internal filesystem integrity, however it can allow
> old data to appear in files after a crash and journal recovery.

Not just old data from the same file, either. In the case of file appending,
"old" can actually mean "whatever junk happened to be in that allocated disk
block earlier".

------
quickquestion42
Where can I learn more about the rename() trick not being safe?

I was also (incorrectly) assuming that renaming a file after having fsynced
the contents is a valid mechanism to perform "atomic" file writes where the
write either appears completely at the destination path, creates a zero byte
file or does not happen at all. But it should never produce an "incorrect"
non-zero byte file - even in the face of crashes.

It is definitely something that is used in the wild so it was surprising to me
that it is not correct. I even found a recent LWN article suggesting
otherwise, so there seems to be a lot of confusion on this topic.

Would love to learn more if anybody has detailed infos or a link to the
relevant mailing list thread or similar.

EDIT: Glibc documentation also implies that the "rename trick" is indeed safe:

From
[https://www.gnu.org/software/libc/manual/html_node/Renaming-...](https://www.gnu.org/software/libc/manual/html_node/Renaming-
Files.html)

> One useful feature of rename is that the meaning of newname changes
> “atomically” from any previously existing file by that name to its new
> meaning (i.e., the file that was called oldname). There is no instant at
> which newname is non-existent “in between” the old meaning and the new
> meaning. If there is a system crash during the operation, it is possible for
> both names to still exist; but newname will always be intact if it exists at
> all.

Another relevant source that suggests the trick is safe is
[https://lwn.net/Articles/327601/](https://lwn.net/Articles/327601/). Keep in
mind that Ted Tso is the ext4 maintainer.

> For the longer term, Ted asked: should the above-described fixes become a
> part of the filesystem policy for Linux? In other words, should application
> developers be assured that they'll be able to write a file, rename it on top
> of another file, omit fsync(), and not encounter zero-length files after a
> crash? The answer turns out to be "yes," but first Ted presented his other
> long-term ideas.

~~~
tfha
We wrote a program where atomicity was mission critical, and tested the hell
out of power loss recovery. I can confirm experimentally that the rename trick
is not good enough.

It was nearly good enough on Linux. Corruption was very rare, but we still
caught issues. On Windows corruption happened something like 1 in 10
poweroffs.

For low performance ACID we write the whole file out twice and fsync between
each write. First write is to the backup file, second is to the primary file.
This method passed our tests.

For high performance, we use another strategy entirely that's a lot more
involved.

~~~
d0mine
> For high performance, we use another strategy entirely that's a lot more
> involved.

It sounds interesting. Is there a write-up with more details to read?

------
_bxg1
I appreciate the desire for stripped-down websites, but "body { max-width:
800px; }" makes a world of difference for readability.

~~~
NoodleIncident
Firefox reader mode is pretty useful in these cases, and can also save you
from websites with the opposite problem.

~~~
_bxg1
True, but sadly Chrome still lacks a reader mode for whatever reason

~~~
kencausey
[https://add0n.com/chrome-reader-view.html](https://add0n.com/chrome-reader-
view.html)

I've been using it for a while and it seems to be solid.

~~~
_bxg1
I don't really trust any browser extensions at all these days

------
GaurVimen
Oh god, I didn't realize how broken filesystems are. Shit.

~~~
yjftsjthsd-h
If it helps, just about everything is like that if you look closely.
Processors have side channel attacks, RAM has rowhammer which recently turned
out to be a real thing, digital electronics in general turn out to have analog
side effects, time and space are both basically impossible for computers to
represent precisely (see: falsehoods programmers believe about *). We should
do what we can, but life goes on:)

~~~
noir_lord
I'm amazed every time a machine boots.

Given the thousands of things that could cause it not to it's incredible it
makes it down the happy path each time.

~~~
jacobush
Yep - as soon as they moved away from ROMs I am too.

Amiga, Atari ST and early Macs were sort of in a twilight - they ran the core
OS from ROM but loaded utilities and sometimes patches from disk (-ette).

------
mark-r
> we should expect to see data corrupiton all the time.

I wonder if the typo was deliberate?

------
Jwarder
> If we look at a worn out drive, one very close to end-of-life, it's specced
> to retain data for one year to three months, depending on the class of
> drive.

What happens once the data expires? Does the SSD return an error when the data
is read, or does it read the bogus data without knowing?

I would kind of prefer the drive bricking itself rather than risking silently
backing up bogus data. The earlier comment about ECC and bit error rates
suggests bad reads are identified as such, but I'm not sure how far to trust
that given that, as mentioned, I/O is hard.

~~~
mastax
Error detection is inherently probabilistic, so there's always the chance that
bogus data is read without detection. SSDs use multiple levels of error
correction where each level is slower but more reliable than the previous.
Such a scheme could only work if the error detection ability of each level
were much greater than its error correction ability. I wouldn't rely on SSD
firmware to do anything in particular, though. Your best bet is to monitor
SMART stats about error rates. If there's a high error rate some of your
sectors may be reading bogus data without you being able to detect it.

------
jorangreef
The article is a great survey of gotchas across the storage stack.

However, I think what motivated Dropbox to drop so many Linux filesystems was
the need for a certain flavor of xattrs, but specifically to detect renames,
i.e. to merge create/delete events into renames.

Perhaps Dropbox adds internal UIDs via xattrs to every file you add to a
Dropbox folder. If you move the file around, no problem, Dropbox can detect
the rename via the xattr UID. Relying on heuristics alone to do rename
detection can be brittle if you don't do it right, and missing rename events
means file version histories get lost, which is a terrible shock for users.
Imagine you suddenly can't find last month's version of a file...

So and just guessing here but probably those Linux filesystems lacked
performant xattr access and were too slow for Dropbox to do hundreds of
thousands of xattr lookups when scanning a folder at startup or every N
minutes, or whenever inotify tells them something changed.

They didn't want the complexity of heuristics, or they found a way to exploit
EXT on-disk data structures for faster xattr lookups, something like that. Who
knows? Perhaps the real reason is less complicated than that.

------
auggierose
iOS 13 changes a lot about how files work, also internally, so implementing a
correct File Provider is going to be quite hard (even the iCloud one doesn't
work properly yet in the beta ...). I hope Dropbox and the others follow suit
and implement this correctly. Just another example of how files are difficult
to get right.

------
networkimprov
The suggestion to "use SQLite" to write files should come with a caveat. Such
files can still become corrupted, and cannot be inspected nor amended with a
text editor.

Any truly robust file-on-local-disk storage scheme for a non-cloud system
should allow for manual diagnostics and repair.

~~~
harperlee
Text-based formats are the best for repairability but not so efficient for
structuring, querying and storing information. It’s a tradeoff, as always!

~~~
networkimprov
A search index can be maintained in binary format; it can be rebuilt in the
event of corruption.

~~~
harperlee
That’s not a good counterexample if that is what you attempted, for 2 reasons:

\- An index is redundant information but dropping it and recreating it is not
“repairability” as in “I read and amend a text”. You compare apples to
oranges.

\- What I wrote still applies to an index, so that’s orthogonal: a binary
index will be more efficient but harder to repair than a text-based binary
index.

------
dang
There's also this from 2015:
[https://news.ycombinator.com/item?id=10725859](https://news.ycombinator.com/item?id=10725859)

------
bjhkx
From reading the first paragraph it sounds like the author will explain to
great lengths why Dropbox had it hard to support filesystems other than ext4.

But then you read the article and it goes to explain how dealing with files is
hard and sometimes data corruption and loss occurs (fair enough), but nothing
filesystem specific that could explain why ext4 is superior (or special) and
had to be chosen.

So it reads kind of like an excuse for Dropbox, only that it isn't.

~~~
ChrisSD
ext4 is by far the most popular filesystem so supporting it is a necessity for
that reason alone. Nothing to do with it being superior or special.

The argument this article is making is that supporting additional filesystems
is hard. The is meant to refute the allegation that it's trivial to add
whatever other filesystems the OS supports.

~~~
bjhkx
No, that's the argument this article is supposedly making. But it's not
actually making it. That's the problem. Filesystems are an abstraction and I
expected to see some problems regarding a leaky abstraction or something, but
the article doesn't mention anything like that.

It mentions that dealing with hard disks is hard (which I believe, since they
are flaky hardware). But dropbox didn't say "we won't be supporting this kind
of hard disk/hard disk controller", but "we won't be supporting these
filesystems". Where's the proof that those filesystems have problems that ext4
doesn't have?

~~~
yorwba
From the article: "Large parts of the file API look like this, where behavior
varies across filesystems or across different modes of the same filesystem.
For example, if we look at mainstream filesystems, appends are atomic, except
when using ext3 or ext4 with data=writeback, or ext2 in any mode and directory
operations can't be re-ordered w.r.t. any other operations, except on btrfs."

That doesn't mean ext4 is problem-free, but it does mean that other
filesystems have _different_ problems that are not fixed by mitigations for
ext4's quirks.

~~~
bjhkx
Okay and? Why does userspace care about this? Should Emacs not run on btrfs?

~~~
pjc50
This is rather like the bluescreen problem: if your application tries to open
a file from the normal filesystem and it's corrupt, the user blames the
application. If the user opens a file in the Dropbox folder, they blame
_Dropbox_. So they end up engaging in heroics to not be blamed for it.

(Windows has gone to increasing lengths to accomodate and contain badly
written drivers, since most bluescreens are caused by drivers. There is now a
subsystem to allow the video drivers to crash and entirely restart without
bluescreening.)

~~~
namibj
Which prevents any CUDA kernels from running longer than like 5 seconds or so.
Which means you can't use the GPU to spawn it's own kernels with no
PCIe/driver latency in between, because this master has to finish before
windows kills it.

Last I checked, providing callback function pointers to binary vendor
libraries (read/write adapters for FFT come to mind, allowing on-the-fly
metric computation or skipping an intermediate storage for FFT convolution)
was only possible on Linux, and with statically linking said vendor library
into the software (incidentally breaking binary distributability for GPL).

------
jchw
I’d have much more sympathy if they didn’t already support more filesystems.
Now I’m no expert, but I’m guessing dropping support mostly constituted only
e2e testing on ext4 and refusing to operate or warning on everything else.

This all feels a little silly. Yes sure, files are hard. If they weren’t I
probably wouldn’t need to pay someone to solve the problem. But I think
solving the file syncing problem for just ext4 is worse than bad for your
customers because now you’re taking the whole ecosystem with you. Imagine
users ditching btrfs or zfs because of Dropbox.

Now nothing and nobody is perfect and I wish Dropbox the best, but I miss the
days when it felt like they cared about Linux and truly focused on solving the
core problem effectively. They _differentiated_ on sheer quality. I’ve moved
on from file syncing to a more elaborate local NAS configuration and I could
never go back, but I did get many good years out of Dropbox.

I’m glad Dropbox did at least walk back the filesystem compatibility issue a
bit. Hopefully it’s a sign of better times to come for the Linux client.

------
Annatar
This is hard only on GNU/Linux because it's inherently broken because solving
this once and for all proved too hard of a problem for volunteers working on
the kernel and filesystem code. That's a hard to swallow fact if one's
favorite operating system is GNU/Linux. There is one group of people who did
solve it and they solved it correctly: the ZFS team under Jeff Bonwick.

If you care about data integrity, if you need an operating system where
filesystem operations aren't leaky abstractions and where fsync(2) works as
POSIX specifies, if correctness of operation is important to you, use an
illumos-based operating system like SmartOS or any other based on the illumos
codebase. Put some effort in and learn real UNIX and leave these long ago
solved problems where they belong, back in the past century.

------
alkonaut
With the risk of sounding like those /r/programming replies: why would an
application be worried about journaling/logs on a file system level? As an
application developer all I’m usually told is that the only atomic operations
are creates, deletes, and renames. So to update a file you always write a
second file and then rename it to the destination.

So

Copy file.txt to file.txt.new Update file.txt.new Rename file.txt to
file.txt.old Rename file.txt.new to file.txt.old

This is ”safe” in the sense that in case of a terminated process, the point
where it failed can be determined and the application itself can resume or
roll back the update _on its next run_.

What it doesn’t guarantee us that the OS/filesystem provides this rollback
independently of the application.

My question is: does Dropbox have any reason to want to work on a lower level
than the “normal” high level where you only manually rollback or resume
transactions? Do other applications also do this? I always felt that trying to
pierce the abstraction of high level FS APIs was unnecessary unless you are
writing drivers or file systems.

Some low level programs (antvirus, backups) I can see why they would need to
peek under the hood, but to me Dropbox is a pretty dumb file sync program that
shouldn’t need complex fs operations like say a backup program. Is it more
complex than I give it credit for?

~~~
pjc50
This is specifically addressed at one point:

> This trick doesn't work. People seem to think that this is safe becaus the
> POSIX spec says that rename is atomic, but that only means rename is atomic
> with respect to normal operation, that doesn't mean it's atomic on crash.
> This isn't just a theoretical problem; if we look at mainstream Linux
> filesystems, most have at least one mode where rename isn't atomic on crash.
> Rename also isn't gauranteed to execute in program order, as people
> sometimes expect.

> The most mainstream exception where rename is atomic on crash is probably
> btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al.,
> ASPLOS’16, rename is only atomic on crash when renaming to replace an
> existing file, not when renaming to create a new file. Also, Mohan et al.,
> OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and
> some introduced the same year as the paper, so you want not want to rely on
> this without extensive teesting, even if you're writing btrfs specific code.

(also, this trick is almost completely useless for databases because the
amount of data to be rewritten is too large)

Unfortunately, piercing the abstraction is completely essential if you want to
achieve high reliability from userspace.

My own experience of this was in a Windows embedded environment on writing to
either internal Flash or SD cards, and trying to construct a database (we
should have used sqlite, but didn't have room) that was resilient against
sudden power off. I discovered all sorts of odd failure modes - "delete file
A, write to file B, crash" could result in the write to B persisting but A not
being deleted, for example.

~~~
alkonaut
Wow I can’t believe I missed the section on this. My bad. Rename not being
atomic on crash sounds absolutely terrifying. Luckily for NTFS there is an
atomic rename (deprecated but still working).

That said, what kind of non-atomicness occurs is also important. Leaving both
the source and destination files around is at least something that can be
recovered from, but if there are other failure modes that would be scary!

This is how nearly all “document-based” applications prevent corruption also
under posix, because nearly no applications would use per-filesystem logic. I
suppose ”prevent corruption with a crash at any point” is simply too high a
bar for most apps - they would resort to backups to prevent data loss with
untimely crashes?

With networks- or removable media I know all sorts of madness will happen.

