From the filesystem point of view, I don't think that Dropbox is so much concerned about you losing your data because of corruption. As, anyway they are supposed to be safe and versioned in the cloud.
And I think that their app is very simple, ie like no specific 'driver' hack.
Their real costly problem is related to metadata!
All different file systems support different naming, encoding or character sets for file names.
Also, they all have their specificities and limitations regarding the support of extended metadata or user right info.
Also some will support file change notification or not, some will have a valid file last modification value that can have different level of decimal precision, etc...
So, as Dropbox ambition is that their app can fully backup and restore your system data, supporting backuping from one and restoring to another or the combinatory of all possible for each case is a nightmare.
Just a simple example: let suppose that you have a filename with the char "ü". With unicode normalization, this can be stored as a single character or as 2 characters "u" + ".." (letter + particule). Everyone uses the first version, but Mac HFS uses the second form.
The crazy thing is that if you try to save a file with the first form in HFS, it will accept but will silently convert the filename.
So, let's suppose that it was Dropbox asking to restore such a filename with the letter as a single character, later when listing your files to see if they are in sync, it will see another filename (based on filename bytes) but not the one it expects. So it might want to download it again and again and again if it was not smart.
> I believe their hand was forced by the way they want to store/use data, which they can only do with ext, but even if that wasn't the case...
The Dropbox examples seem to have clouded the issues actually being discussed.
They claim it’s because not all file systems are case sensitive, but they could at least keep the original case and use it when possible.
Then again, onedrive syncing break on a whole new level. If you upload a docx file, they’ll open it and edit the meta data so the file you added will be different from the one you get back (actually they do it to images and pdfs too).
They could get to the Cloud corrupted, and then continue replicating in your devices, so no...
Cross platform issues (unsupported characters across operating systems and file systems):
Edit: Inspired by the the Unison Spec (published on 2004) which may be why we have conflict files etc. when using Dropbox
A good synthesis is: filesystem API design is obviously a problem, given that people that specialize in using them can't do it correctly:
"Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope would write to files safely, like datbases and version control systems ... they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug ... programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority of programmers ... they still can't use files safely every time"
This is evident in database engines if you look at the number of lines of code dedicated to storage. Working with raw block devices requires the fewest lines of code, working through the filesystem requires the most, and direct I/O (partial filesystem bypass) is somewhere in the middle. And even if you design for comparable guarantees across these models for working with storage, the consistency of behavior across Linux environments also significantly improves as you bypass more of the filesystem.
Ironically, I now tend to borrow the low-level storage interface from database kernels, which abstracts the filesystem mess, for all code that needs to work with storage even if it isn't a database. It provides a saner interface with more consistent guarantees and often better performance. In my ideal world, someone properly designs a completely new filesystem API from scratch that sits alongside the legacy APIs that applications could start migrating to. But it would probably require adverse changes in the way the Linux kernel works for the legacy path, which means it will never happen.
Read-only random-access is well served by mmap() and pread().
For random writes access within a file, preadv2() and pwritev2() could be augmented with additional flags RWF_ACQUIRE and RWF_RELEASE. That's Linux-specific, but it could give database developers the ability to separate ordering with barriers from flushing with fsync. But perhaps I'm being naive. My assumption is that database developers are using flushes in order to get the barriers that they really want.
I'm surprised that page didn't mention the other issue with mmap, which is concurrent file access.
In the context of a VM error, the only deficiency with POSIX signals I can think of off-hand is that POSIX only permits global signal handlers--not per thread--but that's only an issue for the multithreaded case.
I ask because I think sometimes people conflate POSIX signals offering poor semantics (i.e. reentrancy issues) with POSIX signals being too low-level. I can imagine how I might implement SEH using POSIX signals (though a per-thread handler would be really nice), but not vice-versa (though maybe it is possible).
As I understand it, there's a long history behind signals relating to interrupt vs polling software system models. Signals as they exist in Unix were a very early implementation of the interrupt driven model in the context of a kernel<->user space interface. But Unix's process and I/O concepts were perhaps too convenient so the value-add of the signals model was minimal; Unix ended up evolving in directions that didn't require the interrupt abstraction (at least, not until decades later). This history explains why POSIX signals are so low-level and the lack of comprehensive runtime treatment.
Note that they're only low-level by today's standards. At the time they were very high-level--a signal interrupt magically preserved your process state (stack and program counter) and would magically resume the process when returning from the handler, which could be a simple C function. Even better, this could occur recursively! And it's worth mentioning that all kernel interfaces were and remain async-signal safe (e.g. dup2 is atomic even from the perspective of an async signal). The lack of consistent and convenient treatment by the runtime comes from the fact that much of the runtime we're most familiar with came later; threads came way later. When they came about people had already moved away from signals, perhaps because they saw that it was too much work to make the interrupt driven model work well at a high-level.
 In classic Unix style it did all this with the most minimal of kernel and user space code, pushing process state onto the user space stack and relying on an in-process trampoline to restore program state (which is how recursion could be supported without any complexity in kernel space).
It's an API that's so critical it's almost impossible to change or rewrite. The API itself has barely moved in 30 years, whether it's POSIX or Win32. Possibly the only widely adopted change in filesystem API has been "S3" and compatibles, which provide a completely different set of atomicity semantics as well as being network-native.
I would love if operating systems exposed a local object storage syscall API (with object versioning and all that good stuff.)
It could be implemented on top of the filesystem for all I care, as long as it's an abstraction with safe object-storage semantics, enforced by the kernel and exposed to all processes without a need for library support.
I believe that processes that wanted to "base" themselves entirely on object-storage would still need to touch the fileystem abstraction as well, mostly to allocate temporary on-disk "buffers" to gradually write to before submitting them to the object-store ABI as new object bodies. But:
1. Most such buffers would be small enough that you could get away with using an anonymous mmap(2) instead, keeping the file "on disk" in the page file rather than in the filesystem itself.
2. For objects you're writing to by streaming, with an unbounded eventual size, you could do the same trick Google Cloud Storage does: allocate a series of fixed-size "chunk" objects to receive the stream data, closing one and opening the next as each previous chunk gets "filled"; and then expose an API call to concatenate such chunk objects together on the object-storage kernel "backend" into single files (probably in O(1) time, because at a low level it's just concatenating disk extent lists.)
3. For other unbounded-size buffers, you could also have the object-store-kernel-daemon provide an API where it manages "large durable working copy" files for you, sort of "checking out" objects into file descriptors (probably using copy-on-write file clones on the backend), then "checking in" file descriptors to become new versions of those objects (maybe even "helpfully" avoiding doing so if the buffer hasn't been touched.)
That's the situation we are in with file systems API's now.
[+] The absolutely minimal requirement for a reasonable file system is that it works on multiple popular platforms and performs checksumming and other data integrity measures. And that's before we even get to still highly desirable stuff like snaphotting, encryption, etc.
You still need fsync for the data, but you can set sync=disabled on the filesystem, which turns it into a barrier. Alas, you can't do anything more granular than per-filesystem.
def atomically_write(filename, data):
with open(filename + '.tmp', 'w') as fh:
os.rename(filename + '.tmp', filename)
> If only every filesystem was like that..
I wonder if the solution for databases and other server software that needs to persist data reliably is just pretend they are and insist on end-users sticking to non-broken filesystems (most probably ZFS, assuming the set is currently non-empty). The alternative of wasting endless time and resources on sysiphean quests to placate crap filesystems seems to have been an utter failure so far.
Not just old data from the same file, either. In the case of file appending, "old" can actually mean "whatever junk happened to be in that allocated disk block earlier".
I was also (incorrectly) assuming that renaming a file after having fsynced the contents is a valid mechanism to perform "atomic" file writes where the write either appears completely at the destination path, creates a zero byte file or does not happen at all. But it should never produce an "incorrect" non-zero byte file - even in the face of crashes.
It is definitely something that is used in the wild so it was surprising to me that it is not correct. I even found a recent LWN article suggesting otherwise, so there seems to be a lot of confusion on this topic.
Would love to learn more if anybody has detailed infos or a link to the relevant mailing list thread or similar.
EDIT: Glibc documentation also implies that the "rename trick" is indeed safe:
> One useful feature of rename is that the meaning of newname changes “atomically” from any previously existing file by that name to its new meaning (i.e., the file that was called oldname). There is no instant at which newname is non-existent “in between” the old meaning and the new meaning. If there is a system crash during the operation, it is possible for both names to still exist; but newname will always be intact if it exists at all.
Another relevant source that suggests the trick is safe is https://lwn.net/Articles/327601/. Keep in mind that Ted Tso is the ext4 maintainer.
> For the longer term, Ted asked: should the above-described fixes become a part of the filesystem policy for Linux? In other words, should application developers be assured that they'll be able to write a file, rename it on top of another file, omit fsync(), and not encounter zero-length files after a crash? The answer turns out to be "yes," but first Ted presented his other long-term ideas.
It was nearly good enough on Linux. Corruption was very rare, but we still caught issues. On Windows corruption happened something like 1 in 10 poweroffs.
For low performance ACID we write the whole file out twice and fsync between each write. First write is to the backup file, second is to the primary file. This method passed our tests.
For high performance, we use another strategy entirely that's a lot more involved.
It sounds interesting. Is there a write-up with more details to read?
margin: 2rem auto;
I've been using it for a while and it seems to be solid.
Given the thousands of things that could cause it not to it's incredible it makes it down the happy path each time.
Amiga, Atari ST and early Macs were sort of in a twilight - they ran the core OS from ROM but loaded utilities and sometimes patches from disk (-ette).
I wonder if the typo was deliberate?
What happens once the data expires? Does the SSD return an error when the data is read, or does it read the bogus data without knowing?
I would kind of prefer the drive bricking itself rather than risking silently backing up bogus data. The earlier comment about ECC and bit error rates suggests bad reads are identified as such, but I'm not sure how far to trust that given that, as mentioned, I/O is hard.
However, I think what motivated Dropbox to drop so many Linux filesystems was the need for a certain flavor of xattrs, but specifically to detect renames, i.e. to merge create/delete events into renames.
Perhaps Dropbox adds internal UIDs via xattrs to every file you add to a Dropbox folder. If you move the file around, no problem, Dropbox can detect the rename via the xattr UID. Relying on heuristics alone to do rename detection can be brittle if you don't do it right, and missing rename events means file version histories get lost, which is a terrible shock for users. Imagine you suddenly can't find last month's version of a file...
So and just guessing here but probably those Linux filesystems lacked performant xattr access and were too slow for Dropbox to do hundreds of thousands of xattr lookups when scanning a folder at startup or every N minutes, or whenever inotify tells them something changed.
They didn't want the complexity of heuristics, or they found a way to exploit EXT on-disk data structures for faster xattr lookups, something like that. Who knows? Perhaps the real reason is less complicated than that.
Any truly robust file-on-local-disk storage scheme for a non-cloud system should allow for manual diagnostics and repair.
- An index is redundant information but dropping it and recreating it is not “repairability” as in “I read and amend a text”. You compare apples to oranges.
- What I wrote still applies to an index, so that’s orthogonal: a binary index will be more efficient but harder to repair than a text-based binary index.
But then you read the article and it goes to explain how dealing with files is hard and sometimes data corruption and loss occurs (fair enough), but nothing filesystem specific that could explain why ext4 is superior (or special) and had to be chosen.
So it reads kind of like an excuse for Dropbox, only that it isn't.
The argument this article is making is that supporting additional filesystems is hard. The is meant to refute the allegation that it's trivial to add whatever other filesystems the OS supports.
It mentions that dealing with hard disks is hard (which I believe, since they are flaky hardware). But dropbox didn't say "we won't be supporting this kind of hard disk/hard disk controller", but "we won't be supporting these filesystems". Where's the proof that those filesystems have problems that ext4 doesn't have?
That doesn't mean ext4 is problem-free, but it does mean that other filesystems have different problems that are not fixed by mitigations for ext4's quirks.
(Windows has gone to increasing lengths to accomodate and contain badly written drivers, since most bluescreens are caused by drivers. There is now a subsystem to allow the video drivers to crash and entirely restart without bluescreening.)
Last I checked, providing callback function pointers to binary vendor libraries (read/write adapters for FFT come to mind, allowing on-the-fly metric computation or skipping an intermediate storage for FFT convolution) was only possible on Linux, and with statically linking said vendor library into the software (incidentally breaking binary distributability for GPL).
This all feels a little silly. Yes sure, files are hard. If they weren’t I probably wouldn’t need to pay someone to solve the problem. But I think solving the file syncing problem for just ext4 is worse than bad for your customers because now you’re taking the whole ecosystem with you. Imagine users ditching btrfs or zfs because of Dropbox.
Now nothing and nobody is perfect and I wish Dropbox the best, but I miss the days when it felt like they cared about Linux and truly focused on solving the core problem effectively. They differentiated on sheer quality. I’ve moved on from file syncing to a more elaborate local NAS configuration and I could never go back, but I did get many good years out of Dropbox.
I’m glad Dropbox did at least walk back the filesystem compatibility issue a bit. Hopefully it’s a sign of better times to come for the Linux client.
If you care about data integrity, if you need an operating system where filesystem operations aren't leaky abstractions and where fsync(2) works as POSIX specifies, if correctness of operation is important to you, use an illumos-based operating system like SmartOS or any other based on the illumos codebase. Put some effort in and learn real UNIX and leave these long ago solved problems where they belong, back in the past century.
Copy file.txt to file.txt.new
Rename file.txt to file.txt.old
Rename file.txt.new to file.txt.old
This is ”safe” in the sense that in case of a terminated process, the point where it failed can be determined and the application itself can resume or roll back the update on its next run.
What it doesn’t guarantee us that the OS/filesystem provides this rollback independently of the application.
My question is: does Dropbox have any reason to want to work on a lower level than the “normal” high level where you only manually rollback or resume transactions? Do other applications also do this? I always felt that trying to pierce the abstraction of high level FS APIs was unnecessary unless you are writing drivers or file systems.
Some low level programs (antvirus, backups) I can see why they would need to peek under the hood, but to me Dropbox is a pretty dumb file sync program that shouldn’t need complex fs operations like say a backup program. Is it more complex than I give it credit for?
> This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that rename is atomic, but that only means rename is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't gauranteed to execute in program order, as people sometimes expect.
> The most mainstream exception where rename is atomic on crash is probably btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, rename is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive teesting, even if you're writing btrfs specific code.
(also, this trick is almost completely useless for databases because the amount of data to be rewritten is too large)
Unfortunately, piercing the abstraction is completely essential if you want to achieve high reliability from userspace.
My own experience of this was in a Windows embedded environment on writing to either internal Flash or SD cards, and trying to construct a database (we should have used sqlite, but didn't have room) that was resilient against sudden power off. I discovered all sorts of odd failure modes - "delete file A, write to file B, crash" could result in the write to B persisting but A not being deleted, for example.
That said, what kind of non-atomicness occurs is also important. Leaving both the source and destination files around is at least something that can be recovered from, but if there are other failure modes that would be scary!
This is how nearly all “document-based” applications prevent corruption also under posix, because nearly no applications would use per-filesystem logic. I suppose ”prevent corruption with a crash at any point” is simply too high a bar for most apps - they would resort to backups to prevent data loss with untimely crashes?
With networks- or removable media I know all sorts of madness will happen.
It seems to me they could have written much the same article the other way round: starting with a rename-based method, observing that the naive implementation isn't good enough, and going through the steps you need to make it robust in practice. Then they could have put a naive undo log in their "one weird trick" section and claimed it doesn't work.
I think the best rule of thumb is that if your outputs depend only on your inputs, you should aim for some kind of rewrite and atomic replace. If your outputs depend on both new inputs and the previous state, you need the database-like techniques.