Hacker News new | past | comments | ask | show | jobs | submit login
Files Are Hard (2015) (danluu.com)
170 points by signa11 13 days ago | hide | past | web | favorite | 50 comments

I strongly recommend using Sqlite for your own document format. Sqlite is one of the most well-tested pieces of software on earth and ACID-compliant. You can even make it safer than the default if you don't need maximum performance only need it for storing documents. It is very crash and corruption proof, especially with the full sync option and if you use it from one thread only.

It has been a while ago so maybe things are better now but I have seen sqlite being a disaster when the times comes to upgrade the schema. Not all of the common operations like deleting/renaming a column are supported. Since it has been a while I don't remember exactly which ones. Then someone writes an sql script to provide this functionality. Then it turns out that a freshly created database using the latest schema is subtly different from one that was created from an older schema and then updated. Things like a column that has a default but only in one of these two cases. Lots of fun, but not really.

I think what gp meant is fopen-way of sqlite, not full-schema. Basically, tXML/tJSON singleton and tBlob(id, data) for embedded blobs. That way it will have all journaling, syncing, checksums and other fs wisdom that the article mentioned.

For most uses of SQLite it's acceptable (or even advantageous) to copy the entire database for doing an upgrade, instead of the more RDBMS-y way of using DDL.

Which brings you the additional challenge of how to overwrite the old database file with the new one atomically in the face of crashes.


> If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing.

Did the original article not specifically make a point about Linux’s `rename` to be atomic only in the happy case and not when a crash happens during the rename?

Rename is the most atomic you can get on Unix. Original article talks about partial file updates. New SQLITE file + rename should be as failproof as possible. And I’m any case nothing is 100% safe, so redundancy is a hard requirement for real safety.

I've encountered one gotcha there: SQLite operations may create "-journal" files, and you may have to be careful exactly what files you're copying or moving around.

More: https://sqlite.org/howtocorrupt.html , https://sqlite.org/tempfiles.html

I was referring to the last section of the article, the part where it says rename isn’t atomic on crashes. Why not migrate traditionally via DDL in a transaction?

It's not clear to me if they're correct about that. The way it's written, it seems to be saying "my read of the POSIX standard suggests rename may not be atomic on crashes", rather than "there are POSIX implementations in common use that have been observed to have non-atomic renames on crashes".

On Windows you will at least have an old version and a new version lying around intact. Would take this over a single corrupted file any day.

Unless you call FlushFileBuffers after all writes are finished, Windows makes no guarantee that the physical copy of a file on-disk will contain all previously written data, even if the file has been subsequently closed and renamed.

Typical (mis)behavior includes correctly-sized files that consist partially or entirely of zero bytes (because NTFS does guarantee newly-allocated space will be zero-filled).

Nevertheless, I agree with you, and, therefore, when dealing with application-managed files (as opposed to user-visible documents) on local NTFS filesystems, I typically

1. Write to a temporary file filename.temporary, say.

2. FlushFileBuffers

3. Rename the original file to e.g., filename.original.

4. Rename the temporary file to the original filename.

5. Delete filename.original.

Then, in some thread-safe manner at some point before accessing any such files — at application startup, say, when performance permits — I remove lingering ".temporary" files and restore any orphaned ".original" files to their original filenames.

This, of course, assumes FlushFileBuffers works properly, so, in particular, still fails in the presence of storage devices, drivers, and configurations (e.g., "Turn off Windows write-cache buffer flushing") that break buffer flushing, which are not nearly as rare as they should be.

> Unless you call FlushFileBuffers after all writes are finished, Windows makes no guarantee that the physical copy of a file on-disk will contain all previously written data, even if the file has been subsequently closed and renamed.

On *nix you need to sync before renaming as well.

Do you have any example of what you're describing? I've used SQLite for years, written many migrations and never had such issues. A reasonably popular app I'm working on has so far 25 migration steps [0] and I've never encountered something like a new schema being different from an updated one. I'd expect whatever mistake was made to get to that point could be made in any other RDBMS.

0: https://github.com/laurent22/joplin/blob/805a5399b5288b4a282...

After writing the above comment, memory comes back a bit. The problem is that sqlite does not support renaming a column directly. One can find procedures for that on the internet. E.g., https://tableplus.com/blog/2018/04/sqlite-rename-a-column.ht.... However, we also wanted to be able to rename columns where there are foreign key references, for instance references to the table of which a column was renamend. A generic function, not in sql, but in the programming language of our project, in this case c++, was written for this purpose, something like rename_column(table_name, old_column_name, new_column_name). Needless to say the first version of this function was not quite sufficient leading to the problems that I mentioned. In the end I had to improve this function. I remember at the time I was quite pleased with myself figuring out how to do this without temporarily switching off foreign key checking which the first version of the script had.

could you go a little more in depth about your setup?

do you mean you add the sqlite library, load the file and then use sqlite to read/write data to a database instead of a text or bin file? I like it...

I do simulations in physics and we're really not familiar with these types of things.

Probably more than 50% of all the Android apps I have decompiled use SQLite for data storage.

Some extra time to learn SQL but it will save you a lot of time in the long run.

Another benefit compared to ASIC is that you get: File compression, easy to access random parts of the file and fast (random) inserts.

I do the same to manage a few thousand records with Node.js. I like that I don't depend on a complex DB engine. It's just one file, very compact. I even commit the file to GIT :)

Totally agree. To make file system crash-proof you really need an ACID compatible underlying storage layer or implement it inside the file system itself. In fact, file system and database are quite similar in case of atomicity and consistency. If an embedded ACID component is too heavy, utilizing a reliable database as storage, for example sqlite, might be a good idea. That’s exactly what I did in ZboxFS (https://github.com/zboxfs/zbox) which can use sqlite as a underlying storage to achieve ACID compatible behaviours.

As a result of the paper linked in this article sqlite has a new setting

PRAGMA synchronous=EXTRA;

It is not the default (default is usually FULL).

I have no idea on how many mails the author receives, but anyway I would suggest to take a look at Claws Mail. https://www.claws-mail.org/

I have all my mail since forever online, that is, tens of thousands of posts including attachments in multiple inboxes belonging to multiple accounts: everything since my Internet day one (20+ years), the earlier messages imported from Eudora before moving to Linux. It's all in my $home, and searches are damn fast: fractions of seconds for header searches which are indexed, tens of seconds to minutes for searches into the body. As far as I recall, I never lost a single file.

  ~$ du -chs Mail
  3.7G Mail
  3.7G total
Might post some spam from the late 90s if anyone is interested:)

> Might post some spam from the late 90s if anyone is interested:)

Would love to see a few (:

I've had dana@realms.org since 1995 but sadly I haven't kept the stuff older than 15 years.

There were very few of them at that time. About once a month I received a newsletter from Programmer's Paradise which I probably signed for somewhere, then some "adults only" links (message was empty) from a members.xoom.com address, then the usual "enlarge it" spam. Here's an example (addresses probably fake or long dead, but redacted anyway):


  From: XXXX@123india.com
  To: "XXXX@yesitsmail.net"<>
  Subject: YeS It Works!.....Gotta Have  It          15643
  Date: Wed, 18 Nov 1998 02:53:45 -0500
  X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1

Try This POTENT Pheromone Formula That Helps Men and Women To Attract Members of The Opposite Sex Click here to learn more:

Ever wonder why some people are always surrounded by members of the opposite sex?

Now YOU Can................... Attract Members of The Opposite Sex Instantly Enhance Your Present Relationship * Meet New People Easily Give yourself that additional edge Have people drawn to you, they won't even know why

Click Here For Information Read What Major News Organizations Are Saying About Pheromones!

To be removed from our mailing list, Click Here

That spam contained also some quoted text coming from a yahoogroups list I was part of with a bunch of old friends, so either some of them or the server were compromised and spitting out our addresses.

Spamming was occasional at that time, and I probably deleted most of them, but in a few years it became unsustainable and I installed IpCop to identify junk and redirect it to its own folder.

One more example:


  From: "InvestTXXXX@uasc.com.kw" <InvestXXXX@uasc.com.kw>
  To: "XXXX@prodigy.com" <XXXX@prodigy.com>
  Reply-To: <XXXXXGreatPicksDaily@trk.kht.ru>
  Subject: Fwd: Investor's Alert
  Date: Sun, 9 Jun 2002 21:07:41 -0400
Immediate Release

Cal-Bay (Stock Symbol: CBYI) Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI. CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exhange. CBYI is trading around $.30¢ and should skyrocket to $2.66 - $3.25 a share in the near future. Put CBYI on your watch list, acquire a postion TODAY.

REASONS TO INVEST IN CBYI • A profitable company, NO DEBT and is on track to beat ALL earnings estimates with increased revenue of 50% annually! • One of the FASTEST growing distributors in environmental & safety equipment instruments. • Excellent management team, several EXCLUSIVE contracts. IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.

RAPIDLY GROWING INDUSTRY Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billion from "smell technology" by the end of 2003.

ALL removes HONERED. Please allow 7 days to be removed and send ALL address to: XXXXXAgain@btamail.net.cn

Certain statements contained in this news release may be forward-looking statements within the meaning of The Private Securities Litigation Reform Act of 1995. These statements may be identified by such terms as "expect", "believe", "may", "will", and "intend" or similar terms. We are NOT a registered investment advisor or a broker dealer. This is NOT an offer to buy or sell securities. No recommendation that the securities of the companies profiled should be purchased, sold or held by individuals or entities that learn of the profiled companies. We were paid $27,000 in cash by a third party to publish this report. Investing in companies profiled is high-risk and use of this information is for reading purposes only. If anyone decides to act as an investor, then it will be that investor's sole risk. Investors are advised NOT to invest without the proper advisement from an attorney or a registered financial broker. Do not rely solely on the information presented, do additional independent research to form your own opinion and decision regarding investing in the profiled companies. Be advised that the purchase of such high-risk securities may result in the loss of your entire investment. The owners of this publication may already own free trading shares in CBYI and may immediately sell all or a portion of these shares into the open market at or about the time this report is published. Factual statements are made as of the date stated and are subject to change without notice. Not intended for recipients or residents of CA,CO,CT,DE,ID, IL,IA,LA,MO,NV,NC,OK,OH,PA,RI,TN,VA,WA,WV,WI. Void where prohibited. Copyright c 2001 *

edit: formatting to make headers readable.

What kind of email volume are you dealing with. Middle to higher level execs routinely receive 500+ legitimate emails, and an equal amount of spam everyday, and they use Outlook without issues

I do desktop support for a medium size office, we use O365, outlook sucks at managing emails and I regularly find that people who moved emails to a folder can't find it months or years later.

When I worked in marketing, I had Outlook search flat out refuse to find emails in my Inbox, through both header searches and body searches.

Totally undeterministic behaviour, because it would turn around and work on other searches.

The sane old-school way to store mail is using directories and files. Mozilla Thunderbird does this and I've never had a corruption issue. If need be, you can index or open individual files in text editor, as they're all plain text. Would hope for revitalization of the application, but works for my personal use cases.

Yeah, files, are hard.

I don't think Thunderbird is good example here..

It stores inboxes in mbox-like format, then some metadata in very weird .msf format, and on top on that SQLite database for search indexing. And it has some bewildering bugs like mixing up titles and content from two different messages, or slapping on messages date pulled from thin air..

> or slapping on messages date pulled from thin air..

IME, this happens when the message doesn't have a date (that is, Date header missing or invalid). There's an open request on bugzilla to use the last Received date in these cases.

The Maildir format, however, barely uses any tricky filesystem features. In order, all the semantics you need to support are:

- atomic file create with content

- atomic directory create

- atomic file move

- atomic file delete

- single-level file enumeration

No file modification in place which is the actual hard path. Not even appending to existing files.

Incidentally, reliable databases tend to also not rely on file modification, opting for a log-and-compact strategy instead.

It's when you start modifying existing data that 'just use a filesystem' becomes unviable.

> - atomic file create with content

No FS guarantees this. You can emulate this by writing to a tmp location, syncing and performing a move.

When you use rename to overwrite a file, it guarentees that at any point in time, newpath will refer to either the original file, or the new file. This is atomic enough to give you effectivly atomic file creation with contents.

It is not technically atomic, because there is a period where oldpath and newpath both refer to the same file, but I don't think that is an issue for this usecase.

Linux has the O_TMPFILE flag to open, for which this is one of the use cases described in the man page. I can’t say about “guaranteed” though. I assume Microsoft Transactional NTFS (TxF) also had (or has) this capability—but deprecated.

TxF is a corner stone of Windows Update, I can't really imagine them removing it. I always assumed they marked it deprecated to avoid supporting it towards customers.

Yes, just use thunderbird/netscape and be done with it. I still have my old installation running, upgrading and migrating as needed since netscape time. I have mails from 1999 on there still.

At work they were using outlook for mail and wondering why it crashes now and then (past 2 GB it starts writing over the start of the file or some such). We get a lot of big documents on email so switched to thunderbird and all is well since. Had to switch mailserver though since that one also only handled 2 GB. No idea what they were running but most likely something MS-related. Most of their stuff was MS at the time.

I wish there'd be a native e-mail client that utilized a RDBMS properly, I want a really fast full-text search, I want fast filters. The current ones have all been very unsatisfactory, mostly because the file-based approach doesn't allow them to. It really isn't sane in my opinion.

Have you tried notmuch? https://notmuchmail.org/

It uses a Xapian index, not an RDBMS, but it's impressive. RDBMSes aren't known for producing good natural language search results.

I have not, I am tempted however, I really dislike the power Google has over me.

"fast full text search" is something better provided by software that specializes in fast full text search.

Rdbms for the email use case is kind of orthogonal to fast full text search.

A RDBMS can provide a fast full text search alongside other useful features, can you elaborate how it's orthogonal?

Oh, and I also dislike even the idea that I have to find some "specialized software" to have basic functionality in my e-mail client. That's why most still suck a*.

You probably want something that checks for similarly spelled words, maybe natural language processing, whole or partial word matching, indices on all this, etc.

If all you want is to grep your email, that's different, I think you can accomplish that with pine or other text mode email clients.

O ya, you want alot:)

links in my stack: https://github.com/jakeogh/gpgmda-client

> With that tool, they find that most filesystems drop a lot of error codes:

I recently switched my work notebook to XFS, so looking at this table makes me kinda happy, even if I had bad experiences with XFS in the past (a forced turnoff thanks to a crash in X11 was sufficient to put my system in a unbootable state, trying to recover it with xfs_repair broke it completely; also I suffered with 0-size files randomly appearing after forced reboots quite constantly).

I know for a long time that XFS is probably one of the most well written filesystems for Linux, even if the user case seems to be more focused in servers with uninterrupted power supply than desktops and notebooks.

Well, I don't have very good experiences of XFS in the presence of system crashes; you are not alone to get irreparable FS. And even without crashes, the tooling is not very good.

And the thing is that even with uninterruptible UPS, you better handle crashes more gracefully than what XFS is apparently able to do, because e.g. kernel panics can occasionally happen.

So on my side, I'm moving from XFS to ext4. I'm not even sure this will be better, I'll see...

Too much of the storage industry has been consumed by the performance at any cost metric. Even when that means making engineering decisions that put data at risk.

Back when the original POSIX specs were being worked on, a common assumption was that anyone serious about their data ran all disks/raid controllers/filesystems/etc in writethrough (or equivalent) mode. Combined with what the spec doesn't guarantee leaves us in a world where its pretty much impossible to really make any data retention guarantees. A large part of this is that write() style api's should never have been allowed to be anything but synchronous with the data being commited to non-volatile storage. Thats because if you move the error handling to when fsync() or close() (or pick your place) if one of the writes fails its impossible to report accurately to the application what failed sufficiently to know how to recover it. This goes beyond just filesystems these days, there is a major raid chipset vendor that is/was shipped by a couple tier 1 vendors which only ran in write-through/FUA mode if it wasn't provided a battery, the performance was so abysmally bad that nearly everyone ended up buying the battery to enable write back. The problem with this controller in write back mode is that it didn't honor any kind of fencing or FUA when in writeback mode, instead depending on a intermittent timer flush based system to force the data to disk. If there was a disk failure on write, it was impossible to know what had actually been flushed and depending on firmware it would go from either reporting the error to silently dropping the data. Neither helps the end application that might have quit, and had definitely passed its close/rename/flush/etc sequences by the time the error occurred.

Bottom line, there is an API mismatch from top to bottom of the storage stack. Starting with the simple idea that if a filesystem operation doesn't provide an async completion notification it should be forced to be consistent at completion. Anything else _WILL_ create the opportunity for data loss over even simple power loss, much less the more complex cases of delayed writes in raid controllers/async replication/etc.

Put another way, like this article points out and others linked here reference, there are a ton of "bugs" in most OS's storage stack. Enough that they generally behave badly in the face of actual failures.

Related to the post, some disk and filesystem reliability research in chronological order, including things from the author.


An Analysis of Latent Sector Errors in Disk Drives (2007) https://research.cs.wisc.edu/adsl/Publications/latent-sigmet...

Failure Trends in a Large Disk Drive Population (2007) https://static.googleusercontent.com/media/research.google.c...

Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs (2012) https://www.microsoft.com/en-us/research/wp-content/uploads/...

Flash Reliability in Production: The Expected and the Unexpected (2016) https://www.usenix.org/system/files/conference/fast16/fast16...


IRON File Systems (how filesystems behave on disk errors) (2005) https://research.cs.wisc.edu/wind/Publications/iron-sosp05.p...

EIO: Error Handling is Occasionally Correct (2008) https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

SQCK: A Declarative File System Checker (2008) https://www.usenix.org/legacy/events/osdi08/tech/full_papers...

All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications (2014) https://www.usenix.org/system/files/conference/osdi14/osdi14...

Filesystem error handling (2017) https://danluu.com/filesystem-errors/

Files are fraught with peril (2019) https://danluu.com/deconstruct-files/

And a DRAM reliability paper, for a more complete picture:

DRAM Errors in the Wild: A Large-Scale Field Study (2009) https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

I'd be interested to see a current version of ZOL (zfs on Linux) in that table. I suspect it would hold up quite well.


  write(/dir/log, “2,3,foo”, 7);
  pwrite(/dir/orig, “bar”, 3, 2);
and not?

  write(/dir/new, “foo”, 3);
  rename(/dir/new, /dir/orig);

Check the update at the bottom:

> Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that's probably the most common comment I've gotten on this post. If you think this solves the problem, I'm going to ask you to pause for five seconds and consider the problems this might have.

> The main problems this has are:

> rename isn't atomic on crash. POSIX says that rename is atomic, but this only applies to normal operation, not to crashes. > even if the techinque worked, the performance is very poor > how do you handle hardlinks? > metadata can be lost; this can sometimes be preserved, under some filesystems, with ioctls, but now you have filesystem specific code just for the non-crash case etc.

> The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they're explicitly warned that people tend to underestimate this problem!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact