
Show HN: AtomicWrite: cross platform library for atomically writing data to disk - jeffreyrogers
https://gitlab.com/jeffreyrogers27/AtomicWrite#readme
======
Yoric
Unfortunately, this technique is not guaranteed to work. Even if you write,
flush then rename, the OS and disk may decide to apply these operations in a
different order, breaking your guarantee.

A few years ago, I had to fix this behavior on Firefox, because it was
actually causing data loss. The only techniques I found that seem to work are
journaling and rolling backups + transparent recovery (which can still lose
data, just one order of magnitude less often).

If you're interested, I wrote about the latter in this blog post:
[https://dutherenverseauborddelatable.wordpress.com/2014/06/2...](https://dutherenverseauborddelatable.wordpress.com/2014/06/26/firefox-
the-browser-that-has-your-backup/)

~~~
pjc50
Ah, thankyou for fixing session restore. For years this was a really expensive
(although useful) feature that caused huge disk write pressure. I used to have
to move the Firefox directory to a ramdisk in order to have a usable system.

~~~
Yoric
My pleasure :)

Other developers are currently working on fixing other aspects of Session
Restore, including both performance issues and plugging other possible sources
of data loss, but I'm not following this closely.

------
drfuchs
Two issues:

Your technique potentially changes the owner, group, and protection bits of
the file, which may come as an unwelcome surprise to anyone trying to use it
as a drop-in replacement for directly over-writing the old file. (Worse, there
seems to be no way to completely fix this in Unix-land as of the last time I
looked back in the 90’s, when we had to pull a feature in FrameMaker over
this).

Also, be aware that (many? some?) disk controllers routinely lie about whether
their internal caches are actually flushed to physical media, and there’s no
way for fsync to be 100% sure that the data has really made it all the way to
the spinning platter of rust. Try testing your code by putting it in a loop
and repeatedly pulling the power plug out of the wall, and see if you always
end up with an uncorrupted file. (Yes, people do this sort of testing on
highly fault-tolerant systems.)

~~~
matthewaveryusa
Is there anything you can do in software (just thinking Unix) if the hardware
lies to you about fsync? Or at the very least detect such devices, or are you
at the mercy of simply testing the disk?

~~~
zaarn
You're at the mercy of the disk.

The best you can do is assume that anyone who uses your application in
production will have enterprise hardware, those (usually) don't lie about
fsync (consumer hdd's are somewhat likely to do this)

I've also not yet run across a SSD that lies about fsync.

The disappointing thing is that most filesystems also blindly assume this.
Ext4 is somewhat robust by design but I've heard of atleast one time where a
(very) cheap consumer harddisk trashed any filesystem on crash, including ZFS
to the point of becoming unmountable.

~~~
michaelmior
Did you mean _lie_ about fsync?

~~~
zaarn
hmm, yes, thanks.

My spelling can go bad when I type a bit fast.

------
joetucek
This is a shockingly hard problem, which the technique still doesn't get right
even if the hardware cooperates.

1) If the filename provided is just a filename, then dirname (and parent) may
not work as expected. You ought run it through realpath first.

2) What if the parent's parent hasn't been flushed? You need to recurse all
the way up to "/".

3) Extending support for other POSIX systems will be tricky. The Austin group
(keepers of POSIX development) had trouble with this. Some systems consider
using open() on a directory a complete sin (you _must_ use opendir()), some
allow it read only but not with write, fsync() may require write permissions
instead of just read, and opendir() may not produce an FD which you can ever
convince fsync() to operate on (e.g. because no write permission). Indeed, the
initial proposal from the Austin group was that all operations on directory
entries should be atomic if the inode is also synced (which matches e.g. HPUX,
one of the systems which strictly disallows any attempt at calling fsync() on
a directory). Many other systems (Linux especially) obviously don't fit that
model, it has some unpleasant performance implications, and overall no
agreement was ever reached as to how to standardize it.

As for testing, you might look at "Torturing Databases for Fun and Profit"
([https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/zheng_mai)) and "All File Systems Are Not Created Equal:
On the Complexity of Crafting Crash-Consistent Applications"
([https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/pillai)) for detailed approaches.

Good catch on Apple lying on fsync, btw. Not too many people know that one.

------
gumby
The ITS operating system (at the MIT AI lab) had a system call called .RENWO
-- "rename while open". Basically you could open a file, write into it (or
read/write) and then in an atomic operation give it a new name and close it.
So it was actually "atomic close and supercede"

It's really unfortunate the posix link() system call takes a pair of paths. It
should really take an inode. Then you could open a temp file and immediately
unlink it, and keep writing to it until you are finished (well, you can do
this today as well). Then when you're done you could give it a name. If you
crashed, you'd leave no detrietus behind.

~~~
jstanley
Perhaps you could do this by creating a hardlink to /proc/$pid/fd/$n ?

EDIT: Doesn't look like it, gives EXDEV ("invalid cross-device link")

~~~
deathanatos
While I think the grandparent is right about POSIX not supporting this, I
think this should be possible in Linux:

1\. Open the file with open(2), passing the O_TMPFILE flag. This creates a
temporary, unnamed file. (As if you had opened it, then immediately unlinked
it, but atomically.)

2\. Write to the file.

3\. Link the file into the filesystem with linkat(2), passing the
AT_EMPTY_PATH flag (which tells linkat(2) to target a file descriptor that we
pass it, not a path.)

(I'm not sure if step 3 helps in the case that you want to atomically replace
the contents of an existing named file; I suspect that linkat(2) will error
out because the file exists. Perhaps someday linkat(2) will grow a "atomic
replace" flag.)

------
agwa
This code is buggy because it assumes that a single write() will write out the
entire file:

[https://gitlab.com/jeffreyrogers27/AtomicWrite/blob/8a40d050...](https://gitlab.com/jeffreyrogers27/AtomicWrite/blob/8a40d050341f5d35947f1b47e73e884e5d7509ef/AtomicWrite.cpp#L31-37)

In reality, write() may write a partial amount, so you need to loop and call
write() again until the entire file has been written or write() returns an
error.

~~~
jeffreyrogers
Thanks, you're right. I'll fix this.

------
dmitrygr
This library does not come even close to doing what it claims to. This is
excusable because it is actually a very difficult problem with no simple
solution. Depending on your hardware and software, it may have no guaranteed
solution at all.

Let's start with the fact that modern disc controllers cache megabytes of data
and there exists no way to force them to write it out. There exist a few ways
to suggest this. Some will listen, some will lie. SSDs even worse in this
respect, they cache hundreds of megabytes, and while they're doing garbage
collection, even more data may be touched and modified. Depending on how good
their algorithms are, you may lose the data that you are writing if they're
powered off suddenly, or entirely unrelated other data, or perhaps a little of
each. There is no standard way to tell an SSD to flush its data to actual
Flash either, of course. In fact, depending the the logic in the FTL in use,
this may not even be possible to do quickly (say, you discover a few bad
blocks during garbage collection and need to relocate a lot of data suddenly).
This changes even across firmware updates.

And then we arrive at modern file systems. The default mount options generally
only sync metadata to disc no matter how many times you call the sync system
call. All the dirty pages containing actual file data will be written whenever
the system damn well feels like it. And of course, as mentioned above, they're
only written to the volatile RAM inside the disk controller.

This is actually an insanely complicated problem. If you want to see how it is
solved professionally, take a look at a databases. People who work on sqlite
spent countless hundreds of hours making stuff like this work reliably, or
break in ways that are predictable and recoverable. Generally this involves
extreme amounts of journaling, sometimes multiple levels of journals.
Oftentimes there is a whole lot of os-specific quirks handling to fix the
differences between what the OSs promise functions do and the reality.

This is actually why, if you ever start needing to "write things reliably", or
"store more than a few of something in a file, and update it", and you can
spare the code size, just use sqlite for your storage. Because there's a whole
lot of problems you're going to hit that they have already solved for you.

------
waynecochran
The pain I had with atomic writing on Windows is that Windows rename does
_not_ behave like the POSIX standard:

[https://msdn.microsoft.com/en-
us/library/zw5t957f.aspx](https://msdn.microsoft.com/en-
us/library/zw5t957f.aspx)

    
    
          "The old name must be the path of an existing 
           file or directory. The new name must not be 
           the name of an existing file or directory."
    

I ended up having to use ReplaceFile and rename if ReplaceFile failed:

    
    
        ReplaceFile(tmpFileName, fileName, NULL, 
                    REPLACEFILE_IGNORE_ACL_ERRORS, NULL, NULL) == 0 &&
        rename(tmpFileName, fileName);
    

[https://msdn.microsoft.com/en-
us/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa365512\(v=vs.85\).aspx)

Why Windows why?

~~~
Goopplesoft
I usually don't defend Windows, but, rename not overwriting (without an
argument) seems safer as a low level API, no? Given it takes a 2 line helper
to achieve the same thing in C++ [1], doesn't seem that bad.

[1]
[https://gitlab.com/jeffreyrogers27/AtomicWrite/blob/master/A...](https://gitlab.com/jeffreyrogers27/AtomicWrite/blob/master/AtomicWrite.cpp#L144)

~~~
fh973
Safety is usually not a design goal for low level APIs. You rather want it to
be minimal, functions to be orthogonal, clean abstractions and only promise
necessary semantics.

And this might be already the reason. Win32 is geared towards the broad range
of developers whereas Unix comes more from systems thinking.

------
abcd_f
As another poster said - syncing is unreliable.

On Windows even if you use native API to open a file in a write-through mode
with no caching (FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING), you will
still end up with zero-filled files if the machine is power cycled or blue-
screened in the right moment. Disk controllers cache aggressively and there's
not much an OS can do about it.

------
stagbeetle
A reminder that Reiser4 exists and is still getting updated. Fast, stable, and
most of all: atomic (only one core though).

You'll need to patch your kernel[0] and install the mkfs.resier* util[1]. Take
a look at this Gentoo-forums tweaks post[2] for troubleshooting and
performance tweaks. If you find reiser is lagging, your filesystem was likely
built incorrectly and you'll need to run a simple

    
    
         fsck.reiser4 -y --fix --build-fs --build-sb
    

on your partition and it should be fixed. I only made the switch recently
because Ext4 doesn't have intelligent inode allocation and drops the ball at
1mil files.

[0][https://sourceforge.net/projects/reiser4/files/reiser4-for-l...](https://sourceforge.net/projects/reiser4/files/reiser4-for-
linux-4.x/)

[1][https://sourceforge.net/projects/reiser4/files/reiser4-utils...](https://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/)

[2][https://forums.gentoo.org/viewtopic-t-707465-postdays-0-post...](https://forums.gentoo.org/viewtopic-t-707465-postdays-0-postorder-
asc-start-0.html?sid=253eb8003e3ad8a80066f4d24d634c4a)

------
Upvoter33
there has a lot of work on this at U. Wisconsin, namely starting with:
[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
pillai.pdf) The testing framework there is something you need to help
establish confidence that what you are doing actually works...

------
alkonaut
I’d be very interested in any way that would let me (quickly) write small
changes to large files atomically, on Windows. The naive solution is to copy
the original to a temp file, then proceed as in this library (write then
atomic rename).

I’m thinking there may exist some lower level magic api with e.g CoW pages
that lets the caller appear to copy the original without actually doing it?
“Shadow copy” or similar?

~~~
huhtenberg
There's something called Transactional NTFS, which does exactly what you want.
But Microsoft decided to deprecate it in favor of other (and IMO far less
elegant) options.

~~~
trentnelson
That's a funny one. The kernel changes they needed to make in order to support
transactional stuff were pervasive, and it's used extensively under the
covers. The MSDN commentary makes it sound like they're going to completely
pull the functionality one day, which is misleading.

------
dicroce
I'm curious why it writes the whole file... Can't you do something like:

    
    
          if rollback log exists
             recover
          create log
          for each write(),
             write old contents of region to log (along w/ location)
          sync log
          for each write(),
             write new bits
          sync db file
          remove(log)

~~~
jeffreyrogers
That is probably a better approach for large files. My initial use case was an
application that would write an encrypted file containing passwords to disk.
The files are relatively small, so that method didn't seem worth the
additional complexity of something like what you describe. I believe sqlite
does something similar at least in some configurations.

I might add something like this in the future.

------
dis-sys
The title is pretty misleading, as already clearly mentioned by the author,
the library is all about atomically _creating_ a file with specified content
on disk, it is not about how to atomically append/update/delete files on disk.

~~~
jeffreyrogers
That's a much harder problem and I don't think you can do it portably. In
fact, I'm not sure of any filesystem that can make those guarantees.

------
setheron
At our work where we needed atomic write in Java, I believe we are using
undocumented use of the file API to force the flush. (It's a top hit through
stackoverflow)

------
osrec
Interesting. I wonder how much effort it would take to make this asynchronous
as well. I can think of a few situations where this could be pretty useful.

------
nightcracker
I wouldn't touch this problem with a 10 foot pole. Use SQLite and stop
worrying about it.

~~~
jeffreyrogers
I wrote this because I have an application that requires this but where SQLite
isn't suitable.

------
adontz
Windows implementation misses TxN, which makes it pointless.

~~~
ygra
TxN? Do you mean Transactional NTFS (which is shortened to TxF in the
documentation). And isn't that deprecated with a warning that it might not
exist in future Windows versions?

