
Things UNIX can do atomically - blasdel
http://rcrowley.org/2010/01/06/things-unix-can-do-atomically?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+rcrowley+%28Richard+Crowley%27s+blog%29
======
dasht
I must brag (because hopefully it's interesting):

About 22 or 23 years ago my mentor at the time, a unix guru, beat more or less
that same list into my thick skull with a heavy schtick. It was considered
part of "basic competency" for unix. The list, back then, came with a list of
bitter exceptions (e.g., vendor X had broken "rename(2)" and it was not
reliably atomic on those systems).

Why was it important? The notion was that with a unix file system, and "tiny
tools" connected by pipes, you really didn't need an SQL-based relational
database for most tasks. You had shell programs like "cut(1)", "join(1)",
"sort(1)", "grep(1)", and "sed(1)". In a pinch, you could reach for "awk(1)"
but don't overdo it because the joke went "Some people, whenever they have a
problem to solve, reach for AWK. Then they have two problems."

A serious relational database would give you transactions and isolation - the
ACID properties. (Obtaining "durability" (the D in ACID) meant learning when
and how to flush and sync the file system.) Back then, on some historic
versions of unix, you could also count on it to be atomic if you appended a
text file with a modest length, single line, in a single "write(2)". But even
back then, that trick wasn't portable - just sorely missed from the good 'ol
days.

Around about 2001 I set out to implement the GNU Arch revision control system.
I was steamed at how Subversion was large, complicated, hard to get working,
and used a centralized server rather than a distributed and decentralized
system. I thought "I can do better than that, easily!" So I set out to prove
it.

I wrote the initial self-hosting release of Arch as... wait for it ... _shell
scripts_. Initially, there wasn't a single line of original C code in the
system. It was all coded in /bin/sh, /bin/awk, /bin/find, with heavy use of
sed, grep, cut, join, and sort. It was sluggish but usable and, was indeed,
one of the very first truly distributed and decentralized revision control
systems. It also delivered on the "smart merging" algorithms that Subversion
had promised but that, to this day, has not succeeded in implementing.

One of the challenges of a revision control system is the critical importance
of database integrity. When you check your code in, when you make back-ups,
etc -- and when multiple users may be doing the same thing at the same time on
the same repository -- it's absolutely critical to keep the database
consistent. You need ACID updates to the database.

So, here is a fun puzzle: Using just the atomic unix file system operations,
you have to implement (as a shell script), a "write once" list-of-directories
structure. I'll illustrate what I mean:

We have a directory called "the-list". Initially, it may contain arbitrary
"control files/directories" that you design, but it is empty of "data
directories". The data directories we add will be named as numbers. So the
first one is "the-list/0", then "the-list/1", then "the-list/2" and so forth.

When a directory like "the-list/1" is installed, it should already contain
data files. E.g., "the-list/1/patches.tar.gz". You aren't allowed to just
"mkdir the-list/1" and then, after that, install "patches.tar.gz". If "the-
list/1" is there at all, it must already contain "patches.tar.gz".

The data directories must be created in order. If two users both try to create
"the-list/1" at the same time, one should succeed, and the other be forced to
retry, perhaps creating "the-list/2" instead (after "the-list/1" already
exists).

Finally, a program that _starts_ to create "the-list/1" but then crashes must
not harm the database. It is OK if a user has to type a command like "cleanup-
after-crash" but it must be possible to write such a command (as a shell
script). It is important that if someone types "cleanup-after-crash" while the
program trying to create "the-list/1" is still running, that either the
"cleanup" program does nothing, or the programing making "the-list/1" is
forced to fail (and know that it failed, and report that to the user).

It sounds like something that should be trivially easy, right? Well, it
doesn't take a huge amount of code to solve but solutions to the problem are
not obvious. And many obvious attempts to solve the problem have subtle bugs
that lose the ACID properties of the database.

GNU Arch source code (and perhaps the documentation, I forget) contains an
answer key. It might be more fun, though, if you've nothing better to do, to
try to solve it yourself, first.

~~~
dasht
Something to keep in mind if you are inclined to try solving the puzzle is
that solutions ought to work on NFS file systems. Why does that matter?

Well, consider a system call like rename(2) or mkdir(2). Suppose that on a
local file system, two programs both try to rename "a" to "b" at roughly the
same time. At most one will succeed.

Not so, on NFS! An NFS implementation is permitted to behave this way: (1)
Client x issues the rename request; (2) Client y issues the same rename
request; (3) The server reports "success" to x; (4) The server reports "fail"
to y, but the server's report packets get lost so y doesn't get the message.
(5) Client y asks the server "Hey, whatever happened to my rename request?";
(6) Server notices that "a" doesn't exist but "b" does and replies "Success".
Then both x and y think that they renamed "a" to "b".

The rename itself took place atomically. It's just unclear to each client
which client actually triggered the rename. Similarly for other operations.

In other words, return values from file system calls don't reliably tell you
when your process was able to perform some atomic action.

------
Skriticos
I have a copy of "Advanced Programming in the UNIX Environment, Second
Edition" (featuring very good description of C/POSIX/SuS/XSI protocols/api's
and programming techniques for Linux/BSD/OSX/Solaris/other UNIX systems).
Anyone who seriously wants to do thread safe multiprocessing or other complex
low level stuff on UNIX platforms should have a copy of it. It's basically
"the book" on the topic.

~~~
sophacles
This is one of a very small number of books older than 5 years which I still
regularly reference. Amusingly, most of the books on that list were written by
W Richcard Stevens as well.

------
ajross
The very first example ("mv -T" to retarget a symlink, something I didn't know
about and that the manpage is silent on) is _not_ atomic! The symlink(2)
syscall will not overwrite an existing directory entry. You have to call
unlink(2) first, which opens a race.

That doesn't mean this trick is useless, but calling it "atomic" is wrong, and
opens up the possibility of bugs (likely security holes, even) if someone
actually needs atomicity.

~~~
bretthoerner
> something I didn't know about and that the manpage is silent on

It's specific to GNU Coreutils mv(1).

------
geocar
Beware: rename() isn't atomic on Mac OS X:

<http://www.weirdnet.nl/apple/rename.html>

~~~
jacquesm
What's weird is the timeline on that, if you see when it was reported and when
it was still present. Surely it can't be that hard to fix this in a proper way
?

~~~
rbanffy
Could be something with HFS+

~~~
geocar
[Apparently
not]([http://www.reddit.com/r/programming/comments/amo3x/on_how_re...](http://www.reddit.com/r/programming/comments/amo3x/on_how_rename_is_broken_in_mac_os_x/c0ie89p)).

That's pretty embarrassing.

------
prodigal_erik

      pipe(&pipefd);
      assert(len <= PIPE_BUF);
      write(pipefd[1], buf, len);
    

Also, if a file is opened O_APPEND, the seek before every write is atomic, but
since writes in general aren't atomic I'm not sure what value that has.

~~~
geocar
This is a great one: Under POSIX, any write() to a fifo that has at least
PIPE_BUF free is atomic.

There's a neat trick I'd like to add that takes advantage of this: Multiple
clients can use a message whose size divides evenly into PIPE_BUF without any
possibility of interleave, and so they do not need to lock.

    
    
        int n=sizeof(struct message);
        struct message m[n/PIPE_BUF];
        assert(sizeof(m)==PIPE_BUF);
        if(pipe(fds)==-1)abort();
        for(i=0;i<cpus;i++){if(!fork())child(fds[1]);}
        while(r=read(fds[0],m,PIPE_BUF)){for(i=r/n-1;i>=0;--i)handle(m+i);}
    

Each child can write entire messages, without locking and without
interleaving, so long as sizeof(struct message)*n is exactly PIPE_BUF.

Btw, your pipe(&pipefd) should've been pipe(pipefd) if pipefd is an array of 2
int

------
JeffJenkins
I just recently discovered fcntl for creating file-based locks (and the python
function which calls it, fcntl.flock) and was pretty excited. It seems like a
much safer way to do multi-process locking than creating lock files on my own

~~~
carey
There is a catch, though; if you have the same file open multiple times, even
with different names, then closing one file descriptor will release _all_
locks on that file in the same process, even in different threads.

SQLite goes to great lengths to work around this. See:
[http://www.google.com/search?q=broken+by+design+site:sqlite....](http://www.google.com/search?q=broken+by+design+site:sqlite.org)

~~~
cperciva
_There is a catch, though; if you have the same file open multiple times, even
with different names, then closing one file descriptor will release all locks
on that file in the same process, even in different threads._

It's even worse than that. A lock is released as soon as _any_ of its owning
processes closes any descriptor to that file. As a result, if a child forks
and exits, the parent no longer holds any locks...

~~~
JeffJenkins
That's definitely worth keeping in mind! Does that apply to sibling processes
as well (like two apache child processes)?

------
snorkel
IPC semaphore ops using the semop() function are also supposed to be atomic at
the kernel level.

------
samuel
One more: mkdir(2)

~~~
jrockway
How would mkdir not be atomic? It is a logical operation that only
encapsulates a single physical operation.

Rename is an interesting case, because it encapsulates two operations - a copy
and a delete. Doing two operations in one step is interesting atomicity. Doing
one operation in one step (mkdir) is not as interesting.

~~~
geocar
See Phrack 25:
[http://www.phrack.com/issues.html?issue=25&id=5](http://www.phrack.com/issues.html?issue=25&id=5)
for what it was like when mkdir was not atomic.

------
sid0
For what it's worth, Windows can do any arbitrary sequence of file operations
atomically, using transactional NTFS.

~~~
briansmith
Transactional NTFS only works on Vista, Server 2003, and later. It only works
on local drives. There are many other limitations. Because of these
limitations, I have not yet been able to use Transacitonal NTFS in any real
program. Somebody always wants to store all their stuff--even critical stuff--
on some network share.

~~~
gaius
And on iSCSI LUNs IIRC, which is how most people would use consolidated
storage (SAN vs NAS).

~~~
briansmith
I mean doing things like accessing an Access database stored on a network
share from multiple clients. I don't think many people are using iSCSI as a
replacement for SMB yet.

