The amusing thing, apart from the fact that we are still discussing the unexpected interactions of locks and NFS after 35 years, is that recommending not using flock() is the reverse of the manual pages and widespread existing practice.
The Linux manual calls the lockf()/fcntl(F_SETLK) mechanism "unfortunate". The BSD manual, across all flavours, is not so restrained and calls them "completely stupid semantics of System V". The BSD manual calls the flock() mechanism "much more rational" and "recommended". The Linux manual likewise recommends using its per-open-file-description mechanism of fcntl(F_OFD_SETLK) as "a Linux-specific alternative with better semantics".
Thus we are in the unenviable situation that the Linux NFS client/server implementation breaks flock(), by translating into POSIX locks under the covers some of the time; but the non-POSIX flock() is the "better", "rational", mechanism that is widely recommended, and that has been long used by the BSD lockf(1) utility and the Bernstein setlock(1) utility and the Pape chpst(8) utility.
For all the dangerous qualities of classic POSIX/SysV locks, they're in service of one key characteristic[1]: only a single process can hold a lock, which means you can easily identify and communicate with the lock-holding process, such as by querying for and sending a signal to the PID (ignoring NFS, which came along later). With BSD descriptor locks, multiple processes can and often invariably do hold the lock, by holding dup'd references to the same open file table entry.
[1] Maybe this could have been achieved with better semantics, but my guess is the semantics follow from a very straight-forward implementation given the classic 3-level Unix file model of per-process descriptor tables, global open file table, and global open inode table, where you just hang the lock index off of the global open inode table, without any back referencing to the higher-level tables, notwithstanding recording of the lock owner PID. SysV mandatory locking perhaps also figured into this design and the semantics.
That's certainly the implementation, v_filocks. But I'd be wary of saying that the implementation drove the design. Remember that people have historically disagreed about synchronization mechanisms and which ones are the good ones. It's quite possible that the design was simply intentionally that way, not following from any implementation considerations.
For example: Dave Cutler architect of Windows NT called the OS/2 mutex mechanism a "mutant" because xe thought the semantics were bad. But the semantics that xe didn't like were designed so that a locking thread was told whenever the preceding mutex owner lost the lock because it ABENDed, and that thus whatever the mutex was guarding was probably not in a consistent unlocked state.
One has only to cast one's mind back to how file sharing flags worked, or just didn't work, on MS/PC/DR-DOS when they first came along (Hands up if you remember compatibility mode sharing, or Netware's added spin to the whole thing.) to be reminded that sometimes design was just odd, by today's standards. (-:
By the way: NFS began in 1984 and was available for AT&T Unix System 5 Release 2. File locking with lockf()/fcntl() dates from AT&T Unix System 5 Release 3 which was actually later, not earlier. Also remember that System 5 Release 3 at the same time introduced RFS, which did do locking and statefully maintained open file descriptions on the server, so the idea of the PID being accessible is a bit of a red herring.
IIRC the initial implementation of what became POSIX locks was cobbled together in a weekend just so that the author could submit something to the POSIX group but then apparently nobody else did, or bothered to read/argue, so it got standardized as-is, with all of its unfortunate implications.
"...the flock command uses flock() and it's pretty much your best bet for locking in shell scripts."
I find that mkdir is the easiest and most reliable lock between shell scripts under my control, and I use this approach for several variants of inotify (incron, path units, and inotifywait).
#!/bin/sh
if mkdir /tmp/dirlock 2> /dev/null
then trap 'rmdir /tmp/dirlock' EXIT
(critical section)
fi
This also does not require a subshell, so it is more efficient as it avoids a fork() - unlike the flock shell utility.
I don't try to do this over NFS, but mkdir is guaranteed to be atomic, and I have never seen this fail. It's also more secure to lock in a private directory rather than /tmp, unless you want the lock mechanism to be promiscuous for some reason.
True, but it doesn't have to be you who does the cleanup. You could create a background process and establish a temporary UNIX socket connection, so when the parent dies, the socket closes and the child process cleans up everything.
I wonder if the fact that you can delete a file and still access its contents as long as a file descriptor is still open would allow some dorty magic. Probably not because iirc you can create a file with the same name right after deletion even if a fd is open to the old one.
SQLite will open temporary files (for sorting and sundry functions), then immediately unlink() them. This is allowed behavior under POSIX.
The NFS trick was documented in "Why NFS Sucks" by Olaf Kirch.
"Some other cases are a bit stranger. One such
case is the ability to write to an open unlinked
file. POSIX says an application can open a file
for reading and writing, unlink it, and continue
to do I/O on it. The file is not supposed to go
away until the last application closes it.
"This is difficult to do over NFS, since traditionally, the NFS server has no concept of “open”
files (this was added in NFSv4, however). So
when a client removes a file, it will be gone for
good, and the file handle is no longer valid—
and attempt to read from or write to that
file will result in a “Stale file handle” error.
"The way NFS traditionally kludges around this
is by doing what has been dubbed a “silly re-
name.” When the NFS client notices during an
unlink call that one or more applications still
hold an open file descriptor to this file, it will
not send a REMOVE call to the server. Instead,
it will rename the file to some temporary file
name, usually .nfsXXX where XXX is some
hex number. This file will stay around until the
last application closes its open file descriptor,
and only then will the NFS client send the final
REMOVE call to the server that gets rid of this
renamed file.
"This sounds like a rather smart sleight of hand,
and it is—up to a point. First off, this does not work across different clients. But that should
not come as a surprise given the lack of cache
consistency."
We used to use this sort of locking in (frequently running) system cron jobs and the like. Then these jobs started getting killed off by Linux OOM on some systems and we ran into the downsides of locks that don't automatically clear if something goes wrong, and switched to flock(1) based locks (fortunately on the local system, so we're not affected by NFS issues there).
Oops, my fallible memory bit me. We weren't specifically running into OOM, but into strict overcommit (which we had turned on on some machines). OOM will only kill big things, so it would be weird for bash (as /bin/sh) or small Python programs to get killed off. But strict overcommit favors already running things (who've already claimed their memory) over newly started cron jobs.
(You could criticize the shell for needing to do dynamic allocation in failure paths and so being exposed to malloc() failing, but this is a hard area and lots of code assumes it can malloc() on demand, some of it in the C library.)
> Unfortunately, this change creates another surprising situation, which is that the NFS server and a NFS client can both obtain an exclusive flock() lock on the same file. Two NFS clients trying to exclusively flock() the same file will conflict with each other and only one will succeed, but the NFS server and an NFS client won't, and both will 'win' the lock (and everyone loses). This is the inevitable but surprising consequence of client side flock() locks being changed to POSIX locks on the NFS server, and POSIX locks not conflicting with flock() locks. From the NFS server's perspective, it's not two flock() exclusive locks on a file; it's one exclusive POSIX lock (from a NFS client) and one exclusive local flock() lock, and that's nominally fine.
One might say "right, so don't run apps on the file server", but if the file server is running multiple file server protocols some of which want to locally flock(), then you lose unless flocks and POSIX locks are made to conflict.
So either rip out this feature of the NFS client on Linux or make local flock locks and POSIX locks conflict at the very least if the filesystem is shared via NFS.
Sort of. Suppose you want to run Samba or WebDAV or whatever to export the same filesystems over something other than NFS. You could export remote shares, I suppose, but the obvious thing to do is to export local filesystems on the server, but now if smbd or whatever wants to use flock(), you have a problem.
Does anybody know how Linux NFS handles new OFD-owned file locks? Are they also mapped to classic POSIX locks?
OFD-owned locks are a cross of BSD flock and POSIX fcntl/lockf locks. An original proposal is described at https://lwn.net/Articles/586904/, but the naming convention settled on OFD-owned (open file descriptor-owned) rather than file-private. I believe Linux was the first to get an implementation. They're also included in the latest POSIX/SUSv5 draft; "POSIX locks" will become ambiguous in the next year or two.
Not quite answering your question, but the man page says
Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor.
I think that answers it. And looking back at the specification in the SUSv5 draft, it seems this and analogous behavior has been codified--an OFD lock conflicts with a process-private lock and vice-versa. So there's only one reasonable option for mapping OFD locks to NFS locks, which is that they both map to the same NFS server lock in the same manner (unlike BSD flock, both support record locking).
I guess the truth is in the middle. They’re associated with file descriptors, but cloning a file descriptor (e.g. through fork or dup) gives you true clones that both hold the lock.
Or are the Linux and OpenBSD functions subtly different?
The Linux man page also is confusing in that it says
“Only one process may hold an exclusive lock for a given file at a given time”
while, if I understand things correctly, also claiming that a forked process retains the locks of the parent process. So, what happens if that’s an exclusive lock? Does fork fail?
The statements are not contradictory, just confusingly worded. A lock is on a file, held by an open file description.
Also note that, "open file description" != "file descriptor". An open file description exists only in the kernel, and is created on open(2). A file descriptor is the userland-visible handle to this open file description. dup(2) creates a new handle (file descriptor) pointing to this same open file description. Therefore, dup(2), fork(2), etc., do not violate the mutex property of the lock, since they are duplicating the file descriptors (handles), not the open file descriptions (which actually hold the locks).
What's not clear from the man pages is what happens to a locked open file description which is passed to another process via a Unix domain socket. I suspect at that point, both "processes" "hold" the lock -- which in reality simply means they both have a file descriptor referring to the same single open file description which holds the lock, thus not violating the mutex property. Though this contradicts the extensive verbiage in flock(2) which refers to "processes" as being subject to the mutex property, I suspect that is a simplification, as later text in flock(2) indicates that open file descriptions are indeed the subject of the mutex property.
Thanks! Completely overlooked that they talked about ‘description’, not ‘descriptor’ in several places.
It would have helped me if they pointed out that difference in a clearer way. The Linux man page says ‘Locks created by flock() are associated with an open file description (see open(2))’, and that says:
“The term open file description is the one used by POSIX to refer to the entries in the system-wide table of open files. In other contexts, this object is variously also called an "open file object", a "file handle", an "open file table entry", or—in kernel-developer parlance—a struct file.”
rename() is an atomic and portable method of file locking that has worked everywhere for 30 years. Just rename a file. If you can stat the file, it hasn't been renamed yet. If you try to stat the file and get an error that it doesn't exist, someone already has the lock. When you can stat it, try to rename it; rename will only succeed if your system call successfully renamed it, in which case you have the lock. If rename fails, the lock was taken. Rename the file back to unlock it.
Works on every filesystem because you can't just cache a rename (wouldn't make much sense to try to open a file that doesn't exist after a rename) and you can't have a rename race condition.
Wherever possible it is better to use filelocks (a named file or directory) or SQLite (which deals with the arcana for you and has been both unit and battle tested).
Using file locking APIs is just a minefield of pain otherwise.
You can get effectively the same behavior as an mkdir based lockfile when using BEGIN IMMEDIATE instead of just BEGIN with a WAL based SQLite db. (other modes should use BEGIN EXCLUSIVE). Using this feature statements will immediately fail if another process is writing to the db and you can use the sqlite3_busy_timeout to manage automatic retries for you (if desired).
The benefits over the lockfile approach include:
- transactional semantics & rollbacks
- optional concurrent read only queries in WAL mode if you initiate them with BEGIN DEFERRED that operate on a snapshot of data and don't impede the writing process
- simple online backups via VACUUM INTO.
Cons are it is heavier weight than a mkdir call and if all you want to do is elect a master process from a group and don't have structured data to operate on transactionaly then it's not worth it.
“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”
“It is not possible to change the page size after entering WAL mode.”
“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”
“SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode. “Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”
I just don't think that SQLite is an appropriate alternative to mkdir() locking.
All the peculiarities of NFS actually made it so operationally it is optimal for us to just have the NFS server be incredibly dumb. All the action is on clients. Modern Linux permits xattr and flock over NFS so everything will work out so long as you use precisely one kind of lock over NFS and do your xattrs over NFS as well.
Honestly, it would be pretty nice to have some sort of path locking and an ability to request delegation on NFS rather than have it be this automatic conflict-resolution style. But whatever, ultimately it works.
There is a process I have that uses flock(1) with an exclusive read lock. The intent was other processes can read the data but not allow to change it.
It "works", but I noticed many newer utilities no longer checks for these locks. The original vi and nvi checks, so the file is opened read-only. But vim and surprisingly emacs do not check. That means people can change that file while the process is executing using vim/emacs.
fantastic article on flocks and NFS! and short and sweet
I was bitten many years ago by a few of the gotchas he covers in it. great to see clear and precise treatment like this. I like to keep an accurate model of the system in my ind, and pieces like this help me to do that a little better
Speaking of locks if you are writing something that can run a long time and produces a log file it might be nice if you locked the log file before appending a new log entry.
If you do that, someone who wants to rotate logs can do it by (1) locking the log file, (2) copying it to the rotated file, (3) truncating the log file, and (4) releasing the lock.
Here's a C program, cat-trunc, that does that. Usage is "cat-trunc logfile".
That's a little more complicated than is necessary, because it tries to minimize the time it holds the lock.
If you've got a long running program whose stdout you want to log here's a C++ program (but written like it is C plus std::string and std::list because I am lazy), lock-save, to help with that. Usage is "lock-save [-t] file". It copies stdin to file, writing a line at a time, locking the file for those writes. You can then use cat-trunc to rotate the file. The -t option makes it also copy stdin to stdout, making it essentially tee with locks.
...what's wrong with writing to stderr? If you do that, then whatever is on its reading side can route the logs wherever and however it wants to, without any need for locking/truncating shenanigans. Or is it way too simple?
Whereas rotating logs as aforegiven, locked or otherwise, has an unfixable race condition where logs can grow without bounds, because it divorces the log rotater from the log writer. The aforegiven tooling is merely wallpapering over huge cracks and not fixing the actual problems, which were, ironically, fixed with better tooling as long ago as the 1990s.
The Linux manual calls the lockf()/fcntl(F_SETLK) mechanism "unfortunate". The BSD manual, across all flavours, is not so restrained and calls them "completely stupid semantics of System V". The BSD manual calls the flock() mechanism "much more rational" and "recommended". The Linux manual likewise recommends using its per-open-file-description mechanism of fcntl(F_OFD_SETLK) as "a Linux-specific alternative with better semantics".
Thus we are in the unenviable situation that the Linux NFS client/server implementation breaks flock(), by translating into POSIX locks under the covers some of the time; but the non-POSIX flock() is the "better", "rational", mechanism that is widely recommended, and that has been long used by the BSD lockf(1) utility and the Bernstein setlock(1) utility and the Pape chpst(8) utility.