Hacker News new | past | comments | ask | show | jobs | submit login
File Locks on Linux (utcc.utoronto.ca)
100 points by miohtama on May 9, 2023 | hide | past | favorite | 46 comments



The amusing thing, apart from the fact that we are still discussing the unexpected interactions of locks and NFS after 35 years, is that recommending not using flock() is the reverse of the manual pages and widespread existing practice.

The Linux manual calls the lockf()/fcntl(F_SETLK) mechanism "unfortunate". The BSD manual, across all flavours, is not so restrained and calls them "completely stupid semantics of System V". The BSD manual calls the flock() mechanism "much more rational" and "recommended". The Linux manual likewise recommends using its per-open-file-description mechanism of fcntl(F_OFD_SETLK) as "a Linux-specific alternative with better semantics".

Thus we are in the unenviable situation that the Linux NFS client/server implementation breaks flock(), by translating into POSIX locks under the covers some of the time; but the non-POSIX flock() is the "better", "rational", mechanism that is widely recommended, and that has been long used by the BSD lockf(1) utility and the Bernstein setlock(1) utility and the Pape chpst(8) utility.


For all the dangerous qualities of classic POSIX/SysV locks, they're in service of one key characteristic[1]: only a single process can hold a lock, which means you can easily identify and communicate with the lock-holding process, such as by querying for and sending a signal to the PID (ignoring NFS, which came along later). With BSD descriptor locks, multiple processes can and often invariably do hold the lock, by holding dup'd references to the same open file table entry.

[1] Maybe this could have been achieved with better semantics, but my guess is the semantics follow from a very straight-forward implementation given the classic 3-level Unix file model of per-process descriptor tables, global open file table, and global open inode table, where you just hang the lock index off of the global open inode table, without any back referencing to the higher-level tables, notwithstanding recording of the lock owner PID. SysV mandatory locking perhaps also figured into this design and the semantics.


That's certainly the implementation, v_filocks. But I'd be wary of saying that the implementation drove the design. Remember that people have historically disagreed about synchronization mechanisms and which ones are the good ones. It's quite possible that the design was simply intentionally that way, not following from any implementation considerations.

For example: Dave Cutler architect of Windows NT called the OS/2 mutex mechanism a "mutant" because xe thought the semantics were bad. But the semantics that xe didn't like were designed so that a locking thread was told whenever the preceding mutex owner lost the lock because it ABENDed, and that thus whatever the mutex was guarding was probably not in a consistent unlocked state.

One has only to cast one's mind back to how file sharing flags worked, or just didn't work, on MS/PC/DR-DOS when they first came along (Hands up if you remember compatibility mode sharing, or Netware's added spin to the whole thing.) to be reminded that sometimes design was just odd, by today's standards. (-:

By the way: NFS began in 1984 and was available for AT&T Unix System 5 Release 2. File locking with lockf()/fcntl() dates from AT&T Unix System 5 Release 3 which was actually later, not earlier. Also remember that System 5 Release 3 at the same time introduced RFS, which did do locking and statefully maintained open file descriptions on the server, so the idea of the PID being accessible is a bit of a red herring.


IIRC the initial implementation of what became POSIX locks was cobbled together in a weekend just so that the author could submit something to the POSIX group but then apparently nobody else did, or bothered to read/argue, so it got standardized as-is, with all of its unfortunate implications.


I thought that I was done when I augmented https://jdebp.uk/Softwares/nosh/guide/commands/setlock.xml to discuss this in more detail than it previously did.

Then along came Joseph Holsten, author of another flock(1) utility, contacting me to throw the cat amongst the pigeons. (See https://mstdn.social/@josephholsten/110343362815736310 .)


"...the flock command uses flock() and it's pretty much your best bet for locking in shell scripts."

I find that mkdir is the easiest and most reliable lock between shell scripts under my control, and I use this approach for several variants of inotify (incron, path units, and inotifywait).

  #!/bin/sh
  if mkdir /tmp/dirlock 2> /dev/null
  then trap 'rmdir /tmp/dirlock' EXIT
       (critical section)
  fi
This also does not require a subshell, so it is more efficient as it avoids a fork() - unlike the flock shell utility.

I don't try to do this over NFS, but mkdir is guaranteed to be atomic, and I have never seen this fail. It's also more secure to lock in a private directory rather than /tmp, unless you want the lock mechanism to be promiscuous for some reason.


Unlike real locks, your approach has the downside that an unexpected exit (e.g., kill -9) will stay locked until someone manually intervenes.


Sometimes that's precisely what you want, if the unfinished script should run again without being checked.


This was actually a problem on dash, so I added several more signals (INT TERM etc.).

Nothing saves you from kill -9, as we all know.

I did add a reboot entry to cron (@reboot with Vixie cron) to remove the directory.

I can say that this method has processed millions of Oracle archived logs with no errors.


> Nothing saves you from kill -9, as we all know.

True, but it doesn't have to be you who does the cleanup. You could create a background process and establish a temporary UNIX socket connection, so when the parent dies, the socket closes and the child process cleans up everything.


I wonder if the fact that you can delete a file and still access its contents as long as a file descriptor is still open would allow some dorty magic. Probably not because iirc you can create a file with the same name right after deletion even if a fd is open to the old one.


SQLite will open temporary files (for sorting and sundry functions), then immediately unlink() them. This is allowed behavior under POSIX.

The NFS trick was documented in "Why NFS Sucks" by Olaf Kirch.

"Some other cases are a bit stranger. One such case is the ability to write to an open unlinked file. POSIX says an application can open a file for reading and writing, unlink it, and continue to do I/O on it. The file is not supposed to go away until the last application closes it.

"This is difficult to do over NFS, since traditionally, the NFS server has no concept of “open” files (this was added in NFSv4, however). So when a client removes a file, it will be gone for good, and the file handle is no longer valid— and attempt to read from or write to that file will result in a “Stale file handle” error.

"The way NFS traditionally kludges around this is by doing what has been dubbed a “silly re- name.” When the NFS client notices during an unlink call that one or more applications still hold an open file descriptor to this file, it will not send a REMOVE call to the server. Instead, it will rename the file to some temporary file name, usually .nfsXXX where XXX is some hex number. This file will stay around until the last application closes its open file descriptor, and only then will the NFS client send the final REMOVE call to the server that gets rid of this renamed file.

"This sounds like a rather smart sleight of hand, and it is—up to a point. First off, this does not work across different clients. But that should not come as a surprise given the lack of cache consistency."

https://www.kernel.org/doc/ols/2006/ols2006v2-pages-59-72.pd...


> Nothing saves you from kill -9, as we all know.

You making a syscall and getting the kernel stuck somewhere and having your process in a D state will keep you alive ;D.


We used to use this sort of locking in (frequently running) system cron jobs and the like. Then these jobs started getting killed off by Linux OOM on some systems and we ran into the downsides of locks that don't automatically clear if something goes wrong, and switched to flock(1) based locks (fortunately on the local system, so we're not affected by NFS issues there).

(I'm the author of the linked-to entry.)


Switch the shell from bash to dash (which is the default on Ubuntu) if that is part of the memory problem.

Dash is nearly 10x smaller, rumored to be 4x faster than bash.

Otherwise, have you compiled any object code with -Os?


Oops, my fallible memory bit me. We weren't specifically running into OOM, but into strict overcommit (which we had turned on on some machines). OOM will only kill big things, so it would be weird for bash (as /bin/sh) or small Python programs to get killed off. But strict overcommit favors already running things (who've already claimed their memory) over newly started cron jobs.

(You could criticize the shell for needing to do dynamic allocation in failure paths and so being exposed to malloc() failing, but this is a hard area and lots of code assumes it can malloc() on demand, some of it in the C library.)


I spent some time formalizing mkdir/noclobber locks (with lots of help from #bash): https://github.com/jakeogh/commandlock

Long term work in progress... needs a simple Makefile


> Unfortunately, this change creates another surprising situation, which is that the NFS server and a NFS client can both obtain an exclusive flock() lock on the same file. Two NFS clients trying to exclusively flock() the same file will conflict with each other and only one will succeed, but the NFS server and an NFS client won't, and both will 'win' the lock (and everyone loses). This is the inevitable but surprising consequence of client side flock() locks being changed to POSIX locks on the NFS server, and POSIX locks not conflicting with flock() locks. From the NFS server's perspective, it's not two flock() exclusive locks on a file; it's one exclusive POSIX lock (from a NFS client) and one exclusive local flock() lock, and that's nominally fine.

One might say "right, so don't run apps on the file server", but if the file server is running multiple file server protocols some of which want to locally flock(), then you lose unless flocks and POSIX locks are made to conflict.

So either rip out this feature of the NFS client on Linux or make local flock locks and POSIX locks conflict at the very least if the filesystem is shared via NFS.


Doesn't it all just solve itself if you let the original data unused and mount the root of all those file servers through NFS?

You don't need the extreme of not running anything else on the server. You just need to not get it access to the original files.


Sort of. Suppose you want to run Samba or WebDAV or whatever to export the same filesystems over something other than NFS. You could export remote shares, I suppose, but the obvious thing to do is to export local filesystems on the server, but now if smbd or whatever wants to use flock(), you have a problem.


Does anybody know how Linux NFS handles new OFD-owned file locks? Are they also mapped to classic POSIX locks?

OFD-owned locks are a cross of BSD flock and POSIX fcntl/lockf locks. An original proposal is described at https://lwn.net/Articles/586904/, but the naming convention settled on OFD-owned (open file descriptor-owned) rather than file-private. I believe Linux was the first to get an implementation. They're also included in the latest POSIX/SUSv5 draft; "POSIX locks" will become ambiguous in the next year or two.


Not quite answering your question, but the man page says

  Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor.


I think that answers it. And looking back at the specification in the SUSv5 draft, it seems this and analogous behavior has been codified--an OFD lock conflicts with a process-private lock and vice-versa. So there's only one reasonable option for mapping OFD locks to NFS locks, which is that they both map to the same NFS server lock in the same manner (unlike BSD flock, both support record locking).


Reading the Linux and BSD man pages for flock confuses me.

https://man7.org/linux/man-pages/man2/flock.2.html says:

“Locks created by flock() are associated with an open file description”

On the other hand https://man.openbsd.org/flock.2 says:

“Locks are on files, not file descriptors.”

I guess the truth is in the middle. They’re associated with file descriptors, but cloning a file descriptor (e.g. through fork or dup) gives you true clones that both hold the lock.

Or are the Linux and OpenBSD functions subtly different?

The Linux man page also is confusing in that it says

“Only one process may hold an exclusive lock for a given file at a given time”

while, if I understand things correctly, also claiming that a forked process retains the locks of the parent process. So, what happens if that’s an exclusive lock? Does fork fail?


The statements are not contradictory, just confusingly worded. A lock is on a file, held by an open file description.

Also note that, "open file description" != "file descriptor". An open file description exists only in the kernel, and is created on open(2). A file descriptor is the userland-visible handle to this open file description. dup(2) creates a new handle (file descriptor) pointing to this same open file description. Therefore, dup(2), fork(2), etc., do not violate the mutex property of the lock, since they are duplicating the file descriptors (handles), not the open file descriptions (which actually hold the locks).

What's not clear from the man pages is what happens to a locked open file description which is passed to another process via a Unix domain socket. I suspect at that point, both "processes" "hold" the lock -- which in reality simply means they both have a file descriptor referring to the same single open file description which holds the lock, thus not violating the mutex property. Though this contradicts the extensive verbiage in flock(2) which refers to "processes" as being subject to the mutex property, I suspect that is a simplification, as later text in flock(2) indicates that open file descriptions are indeed the subject of the mutex property.


Thanks! Completely overlooked that they talked about ‘description’, not ‘descriptor’ in several places.

It would have helped me if they pointed out that difference in a clearer way. The Linux man page says ‘Locks created by flock() are associated with an open file description (see open(2))’, and that says:

“The term open file description is the one used by POSIX to refer to the entries in the system-wide table of open files. In other contexts, this object is variously also called an "open file object", a "file handle", an "open file table entry", or—in kernel-developer parlance—a struct file.”


> Thanks! Completely overlooked that they talked about ‘description’, not ‘descriptor’ in several places.

The classic "close reading" that man pages require...


In the Windows land, those things are indeed called "file objects" and "file handles" respectively, which is a much less confusing terminology.


Not mentioned: “Open file description locks” (https://man7.org/linux/man-pages/man2/fcntl.2.html), which are the locks you usually actually want.


rename() is an atomic and portable method of file locking that has worked everywhere for 30 years. Just rename a file. If you can stat the file, it hasn't been renamed yet. If you try to stat the file and get an error that it doesn't exist, someone already has the lock. When you can stat it, try to rename it; rename will only succeed if your system call successfully renamed it, in which case you have the lock. If rename fails, the lock was taken. Rename the file back to unlock it.

Works on every filesystem because you can't just cache a rename (wouldn't make much sense to try to open a file that doesn't exist after a rename) and you can't have a rename race condition.


Wherever possible it is better to use filelocks (a named file or directory) or SQLite (which deals with the arcana for you and has been both unit and battle tested).

Using file locking APIs is just a minefield of pain otherwise.


I find that SQLITE_BUSY causes too many failures when used in this capacity.

Better to use a guaranteed atomic call; my preference is mkdir.


You can get effectively the same behavior as an mkdir based lockfile when using BEGIN IMMEDIATE instead of just BEGIN with a WAL based SQLite db. (other modes should use BEGIN EXCLUSIVE). Using this feature statements will immediately fail if another process is writing to the db and you can use the sqlite3_busy_timeout to manage automatic retries for you (if desired).

The benefits over the lockfile approach include:

- transactional semantics & rollbacks

- optional concurrent read only queries in WAL mode if you initiate them with BEGIN DEFERRED that operate on a snapshot of data and don't impede the writing process

- simple online backups via VACUUM INTO.

Cons are it is heavier weight than a mkdir call and if all you want to do is elect a master process from a group and don't have structured data to operate on transactionaly then it's not worth it.


Oh my, you must understand the limitations of WAL mode. I just don't think that SQLite is an appropriate alternative to mkdir() locking.

You cannot use WAL mode with networked filesystems as all processes cannot see the shared memory.

Below are well-known limitations of WAL mode.

https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”

“It is not possible to change the page size after entering WAL mode.”

“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”

https://www.sqlite.org/lang_attach.html

“SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode. “Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”

I just don't think that SQLite is an appropriate alternative to mkdir() locking.

It's just the wrong tool.


All the peculiarities of NFS actually made it so operationally it is optimal for us to just have the NFS server be incredibly dumb. All the action is on clients. Modern Linux permits xattr and flock over NFS so everything will work out so long as you use precisely one kind of lock over NFS and do your xattrs over NFS as well.

Honestly, it would be pretty nice to have some sort of path locking and an ability to request delegation on NFS rather than have it be this automatic conflict-resolution style. But whatever, ultimately it works.


NFS has so many stupid warts that I often just use samba to share between Linux servers and feel bad about it.


The NFSv4 client is not dumb, as it implements delegations and can recall them from a client after they are granted.

Unlinking a file over NFS has also been implemented as renaming to a temporary file, so slight of hand intelligence was present in earlier versions.


The locks that never actually lock unless all gotchas are taken care of, and even then, might break on other UNIX platforms.

Hence why I rather prefer the locks on non UNIX platforms, they actually lock.


There is a process I have that uses flock(1) with an exclusive read lock. The intent was other processes can read the data but not allow to change it.

It "works", but I noticed many newer utilities no longer checks for these locks. The original vi and nvi checks, so the file is opened read-only. But vim and surprisingly emacs do not check. That means people can change that file while the process is executing using vim/emacs.


What's an "exclusive read lock"?


if you are reading a file amd put a read lock on it, vi(1) will refuse to write to it, but can read it.


fantastic article on flocks and NFS! and short and sweet

I was bitten many years ago by a few of the gotchas he covers in it. great to see clear and precise treatment like this. I like to keep an accurate model of the system in my ind, and pieces like this help me to do that a little better


Speaking of locks if you are writing something that can run a long time and produces a log file it might be nice if you locked the log file before appending a new log entry.

If you do that, someone who wants to rotate logs can do it by (1) locking the log file, (2) copying it to the rotated file, (3) truncating the log file, and (4) releasing the lock.

Here's a C program, cat-trunc, that does that. Usage is "cat-trunc logfile".

  #include <stdio.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <unistd.h>

  char buf[65536];

  int main(int argc, char *argv[])
  {
    struct flock l;
    ssize_t got;
    ssize_t copied = 0;
    off_t eof;
    int finished = 0;
    int status = 0;

    if (argc != 2)
    {
        fprintf(stderr, "%s: error: no log file specified!\n", argv[0]);
        return 1;
    }

    int fd = open(argv[1], O_RDWR);
    if (fd == -1)
    {
        perror(argv[0]);
        return 2;
    }

    l.l_whence = SEEK_SET;
    l.l_start = 0;
    l.l_len = 0;

    while (! finished) {
        got = read(fd, buf, sizeof buf);
        if (got > 0) {
            copied += got;
            if (write(1, buf, got) != got) {
                status = -1;
                finished = 1;
                perror(argv[0]);
            }
        } else if (got == 0) {
            l.l_type = F_WRLCK;
            fcntl(fd, F_SETLKW, &l); 
            eof = lseek(fd, 0, SEEK_END);
            if (eof == copied) {
                ftruncate(fd, 0);
                finished = 1;
            } else {
                lseek(fd, copied, SEEK_SET);
            }
            l.l_type = F_UNLCK;
            fcntl(fd, F_SETLKW, &l);
        } else {
            status = -1;
            finished = 1;
            perror(argv[0]);
        }
    }
    return status;
  }
That's a little more complicated than is necessary, because it tries to minimize the time it holds the lock.

If you've got a long running program whose stdout you want to log here's a C++ program (but written like it is C plus std::string and std::list because I am lazy), lock-save, to help with that. Usage is "lock-save [-t] file". It copies stdin to file, writing a line at a time, locking the file for those writes. You can then use cat-trunc to rotate the file. The -t option makes it also copy stdin to stdout, making it essentially tee with locks.

  #include <stdio.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <pwd.h>
  #include <fcntl.h>
  #include <unistd.h>
  #include <string>
  #include <list>

  std::list<std::string>  lines;

  int main(int argc, char *argv[])
  {
    char * prog_name = argv[0];
    char * file_name = 0;
    int t_flag = 0;
    struct flock l;
    int c;
    std::string line;

    if (argc > 1 && strcmp(argv[1], "-t") == 0) {
        t_flag = 1;
    }
    if (argc != 2+t_flag) {
        fprintf(stderr, "Usage: %s [-t] file\n", prog_name);
        return 1;
    }
    file_name = argv[1+t_flag];

    int fd = open(file_name, O_RDWR | O_CREAT | O_APPEND, 0666);
    if (fd == -1) {
        perror(prog_name);
        return 2;
    }

    l.l_whence = SEEK_SET;
    l.l_start = 0;
    l.l_len = 0;
    while ((c = getchar()) != EOF) {
        if (t_flag) {
            putchar(c);
        }
        line += c;
        if (c == '\n') {
            lines.push_back(line);
            l.l_type = F_WRLCK;
            if (fcntl(fd, F_SETLK, &l) != -1) {
                while (lines.size())
                {
                    line = lines.front();
                    lines.pop_front();
                    write(fd, line.c_str(), line.length());
                }
                l.l_type = F_UNLCK;
                fcntl(fd, F_SETLK, &l);
                line = "";
            }
        }
    }
    return 0;
  }


...what's wrong with writing to stderr? If you do that, then whatever is on its reading side can route the logs wherever and however it wants to, without any need for locking/truncating shenanigans. Or is it way too simple?


Exactly. Nothing at all is wrong with it.

Whereas rotating logs as aforegiven, locked or otherwise, has an unfixable race condition where logs can grow without bounds, because it divorces the log rotater from the log writer. The aforegiven tooling is merely wallpapering over huge cracks and not fixing the actual problems, which were, ironically, fixed with better tooling as long ago as the 1990s.

* https://jdebp.uk/FGA/do-not-use-logrotate.html


Luckily everyone is using journald nowadays and do not need to worry about this (:




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: