
Mandatory File Locking for the Linux Operating System (2007) - luu
https://www.kernel.org/doc/Documentation/filesystems/mandatory-locking.txt
======
Animats
As a retrofit, that never worked out.

I've suggested before that files should be divided into several types - unit,
log, temp, and managed. This also addresses the locking problem.

Unit files are always updated as a unit - once written and closed, they can't
be changed, only replaced. Using "creat" creates a new file, which replaces
the old one on clean close. A crash or a program abort reverts to the old
file. So they implicitly have a form of mandatory locking. Should two
processes be able to create new updates at the same time? Probably not. This
is the default kind of file.

Log files should be append-only. Multiple users can write, but only at the
end. That deals with the locking problem there. That's what open for append
should do.

Temp files disappear at reboot, and should have N-readers or 1 writer locking.
Anything in /tmp gets this treatment.

Managed files are for databases. Shared access and partial file locking is
supported. Managed files support an API where you get a callback when the data
has been safely committed to disk, something ACID databases care about. Only a
few programs use managed files, and you know which ones they are. Managed file
mode has to be explicitly turned on for a file.

That's how locking ought to work.

~~~
sargun
Have you tried proposing this to be included in POSIX? Or tried implementing
this in the VFS layer in Linux? I imagine all of these could be xattrs.

~~~
Animats
The usual objections are:

\- Unit file updates are not needed because of (complex non-portable
workaround for atomic file replacement).

\- It's not backwards compatible with NFS.

\- Lock files are good enough.

\- If I want to seek while adding to a log file and rewrite in the middle, I
should be allowed to do so.

Database people like the idea of a callback when data is safely on disk. They
work hard to get the effect of that, and struggle with "flush" calls either
flushing too much, or return when the data reaches the drive controller, not
the disk surface. Doing this right needs file system and driver support,
though.

~~~
daurnimator
> \- Unit file updates are not needed because of (complex non-portable
> workaround for atomic file replacement).

I agree with all your points except this.

What is wrong with creating the file somewhere else and using rename() to
atomically overwrite?

~~~
FooBarWidget
Because that is not portable and/or has performance problems.

I'm not kidding you. I know the conventional wisdom is that a rename is
enough. It is not. Depending on the filesystem, you may have to fsync() the
directory entry too, or fsync the _parent_ directory entry too. Or on other
filesystems, _don 't_ do these because doing these may cause problems. And do
all this in the right order, because if you get it slightly wrong then it
doesn't do what you want in the event of a crash.

The problem is maddening. There is an article out there that explains all this
in much more detail. I _think_ it is this one but I don't quite remember
whether it is the same as the one I read:
[http://blog.httrack.com/blog/2013/11/15/everything-you-
alway...](http://blog.httrack.com/blog/2013/11/15/everything-you-always-
wanted-to-know-about-fsync/) The performance problem comes from the fact that
you need _two_ fscyncs.

------
Asooka
What I would really want is just support for the common Windows-like case:
Disallow opening the file by another process for reading or writing, paired
with a flag you can pass to open(). Just these two:

    
    
      open("file1", O_CREAT | O_EXCL | O_LOCK_RW);
      open("file2", O_CREAT | O_EXCL | O_LOCK_W);
    

First atomically creates and locks a file if it didn't already exist and
forbids other processes to open it for either reading or writing. Second does
the same, but permits other processes to open it for reading. If you want to
release the lock, maybe you can do it with fcntl, or maybe you just open the
file again and close the original descriptor. Locks are always tied to
descriptors, not like epoll. You aren't required to combine the lock with
creat|excl like in the example, you can open and lock an existing file as long
as the lock won't conflict with another process's open fd. That is, if process
A has opened file X for reading, process B can open file X with O_LOCK_W,
after which no other process can open file X for writing, but process B can't
open it with O_LOCK_RW. Multiple processes can have O_LOCK_W but only a single
one can have O_LOCK_RW. If the file system can't support locking (NFS et. al)
return -1, set errno to EINVAL. You can pair this with fcntl to take a lock
with an optional timeout on an already open file.

This covers the usual case of "I don't want someone else to mess with this
file while I do things with it", while avoiding the explosion of intricate
states that region locking brings, and lets you create-and-lock atomically a
file.

~~~
jo909
Why prevent others from reading the file? I guess mainly to avoid them reading
mid-update in an inconsistent state.

But preventing the read completely until you are done writing the file is
arguably also not "consistent", the reader wants a file and gets an error
instead.

So now the reader has to either error out or implement some error handling,
like waiting a bit which is rather impractical. But wouldn't it be nicer if he
just got the "old" version of the file until you are finished with the update
and "commit" the new file?

~~~
Asooka
Yes, it would be quite nice. But this version is simpler, and you can
implement your version by writing to a temporary file, then doing an atomic
rename.

My operating model assumes a user creating some kind of content using an
application. E.g. Photoshop is running and the user is editing a file.
Photoshop can very well write partial updates to the file, so it's important
that nobody else write to it, or you end up with a corrupted file.

Maybe just opening the file with no locking mode should always succeed though,
or there can be an additional LOCK_NONE for that case (I don't want to
participate in the locking scheme and yes I want to read whatever data is
there). In that case you're saying "I just want to read what's there _even if
it is not consistent right now_ ". The LOCK constants should probably also be
named FORBID, in the sense that you're opening the file and FORBIDding the
operation from happening while you have it open. So you can forbid writing
(because inconsistent data is nonsensical for this case) or reading and
writing (because you'll be updating the file). But I think the LOCK names are
easier to use, since you're specifying the operation you'll be doing and can
be understood analogous to read/write locks.

~~~
jo909
You can also "forbid" others from reading and writing your file if use a new
temporary file they don't know about, and weakly enforce that by removing the
file system entry with unlink while keeping the fd open, then link it again
later.

Part of the goal is IMHO not having to write out the entire file every time,
only because you changed a few bytes. The atomic renaming can't do that, to my
knowledge.

I'm really only wondering about the forbidding reads part, and if that gains
anything useful. For a process that knows about and participates in the
locking process that's already possible, but mandatory file locks are also for
processes that did not participate. Like for example cat, the backup process
etc.

------
tedunangst
Mostly solved this problem with other solutions. Maildir for mail, SQLite for
everything else.

~~~
unixhero
What's maildir?

~~~
tedunangst
A directory of mail files.

~~~
emmelaich
> A directory of mail files. ..which avoids (user level) locking.

(to answer the question fully)

~~~
emmelaich
See also dirq, a queuing system using similar principles, with implementations
in Perl, Python and Go.

------
emmelaich
When UNIX started its commercial success, some mainframers sneered at the
newcomer because it didn't even have mandatory locking.

Turns out, neither does MVS. The locking was done by the DB system, not the
OS.

