
Appending to a file from multiple processes - signa11
http://nullprogram.com/blog/2016/08/03/
======
danbruc
Why not just use a lock shared between all participating processes and call it
a day? That seems a way more robust solution and is probably even way quicker
to implement than digging up all the edge cases from all the (file) system
specifications and then hoping you did not miss anything and all systems will
continue to honor the least common denominator you just figured out in the
future.

~~~
mwcremer
I suspect shared lock performance is the sticking point.

Personally, I agree with tlb: it seems simpler to generate one file per
thread, and combine them at the end.

~~~
danbruc
I just tried it on Windows, 4 processes each acquiring the lock 100,000 times
and doing no work while holding the lock, and it took about 2000 ms. I don't
think that would become the bottleneck for a lot of applications.

EDIT: Corrected the numbers, I had a bug, previously it said 4 processes with
10 million locks per process and 750 ms. It is still pretty fast even if 250
times slower than initially claimed.

~~~
altano
The issue is the contention caused by all the processes waiting on file IO
behind the lock, not the locking

~~~
danbruc
If I/O were the bottleneck then that would be the case with or without a lock,
wouldn't it?

~~~
altano
It depends on what has to wait for the IO to complete. ETW, for example, is a
Windows-wide event framework and it would be untenable for everything using it
to wait on the file IO of everything else making use of it. So, it has a
solution that makes use of locks but nothing ever has to wait on file IO.

File IO is always expensive, the trick is always in how you work around that.

------
tlb
It's really much better for each process to write to its own file, and the
consumer can merge-sort the records.

~~~
altano
This is basically how Event Tracing for Windows (ETW) solves the problem very
efficiently. It uses InterlockedIncrement() to quickly and with minimal
contention have something to merge all the records at the end.

If you're on Windows you should use ETW instead of reinventing it when
possible :)

~~~
adzm
ETW is awesome. Too bad it's been plagued by such an opaque API and all the
registration junk. I hear it can be easier to use now, but I'm still not sure
if an exe can emit ETW events without having to register every possible event
or whatever.

~~~
hirsin
The new ETW format is self-documenting at the expense of some overhead. You
can set up hello world in ETW in about three lines of code [1], just create
the logging channel and log a string or property bag to it.

1\. [https://blogs.windows.com/buildingapps/2016/06/10/using-
devi...](https://blogs.windows.com/buildingapps/2016/06/10/using-device-
portal-to-view-debug-logs-for-uwp/)

------
herge
This is a great interview question. It's very easy to formulate and explain,
and yet the interviewer can gauge the depth of a candidates os knowledge by
digging deeper into any solution, with no easy 'perfect solution'.

~~~
Someone1234
I'd argue this has the same problems as many existing interview questions: It
relies on arcane knowledge that won't be used day to day at the job in
question, and rewards the candidate who is lucky to know the answer or
answers.

Can we stop with the silly/trick/arbitrary interview questions? I know we
cannot, but I dream for the day.

~~~
avar
It's a good question because even supposing that you have no prior knowledge
of the problem at hand you should be able to think of some obvious problems if
you have any programming experience.

You have N number of writers, those writers can write X sized data to location
A.

How would those writers go about writing to "A" in a performant manner without
causing issues? Whether those issues are interleaved data, slowing down the N
writers or having to keep outsized buffers on the OS level.

Even if you have no knowledge of POSIX filesystems you should be able to
reason about this in an intelligent way in an interview.

Do we go for locks? That has its own set of problems, what are those? Do we go
for fixed sized non-interleaved guarantees? What problems does that cause? Do
we buffer arbitrarily sized output and and merge it later etc.

------
catern
Atomic writes are pretty easy - as this article points out, PIPE_BUF sized
writes to pipes are atomic.

It seems like atomic reads are harder. You want to read a record from a shared
input, but you don't know how big a record might be. Any thoughts?

~~~
rpcope1
If you have the ability to (minimally) parse what is coming out of your pipe,
and you know that your writes are not interleaved, and you know the size of
the record you're writing before you write it, you can prefix each record with
the size, and use this on the reader side to fetch only upto the record length
(read(2) syscall and friends allow you to specify a maximum number of bytes to
fetch), for each record. Does this answer your question?

~~~
catern
I don't believe what you're describing works. You're suggesting reading first
the size of a record (stored itself in a fixed-size field), then reading the
rest of it?

But unfortunately that is not atomic. Another process could read into the data
between your first and second reads.

------
gpderetta
Posix seems to guarantee that writes with O_APPEND are atomic:

"If the O_APPEND flag of the file status flags is set, the file offset shall
be set to the end of the file prior to each write and no intervening file
modification operation shall occur between changing the file offset and the
write operation."

but then it goes on and says:

"This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some form of
concurrency control."

Also, it does seems to allow a signal to interrupt file I/O.

Anyways, it seems that historical unix behaviour is that signals never
interrupt file I/O and O_APPEND can be used for atomic appends. Of course,
this is not documented anywhere, but you can find discussions on the topic. In
particular, here [1] Linus goes in one of his rants when commenting on a patch
to change this behaviour.

[1]
[http://yarchive.net/comp/linux/wakekill.html](http://yarchive.net/comp/linux/wakekill.html)

------
adenner
Couldn't you use a named pipe to send data from each process to a single
writer process?

~~~
herge
How would the single writer process know that one 'atom' of writing has been
received from each process? The named pipe might be empty if the sending
process is filling it's buffer with the second half of the message.

~~~
kbenson
Presumably you would have some protocol to passing the messages that handles
this, such as at the simplest level wrapping the the message with a preceding
length and following end of record identifier, which in conjunction should
work.

But that's the (implicit) point of all this, now you've added a bunch of logic
for an accumulating writer, a protocol to pass messages between the processes,
and state so that partially received messages can be continued when you get
back to that process. Now the application has a non-trivial amount of logic
and code to deal with logging which might be a source of problems itself. As
another commenter noted, individual files per process with an aggregation step
(or specialized reader) is much simpler and easier to reason about and mirrors
actual case in the article where it was eventually inserted into SQLite, which
essentially enforces this read ordering as needed (if there is an ordering
field).

------
pjc50
The classic application from appending from multiple sources while potentially
modifying a file and potentially having NFS in the way is a mail spool using
mbox.

While there are various hard-won implementations that work doing this, it's
also one of the reasons maildir was invented.

It's _much_ more portable, comprehensible, and reliable to have a single
process writing to the file and all the other producers sending records to it
using sockets or named pipes. The consolidator needs to be aware of record
boundaries so you don't have to worry about pipe atomicity.

------
ridiculous_fish
We have this issue in the fish shell. There's a single shared history file,
and multiple shell instances may append to it at once. We use fcntl() locking
but we do in fact have users who have $HOME on nolock NFS, where locking
fails.

Our strategy is to keep the history as append-only, with periodic "vacuums"
that rewrite the file into a temporary and move it into place. Even the
appends in principle could be split, as NFS provides no guarantees here, but
in practice writing individual items seems to be atomic when done with
O_APPEND.

------
DougWebb
It's fairly simple to write a little server that accepts socket connections
from multiple processes, and reads a line of data from each connection
whenever there is data available and append it to the log file. I wrote a
server like that years ago; a write-only socket service which hundreds of
separate processes could write high-volume statistics-tracking to. The heart
of it was only a few dozen lines.

~~~
ww520
Something like syslogd?

~~~
DougWebb
Similar, but different requirements. All of the messages were grouped by
sending process, and when the sending process said its session was over its
messages were fed into a post-processing queue. There was support for holding
messages in memory or on disk, and data-preserving fallbacks if the post-
processing halted or got too backed up. The message rate varied quite a bit;
during the US business day the rate was at least doubled. We used that to give
the post-processing queue time to catch up off-hours.

This was designed about 16 years ago, so well before the big-data tools, ssds,
or powerful machines we have today.

------
JoeAltmaier
how about

    
    
        mkfifo foo
    
        tee -a log.txt < foo &
    

and then multiple processes write to pipe foo.

------
ArkyBeagle
I fully expect to be disappointed at some point, but I've always seen the "use
fflush()" approach work, on both Linux and Windows.

TCP Sockets to a server over localhost might also work ( although the flush
IOCTL makes no guarantees w/ sockets), then have the server serialize out. One
of the nice things about Tcl is that such a server is relatively easy to do.
But you may need to mind newlines - don't write until an entire "line" of text
has been read.

~~~
pjc50
fflush() over networked drives?

~~~
ArkyBeagle
Good question - I tended to create a TCP server and use that to write to
drives local to the server machine. I never really did a "trust fall" with
network mounted drives because the logger is generally not a standard part of
the system - it was an add-on.

------
known
Check how Apache implements its log
[http://httpd.apache.org/](http://httpd.apache.org/)

------
Nuffinatall
My approach would rely on database locking. Seems like a good fit.

------
leni536
Is it possible to call lockf on an O_APPEND file?

~~~
rleigh
It should work just fine. But since the current offset will always be
changing, you might want to lock the entire file and fcntl has a more flexible
interface for doing that.

