Hacker News new | comments | show | ask | jobs | submit login
What's So Bad About Posix I/O? (nextplatform.com)
160 points by rbanffy on Sept 11, 2017 | hide | past | web | favorite | 96 comments


"POSIX I/O is simply not what HPC needs." but it's ok for almost any other use case and "There are many ways to move beyond the constraints imposed by POSIX I/O without rewriting applications, and this is an area of active research in both industry and academia." The problem is the lack of a standard solution yet.

The post does a good job at explaining why HPC doesn't need the strong guarantees provided by POSIX I/O and why they slow down those kind of applications. They are perfectly fine and desirable for most workloads on single user workstations, such as web development (save a file in the editor and the compiler gets its bytes exactly as you wrote them.)

I'd have thought MPI-IO is the relevant standard for HPC i/o (perhaps under something like HDF5). I'd expect HPC applications with serious i/o requirements to be using that anyway.

It is and people do use it, but it ultimately still exists as software that sits on top of POSIX in the vast majority of cases. It can clean up really bad POSIX I/O by buffering and reordering, but doing so induces latency, memory overheads, and consumes other resources that don't always pay dividends at all scales.

ROMIO unfortunately requires global locking on Lustre, but I don't see how that's relevant to the use of standard or other high-level (HDF, NetCDF, ...) non-POSIX interfaces. You can swap in PVFS, or something else with a different architecture. (So what if object stores are on _local_ POSIX-y filesystems?)

I'd have thought the salient feature is collective i/o, not cleaning up silly POSIX i/o of some sort. Nothing's perfect, but what should replace the non-blocking collective MPI-IO under5, for instance? I wonder what are the major NERSC applications that don't do sane i/o and should. The published workload characterization doesn't help.

Applications using MPI-IO are still typically run on top of a parallel file system. The MPI-IO API is, in simple terms, intended to create I/O workloads that should perform well on top of the POSIX I/O interface (e.g. large, sequential I/O). But my understanding is that MPI-IO can have some major drawbacks, like a huge amount of compute-side network utilization used to reorganize and stage I/O.

The PFS under MPI-IO clearly doesn't have to be a POSIX one; PVFS is stateless. I can't comment on that specific intention, which isn't in the standard's rationale. Where is it documented? You'd have to argue with Rob Latham about whether "MPI is a good setting for parallel i/o".

The author brings up a lot of good points, but they're mostly good within the context of HPC. Out in the broader universe where most data lives, many people do require the full suite of POSIX namespace, metadata, and permission semantics. Yes, even locking, no matter how many times we tell them that relying on that in a distributed system is Doing It Wrong. I know because I support their crazy antics at one of the biggest internet companies.

The author's on much firmer ground when he talks about POSIX consistency semantics. While we can't control misguided applications' (ab)use of locks, we can certainly offer models that don't require serializing things that the user never wanted or expected to be serialized, or making them synchronous likewise. As I've also written before[1], we need more ways to describe the user's desired level of ordering and durability, separately, giving the implementation maximum flexibility to optimize I/O while preserving the guarantees that users really care about. The setstream primitive in CCFS[2] is a good example of the sorts of things we need here. I'm not at all convinced that object-store-centric approaches are the way to go here, but some evolution is clearly needed.

[1] http://pl.atyp.us/2016-05-updating-posix.html section under "fsync"

[2] https://www.usenix.org/system/files/conference/fast17/fast17...

The context was definitely HPC, but that's mostly because it's the last place where this problem persists in a very big way. The WORM workloads that serve up most of the data we consume globally just rejected POSIX entirely and created the scalable object interfaces we know and love. It was very obvious that having every Roku in the world mount a Netflix file system over the WAN was insane, so nobody had to write an article calling for the end of that bad behavior.

Updating POSIX, whether it be through more expressive consistency options or just giving middleware developers the knobs they need to implement such mechanisms themselves, has come up before[1], as you say. My worry is that many of the serious efforts from years past were ahead of their time, and hyperscale has gone their own way so the ship has sailed on POSIX I/O.

[1] http://www.pdl.cmu.edu/posix/

This reminded me of another attempt to embed ordering in a filesystem in a usable and performant way, Featherstitch:


I see the CCFS paper cites it to say that CCFS is better, but i haven't dug into why.

I'm surprised there is no mention of the lack of non-blocking filesystem reads and opens. In other words, I should be able to submit a read request, and then select()/poll()/epoll() for when the data is available like network I/O.

Most unixes (including Linux) always report files to be readable under select()/poll()/epoll() so this doesn't actually help read from disks (even when they're backed by network/shared storage). That's why you often see this kind of seeming "nonsense" in HPC startup scripts:

    for x in *; do cat "$x" & done | pipeline ...
Furthermore, in a networked-environment, select()/poll()/epoll() actually decrease concurrency compared to threads because now you need another system call to go fetch the data from the kernel. If I've got the CPUs I'd rather use threads.

In the Windows model (in contrast), the programmer says "when this data is available (from this socket, file, etc), put it here (into this memory buffer)" and then waits for the kernel to do something. As a result, you skip that extra syscall -- but you also skip any other read you might have needed to do for data that's arrived at (roughly) the same time. AIO[1] is potentially similar, but it never seems to benchmark well in my experience.

[1]: http://man7.org/linux/man-pages/man7/aio.7.html

> Most unixes (including Linux) always report files to be readable under select()/poll()/epoll()

There is no other choice really. Let's say you want to read a specific page from a file. No page is currently in the kernel buffers. If poll where to return non-ready (and remember that poll has no idea which pages you are interested in), what action, within posix, would you take to change it?

poll and friends just do not make sense for plain file descriptors. In linux, splice add an interesting twist, but it is still awkward to use.

> Let's say you want to read a specific page from a file.

pread() is an unusual case. I'd be happy if poll() simply indicated read() would succeed, and I'm happy doing an lseek() before every read().

However a pread() with O_NONBLOCK could return EAGAIN and poll() would then know what pages I'm interested in.

> what action, within posix, would you take to change it?

The general (POSIX-compatible) strategy (as I indicated) has been to fork a thread off to handle reading so that my application is notified when the data is in the fifo. This works for me when I'm reading (say) five files from a slow (or remote) storage, but it's not ideal since it causes a copy.

poll can't guarantee that the next read will succeed because first of all doesn't know how many pages you are interested in reading and any relevant page in the page cache might be evicted between the poll and the read.

Also having pread, which is designed for concurrency and scalability, carry state around for poll (an unbounded amount of state btw: you can have multiple threads issuing concurrent preads on the same fd) makes it completely pointless.

There are linux patches floating to add flags to preadv2 so that on a page not present it returns EAGAIN, so that you can optimistically do a sync read an fall back to a thread pool in the slow case. It still doesn't involve poll and friend.

poll() can't guarantee that the next read will succeed anyway, because something else might drain the socket or fifo (e.g. another thread). Programmers simply use O_NONBLOCK anyway and use poll() is wait until the process should try again.

I agree completely. Pedantically, there is one other option: return an error like windows select does (WSAENOTSOCK). But then apps need special cases to deal with files and life is horrible.

> If poll where to return non-ready (and remember that poll has no idea which pages you are interested in), what action, within posix, would you take to change it?

readahead(2) ? posix_fadvise(2) ?

From man page of readahead(2), "readahead() blocks until the specified data has been read." The other one claims async functionality. But says it is not reliable.

An idea for how it could be implemented in linux: kernel would offer a new kind of file descriptor. It would have a limited, in-kernel buffer, much like a socket has an arrival space. Once you have it, call something like,

   newkind_page_copy(newkind_fd, disk_fd, page_count, timeout) 
Timeout could be unlimited.

The buffer would behave like socket for the purpose of the read and error lists in select/poll/epoll. If you closed either descriptor, it goes to error-state. Within the kernel, it would watch real_file_fd, using code to similar to whatever is underneath inotify. Like inotify, this would not work for NFS.

> From man page of readahead(2), "readahead() blocks until the specified data has been read."

? What man page are you reading? The goal of readahead is to prime the cache and return immediately. http://man7.org/linux/man-pages/man2/readahead.2.html

I quoted from this one, https://linux.die.net/man/2/readahead

There is a second obstacle I have in mind, around the robustness of this. Imagine if we had an implementation of that which reliably returned quickly. But we take our time doing the subsequent read. By the time we do, the value has already been chased out of the cache. It is an edge-case, and you could probably avoid it in practice, but it would be significant effort to to verify your mechanism, and would create cognitive load.

I think this would be pretty straightforward if O_NONBLOCK was allowed to have an effect for regular files. The rules would be:

1. If a read for a range of bytes on a non-blocking file descriptor can be wholly or partly satisfied from the cache, put that data in the read buffer, and return the amount of it

2. If a read on a non-blocking file descriptor can't be satisfied from the cache, return -1, set errno to EAGAIN, and kick off an asynchronous task to load that range of bytes into the cache (an 'asynchronous load')

2a. If there is already an asynchronous load for that range running, don't start another one

3. When an asynchronous load finishes, mark that range as available

1a (surprise!). If that range of bytes was marked as available, remove the mark

4. If there are a nonzero number of available ranges for a file descriptor, return POLLIN from poll()

The way a program with lightweight threads (Node, Go, etc) would use this:

1. Open files as non-blocking

2. Make reads as usual

3. If a read succeeds, or fails other than with EAGAIN, pass the result to the thread and carry on

4. If a read fails with EAGAIN, suspend the thread, and remember the range it was reading

Meanwhile, in a background IO thread ...

5. Repeatedly poll all file descriptors for which there are remembered reads

6. If a file descriptor shows up as POLLIN, run through all the remembered reads for it, and make them again

7. If a read fails with EAGAIN, cool, leave it remembered and carry on

8. If a read succeeds, or fails other than with EAGAIN, pass the result to the thread and resume it

Would that work?

I haven't really thought about partial results of asynchronous reads. If a read asks for two blocks, and one gets loaded, should poll return POLLIN or not? I lean towards yes, because the program can make use of that data. That raises the question of whether to remove the availability mark after the following read. Again, i lean towards yes, because it means asynchronous reads have the same behaviour as ones which partially hit cache. The OS can carry on reading the rest of the range into the cache, but it won't signal the arrival of that data to the program. That should be fine, as the program will probably try to read it soon anyway.

If a program takes too long to get round to reading data once it is available, it might have been pushed out of the cache by then. Too bad. The read will fail with EAGAIN and the whole cycle repeats.

If two threads in a program independently read the same range, only one asynchronous load will be run, and only one thread will end up removing the mark. Programs will need to be careful not to trip up over this; they probably need to coalesce reads on their side of the API.

It sucks to have to retry every outstanding read for a file when it shows POLLIN, but i can't see a way to return information about which ranges are available through the existing calls. A program could keep its list of outstanding reads in time order, scan it from earliest to latest, and stop when it gets a non-EAGAIN result. If there is more than one available range, poll will return POLLIN again and the program will go back and pick it up. If asynchronous loads complete one at a time in order, this reduces the cost to O(1), with a smooth increase in cost as that is less and less the case.

Basically, what i'm saying is that everyone should just implement the kqueue API from BSD.

  2. If a read on a non-blocking file descriptor can't be
  satisfied from the cache, return -1, set errno to EAGAIN,
  and kick off an asynchronous task to load that range of
  bytes into the cache (an 'asynchronous load')
The way Windows and Unix operating systems are written, that "asynchronous load" will always be implemented by using a pool of kernel-scheduled threads. The filesystem is a stack of software, several layers deep, with tentacles reaching all over the place (e.g. modern file buffer and VM page caches are unified). Existing APIs and implementations simply do not support truly asynchronous operation throughout the entire stack. This is one reason why AIO on Linux is limited to direct I/O--to cut out all the middle layers--and even then it's not truly asynchronous (e.g. file opens). Windows asynchronous I/O just uses a kernel thread pool. So does FreeBSD. And so does every Linux proposal for buffered AIO.

If you want pollable file reads, just use eventfd and roll your own. Linux threads are lightweight enough that there's not much reason to care whether they're "kernel" or "user" threads. But pollable file I/O will really only be a win for streaming reads and writes, so you could just use sendfile(), in which case you're reading or writing to a pollable socket; you use a dedicated thread pool to pump data from sockets to the file descriptors, or vice-versa, without any copying.

BTW, Go doesn't need this solution (timeouts notwithstanding) because Go schedules multiple goroutines across multiple kernel-scheduled threads.

If you're really worried about file I/O latency, you should be at least as worried about VM pagefault latency. I disable swap on all my Linux servers. And I also disable overcommit. But my languages of choice (C and Lua) are capable of gracefully handling allocation failure (contrast Python or Go), and I implement my solutions to handle allocation failure without crashing the entire service. Too many other languages make the assumption that memory is infinite. (Alas, large parts of the Linux kernel do this as well, even with overcommit disabled :( Solaris and Windows are the only major OSs I'm familiar with that implement strict and rigorous memory accounting, permitting the system to stay alive (and consistent) under memory exhaustion. FreeBSD might as well, I'm not sure.

In theory FreeBSD supports a sane aio_read/write/fsync() implementation by extending sigevent to include the generation of kqueue events, but there are no aio_open/close/fstat/rename() syscalls and if you need a thread pool to deal with those remaining blocking file system calls there are few advantages to using the in-kernel thread pool implementing the aio_*() syscalls. An other problem is that filesystem aio calls are limited to local file systems, because accesses to network file systems could block indefinitely starving local file system accesses by blocking all threads in the pool.

What is surprising to me is that it is 2017 and nobody has attempted to solve these problems. Maybe the current mechanisms, with all their faults, are considered good enough?

I think that's unfair: A lot of people have attempted to solve these problems(), and the Windows Way™ has it's own problems. The real answer is probably between elastic infrastructure and the fact that computers got faster and cheaper, and so -- at least for a while the number of people needing solutions here went down.


* https://lwn.net/Articles/223899/

* https://lwn.net/Articles/221913/

* https://lwn.net/Articles/316806/

* https://lwn.net/Articles/612038/

* https://lwn.net/Articles/671658/

* http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

* http://ogris.de/howtos/splice.html

And so on.

However a few companies have seen real commercial benefit from actually analysing their data, so we're seeing a lot of "big data" interest. Some of these companies have noticed that a bunch of hadoop/mongo/whatever boxes can't even approach the performance of one "big" box tuned right, so we're seeing somewhat of a resurgence in interest here.

> and the Windows Way™ has it's own problems

Could you expand on that?

(I'm assuming that by the "Windows Way", you mean I/O completion ports.)

Other systems (Windows and I think Solaris, but some more obscure systems likely do this as well) can do asynchronous disk IO, but it's not perfect, either.

I think the main reason Linux doesn't really do it is that asynchronous IO is an intrusive change and there would be just too much work to implement it in all the file systems etc.; i.e. "not worth it".

Many applications which are not OK with synchronous disk I/O seem to find thread pools good enough: reasonably easy to implement, reasonably portable, usually performs ok.

I only half buy the "not worth it".

Traditional unix, including Linux pretends to support poll() etc for file I/O (i.e. it supports the interfaces, but the reality is synchronous). The reason for this is that an application that expects async I/O can kinda-sorta-work with real files -- and "kinda sorta" will not be too bad because local disks are fast enough to paper over any problem.

But then if the disks aren't local, why not actually make non-blocking I/O work as advertised?

Actually I can guess the answere: there are 101 corner cases that mean I can't neatly separate out the apps design for noblocking I/O. And that's why I half buy the argument.

But only half.

Generally epoll and event io are good enough for millions of simultaneous files or sockets. Nobody tells you to use direct read or write instead of mmap either.

Mmap - I can't think of a scenario where it improves over read or write in the context discussed here: async access. Mmap is synchronous, and you can't do select/poll on it.

Not sure why you were downvoted.

That's some seriously hairy stuff there. I reckon I'd move to building my app as a kernel module more readily than trying to make a robust async arrangement using that mechanism :) Call me demanding, but I think we should can reasonably expect easier access to async from our OS than this mechanism offers.

On that topic of which.. how's the future of operating systems coming along?

Those problems are already solved by async I/O in platforms other than Linux.

You can't do that with normal read/write and select/poll/epoll, but you can either use aio (posix) or io_submit (linux-specific).

In my opinion, a big issue with these is that asynchronous i/o requests can not be used with select/poll/epoll, but rely on other notification methods instead. AIO can use unix signals (ugh) or thread notification (see sigevent(7)) to signal completion. io_submit uses io_getevents to read events about completed I/O requests.

From a userspace programmer's point of view, it would make much more sense if io_setup() would return a file descriptor that could be used with select/poll, and io_events would be read() from the fd.

Currently it's impossible (?) to have filesystem i/o in the same select/poll loop with your network i/o because of this issue. This is why libuv falls back to using threads and synchronous i/o to provide "async" i/o.

> Currently it's impossible (?) to have filesystem i/o in the same select/poll loop with your network i/o because of this issue.

Hacky workaround: don't do filesystem i/o in the first place. Have the kernel serve an iSCSI target, then consume it over the loopback interface entirely in userspace. Your disk now has network semantics! (But you have to bring your own filesystem driver.)

If you don't mind some extra abstraction overhead, you can skip past iSCSI and instead build a userspace NFS client lib into your app, consuming a loopback NFS share. s/NFS/9P/ if your disk server has in-kernel 9P.

(Is there any use to these approaches? Yes: they're actually nearly-optimal ways to persist state from a unikernel running on a hypervisor, because all the disks are remote one way or another anyway. Erlang on Xen uses the 9P approach, for example.)

This aligns all semantics and removes file io as special case? Unikernels appear to be a good way to evolve past Unix.

you can do io_submit from one thread and then have some dedicated threads do io_getevent + event signalling (comes at a price though).

Still all this event goodness is not good enough for things like infiniband and the brave world of zero copy networking.

Yeah, sure that kinda works but it won't play nice with having an event loop on network events. A typical high performance server has multiple worker threads waiting on events on a select/epoll and if you add in another thread (or several) waiting on io_getevent you end up in a rather contrived solution.

You could use a pipe or eventfd or some other fd-based mechanism to wake up a worker thread on a filesystem event, but then you get two context switches (one to wake up io_getevent thread and another to wake up a worker). This carries a lot of indirect costs from jumping back and forth to the kernel.

I am unaware of a solution that would make filesystem i/o and network i/o play nice in one event loop without adding some overhead.

Threads are less of a problem with file i/o, because usually a process doesn't need to have 10000 outstanding disk io requests.

(Windows async disk i/o is thread-backed too behind the scenes, though)

Windows async I/O ("overlapped" in Windows parlance) is backed by a thread pool (completion ports).

Windows does not spin up or allocate a thread just to wait for an IO operation to complete. Rather, when an IO operation completes, a thread is allocated from the thread pool (hence the name completion ports) on which the callback happens.

So there is no excessive allocation of threads. Everything is event-driven and kernel managed until the operation completes.

Simultaneously, people overestimate the costs of threads, and underestimate the efficiencies they provide. At extremely high loads, they become an issue, but it's quite rare. And at normal loads, threading is (IME) often simply more efficient than typical async solutions in practice, probably since a thread gets synchronized access to its internal state and consistent ordering of execution (almost) for free, whereas async code needs to (usually) reimplement that in user mode.

People have taken the notion that threads are slow in some circumstances and made the conclusion that threads are slow, period. But that's simply not true.

The rate of computation is obviously very similar for threads and async, but the rate of multiplexing is not. Switching between processes or threads takes more time than switching between fibers/coroutines. So the general observation that for massively multiplexed workloads fibers/coroutines are lower overhead is generally correct, but the more interesting question is of course "what qualifies as a massively multiplexed workload?" ... I'd argue most kinds of web application servers don't.

Yeah, and the devil is really in the details there; not all context switches are created equal. If you're dealing with less than (rough ballpark) 1000 requests per core per second, there's just no way context switching is going to be anywhere near a bottleneck. Depending on all those details (app/OS/processor/lunar cycle), you may be able to deal with 10 to 100 times as many before context switching is an issue.

Those kind of servers are simply quite rare (IME). Nothing wrong with interesting problems in niche spaces, but it's not something you should be worrying about by default.

(No idea what "big" servers provide public stats, but e.g. https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar... ) lists a random day for stackoverflow having 209,420,973 http reqests, i.e. a little less than 3000 per second. I doubt context switching is going to matter for them (in the hypothetical world all this was served by one HTTP server, which of course it isn't).

I largely agree with the above and I would say that 1000 reqs/sec is probably a decent threshold for considering when async IO is going to matter for performance. That said, the details of your particular workload may benefit from async IO at significantly lower levels. As an example, one message routing application on which I worked with a typical workload of ~100 reqs/sec increased its performance by about 5x when switched from a blocking thread per request model to async IO. The application typically maintained a larger number (500-1000) of open but usually idle network connections. With that particular workload and on that particular platform, the overhead of thread context switching became a significant factor at much fewer than 1000 reqs/sec. One hint that this was the case was relatively high percentage (30%+) of CPU time spent in kernel mode. Switching to async IO dropped kernel time to about 5% on this particular application.

Yeah, that's a neat case in point (and illustrates nicely that there are cases where async is super-valueable!).

You can do ~ 1 million thread context switches per second, per core. That's a lot of disk operations in flight before the context switch overhead becomes a large component in your performance equation. Multiply that by 10-20 cores for a server CPU, too. If you still haven't pegged your storage hardware IOPS, I'd say you have a "happy problem" :)

I never imagined that the cost of threads was predominantly a performance cost (though obviously that exists too in some parameter regimes).

The real cost of threads is that threaded code is far more difficult to get right than non-threaded code. The closest thing I know of to an excuse for using threads is that we are already using threads, so all the complixity is already there.

After all, once every line of code is already a potential race condtion, it hardly matters if you turn a few more of them into real ones.

the problem with threads is that unless you spawn an ungodly amount of them, you might still not be able to issue enough pending blocking disk IO operations to saturate the disk bandwidth.

AFAIK, there's no reason why an OS can't make filesystem I/O non-blocking with the standard read/write/select (or poll or whatever) system calls.

The only tricky part (API-side) is that write() is defined to confirm that some data got written, or not. If you let write() become non-blocking, and deferred the I/O, perhaps by returning immediately with EAGAIN or some such response, the write is is an unknown state to the rest of the application. What data can then be expected from other read() calls on that file from the same process? The old data? The new? A mix? You'd have to drop the POSIX consistency guarantee. For most programs, this is never likely to be an issue, as they don't tend to be re-reading bits of a file that they are also writing to, so it's a shame there's no option to let the write() be non-blocking if requested.

Issue 2 is failures: If the write, when it eventually happens, fails, you've got to have some way of passing that information back to the process. The solution for this seems easy enough; just let the FD become writeable again and the program will probably re-attempt the same write, at which time you can report the previous error.

Finally, making open() calls non-blocking would be the real API changer. If you defer all of the file opening work and give the application a file descriptor before doing any I/O, then you can hit all kinds of errors (permissions problems, file not found, and so on). How do you pass them back to the program? You can make any future read() or write() calls fail but these syscalls would now have to return error codes that they aren't specified as actually being able to return (read can't return things like EEXIST, EACCESS and so on). So the API is inadequate for these situations. You can see how they could be extended to cover these cases though.

File write could work the same as socket send. If the data cannot definitely be written now, write nothing and allow the application to try again, probably after waiting for readiness via poll or select.

See gpderetta's comment elsewhere in this thread: the page-cache is different than socket mbufs in some pretty significant ways.

I thought you could, at least on Linux.

You can on (more or less) any modern OS, but it's not part of POSIX, it's platform specific.

It is part of posix, it's called posix AIO, but working with it is a massive pain.

Note that the title is a little click-baity. Specifically, this article is about what is wrong with POSIX I/O in high-performance, exascale computing. This article is about how the POSIX I/O model breaks down in a pretty extreme use case, not that it's fundamentally broken.

I can see how it might seem click-baity but I am pretty sure it wasn't intentional or malicious. Nextplatform is a site about high-performance, exascale computing. If you're on the site then it's not click-baity since the article is exactly what you would expect. The issue is that if you're on a link aggregator like HN then you lose that context.

A more pedantic title would be "What's so bad about POSIX I/O for scalable performance on networked file systems". As others have said, it's an old problem that predates exascale by decades. I never asserted that POSIX is fundamentally broken; it's just that other communities with I/O issues at extreme scales decided to not even bother with the idea of POSIX and went straight for stateless, REST-like APIs and got atomicity through immutability.

It's a pronounced problem in high-performance computing because that community evolved from a much more primitive era and is dragging decades of legacy applications (and, to some degree, thinking) behind it.

Oh I agree with you fully, when I posted this comment there was nothing here. Just wanted to clarify a bit for people.

the atomic write business is pretty painful in any distributed setting, and only comes into play with concurrent writers with overlapping byte offset regions.

I've always kind of struggled to imagine a case where an application would be correct with this guarantee and faulty otherwise. i guess some sort of cleverly designed structure with fixed length records where a compound mutation could be expressed in a contiguous byte range?

There are very few cases in practice where it's needed. The bigger issue is deciding what consistency semantics you _do_ want to have once you decide that strong isn't necessary. Every decision exposes another corner case where someone's application breaks.

Imagine two processes appending rows to a csv file. If writes were non-atomic, they might insert partial rows, or overwrite each other.

the append case is actually different, it requires serializing the end of file pointer.

in the write atomicity case we're asking the file system to order all the writes, and make sure the resulting file shows each of them being completed in their entirety before the next is applied, without any interleaving even if they refer to the same byte offsets in the file

Aren't these atomic writes only a reflection of the fact that data is written out as blocks?

well, yeah, it wouldn't otherwise be necessary for the writes to be fully serialized (because blocks and the cache traditionally, but really anything)

edit: its also worth noting that even in the traditional implementation since the block interface doesn't provide atomicity on faults this is kinda probabilistic anyways

Regardless of the article, the issues apply long before you reach exascale, e.g. on typical university-scale systems. PVFS' design compromises on POSIX go back quite a long way, and MPI-IO is about 20(?) years old. We've been running non-POSIX-compliant NFS for decades too.

> the guarantee that data has been committed to somewhere durable when a write() call returns without an error is a semantic aspect of the POSIX write() API

No, that's just flat out wrong.

> POSIX I/O is stateful

This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.

The title of the article is really bold. POSIX I/O solves a common problem just fine (it's not perfect, but not for the reasons given in the article, and we don't know how to do it much better). I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.

>> the guarantee that data has been committed to somewhere durable when a write() call returns without an error is a semantic aspect of the POSIX write() API

> No, that's just flat out wrong.

from "man 3p write":

       After a write() to a regular file has successfully returned:

        *  Any successful read() from each byte position in the file that  was
           modified  by  that  write  shall  return  the data specified by the
           write() for that position until such byte positions are again modi‐

        *  Any  subsequent successful write() to the same byte position in the
           file shall overwrite that file data.

>> POSIX I/O is stateful

> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.

This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.

> . I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.

The author is talking about (possibly distributed) networked filesystems backing clusters with extreme levels of parallelism (minimum 100s of nodes with 10s of processors on each node, and it gets much bigger). As far as "using a database" that falls under the category of a user-space I/O stack, where the (userspace) database is proxying the I/O to reduce state.

The title of the article isn't at all bold in context, because it is well accepted in HPC that POSIX I/O is the bottleneck for certain types of loads, and the author is clarifying to those not familiar with the details why this is true.

For this ...

    Any successful read() from each byte position in the file that  was
    modified  by  that  write  shall  return  the data specified by the write()
... perhaps I can re-open() and re-read() the same byte value written by another process for the same file, but the file contents may not have been fully flushed all the way to disk. The file contents may be "durable" across processes on the same running OS that mount the same filesystem ... but if the OS happens to die before the data is flushed, then perhaps after reboot the open()/read() will return an older value previously written.

The semantics of "durability" are a squishy concept.

yes, the term "durable" was perhaps a poor choice of words, but the paragraph that followed made it clear that they were aware of the specific requirements (specifically mentioning making dirty caches available to all processes)

Durable has a very specific meaning for I/O and is just not correct here.

POSIX does intentionally not specify any durability at all (e.g. a no-op fsync is explicitly permitted).

And in practice it is close to that, due to cheating hardware with a cache.

You can actually do that with mmap and if you want shm. (Mmap works on files well.) Of course then you have to deal with all the dirt yourself.

>>> POSIX I/O is stateful

>> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.

> This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.

Kernel can already cache such checks, I suppose. If you open a file 100k for 100k users you definitely have other scaling problems to solve first. Even if that becomes a bottleneck, a simple userspace LRU can solve the problem.

I don't understand how having the permission check performed upon every operation is faster than performing it only once at open.

But you don't need to keep the state of "this file is opened".

What if the task that is being processed is idempotently writing something to the disk and then fails (and we assume we should be able to repeat it)?

Statefulness would make us write the code to process the closure of the file and also to process all of the failures if we failed to close the file for proper RAII.

You don't keep the state "this file is opened", the kernel does. All you have is a ticket, the file descriptor.

The problem is that the bookkeeping also includes the position in the file. I guess an API with a position argument could work, when you can leave it null for network operation (or unseekable file streams).

But again, this does not make the API completely stateless and for good performance reasons.

>direct access to the block device

And that's more or less a thing that is in widespread use right now! https://en.m.wikipedia.org/wiki/Filesystem_in_Userspace

I'm not sure FUSE is all that useful for that. It still goes through the kernel and additionally roundtrips through userspace again. I think the only advantage is the convenience of not having to do kernel development to make a filesystem. I.e. anything you could do with FUSE, you could do in the kernel with fewer copies and context switches.

I'm also not aware of any FUSE filesystems that directly access block devices. They usually have network or other filesystem backends.

Ntfs-3g uses FUSE and directly accesses block devices.

> A typical application might open() a file, then read() the data from it, then seek() to a new position, write() some data, then close() the file. File descriptors are central to this process; one cannot read or write a file without first open()ing it to get a file descriptor, and the position where the next read or write will place its data is generated by where the last read, write, or seek call ended.

pwrite() doesn't have this problem at alll, and my copy of the "POSIX.1-2008 with the 2013 Technical Corrigendum 1 applied" man pages (type "man 3p pwrite" on your local up-to-date Linux box) mentions a function called pwrite(). I'm pretty sure that this makes pwrite() an example of POSIX I/O.

> Because the operating system must keep track of the state of every file descriptor–that is, every process that wants to read or write–this stateful model of I/O provided by POSIX becomes a major scalability bottleneck as millions or billions of processes try to perform I/O on the same file system.

This makes no sense. If you have millions or billions of processes on one machine trying to perform I/O on the same file system, you probably have scalability problems, and those problems have nothing whatsoever to do with POSIX I/O.

If, on the other hand, you have millions or billions of processes performing I/O to the same networked filesystem, then you certainly need to think carefully about scaling, and POSIX semantics may well get in the way, but the problem you face here has nothing to do with the fact that POSIX has file descriptors. This is because file descriptors are local to each machine, and they very much don't synchronize their offsets with each other.

(Unless you use O_APPEND. Having millions of processes O_APPENDing to the same file is nuts, unless your filesystem is designed for this, in which case it's probably just fine.)

type "man 3p pwrite" on your local up-to-date Linux box)

Or, of course, type "man 3p write" into Google, there are many man-page sites.

I typically use die.net [1] which is among my top search hits. It could do with some cleaning of the web versions, but it's fine.

[1] https://linux.die.net/man/3/pwrite

I’ve spent most of my career specifically dealing with HPC I/O from a systems and applications perspective. I worked in academia for a while, then in the DOE (almost indistinguishable from academia except worse bureaucracy; shocking, I know), and finally in private industry.

I don’t think I’ve ever read a more spot-on description of the problem. Once I left the DOE, I realized that the problem was far more acute than I had thought. At least on the real big systems, most of our work was with a small number of research groups basically doing the same workflow: start your job, read in some data, crunch, every 15 minutes or something slam the entire contents of system memory (100s of TB) out to spinning disk, crunch, slam, your job gets killed when your time slice expires, get scheduled again, load up the last checkpoint, rinse, repeat. Because of the sheer amount of data, it’s an interesting problem, but you could generally work with the researchers to impose good I/O behavior that gets around the POSIX constraints peculiarities of the particular filesystems. You want 100,000,000 cpu hours on a $200M computer? You can do the work to make the filesystem writes easier on the system.

Coming into private industry was a real eye-opener. You’re in-house staff and you don’t get to say who can use the computer. People use the filesystems for IPC, store 100M files of 200B each, read() and write() terabytes of data 1B at a time, you name it. If I had $100 for every job in which I saw 10,000 cores running a stat() in a while loop waiting for some data to get written to it by one process that had long since died, I’d be retired on a beach somewhere.

The problem with POSIX I/O is that it’s so, so easy and it almost always works when you expect it to. GPFS (what I’m most familiar with) is amazing at enforcing the consistency. I’ve seen parallel filesystems and disk break in every imaginable way and in a lot of ways that aren’t, but I’ve never seen GPFS present data inconsistently across time where some write call was finished and it’s data didn’t show up to a read() started after the write got its lock or a situation where some process opened a file after the unlink was acknowledged. For a developer who hasn’t ever worked with parallel computing and whose boss just wants them to make it work, the filesystems is an amazing tool. I honestly can’t blame a developer who makes it work for 1000 cores and then gets upset with me when it blows up at 1500. I get grouchy with them, but I don’t blame them. (There’s a difference!)

But as the filesystems get bigger, the amount of work the filesystems have to do to maintain that consistency isn’t scaling. The amount of lock traffic flying back and forth between all the nodes is a lot of complexity to keep up with, and if you have the tiniest issue with your network even on some edge somewhere, you’re going to have a really unpleasant day.

One of the things that GCE and AWS have done so well is to just abandon the concept of the shared POSIX filesystem, and produce good tooling to help people deal with the IPC and workflow data processing without it. It’s a hell of a lot of work to go from an on-site HPC environment to GCE though. There’s a ton of money to be made for someone who can make that transition easier and cheaper (if you’ve got it figured out, you know, call me. I want to on it!), but people have sunk so much money into their parallel filesystems and disk that it’s a tough ask for the C-suite. Hypothetically speaking, someone I know really well who’s a lot like me was recently leading a project to do exactly this that got shut down basically because they couldn’t prove it would be cheaper in 3 years.

It seems like the OP really wants to use a database instead of a file system.

I don't think so; HPC is a different beast There are very few if any organizations on earth that have a database that can write a sustained 1TByte/second+ from as few as 300-400 node to 100s of PB of stateful media for years on end.

Ken Batcher (maybe of OSU?) wrote a quote that the rest of us have been using for years: "Supercomputers are a tool for converting a CPU-bound problem into a HPC-bound problem."

The filesystems start to look more like databases over time, but it's not like they can throw down a nice Cassandra cluster and have it pick up the slack. I'm not saying it will never happen, but I don't think it's am option at the moment.

Rather "into an I/O-bound problem". Apologies for ruining Ken's great joke.

Exactly. Layer a system on top of the usual I/O and you get the ratio of performance / statefulness you want.

They do, more or less. It's called MPI-IO, but it's still tough.

Now working on my second parallel file system, I am still amazed how the (POSIX) file system interface has stood the test of time and has been reimplemented in so diverse forms, from local file systems to file systems that are essentially modern large-scale fault-tolerant distributed systems. Completely different languages, platforms, technologies.

Sure, it is not easy to implement a file system and the details are subtle, but the abstraction gives you the necessary freedom to build very scalable and high performance systems and at the same time provides applications with a well-defined set of mechanisms to solve persistence.

It might be the most stable and versatile interface that we have in computing?

Basic message: big parallel I/O systems need to shift responsibility for maintaining file state, moving away from the operating system and towards the application.

POSIX API has pretty simple view of file state with open(), read(), write(), and close(), but this interface does not scale well concurrency-wise.

That part scales fine. There are other problems with the POSIX open/read/write model, such as being too strict wrt consistency and too loose wrt ordering and durability, but it scales OK. It's the metadata/namespace operations that are hard to scale. I've been working on distributed filesystems for a long time. If all I had to deal with was reads and writes I'd be a happy (OK, less angry) man.

I trust your experience over my own in this area; just reading this piece with curiosity. That makes sense, metadata and namespace operations would be difficult when distributed. But isn't the graph in the article specifically about file handle state maintained by the OS? I guess the graph is weird though it wants to be logarithmic but the power of the series keeps decreasing...what type of graph is that?

What's so bad about POSIX IO? People not knowing that async write APIs exist. Additionally, that mmap exists to bypass all the atomicity guarantees. Both are in POSIX.

Some of the claims in this article are total nonsense.

The article points out that file descriptors are stateful - they have a current offset associated with them, and I/O syscalls update that offset - then claims that this "stateful model" could be a "scalability bottleneck" with "billions of processes" in parallel filesystems. Except that each of those billions of processes will have its own file descriptor with its own offset! The only case where this could cause contention is if multiple threads in a single process are trying to read from the same file descriptor, which would be a really bad idea in the first place, precisely because file descriptors are stateful. Just open multiple descriptors - or use pread/pwrite, which have existed for a long time. Perhaps the process-wide statefulness of many POSIX APIs is a bad design in a world with threads, but it has nothing to do with the concurrent-file-open benchmark in the article, or really any other performance problems with parallel filesystems.

Anyway, file descriptors are just per-process handles used for communicating with the kernel. At least in principle, there's no reason that remote filesystems should know or care about file descriptors on the client end, unless clients are using file locks (well, except those aren't file-descriptor-based anyway, although they should be).

Later, the article claims:

> While the POSIX style of metadata certainly works, it is very prescriptive and inflexible; for example, the ownership and access permissions for files are often identical within directories containing scientific data (for example, file-per-process checkpoints), but POSIX file systems must track each of these files independently.


> Supporting the prescriptive POSIX metadata schema at extreme scales is a difficult endeavor; anyone who has tried to ls -l on a directory containing a million files can attest to this.

POSIX access bits are literally 15 bits per file. uid and gid are a few bytes. The overhead of storing these for each file definitely isn't what's making ls -l slow. Perhaps there's some system where time spent checking them all, at access time, is a bottleneck, but I'd be very surprised if modern Linux was such a system; that kind of problem sounds easy to solve with some basic caching.

The article calls out created and modified times as another part of the metadata, thereby amusingly missing the only one of the three POSIX file timestamps - access time - that actually can cause big scalability issues (if left enabled).

Also, apparently the author has not heard of either ACLs, which are in POSIX, or xattrs, which are pseudo-POSIX (multiple systems have roughly compatible implementations based on an old POSIX draft) - both of which try to improve flexibility compared to classic POSIX metadata. There are problems with both of them, but you'd think they'd at least deserve a mention in the list of alternatives.

> Except that each of those billions of processes will have its own file descriptor with its own offset!

The offset is only a tiny part of how POSIX is stateful. The very fact that each read or write is associated with a particular fd, therefore with a particular authorization and lock context, is more of an issue at the servers. Even more of an issue is the possibility of still-buffered writes, which POSIX does require be visible to reads on other fds.

> At least in principle, there's no reason that remote filesystems should know or care about file descriptors

Untrue, and please don't try to "correct" others with your own inaccurate information. As I just said, each file descriptor (or file handle in NFS) has its own authorization and lock context, which must be enforced at the server(s) so knowledge of them can't be limited to the client.

> POSIX access bits are literally 15 bits per file. uid and gid are a few bytes.

Also mtime and atime, and xattrs which can add up to kilobytes, but more importantly what the author was really talking about was namespace information rather than per-file metadata. It's a common mistake. Even as someone who writes code to handle both of these separate concerns, I'm not enough of a pedant to whine every time an application programmer gets my domain's terminology wrong.

> the only one of the three POSIX file timestamps - access time - that actually can cause big scalability issues (if left enabled).

Untrue yet again. Mtime can be a problem too, as can st_size and st_blocks. In an architecture where clients issue individual possibly-extending writes directly to one of several data servers for a file but other clients can then query these values through a separate metadata server, that creates a serious aggregation problem. That's why I think the separate ODS/MDS model (as in Lustre) sucks. People resort to it because it makes the namespace issue easier, but it makes metadata issues harder. In the particular use cases where people have to stick with a filesystem instead of switching to an object store, it's a net loss.

Relevant quotes from http://github.com/globalcitizen/taoup ...

Optimization considered harmful: In particular, optimization introduces complexity, and as well as introducing tighter coupling between components and layers. - RFC3439

Design up front for reuse is, in essence, premature optimization. - @AnimalMuppet

To speed up an I/O-bound program, begin by accounting for all I/O. Eliminate that which is unnecessary or redundant, and make the remaining as fast as possible. - David Martin

The fastest I/O is no I/O. - Nils-Peter Nelson

The cheapest, fastest and most reliable components of a system are those that aren't there. - Gordon Bell

Safety first. In allocating resources, strive to avoid disaster rather than to attain an optimum. Many years of experience with virtual memory, networks, disk allocation, database layout, and other resource allocation problems has made it clear that a general-purpose system cannot optimize the use of resources. - Butler W. Lampson (1983)

Crowley's 4th rule of distributed systems design: Failure is expected. The only guaranteed way to detect failure in a distributed system is to simply decide you have waited 'too long'. This naturally means that cancellation is first-class. Some layer of the system (perhaps plumbed through to the user) will need to decide it has waited too long and cancel the interaction. Cancelling is only about reestablishing local state and reclaiming local resources - there is no way to reliably propagate that cancellation through the system. It can sometimes be useful to have a low-cost, unreliable way to attempt to propagate cancellation as a performance optimization.

Optimization: Prototype before polishing. Get it working before you optimize it. - Eric S. Raymond, The Art of Unix Programming (2003)

Before optimizing, using a profiler. - Mike Morton

Spell create with an 'e'. - Ken Thompson (referring to design regrets on the UNIX creat(2) system call and the fallacy of premature optimization)

The No Free Lunch theorem: Any two optimization algorithms are equivalent when their performance is averaged across all possible problems (if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems).

An efficient program is an exercise in logical brinkmanship. - Edsger Dijkstra

Choose portability [high level] over efficiency [low-level]. - Mike Garcanz: The Unix Philosophy

Laziness is the mother of efficiency. - Marian Propp

Jevons Paradox: As technology progresses, the increase in efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource.

It brings everything to a certainty, which before floated in the mind indefinitely. - Samuel Johnson, on counting

... and the kicker...

Those who don't understand Unix are condemned to reinvent it, poorly. - Henry Spencer

And this is how we get incredibly inefficient programs like atom.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact