"POSIX I/O is simply not what HPC needs." but it's ok for almost any other use case and "There are many ways to move beyond the constraints imposed by POSIX I/O without rewriting applications, and this is an area of active research in both industry and academia." The problem is the lack of a standard solution yet.
The post does a good job at explaining why HPC doesn't need the strong guarantees provided by POSIX I/O and why they slow down those kind of applications. They are perfectly fine and desirable for most workloads on single user workstations, such as web development (save a file in the editor and the compiler gets its bytes exactly as you wrote them.)
I'd have thought the salient feature is collective i/o, not cleaning up silly POSIX i/o of some sort. Nothing's perfect, but what should replace the non-blocking collective MPI-IO under5, for instance? I wonder what are the major NERSC applications that don't do sane i/o and should. The published workload characterization doesn't help.
The author's on much firmer ground when he talks about POSIX consistency semantics. While we can't control misguided applications' (ab)use of locks, we can certainly offer models that don't require serializing things that the user never wanted or expected to be serialized, or making them synchronous likewise. As I've also written before, we need more ways to describe the user's desired level of ordering and durability, separately, giving the implementation maximum flexibility to optimize I/O while preserving the guarantees that users really care about. The setstream primitive in CCFS is a good example of the sorts of things we need here. I'm not at all convinced that object-store-centric approaches are the way to go here, but some evolution is clearly needed.
 http://pl.atyp.us/2016-05-updating-posix.html section under "fsync"
Updating POSIX, whether it be through more expressive consistency options or just giving middleware developers the knobs they need to implement such mechanisms themselves, has come up before, as you say. My worry is that many of the serious efforts from years past were ahead of their time, and hyperscale has gone their own way so the ship has sailed on POSIX I/O.
I see the CCFS paper cites it to say that CCFS is better, but i haven't dug into why.
for x in *; do cat "$x" & done | pipeline ...
In the Windows model (in contrast), the programmer says "when this data is available (from this socket, file, etc), put it here (into this memory buffer)" and then waits for the kernel to do something. As a result, you skip that extra syscall -- but you also skip any other read you might have needed to do for data that's arrived at (roughly) the same time. AIO is potentially similar, but it never seems to benchmark well in my experience.
There is no other choice really. Let's say you want to read a specific page from a file. No page is currently in the kernel buffers. If poll where to return non-ready (and remember that poll has no idea which pages you are interested in), what action, within posix, would you take to change it?
poll and friends just do not make sense for plain file descriptors. In linux, splice add an interesting twist, but it is still awkward to use.
pread() is an unusual case. I'd be happy if poll() simply indicated read() would succeed, and I'm happy doing an lseek() before every read().
However a pread() with O_NONBLOCK could return EAGAIN and poll() would then know what pages I'm interested in.
> what action, within posix, would you take to change it?
The general (POSIX-compatible) strategy (as I indicated) has been to fork a thread off to handle reading so that my application is notified when the data is in the fifo. This works for me when I'm reading (say) five files from a slow (or remote) storage, but it's not ideal since it causes a copy.
Also having pread, which is designed for concurrency and scalability, carry state around for poll (an unbounded amount of state btw: you can have multiple threads issuing concurrent preads on the same fd) makes it completely pointless.
There are linux patches floating to add flags to preadv2 so that on a page not present it returns EAGAIN, so that you can optimistically do a sync read an fall back to a thread pool in the slow case. It still doesn't involve poll and friend.
readahead(2) ? posix_fadvise(2) ?
An idea for how it could be implemented in linux: kernel would offer a new kind of file descriptor. It would have a limited, in-kernel buffer, much like a socket has an arrival space. Once you have it, call something like,
newkind_page_copy(newkind_fd, disk_fd, page_count, timeout)
The buffer would behave like socket for the purpose of the read and error lists in select/poll/epoll. If you closed either descriptor, it goes to error-state. Within the kernel, it would watch real_file_fd, using code to similar to whatever is underneath inotify. Like inotify, this would not work for NFS.
? What man page are you reading? The goal of readahead is to prime the cache and return immediately. http://man7.org/linux/man-pages/man2/readahead.2.html
There is a second obstacle I have in mind, around the robustness of this. Imagine if we had an implementation of that which reliably returned quickly. But we take our time doing the subsequent read. By the time we do, the value has already been chased out of the cache. It is an edge-case, and you could probably avoid it in practice, but it would be significant effort to to verify your mechanism, and would create cognitive load.
1. If a read for a range of bytes on a non-blocking file descriptor can be wholly or partly satisfied from the cache, put that data in the read buffer, and return the amount of it
2. If a read on a non-blocking file descriptor can't be satisfied from the cache, return -1, set errno to EAGAIN, and kick off an asynchronous task to load that range of bytes into the cache (an 'asynchronous load')
2a. If there is already an asynchronous load for that range running, don't start another one
3. When an asynchronous load finishes, mark that range as available
1a (surprise!). If that range of bytes was marked as available, remove the mark
4. If there are a nonzero number of available ranges for a file descriptor, return POLLIN from poll()
The way a program with lightweight threads (Node, Go, etc) would use this:
1. Open files as non-blocking
2. Make reads as usual
3. If a read succeeds, or fails other than with EAGAIN, pass the result to the thread and carry on
4. If a read fails with EAGAIN, suspend the thread, and remember the range it was reading
Meanwhile, in a background IO thread ...
5. Repeatedly poll all file descriptors for which there are remembered reads
6. If a file descriptor shows up as POLLIN, run through all the remembered reads for it, and make them again
7. If a read fails with EAGAIN, cool, leave it remembered and carry on
8. If a read succeeds, or fails other than with EAGAIN, pass the result to the thread and resume it
Would that work?
I haven't really thought about partial results of asynchronous reads. If a read asks for two blocks, and one gets loaded, should poll return POLLIN or not? I lean towards yes, because the program can make use of that data. That raises the question of whether to remove the availability mark after the following read. Again, i lean towards yes, because it means asynchronous reads have the same behaviour as ones which partially hit cache. The OS can carry on reading the rest of the range into the cache, but it won't signal the arrival of that data to the program. That should be fine, as the program will probably try to read it soon anyway.
If a program takes too long to get round to reading data once it is available, it might have been pushed out of the cache by then. Too bad. The read will fail with EAGAIN and the whole cycle repeats.
If two threads in a program independently read the same range, only one asynchronous load will be run, and only one thread will end up removing the mark. Programs will need to be careful not to trip up over this; they probably need to coalesce reads on their side of the API.
It sucks to have to retry every outstanding read for a file when it shows POLLIN, but i can't see a way to return information about which ranges are available through the existing calls. A program could keep its list of outstanding reads in time order, scan it from earliest to latest, and stop when it gets a non-EAGAIN result. If there is more than one available range, poll will return POLLIN again and the program will go back and pick it up. If asynchronous loads complete one at a time in order, this reduces the cost to O(1), with a smooth increase in cost as that is less and less the case.
Basically, what i'm saying is that everyone should just implement the kqueue API from BSD.
2. If a read on a non-blocking file descriptor can't be
satisfied from the cache, return -1, set errno to EAGAIN,
and kick off an asynchronous task to load that range of
bytes into the cache (an 'asynchronous load')
If you want pollable file reads, just use eventfd and roll your own. Linux threads are lightweight enough that there's not much reason to care whether they're "kernel" or "user" threads. But pollable file I/O will really only be a win for streaming reads and writes, so you could just use sendfile(), in which case you're reading or writing to a pollable socket; you use a dedicated thread pool to pump data from sockets to the file descriptors, or vice-versa, without any copying.
BTW, Go doesn't need this solution (timeouts notwithstanding) because Go schedules multiple goroutines across multiple kernel-scheduled threads.
If you're really worried about file I/O latency, you should be at least as worried about VM pagefault latency. I disable swap on all my Linux servers. And I also disable overcommit. But my languages of choice (C and Lua) are capable of gracefully handling allocation failure (contrast Python or Go), and I implement my solutions to handle allocation failure without crashing the entire service. Too many other languages make the assumption that memory is infinite. (Alas, large parts of the Linux kernel do this as well, even with overcommit disabled :(
Solaris and Windows are the only major OSs I'm familiar with that implement strict and rigorous memory accounting, permitting the system to stay alive (and consistent) under memory exhaustion. FreeBSD might as well, I'm not sure.
And so on.
However a few companies have seen real commercial benefit from actually analysing their data, so we're seeing a lot of "big data" interest. Some of these companies have noticed that a bunch of hadoop/mongo/whatever boxes can't even approach the performance of one "big" box tuned right, so we're seeing somewhat of a resurgence in interest here.
Could you expand on that?
(I'm assuming that by the "Windows Way", you mean I/O completion ports.)
I think the main reason Linux doesn't really do it is that asynchronous IO is an intrusive change and there would be just too much work to implement it in all the file systems etc.; i.e. "not worth it".
Many applications which are not OK with synchronous disk I/O seem to find thread pools good enough: reasonably easy to implement, reasonably portable, usually performs ok.
Traditional unix, including Linux pretends to support poll() etc for file I/O (i.e. it supports the interfaces, but the reality is synchronous). The reason for this is that an application that expects async I/O can kinda-sorta-work with real files -- and "kinda sorta" will not be too bad because local disks are fast enough to paper over any problem.
But then if the disks aren't local, why not actually make non-blocking I/O work as advertised?
Actually I can guess the answere: there are 101 corner cases that mean I can't neatly separate out the apps design for noblocking I/O. And that's why I half buy the argument.
But only half.
That's some seriously hairy stuff there. I reckon I'd move to building my app as a kernel module more readily than trying to make a robust async arrangement using that mechanism :) Call me demanding, but I think we should can reasonably expect easier access to async from our OS than this mechanism offers.
On that topic of which.. how's the future of operating systems coming along?
In my opinion, a big issue with these is that asynchronous i/o requests can not be used with select/poll/epoll, but rely on other notification methods instead. AIO can use unix signals (ugh) or thread notification (see sigevent(7)) to signal completion. io_submit uses io_getevents to read events about completed I/O requests.
From a userspace programmer's point of view, it would make much more sense if io_setup() would return a file descriptor that could be used with select/poll, and io_events would be read() from the fd.
Currently it's impossible (?) to have filesystem i/o in the same select/poll loop with your network i/o because of this issue. This is why libuv falls back to using threads and synchronous i/o to provide "async" i/o.
Hacky workaround: don't do filesystem i/o in the first place. Have the kernel serve an iSCSI target, then consume it over the loopback interface entirely in userspace. Your disk now has network semantics! (But you have to bring your own filesystem driver.)
If you don't mind some extra abstraction overhead, you can skip past iSCSI and instead build a userspace NFS client lib into your app, consuming a loopback NFS share. s/NFS/9P/ if your disk server has in-kernel 9P.
(Is there any use to these approaches? Yes: they're actually nearly-optimal ways to persist state from a unikernel running on a hypervisor, because all the disks are remote one way or another anyway. Erlang on Xen uses the 9P approach, for example.)
Still all this event goodness is not good enough for things like infiniband and the brave world of zero copy networking.
You could use a pipe or eventfd or some other fd-based mechanism to wake up a worker thread on a filesystem event, but then you get two context switches (one to wake up io_getevent thread and another to wake up a worker). This carries a lot of indirect costs from jumping back and forth to the kernel.
I am unaware of a solution that would make filesystem i/o and network i/o play nice in one event loop without adding some overhead.
(Windows async disk i/o is thread-backed too behind the scenes, though)
Windows does not spin up or allocate a thread just to wait for an IO operation to complete. Rather, when an IO operation completes, a thread is allocated from the thread pool (hence the name completion ports) on which the callback happens.
So there is no excessive allocation of threads. Everything is event-driven and kernel managed until the operation completes.
People have taken the notion that threads are slow in some circumstances and made the conclusion that threads are slow, period. But that's simply not true.
Those kind of servers are simply quite rare (IME). Nothing wrong with interesting problems in niche spaces, but it's not something you should be worrying about by default.
(No idea what "big" servers provide public stats, but e.g. https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar... ) lists a random day for stackoverflow having 209,420,973 http reqests, i.e. a little less than 3000 per second. I doubt context switching is going to matter for them (in the hypothetical world all this was served by one HTTP server, which of course it isn't).
The real cost of threads is that threaded code is far more difficult to get right than non-threaded code. The closest thing I know of to an excuse for using threads is that we are already using threads, so all the complixity is already there.
After all, once every line of code is already a potential race condtion, it hardly matters if you turn a few more of them into real ones.
The only tricky part (API-side) is that write() is defined to confirm that some data got written, or not. If you let write() become non-blocking, and deferred the I/O, perhaps by returning immediately with EAGAIN or some such response, the write is is an unknown state to the rest of the application. What data can then be expected from other read() calls on that file from the same process? The old data? The new? A mix? You'd have to drop the POSIX consistency guarantee. For most programs, this is never likely to be an issue, as they don't tend to be re-reading bits of a file that they are also writing to, so it's a shame there's no option to let the write() be non-blocking if requested.
Issue 2 is failures: If the write, when it eventually happens, fails, you've got to have some way of passing that information back to the process. The solution for this seems easy enough; just let the FD become writeable again and the program will probably re-attempt the same write, at which time you can report the previous error.
Finally, making open() calls non-blocking would be the real API changer. If you defer all of the file opening work and give the application a file descriptor before doing any I/O, then you can hit all kinds of errors (permissions problems, file not found, and so on). How do you pass them back to the program? You can make any future read() or write() calls fail but these syscalls would now have to return error codes that they aren't specified as actually being able to return (read can't return things like EEXIST, EACCESS and so on). So the API is inadequate for these situations. You can see how they could be extended to cover these cases though.
It's a pronounced problem in high-performance computing because that community evolved from a much more primitive era and is dragging decades of legacy applications (and, to some degree, thinking) behind it.
I've always kind of struggled to imagine a case where an application would be correct with this guarantee and faulty otherwise. i guess some sort of cleverly designed structure with fixed length records where a compound mutation could be expressed in a contiguous byte range?
in the write atomicity case we're asking the file system to order all the writes, and make sure the resulting file shows each of them being completed in their entirety before the next is applied, without any interleaving even if they refer to the same byte offsets in the file
edit: its also worth noting that even in the traditional implementation since the block interface doesn't provide atomicity on faults this is kinda probabilistic anyways
No, that's just flat out wrong.
> POSIX I/O is stateful
This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
The title of the article is really bold. POSIX I/O solves a common problem just fine (it's not perfect, but not for the reasons given in the article, and we don't know how to do it much better). I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.
> No, that's just flat out wrong.
from "man 3p write":
After a write() to a regular file has successfully returned:
* Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the
write() for that position until such byte positions are again modi‐
* Any subsequent successful write() to the same byte position in the
file shall overwrite that file data.
> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.
> . I don't know anything about the domain the author talks about (HPC), but it seems what he needs is basically direct access to the block device. Or writing away through network sockets / using a database.
The author is talking about (possibly distributed) networked filesystems backing clusters with extreme levels of parallelism (minimum 100s of nodes with 10s of processors on each node, and it gets much bigger). As far as "using a database" that falls under the category of a user-space I/O stack, where the (userspace) database is proxying the I/O to reduce state.
The title of the article isn't at all bold in context, because it is well accepted in HPC that POSIX I/O is the bottleneck for certain types of loads, and the author is clarifying to those not familiar with the details why this is true.
Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
The semantics of "durability" are a squishy concept.
POSIX does intentionally not specify any durability at all (e.g. a no-op fsync is explicitly permitted).
>> This is fundamental to the authorization model. Authorization happens at file open time. It's also what enables the stream abstraction.
> This is very true, but in the workloads the author is talking about, there are often times that a stateless API would enable a more efficient implementation. Think about what is going on in your file server when you have 100k clients all accessing the same open file.
Kernel can already cache such checks, I suppose. If you open a file 100k for 100k users you definitely have other scaling problems to solve first. Even if that becomes a bottleneck, a simple userspace LRU can solve the problem.
I don't understand how having the permission check performed upon every operation is faster than performing it only once at open.
What if the task that is being processed is idempotently writing something to the disk and then fails (and we assume we should be able to repeat it)?
Statefulness would make us write the code to process the closure of the file and also to process all of the failures if we failed to close the file for proper RAII.
The problem is that the bookkeeping also includes the position in the file. I guess an API with a position argument could work, when you can leave it null for network operation (or unseekable file streams).
But again, this does not make the API completely stateless and for good performance reasons.
And that's more or less a thing that is in widespread use right now! https://en.m.wikipedia.org/wiki/Filesystem_in_Userspace
I'm also not aware of any FUSE filesystems that directly access block devices. They usually have network or other filesystem backends.
pwrite() doesn't have this problem at alll, and my copy of the "POSIX.1-2008 with the 2013 Technical Corrigendum 1 applied" man pages (type "man 3p pwrite" on your local up-to-date Linux box) mentions a function called pwrite(). I'm pretty sure that this makes pwrite() an example of POSIX I/O.
> Because the operating system must keep track of the state of every file descriptor–that is, every process that wants to read or write–this stateful model of I/O provided by POSIX becomes a major scalability bottleneck as millions or billions of processes try to perform I/O on the same file system.
This makes no sense. If you have millions or billions of processes on one machine trying to perform I/O on the same file system, you probably have scalability problems, and those problems have nothing whatsoever to do with POSIX I/O.
If, on the other hand, you have millions or billions of processes performing I/O to the same networked filesystem, then you certainly need to think carefully about scaling, and POSIX semantics may well get in the way, but the problem you face here has nothing to do with the fact that POSIX has file descriptors. This is because file descriptors are local to each machine, and they very much don't synchronize their offsets with each other.
(Unless you use O_APPEND. Having millions of processes O_APPENDing to the same file is nuts, unless your filesystem is designed for this, in which case it's probably just fine.)
Or, of course, type "man 3p write" into Google, there are many man-page sites.
I typically use die.net  which is among my top search hits. It could do with some cleaning of the web versions, but it's fine.
I don’t think I’ve ever read a more spot-on description of the problem. Once I left the DOE, I realized that the problem was far more acute than I had thought. At least on the real big systems, most of our work was with a small number of research groups basically doing the same workflow: start your job, read in some data, crunch, every 15 minutes or something slam the entire contents of system memory (100s of TB) out to spinning disk, crunch, slam, your job gets killed when your time slice expires, get scheduled again, load up the last checkpoint, rinse, repeat. Because of the sheer amount of data, it’s an interesting problem, but you could generally work with the researchers to impose good I/O behavior that gets around the POSIX constraints peculiarities of the particular filesystems. You want 100,000,000 cpu hours on a $200M computer? You can do the work to make the filesystem writes easier on the system.
Coming into private industry was a real eye-opener. You’re in-house staff and you don’t get to say who can use the computer. People use the filesystems for IPC, store 100M files of 200B each, read() and write() terabytes of data 1B at a time, you name it. If I had $100 for every job in which I saw 10,000 cores running a stat() in a while loop waiting for some data to get written to it by one process that had long since died, I’d be retired on a beach somewhere.
The problem with POSIX I/O is that it’s so, so easy and it almost always works when you expect it to. GPFS (what I’m most familiar with) is amazing at enforcing the consistency. I’ve seen parallel filesystems and disk break in every imaginable way and in a lot of ways that aren’t, but I’ve never seen GPFS present data inconsistently across time where some write call was finished and it’s data didn’t show up to a read() started after the write got its lock or a situation where some process opened a file after the unlink was acknowledged. For a developer who hasn’t ever worked with parallel computing and whose boss just wants them to make it work, the filesystems is an amazing tool. I honestly can’t blame a developer who makes it work for 1000 cores and then gets upset with me when it blows up at 1500. I get grouchy with them, but I don’t blame them. (There’s a difference!)
But as the filesystems get bigger, the amount of work the filesystems have to do to maintain that consistency isn’t scaling. The amount of lock traffic flying back and forth between all the nodes is a lot of complexity to keep up with, and if you have the tiniest issue with your network even on some edge somewhere, you’re going to have a really unpleasant day.
One of the things that GCE and AWS have done so well is to just abandon the concept of the shared POSIX filesystem, and produce good tooling to help people deal with the IPC and workflow data processing without it. It’s a hell of a lot of work to go from an on-site HPC environment to GCE though. There’s a ton of money to be made for someone who can make that transition easier and cheaper (if you’ve got it figured out, you know, call me. I want to on it!), but people have sunk so much money into their parallel filesystems and disk that it’s a tough ask for the C-suite. Hypothetically speaking, someone I know really well who’s a lot like me was recently leading a project to do exactly this that got shut down basically because they couldn’t prove it would be cheaper in 3 years.
Ken Batcher (maybe of OSU?) wrote a quote that the rest of us have been using for years: "Supercomputers are a tool for converting a CPU-bound problem into a HPC-bound problem."
The filesystems start to look more like databases over time, but it's not like they can throw down a nice Cassandra cluster and have it pick up the slack. I'm not saying it will never happen, but I don't think it's am option at the moment.
Sure, it is not easy to implement a file system and the details are subtle, but the abstraction gives you the necessary freedom to build very scalable and high performance systems and at the same time provides applications with a well-defined set of mechanisms to solve persistence.
It might be the most stable and versatile interface that we have in computing?
POSIX API has pretty simple view of file state with open(), read(), write(), and close(), but this interface does not scale well concurrency-wise.
The article points out that file descriptors are stateful - they have a current offset associated with them, and I/O syscalls update that offset - then claims that this "stateful model" could be a "scalability bottleneck" with "billions of processes" in parallel filesystems. Except that each of those billions of processes will have its own file descriptor with its own offset! The only case where this could cause contention is if multiple threads in a single process are trying to read from the same file descriptor, which would be a really bad idea in the first place, precisely because file descriptors are stateful. Just open multiple descriptors - or use pread/pwrite, which have existed for a long time. Perhaps the process-wide statefulness of many POSIX APIs is a bad design in a world with threads, but it has nothing to do with the concurrent-file-open benchmark in the article, or really any other performance problems with parallel filesystems.
Anyway, file descriptors are just per-process handles used for communicating with the kernel. At least in principle, there's no reason that remote filesystems should know or care about file descriptors on the client end, unless clients are using file locks (well, except those aren't file-descriptor-based anyway, although they should be).
Later, the article claims:
> While the POSIX style of metadata certainly works, it is very prescriptive and inflexible; for example, the ownership and access permissions for files are often identical within directories containing scientific data (for example, file-per-process checkpoints), but POSIX file systems must track each of these files independently.
> Supporting the prescriptive POSIX metadata schema at extreme scales is a difficult endeavor; anyone who has tried to ls -l on a directory containing a million files can attest to this.
POSIX access bits are literally 15 bits per file. uid and gid are a few bytes. The overhead of storing these for each file definitely isn't what's making ls -l slow. Perhaps there's some system where time spent checking them all, at access time, is a bottleneck, but I'd be very surprised if modern Linux was such a system; that kind of problem sounds easy to solve with some basic caching.
The article calls out created and modified times as another part of the metadata, thereby amusingly missing the only one of the three POSIX file timestamps - access time - that actually can cause big scalability issues (if left enabled).
Also, apparently the author has not heard of either ACLs, which are in POSIX, or xattrs, which are pseudo-POSIX (multiple systems have roughly compatible implementations based on an old POSIX draft) - both of which try to improve flexibility compared to classic POSIX metadata. There are problems with both of them, but you'd think they'd at least deserve a mention in the list of alternatives.
The offset is only a tiny part of how POSIX is stateful. The very fact that each read or write is associated with a particular fd, therefore with a particular authorization and lock context, is more of an issue at the servers. Even more of an issue is the possibility of still-buffered writes, which POSIX does require be visible to reads on other fds.
> At least in principle, there's no reason that remote filesystems should know or care about file descriptors
Untrue, and please don't try to "correct" others with your own inaccurate information. As I just said, each file descriptor (or file handle in NFS) has its own authorization and lock context, which must be enforced at the server(s) so knowledge of them can't be limited to the client.
> POSIX access bits are literally 15 bits per file. uid and gid are a few bytes.
Also mtime and atime, and xattrs which can add up to kilobytes, but more importantly what the author was really talking about was namespace information rather than per-file metadata. It's a common mistake. Even as someone who writes code to handle both of these separate concerns, I'm not enough of a pedant to whine every time an application programmer gets my domain's terminology wrong.
> the only one of the three POSIX file timestamps - access time - that actually can cause big scalability issues (if left enabled).
Untrue yet again. Mtime can be a problem too, as can st_size and st_blocks. In an architecture where clients issue individual possibly-extending writes directly to one of several data servers for a file but other clients can then query these values through a separate metadata server, that creates a serious aggregation problem. That's why I think the separate ODS/MDS model (as in Lustre) sucks. People resort to it because it makes the namespace issue easier, but it makes metadata issues harder. In the particular use cases where people have to stick with a filesystem instead of switching to an object store, it's a net loss.
Optimization considered harmful: In particular, optimization introduces complexity, and as well as introducing tighter coupling between components and layers. - RFC3439
Design up front for reuse is, in essence, premature optimization. - @AnimalMuppet
To speed up an I/O-bound program, begin by accounting for all I/O. Eliminate that which is unnecessary or redundant, and make the remaining as fast as possible. - David Martin
The fastest I/O is no I/O. - Nils-Peter Nelson
The cheapest, fastest and most reliable components of a system are those that aren't there. - Gordon Bell
Safety first. In allocating resources, strive to avoid disaster rather than to attain an optimum. Many years of experience with virtual memory, networks, disk allocation, database layout, and other resource allocation problems has made it clear that a general-purpose system cannot optimize the use of resources. - Butler W. Lampson (1983)
Crowley's 4th rule of distributed systems design: Failure is expected. The only guaranteed way to detect failure in a distributed system is to simply decide you have waited 'too long'. This naturally means that cancellation is first-class. Some layer of the system (perhaps plumbed through to the user) will need to decide it has waited too long and cancel the interaction. Cancelling is only about reestablishing local state and reclaiming local resources - there is no way to reliably propagate that cancellation through the system. It can sometimes be useful to have a low-cost, unreliable way to attempt to propagate cancellation as a performance optimization.
Optimization: Prototype before polishing. Get it working before you optimize it. - Eric S. Raymond, The Art of Unix Programming (2003)
Before optimizing, using a profiler. - Mike Morton
Spell create with an 'e'. - Ken Thompson (referring to design regrets on the UNIX creat(2) system call and the fallacy of premature optimization)
The No Free Lunch theorem: Any two optimization algorithms are equivalent when their performance is averaged across all possible problems (if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems).
An efficient program is an exercise in logical brinkmanship. - Edsger Dijkstra
Choose portability [high level] over efficiency [low-level]. - Mike Garcanz: The Unix Philosophy
Laziness is the mother of efficiency. - Marian Propp
Jevons Paradox: As technology progresses, the increase in efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource.
It brings everything to a certainty, which before floated in the mind indefinitely. - Samuel Johnson, on counting
... and the kicker...
Those who don't understand Unix are condemned to reinvent it, poorly. - Henry Spencer