Why not mmap?

rdtsc · on May 19, 2011

The main gist:

"mmap() interface is missing a non-blocking way to access memory."

At the end of the day your process will still be suspended when the page is brought in from disk the first time it is read. In some situations, when you can't afford your thread to be suspended waiting for disk IO (say if you could do something else in the meantime, like service network requests) it would be better to use AIO for example to schedule a file read into a buffer and be notified later when it is finished ( http://linux.die.net/man/3/aio_read )

rbranson · on May 19, 2011

The mechanism suggested by the author seems a bit absurd. Code would to have to pre-warm regions of memory to be available in a non-blocking fashion before accessing it. One of the major attractions to mmap() is that this is unnecessary, the OS handles all of the read-ahead and caching for you.

It also seems silly to go through all of this, but end up using a thread pool to perform the page faults, which brings back the issues of thread starvation, context switches, and lock overhead which cooperative concurrency models try to avoid.

I'm trying to think thru all the scenarios, but it doesn't seem as if it's really practically possible (in C at least) to make mmap() I/O work in a cooperative concurrency model.

jerf · on May 19, 2011

Not everything is C. Anything from event-based libraries like Twisted or Node.js up through Erlang or Haskell could potentially make very good and very transparent use of this, if the implementation was slick enough, without changing much code.

I'd say that generally, when the kernel can do something such that a cooperating VM can get a speed boost without doing very much, that's a good thing, all else being equal. Which it never is, when it comes to the kernel. But it's worth thinking about, even if it's too complicated for C to use reasonably. (It's time to let that bottleneck go anyhow. There's all kinds of very good things that are simply too complicated to reasonably use in C, and the solution is not to limit ourselves to C for the rest of eternity.)

asomiv · on May 19, 2011

What does not being in C have anything to do with that? Whatever tricks Twisted and Node.js and friends use can be done in C, just with a different syntax, but at the end of the day it's the same machine code. I don't know how Twistd handles file I/O but Node.js uses libeio which performs all file I/O in a thread pool. It's no better than what you can do in C.

jerf · on May 19, 2011

rbranson said: "but it doesn't seem as if it's really practically possible (in C at least) to make mmap() I/O work in a cooperative concurrency model."

In that context, it ought to make sense.

Further, there's a set of technologies that are good, but range from difficult to effectively impossible to use in C, and the list is growing, not shrinking. We can't afford to bind ourselves to what will work in C forever. For instance: Garbage collection, possible but hard and certainly a bit inelegant, you're fighting C. Software transactional memory: Don't even think about doing it in C. But it may still be a useful thing in some places. And so on, for a mix of things. C is what C is, but "it doesn't practically work in C" can't be allowed to be a fatal objection if we want to progress.

Sure, you can type anything you want in C. But you won't. And I know it. Don't even try to argue that you will, it's obvious that you won't. Given the current state of the programming language environment, it is an impossibly small window to sail through to claim that Java, Erlang, Haskell, Javascript, C#, C++, SQL, etc, all have no advantages over C because we can always do the same thing in C (and I deliberately picked a wide range), yet C has some sort of advantage over assembler that makes it worth using. I reject that C is a Platonic default language that all others must justify their existence against; it's merely one that got some popular OSes written in it for certain good reasons, but that doesn't gold-plate it against criticism. This argument hasn't been sensible for about 30 years now, and now it's just gibberish; we know better. Cost factors matter a lot.

rbranson · on May 19, 2011

Definitely agree with you that abstraction matters. I use C as an example because it's the canonical way in which we as developers can reasonably converse about directly dealing with the machine. C is still important and will continue to be important because it's arguably the best abstraction we have to write code directly against the machine. GC and STM would be orthogonal to the purpose of C.

It's really sort of pointless to argue what's possible in environments lacking real, unsafe pointers, because the whole idea with mmap()'d I/O is that it lets apps access files as if they are a region of memory. Java wraps up the behavior in FileChannel, and most other platforms have a way to access mmap() in some fashion, albeit the practical use cases are vanishingly small. In C, it's done mostly for squeezing out bits of extra performance by shrinking memory requirements/allocation cost and avoiding copies. Much of the safety mechanisms used by VM environments invalidate or make unavailable the shortcuts that are used in tandem with mapped I/O to achieve higher performance.

In the end, it's all still triggering the virtual memory interrupt that will load the page from disk if it's not in RAM, and the kernel is still going to suspend the calling thread while it's happening. No way around that. Alleviating this through thread pools would just re-introduce the issues of inter-thread locking and context switch overhead, and present an additional bottleneck; in the end just piling on extra complexity. I don't see that as a net win at all.

While the post has some great information on mapped I/O, it seems as if the author is dogmatic about the use of mapped I/O and is trying to find a way to use this hammer in a situation where it might be inappropriate.

EDIT: To the author who replied: awesome, awesome. Ultimately this is great stuff to talk about and these types of posts are what HN should really be about, even though it's turned into startup TMZ.

jerf · on May 19, 2011

Broad agreement in general, but I would observe there are still use cases for non-bare-metal languages for mmap. It doesn't conflict with all VMs. Haskell could use it (it actually has surprising interfacing capabilities on this front), Google says OCaml can use mmap though AFAIK it has a less useful threading story, Lua can probably productively use it though again I'm unclear on the threading. Erlang can't use it out-of-the-box but conceptually it could be modelled as a port, though again whether you could get a performance win I don't know. Mono and the JVM could use it, though again, primitive threading story. Python and Perl have interfaces to them but you sacrifice so much performance simply by using them that yeah, it doesn't much matter. But at least in theory there are VMs that can productively use it.

littledanehren · on May 19, 2011

Good point about C. You misunderstand the intent of my post though.

First, as a small point, I was not seriously suggesting using thread pools to implement mlock_eventfd. That's just a way that we could play around with it short of a kernel implementation, which would probably be more efficient.

Second, I don't think I'm dogmatic about mmap. It's just a cool interface that would be nice to see more widely used, and I was suggesting that it might be more usable with a small addition. Commenters here, at my blog and on Reddit have pointed out other problems, and I'm happy to learn about these. If mmap isn't good and we should keep using DIO, then I think there are still possible improvements that we could make to the interface to allow better cooperation with the kernel.

alnayyir · on May 19, 2011

I don't think you've actually tried to write re-entrant or evented code in C before without help from libev.

There's a reason* my day job is Python.

JoachimSchipper · on May 19, 2011

You don't have to use the proposed mechanism: it's mostly intended for programs that already implement their own caching layer.

I'd expect the limited address space on 32-bit processors to be a bigger issue than the complexity.

Andys · on May 19, 2011

Actually, in my experience, read-ahead into cache is what the OS does NOT handle for you. To get file-level read ahead in *nix you need to use normal IO syscalls. mmap is only suitable for completely random IO.

ghotli · on May 19, 2011

In instances where you may have many processes reading the same files in some sort of access pattern and you know they need to be in disk cache you can use fadvise[1]. It'll let you tell the kernel that I am going to need that file, why don't you go ahead and get it in the disk cache.

For many applications this is not the right way to go, but for completely predictable access patterns on a machine that is devoted to it, it can save the day.

[1] http://linux.die.net/man/2/fadvise

rbranson · on May 19, 2011

There is an madvise() syscall that roughly translates into an fadvise() for memory-mapped regions as well.

forkqueue · on May 19, 2011

mmap is how the Varnish web cache works - the cache files is mmaped in, and the kernel does the rest.

Given the awesome performance of Varnish, I am surprised more applications haven't taken this approach.

aaronblohowiak · on May 19, 2011

Varnish is good because the size of a web page fits nicely with the size of memory pages... if you have very many small objects, you'll have to get a bit more clever about how you place them.

syncopate · on May 19, 2011

For what it's worth, I've written several projects that entirely rely on mmap()ed memory:

Localmemcache (key-value database): http://localmemcache.rubyforge.org/

clispy (Port of PNorvig's lispy to C): https://github.com/sck/clispy

premchai21 · on May 19, 2011

The big problem with mmap seems to be handling I/O errors. The most transparent uses of mmap are for executables. Failing to page in part of the program can probably reasonably result in a crash. But what of a server process handling a pile of different data on unreliable disks? Destroying the entire process on a single I/O error isn't ideal.

On Windows it seems like the best way to handle these errors is by either embedding SEH into the surrounding code or adding a vectored exception handler globally. On Unix, you have to set a SIGBUS signal handler. But then, mmap is apparently not guaranteed to be async-signal-safe if you want to remap a zero page over the broken one, and longjmp out of a signal handler is its own pile of potatoes; both seem to work on various modern Unixoids, but I haven't been able to find documentation saying that they'll continue to work. And with longjmp, or on Windows (where you can't remap pages over other pages directly, that I know of), any surrounding code that accesses the map needs to be abortable all the way up to a suitable error point rather than just having to handle bogus values. Much code assumes that a simple memory access will not cause a recoverable exception that may result in reëntering the code later.

And if you're in a library on Unix, good luck getting permission from the main process to alter signal handlers. The hook mechanism isn't as rich as that in Windows, so with the exception of large application framework libraries that are expected to take over the process anyway, it's an invasive and possibly irreversible activity.

This is all sad, because I love the idea of mmap. I was tinkering with a C library for accessing certain kinds of files, and I want to do it with all mmap, but I'm not sure I can overcome the I/O error problem adequately. (The blocking and address space problems are not too bad here; they impact performance and capacity, but not correctness.)

rbranson · on May 19, 2011

On top of all of that, using mapped I/O effectively means embracing it directly. Adding an abstraction layer on-top that would allow a quick switch to stdio would negate the benefits. Sucks.

phamilton · on May 19, 2011

Another problem with mmap not mentioned is that not all file systems support it. Filesystems like jffs2 which access NAND flash don't support mmap.

rbranson · on May 19, 2011

mmap is great, but it often lures developers into thinking that in-memory and on-disk data structures can be reasonably unified. It's still spinning media, and even SSDs are orders of magnitude slower than regular RAM. Optimized data structures should be used accordingly.

It's also not a hard rule that the OS cache is always better. It's probably very good at block-level caching, so don't rewrite that, but that's not very fine grained. The OS can't collect much more than some basic access statistics and madvise()s to figure out what to keep resident. It's kind of dishonest to make it seem like "advanced databases" like PostgreSQL should abandon their own caches entirely. In fact, most PostgreSQL tuners suggest that only a fraction of available RAM should be used as a buffer cache, and that it's prudent to just let the OS manage most of it. The query planner can even be advised as to how much effective cache space is available, including the OS disk cache.

nicolas314 · on May 19, 2011

Another common use of mmap is allocating blocks of memory, e.g. by mapping /dev/zero in private mode. A couple of convenience functions to use mmap can be found here: https://github.com/ndevilla/mmapi

danssig · on May 19, 2011

I don't get the point of things like this. If you remove every single blocking part in your code then you will always use up your scheduling quantum and become the lowest priority process (the mother of all blocking) anyway.

asomiv · on May 19, 2011

If database throughput is that important to you then you will put it on an idle machine that has no other services running, or assign a dedicated CPU core to it.

kunley · on May 19, 2011

Referring to comments below the original article:

So when epoll doesn't cope with regular files as author expected, why not use better interface: kqueue?

bnoordhuis · on May 19, 2011

linux doesn't have kqueue. But kqueue is only an API, you could emulate it with io_submit() and io_getevents().

A bigger obstacle is that not all file systems support asynchronous I/O, the io_*() syscalls won't help you there.