Hacker News new | past | comments | ask | show | jobs | submit login
Use mmap with care (sublimetext.com)
607 points by ingve 6 months ago | hide | past | web | favorite | 213 comments

The first serious bug I ever dealt with professionally was a result of the hazards of mmap(). This was 1995, and I was working on AIX with a system that used a series of shared memory buffers for IPC. It was originally written with shmat(), and on AIX (at least in those days), shmat was limited to three shared segments, so we had a lot of performance-wrecking blocking going on while waiting for the buffers to be cleared.

My first being original-idea development was rewriting with mmap so I could use an arbitrary (and programmable) number of buffers, tuning blocking against memory consumption, with logging to track performance for tuning. It was very cool. Worked great!

Until it went to production.

In production, it crashed every time it ran, shortly after starting up. Since we were doing seasonal production, backing out my change also backed out other necessary changes. It was very embarrassing and frustrating. Worse, I could not replicate the problem in testing! It only happened on the prod servers. And, as a wet behind the ears junior programmer, everyone assumed I'd just screwed up and was too dumb to understand how.

So I wrote a test program, divorced from our regular code, to test mmap() itself. Turned out that it ran fine on dev/test servers, but on the prod servers, it would randomly overwrite 1k memory pages with nulls. Yeah. Once i convinced the senior engineers and my manager, I got to report an OS bug to IBM. Who were like "What's wrong with your code, really?" I wound up sending them my C test code and the compiled executable, along with results.

It turned out the bug was caused by the order in which OS patches had been applied on the servers.

One of the first rules in the marvelous book The Pragmatic Programmer is "Select() isn't broken". Yeah, but sometimes it is.

And if a junior programmer came to me today reporting a bug in OS memory management, my first response would be "What's wrong with your code, really?"

AIX 3.2.5, around the same time, had a different mmap issue.

If you mmap a file and write a byte into the mapped region, but beyond the end of the file, you get a signal; in handling the signal, you can extend the file. You also have

1. Made an error, as documented in the mmap man page,

2. Done something that works on JFS,

3. In fact, done what JFS does itself under the covers, and

4. Made a huge, huge mistake.

If you did this on a NFS filesystem, it deadlocks the filesystem. If you are using an automounter, it turns into a tar-baby: IBM's automountd was single threaded. Any attempt to touch that filesystem deadlocks. Any attempt to mount a new filesystem blocks waiting on automountd. The only solution was to reboot the system.

This is a bit of a problem if you have a database implementation class all trying this at the same time.

I reported this to IBM and their response was, "it's documented as not working, whaddya want?" I considered releasing it on BUGTRAQ, but I didn't. The AIX 4 multithready rewrite fixed it, BTW.

NFS deadlocks are not unique to AIX. I've seen NFS in the state where any attempt to touch it locks up the process doing it (in the state where it can't even be killed - yes, the 'D' state) and if you need something from that process, good luck - reboot is pretty much the only way out. Nowdays, I think, NFS drivers in Linux became a bit saner but before that, if you get D's on NFS, you're out of luck.

Wow! Glad I missed that one!

Yeah. It's sometimes hard to remember just How Bad software was even in living memory. These days, we all just assume that the basics all work and are surprised to see bugs, which we vote up to the top of HN. But it wasn't always like that, things just failed in crazy ways, at all levels of the stack.

At the start of the dotcom boom, a server uptime measured in months was considered notable, and the idea of a client system lasting a week between some failure or voodoo reboot was laughable. Now... I dunno, when was the last time I bounced my phone? Genuinely can't remember.

To me in the 90s uptime was mostly limited by the fact that every month or so there was a new Linux kernel with a feature, driver, or fix that actually mattered to me. And I'd do `make oldconfig` to answer only new questions.

Now, it just works... mostly. systemd came in and made it like it's the 90s all over again with just plain basic things no longer working. It's pretty much stabilized in my experience in the last year, though, but it took years. The latest disaster in sound systems was well into the '10s, and actually not quite "just works" yet.

As for phone, I bounce it about once a month when the bluetooth stack just stops working.

And that's all for Linux. *BSD tends to go "Yeah, we don't do that here" about POSIX, so it can be unpredictable. And Solaris is just madness.

So no, I wouldn't say we're quite into "basic stuff works" just yet.

Yep, this is part of the reason why it drives me crazy when people say the '90s were the heyday of computing. Computers were awful back then.

I have a whole lightning talk about this, called "Why Software Sucks". The premise is that we are always dealing with the boundary between "barely works" and "doesn't work". Adding more stability, more features, etc just changes where that boundary is. After 25+ years in this industry, it sucks as bad as it ever did. Except now instead of a buggy OS, I'm dealing with, say, bugs in Openshift. Something like Openshift was almost unimaginable back when I started. Kinda sucks now.

And the improvements are less than you might think. On that same job, we were managing a couple dozen servers in two different data centers, and 200+ client machines, all remotely, able to deploy different versions of the software to specific machines. It was like "cloud", except we did it all without so much as ssh. We had a configuration management database in Sybase, and rsh/perl scripts, along with SMIT commands (I miss SMIT).

I think they still have a point. It was the heyday of computing precisely because they were awful; there was always something to work on or improve. Fixing bugs in operating systems offered clear improvements to the livelihoods of thousands or millions.

It's no longer that easy now. Most critical things don't suck that hard anymore, so it's hard to find a way to feel special, or feel like it's a special time to be a part of.

Even in startup land most work done is "build this website" or "build this app" - what the software does could be cool and easy enough to get passionate about, but the hope that "we're going to use this code to revolutionize the entire world for the better" that was omnipresent in '90s hacker culture has pretty much been murdered by the sins of Facebook and others in Big Tech. Now it's "what can I do to make sure the stuff I work on hurts the least amount of people as possible?"

On the flip side, while OpenSSH may be great and high quality now, and wasn't in the '90s, in the '90s people used telnet on the public Internet, and telnet worked.[1]

IOW: It wasn't basic tech and great opportunity at the same time.

Today you can work on immature things like OpenBSD's MPLS implementation and who knows, maybe in 20 years people will say that back in 2019 there were so many things that could be improved about that basic tech! (cue replies anti-labelswitching dweebs)

[1] Not me, I used ssh since before OpenSSH.

Telnet and rlogin were both very common. I remember installing ssh on my Slackware Linux box back in 1996. I had to build it from source... fun times!

On the other hand, there are some things that are decidedly worse today than on some old machines. For example keyboard input latency on modern computers can be horrendous, and the physical keyboard hardware is utter garbage compared to what was available in the 80s.

For a single-purpose writing machine (and as long as you don’t mind sitting at a desk), my parents’ Mac SE from 1989 running a version of MS Word from the early 1990s is IMO significantly more usable than a top-of-the-line 2019 computer running the latest version of MS Word.

Given the amount of research and implementation effort/investment that has gone into improving computers over the past few decades, the state of user-facing software (including browsers and web apps) for the average person today is in my opinion shamefully bad.

> my parents’ Mac SE from 1989 running a version of MS Word from the early 1990s is IMO significantly more usable than a top-of-the-line 2019 computer

This is a bit spun. I can't tell you what to like, but it's easy to romanticize older machines and forget what life was like with a 512x342 monochrome 60 Hz CRT, no backbuffer, and a CPU that took multiple frames to do something as simple as scroll.

I actually have a still-working Mac Plus I like to show off to friends. And yeah -- it was a great writing machine and in lots of ways current MS Office is a disaster of complexity. It was elegant in a historic way, but.. no, it's not "better" by the definition used by pretty much any working writer.

My father continued to use it for the majority of his work until at least 2005, and the computer still boots and works just fine today. He had far more frequent problems with a succession of Windows laptops in the subsequent 15 years than from the 15 years of using the Mac SE.

Having a small monochrome screen is not an insurmountable problem in practice for people used to doing most of their work by spreading physical paper around on the floor. Yes it could be better if it took advantage of 30 years of CPU, etc. improvements, but it’s nonetheless a fantastic tool.

Your “working writers” probably also need their computers for web browsing or whatever. And probably waste a ton of time on computer-related or computer-aided bullshit.

My favorite “how did this go unnoticed” bug in that era was in Windows 95 (so … target rich) when working on an IDE where the File Open dialog showed the C: drive as empty, but only on one PC at a single customer site and one test box in our QA lab – every other available system worked, as did all of the Windows NT systems. Someone wrote a test program and confirmed that it was immediately exiting with the normal no-more-entries status, exactly as in the documentation except for the lack of entries.

I had a lucky guess remembering that the test box wasn’t on the LAN and discovered that the problem depended on whether you’d enabled network sharing. You didn’t need to have shared anything or be using a share – simply having it enabled meant that a file system filter was installed and the local hard drive magically started returning results.

It turns out that there were multiple APIs which did this and apparently the original developer had unluckily picked the less common one. Switching was easy and had no apparent drawbacks. The best theory we had was that Microsoft’s QA group disliked copying builds around on floppies as much as we did and didn’t have many completely offline systems around.

> One of the first rules in the marvelous book The Pragmatic Programmer is "Select() isn't broken". Yeah, but sometimes it is.

poll() was broken on early versions of Mac OS X for quite a while[1] :-). And I thought I recalled select() being broken too, but might be mistaken. Looks like some reports of kqueue() being broken too.[2][3]

[1]: https://daniel.haxx.se/blog/2016/10/11/poll-on-mac-10-12-is-...

[2]: http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AN...

[3]: https://discussions.apple.com/thread/4783301?tstart=0

I had a similarly maddening issue (circa 2014) once where small chunks of memory would randomly get overwritten on PV-virtualized AWS instances, but not on HVM-virtualized instances. We spent an enormous amount of time chasing this problem; the most senior engineer on the team wrote a test case that would have to write 100TB of data before reliably reproducing the issue.

We never did figure out what caused it. Eventually we migrated all our instances to be HVM-only, and the problem went away.

>>> One of the first rules in the marvelous book The Pragmatic Programmer is "Select() isn't broken". Yeah, but sometimes it is.

How come you give this example just now? This can't be a coincidence. select() just caused us a major production outage.

FYI: select() is broken on pretty much all Linux kernels up to very recent ones. Doesn't work when there are more than 1024 opened file descriptors, not a lot for a server app. Side effects include crashing the application.

>"FYI: select() is broken on pretty much all Linux kernels up to very recent ones."

Can you elaborate or might you have some links? What was the cause and resolution?

See first paragraph in section BUGS. http://man7.org/linux/man-pages/man2/select.2.html

If you google for select 1024 file descriptors, you will find a ton of issues affecting a ton of libraries. The correction seems to be using poll(), but in our case was to remove the routine that wasn't needed anymore.

No, it's the least broken event notification mechanism on all systems. And it doesn't have a 1024 FDs limit on most systems either, as long as it used directly via syscall or through an event loop library that does that.

But in general for all event notification mechanism on all systems, it's incorrect to assume that reported FDs are actually ready, because that could cause a busy waiting loop, and incorrect to assume that unreported FDs are not ready, because that could cause timeouts.

Was your problem something other than the FD_SETSIZE limitation of select?

> One of the first rules in the marvelous book The Pragmatic Programmer is "Select() isn't broken". Yeah, but sometimes it is. And if a junior programmer came to me today reporting a bug in OS memory management, my first response would be "What's wrong with your code, really?"

In my years of software development I had the fun but unfortunate experience of coming across not one, but two compiler bugs. The program would segfault on a seemingly innocuous line, but after looking a the assembly, it was clear the compiler was doing something screwy. These types of bugs are exceedingly rare, and they can be maddening to debug, but when you catch them you get a nice sense of satisfaction.

this is an example of where experience matters. kids these days don’t get to experience deep bugs like this. and they are the worse for it.

Even without NFS, using mmap requires being real careful about signals - SIGBUS can be raised any time the underlying file operation fails, including because someone else truncated the file, or because the underlying storage had an error (disk error, removed media, network storage). And, as this post so eloquently illustrates (and through my personal experience), handling SIGBUS/SIGSEGV cleanly in a multithreaded program on POSIX is incredibly painful.

Honestly, pread is just a much better solution for 90% of use cases, and it works for large files on 32-bit systems (mmap does not!). If you're doing largely sequential things, fread/fseek often work remarkably well as they handle all the caching for you.

mmap tends to shine performance-wise if you need random access to a file but access certain parts of the file frequently (for example, accessing the index in a header + contents of the file), because the page cache is literally designed for this type of usage. But the performance improvement is rarely worth the technical complexity.

Alternatively one can run a separated process that does mmap and runs the calculations or whatever that needs to access the file as quickly as possible and do the the straightforward recovery in the parent process when the child process dies. The drawback is the need to some form of RPC, but there a lot of libraries to do that without much hustle.

If you are already copying things around (between address spaces even), you might as well just use pread or lseek+read, which at that point is likely a much better choice overall.

Couldn’t you mmap a shared, read-only page so that your OS doesn’t copy it?

It would be better to mmap the file into N single-threaded processes instead of mmapping the file into 1 N-threaded process. This is exactly why signals are process-global instead of thread-local. The intent is that they were used for inter-process communication. The whole notion of processes and signals predates the notion of a thread, and threads are essentially a performance optimization for 30 or 40 year old hardware. If you use single-threaded processes then all of the processes can MAP_SHARED map the file and then they all get access to the file contents as each process is updating the file. And they all also have their own address space and can crash separately from each other.

That's because mmap() predates multithreading (at least as a usable production approach rather than a lab experiment). It's best for IPC, not sharing between threads.

I forget sometimes that the young'ns never grew up in a world without threads, and often never learned how unix really works.

>This is exactly why signals are process-global instead of thread-local.

Signals can be sent either to a process or to a specific thread under linux. Signals sent to the process are handled by an arbitrary thread. There's a bunch of gotchas and whatnot past that.

Mixing signals and threads is pretty much trash.

> Mixing signals and threads is pretty much trash.

Agreed, except for SIGSEGV, SIGBUS, and SIGPIPE, which are guaranteed to be delivered, if at all, to the thread which triggered the condition. (Unless, of course, they were raised explicitly.) Still trash in the sense that to do it right you're depending on installing global signal handlers and (very likely) arranging for thread-local storage (actual TLS isn't even async-signal safe; see my post elsethread regarding the sigaltstack trick), so these tricks are really only practical for the core application and not something you can stash in a library to perform transparently.

FWIW, none of this behavior is specific to Linux. Linux follows POSIX rather well in this regard.

I guess if you are using RPC you could forget about NFS and just run your "editor server" on the server where the file is actually stored :-)

In many cases, the server where the file is actually stored will not allow you to execute random code.

You can do that with a MAP_ANONYMOUS | MAP_SHARED mapping too: that kind of mapping is writable by both parent and child, but isn't backed by a disk file and so can't be truncated or surprise-removed. The article's points about mmap infelicity applies mostly to mappings of disk files. Anonymous mappings don't have the same problems.

Anonymous mappings are backed by swap and may be overcommitted, it's still possible to catch signals in a wide variety of circumstances

There is probably enough evidence in this thread to use it as a reference for why typical apps should avoid mmap whenever possible -- it's clear almost nobody fully understands it

> Anonymous mappings are backed by swap and may be overcommitted, it's still possible to catch signals in a wide variety of circumstances

Anonymous mappings won't cause signals, they'll trigger the OOM killer. Remember that malloc() is just a fancy wrapper for mmap() (and sbrk()).

If a process was swapped out and a fault fails to bring a page back due to an IO error, you can at least catch (I think) SIGBUS. But this just reinforces the point: nobody really understands virtual memory, even people like us that think they do

So should we extend your conclusion above to the following?

"There is probably enough evidence in this thread to use it as a reference for why typical apps should avoid virtual memory whenever possible -- it's clear almost nobody fully understands it"

I'd suggest that is ludicrous, and for the same reason your original conclusion is also excessive.

It is not constructive to form a sweeping generalization from a statement and then claim the sweeping generalization is ludicrous, implying the original statement is ludicrous. :)

My first comment was in reply to one claiming anonymous memory did not have the same problems as file-backed memory, indicating the parent did not understand they are the same thing. The subsequent reply was to another comment continuing to claim anonymous memory was somehow safer, both instances supporting the notion that most people in this thread don't seem to understand mmap at all.

What we're examining is a powerful (and consequently hazardous) OS feature that often provides only marginal performance improvement, yet introduces many exotic error paths into a program that have their own exotic problems (memory access in thread A can raise SEGV in thread B, async-signal safety), that 7 hours' commenting has not been sufficient to fully capture. This thread is full of upvoted miscomprehension, bad advice (spawn a child to deal with SEGV!?), obviously incorrect solutions (signalfd), and yet still manages to completely omit some critical characteristics of mmap, for example, that faults take a VM-global semaphore -- mmap can easily destroy multithreaded app performance in a way read() is immune to, because nobody expects file IO in one thread to cause malloc() latency in another.

If this isn't evidence for "avoid this feature wherever possible", I really don't know what is.

As if "avoid virtual memory" is substantially more of a "sweeping generalization" than "avoid mmap."

If you apply the same reasoning that you've used to conclude that everyone should avoid mmap, than you are led directly to the conclusion that everyone should avoid virtual memory.

The "same problems" that you are pointing out are possible with anonymous memory aren't unique to memory you get directly from mmap, they also apply to all memory, period. mmap'ed anonymous memory might have th same problems as file-backed memory, but those are the same problems that .text and .bss have.

The "powerful (and consequently hazardous) OS feature" here isn't mmap, it's literally virtual memory. At the moment you concede that memory may be backed by something besides physical memory at any point in time, you get the possibility of all those "exotic error paths."

This is spot on.

The exotic error paths are always there, but you don't always need to handle them. You can push some error handling to other systems, such as clients and supervisors / orchestrators. The reason why mmap with files is tractable is because we have a good error handling strategy (remap with zeroes, mark error) and we have a few understandable reasons why we might expect the error (IO errors, lost media/network failures, or even truncate). In general the problem that IO is done outside of direct syscalls like write() can be difficult even when you're not using mmap, like when Postgres was losing errors when calling fsync.

But when you have an IO error in your swap file, go ahead, eat the SIGBUS and die. This is fine.

Let's try and simplify things here: the post we're both currently commenting on relates to doing file IO via nmap. Of course 'avoiding all use of virtual memory' is ridiculous, but nobody except you is suggesting that, and you continue to suggest it even after a long reply.

The "powerful (and consequently hazardous) OS feature" here is using mmap for general file IO, it:

- introduces resources leaks many developers can't profile

- introduces VM bottlenecks 99% of developers can't profile

- introduces random segfaults delivered to arbitrary threads in the running process, leading to crashes many developers can't diagnose

- the mmap() interface itself is fundamentally unsafe in that it allows partially overwriting random bits of VM (MAP_FIXED) with file views, and worse still, allows those mappings to be read-write

Once again, nobody has ever suggested avoiding virtual memory except you -- once again, that is impossible in a modern environment, but it is more than possible, and ultimately incredibly sensible, not to mention entirely on topic with regards to this thread and the article it is attached to, to suggest avoiding use of mmap for general file IO

- What resource leaks are introduced by mmap?

- Nobody ever said mmap was always faster than the alternative. If you care about performance then you should do whole-application performance testing with and without features enabled (like mmap IO). This is not unusual, there are plenty of aspects of performance that are counter-intuitive, where speeding up one part of your program causes a seemingly unrelated part of your part of your program to slow down.

- The signals are SIGBUS, and they can be intercepted, mapped with zeroes, and the errors can be propagated back to your app code later. This is not trivial but neither is it outrageous.

- You can overwrite arbitrary memory with read(), too, you just have to pass it a pointer to something you want to overwrite. mmap() is not any less safe. Recall that typical use of MAP_FIXED is so you can overwrite an existing mmap() region with something else, not so you can nuke random parts of your address space.

Keep in mind that you are, if nothing else, an indirect user of mmap(). The question is whether using mmap() directly is advantageous for your applicaiton. "Yes" is not an unreasonable nor outrageous answer for some applications.

> here is using mmap for general file IO

> nobody has ever suggested avoiding virtual memory except you

Do you know what the "anonymous" in "anonymous mapping" means? You are the one that started asserting that anonymous mapping from mmap have the same difficulties as mapping of normal files and therefore too dangerous to use.


I hate to do this, but:

> paragraph, n.: a distinct section of a piece of writing, usually dealing with a single theme and indicated by a new line, indentation, or numbering

In the original comment you will find two of these, the former correcting an error in the parent comment, the latter making an observation based on the obvious brainwrong riddled throughout this thread

I'm done replying, you're of course free to continue checking in hazardous and suspect file IO code, as the rest of us are free to giggle at such things before ripping them out

> nobody except you is suggesting [avoiding all use of virtual memory]


> Anonymous mappings

> typical apps should avoid mmap whenever possible

All virtual memory is either a user-mode wrapper around mmap, or sbrk (which is functionally a kernel-mode wrapper around mmap).

This simply won't die, will it? I mean, while we're at it, let's advocate abandonment of all higher level languages because essentially they all boil down to machine code, and nobody could recommend working directly with machine code any more, could they.

(But that would be a sweeping generalization)

> This simply won't die, will it?

Please clarify how you think that statement is incorrect. As far as I'm concerned, it won't die because that's how virtual memory works, and you have something going on in your head that is either wrong or massive hairsplitting.

I would have expected a better understanding from someone bloviating about how the try of the commenters are too thick to understand mmap and virtual memory.

Okay, but I'd say the correct thing to do is let SIGBUS kill the process. You can at least expect to handle SIGBUS for a file you've mapped.

> Anonymous mappings are backed by swap and may be overcommitted,

So is normal memory. Many allocators today even use mmap internally.

All* allocation is mmap, really. All mmap does is dedicate a region of address space to some kind of backing storage. The particular kind of backing storage makes all the difference. The problem is that people colloquially use "mmap" to mean "mmap of a conventional disk file" and don't mean all the other kinds of mmap out there, so discussions can become confusing.

* Ther's sbrk too, but it's just a fancy legacy path that amounts to the same thing as anonymous mmap

The article points out the infelicity as applied to mappings of files into multi-threaded programmes. What fpoling suggests is a solution to the POSIX signals problem where anonymous mmaps are not an option.

Yes, but if the file is on local disk its exactly the same situation as the page file.

OP seems to be trying to cope with NFS errors....but if you are using mmap on NFS you have bigger problems...

The mistake here is using longjmp / siglongjmp. This is a possible way to handle SIGBUS, but in practice it will be intractable in larger programs written in C or C++. The compiler is generally free to move loads and stores around, and you might be completely blindsided by how the compiler has reordered your memory operations once you add side effects to one of the operations. Theoretically, if accessing a memory location can siglongjmp out then at least the memory location should be volatile.

A better way to handle SIGBUS is to just map zeroes over the offending pages using MAP_FIXED and then setting a flag. After every operation that works on a file, you check the flag.

This sounds interesting. Do you know of an example that does this?

So if I have multiple threads reading the same mmap'd file, I use si_addr in the signal handler to know which page to call MAP_FIXED on?

I don’t know an open-source example off the top of my head. But that’s the gist of it… you keep an array somewhere with all the address ranges mapped. If you get SIGBUS, find the corresponding map, replace it with zeroes, and mark it as having an error. If there is no corresponding map, uninstall the signal handler and return—the thread will SIGBUS immediately after return and the default action will kill the process.

I solved the problem this way recently as well. This solution would also work with multiple threads (you can probably set the flag in thread local storage, though I haven't done that yet - currently using a global map with thread id as key). Though still not with multiple different signal handlers.

That said, if I could do it over I wouldn't use mmap again. Especially since io_uring is around the corner (on Linux) that allows zero-copy reads/writes with no syscall overhead.

Thread-local storage is not async-signal safe. Some systems (including glibc!) will lazily initialize TLS slots on the first access, which would result in a call to malloc and/or locking of a mutex, neither of which are async-signal safe. Some slots may be initialized at link time but others lazily, depending on how and when the code was compiled and loaded, so just because it seems to work for you doesn't mean the behavior won't change.

The safest (and also portable) way I've found to setup per-thread async-signal-safe local storage is to install an alternate signal stack for each thread using sigaltstack. Allocate a larger buffer than needed (and reported) for the stack and use the remaining memory for storage and guard pages. For example, allocate ((PAGE_SIZE * 3) + roundup(MINSIGSTKSZ, PAGE_SIZE)--two guard pages, one page for your local storage, and the remainder for the stack.

Honestly, this reads like a thorough indictment of signals in user space.

* Signal handlers are process global * Signal handlers need to be re-entrant safe

Re-entrancy is painful but can be done, but process-global signal handlers means that pulling in a totally unrelated library can break your code. Moreover, it makes the combined use of certain libraries straight-up impossible. Similarly, it means that the use of library precludes you from using certain features.

Combined, this means that the use of signal handlers is simply toxic. Which makes them an anti-feature. It feels to me like having the restrictions of kernel-space, with all of the downsides of user-space.

Are there any plans in linux to replace signals?

The real issue is that people are using threads for things that processes were originally intended for. Having one thread for the UI, one thread for network code, one thread for program logic, etc. is actually a mis-use of threads in POSIX-land. POSIX is designed for you to use multiple processes where in Windows you would use multiple threads. This gets you a lot of robust interprocess communication mechanisms and also memory isolation on systems with an MMU. It also had performance implications on the state of the art hardware at the time, which is why threads were invented. And finally it requires you to expand your usage of the operating system.

Being multi-process would solve the issues with a process-global signal handler because there would no longer be a question of which thread generated the signal.

This seems like an ivory tower argument, people shouldn't want shared mutable state, so anyone suffering is doing so at their own hand.

At the same time, shared mutable state, specifically a fully shared and transparent (ignoring caches) address space is the least effort way to take advantage of multi-core CPUs. Hence, people will be using it. There is essentially no getting around that, and there are even some reasons for wanting it.

Being multi-process would solve a lot of issues, but so would settling on a single convention for endianness, rewriting C++ to use unique_ and shared_ptr where applicable, and many other nice to haves.

At the end of the day, the easiest road is going to be taken more often. If that happens road is lined with bandits and barely visible traps, that is a problem. No matter how often you tell people not to take that road, and climb a slippery mountain path instead.

It's much more feasible to fix the signals API to work well in a multithreaded world than to roll back the clock to the 1980s and get people to stop using threads.

There's nothing wrong with signals per se. The problem is the signals API. Why do signals handlers need to be process-global? Why do we need to live with global signal handler registration clobbering any previous registration? We can change these things!

These problems are all fixable without "replac[ing] signals" as the mechanism. Ultimately, as long as processors have traps (which they will, as long as we have virtual memory) and as long as you want to give userspace to do something in response to these traps other than immediately die (an ability that's tremendously useful) you need some kind of stop-the-thread-that-trapped-and-call-a-callback mechanism, and whatever shape that mechanism takes, it's going to end up looking at least somewhat like signals.

Instead of just saying "signals are awful" and burying our heads in the sand, we should talk about what a better signals API should look like. I've already written a detailed proposal that I've linked elsewhere.

The real problem here is that the glibc people are completely uninterested in actually improving the signals API. Instead, they've taken the radical, unhelpful, and realistic stance that nobody should be using signals. As long as they think that way, people will keep using sigaction(2), and the world will remain in a half-broken and awful state.

For others [1] is the proposal meant, suggested in HN comment [2]

[1] https://www.facebook.com/notes/daniel-colascione/toward-shar... [2] https://news.ycombinator.com/item?id=19806448

Normal libraries should never register signal handlers. Google Breakpad is a crash-reporting system and as such requires signal handling to function.

Also since we're on the topic, here's a nice vulnerability caused by bad signal handling: https://news.ycombinator.com/item?id=16753013

Pretty sure there are no plans to replace signals, but maybe there are libraries that make signal handling easier?

> Normal libraries should never register signal handlers.

That statement alone should be enough to re-examine a feature. Code composition is a rather nice thing to have, making that very complicated is not a good thing.

> Pretty sure there are no plans to replace signals, but maybe there are libraries that make signal handling easier?

Non-portable, but sigprocmask() SIG_BLOCK plus signalfd() (Linux) or kqueue() EVFILT_SIGNAL (BSDs). Neither is a good solution for handling mmap SIGBUSes, but they're generally good for handling most signals (USRn, TERM, HUP, CHLD, etc) more similarly to other kinds of events.

signalfd does NOT work the same as kqueue EVFILT_SIGNAL. On Linux signalfd basically behaves the same as installing a signal handler--only a single signalfd object will receive a signal, so whether installing a signal handler or a signalfd listener you're changing behavior globally. It's even messier than that, because signalfd only works if there are no other signal handlers installed.

By contrast, with kqueue all EVFILT_SIGNAL listeners will be notified of a signal, even if the signal was also delivered to a handler. Big difference.

This is one among many reasons why Linux's event syscalls are widely considered inferior to BSD kqueue. It's ridiculous considering that kqueue not only predated epoll by several years and signalfd by nearly a decade, with ample use cases, it was well documented in a paper that described the rationale for all the semantics. Somehow the authors of epoll, signalfd, etc couldn't even bothered do to basic research; they just spitballed semantics without having much real-world experience about what was needed and most useful.

Signals are fundamental to the unix design. That'd be like "I want a car, but without any wheels, because wheels are often implicated in crashes".

Depending on OS internals within threading is a problem. Libraries that implement their own signal handling are a problem. Trying to get thread-local signal handling behavior when signals are process-global is a problem.

Threads are not a wholesale substitute for processes. Signals work just fine, as long as you plan for ways they can affect your code's behavior.

A feature that

1. cannot be used if we compose two systems that use it 2. precludes using other mainstream features 3. is an essential part of the interface between userspace and kernelspace

is a bad feature. I am not saying we should drop the idea of allowing userspace to respond to things the kernel raises. I am saying we need a way that will allow 1. program composition and 2. play nice with threads.

I am not a unix/linux expert, but those requirements seem like they could be met.

Signals play nice with threads.

It's sloppy thread programming that doesn't play nice with signals.

Consider the following scenario:

Two threads both want to MMAP a file / catch SIGSEGV for different pieces of code. Moreover, these threads come from different modules of a system maintained by different people. All of a sudden, these modules become coupled because we need some system for delegation of signal handlers between them. Or, we need to introduce a custom signal-handler module to deal with our delegation.

I would not call that 'playing nice with threads' nor would I call that sloppy thread programming.

You're really just talking about poor practice in writing signal handlers. The signal API always tells you the previous value of a handler. Your own handlers should remember the previous value and call it if your own handler doesn't fully handle a signal it receives.

Previously: https://news.ycombinator.com/item?id=11899385

"I found a paragraph in this article http://www.linusakesson.net/programming/tty/ very apt at describing what Unix signals are like:

  In *The Hitchhiker's Guide to the Galaxy*, Douglas Adams 
  mentions an extremely dull planet, inhabited by a bunch of 
  depressed humans and a certain breed of animals with sharp
  teeth which communicate with the humans by biting them very
  hard in the thighs. This is strikingly similar to UNIX, in
  which the kernel communicates with processes by sending
  paralyzing or deadly signals to them."

Really cool talk by Bryan Cantrill about the joys of mmap in a "simple" use-case: https://m.youtube.com/watch?v=vm1GJMp0QN4#

Such a great talk - thanks a lot! Here is the direct link as it is the last of the lightning talks: https://www.youtube.com/watch?time_continue=2462&v=vm1GJMp0Q...

Things are so much better if you are not writing apps for general public (mine are trading-related). You can tell your few clients — make sure that the access to mmapped file is exclusive — and get away with it.

And yes, mmap is the awesomest thing out there.

I’ve moved away from mmap for trading in favor of a separate write thread.

While mmap is fast, the combination of factors that can make it decide to stall your thread while it commits to disk is difficult to manage from an operational standpoint. A slight misconfiguration is all it takes to introduce a rare and hard to notice multi-millisecond delay.

Whereas with a spinlocked sized-reserved vector, the fail state performance is however long it takes to allocate more space which is on the level of microseconds. You do pay 50-150ns for that spinlock though.

Or better yet, mmap in a read-write-thread?

Ya sure, go nuts on the other thread. It’s not critical to trading. I normally fprintf all sorts of time stamps, log level info, and handle formatting over there.

Just make sure you put it on another core to protect your cache.

But if you want mmap, you do you.

Writable mappings are problematic though, because of the unpredictable pauses caused by the TLB shutdowns required to maintain coherence of the dirty flag.

Author here, if anyone has any questions in relation to me or Sublime HQ please feel free to ask.

Is there a post where it's covered why Sublime Merge implements things like packfile reading on its own, rather than using git's own plumbing? E.g. in this case presumably keeping a "git cat-file --batch" would do the trick.

I contribute to git.git, and it would be interesting to know if there's inherent issues stopping you from doing that, or if it's implementation problems in some cases (e.g. missing plumbing commands or features). There's definitely interest from upstream in reviewing patches / helping if there's missing or inadequate plumbing.

You'd get upstream features for free as they come along. E.g. presumably you haven't implemented the new MIDX format, but that speeds up pack file access by a lot for some use-cases, and presumably the boring bits of low-level git operations aren't much of a selling point in and of themselves.

Aside from whether you'd use "git" itself, such a trick of using a slave process you'd talk to over IPC of some sort would cover some of the issues you wrote about, e.g. issue with sharing global state with libraries like Breakpad.

We do defer to Git for all write operations, but for reading, we do it ourselves partly for efficiency, and partly to get the right data.

In terms of getting the right data, one example is that we need to know the full set of non-ignored sub-directories in the working directory, so we can watch them for changes. It's easy enough to generate this ourselves as we calculate the status output, but I don't believe that git will emit it.

In terms of performance, we rely on being able to read objects efficiently. For example, to show a commit, we can't just use the output of "git diff", as we need the full file contents to be able to calculate syntax highlighting correctly. You could go a long way with "git cat-file --batch", but there are plenty of contexts where you can't practically batch requests, and process creation costs + the lack of caching across requests (which can be quite significantly due to the delta encoding of objects) would be quite significant.

Thanks. That makes perfect sense, I'd be very surprised if all the current plumbing was able to serve your current use-cases.

There's going to be cases where it sucks, e.g. what you point out with wanting both raw blobs and their diffs, you'd need to do that in two plumbing commands now.

But just on that example: Having poked at some of the diff code recently I can tell you there's no big technical hurdle to just exposing that sort of thing. I.e. spewing out machine-readable raw blobs and their diffs, it just happens not to be exposed now.

I think what a program like Sublime Merge would want/need short of C API access (which is unlikely to happen) is a git version of an open-ended "plumbing" IPC protocol of the sort that Common Lisp VMs tend to expose. I.e. being able to have one (or few) "git command-server" processes spawned, and ask them questions like "look up this blob" or "diff these two blobs" (where the previous blob lookup would be cached).

Obviously patching/coordinating/upstreaming those sorts of changes is going to take work, but so is duplicating and keeping up-to-date with the diff, pack, status etc. code.

I'm not trying to tell you what to do, just saying that the git project is definitely friendly to "we're a commercial product and need this missing plumbing for our editor" (unlike say, GCC).

The plumbing that's there now is mostly in the state it's in because it's what git itself needed in the past when it was more of a collection of shellscripts, as well as being biased towards what git server operators like GitHub needed (because they sent more patches), which is why plumbing for say batch blob operations tends to be better than the one for "status".

In any case it would be very interesting to have some post about the sort of read-only operations Sublime Merge is doing with its own custom git code.

The license of git (GPL2) might be an issue for a commercial product. libgit2 is also GPL.

(Also, IPC and fork+exec has overhead that mmap or thread in the same program does not.)

There are no license issues with bundling the git command-line tools in a commercial product, there's multiple existing proprietary commercial applications built on top of git that do so.

The libgit2 code is GPL with a linking exception, so you can use it (unlike "git" itself) as a C library in a proprietary commercial product.

> IPC and fork+exec has overhead[...]

The "git cat-file --batch" command is something you'd invoke once, and then as your program runs you keep feeding it SHA-1s on stdin and it spews out their content on stdout. So even on Windows the overhead of that should be fine.

It's clear from ben-schaaf's other comments (which I read later) that one concern was the simplicity of downstream APIs being able to read the data using a normal C variable.

But that just leaves more questions. People in this thread are mentioning pack files, assuming that a multi-GB "git object" must be in a pack, but I notice the original post doesn't say anything about it.

If they're reading packs with this they'll need to parse it, resolve deltas etc. So likely the code that deals with the mmap()'d variable is small in any sane codebase (they're surely not doing delta resolution repeatedly all over the place...).

If they're very large loose objects those will most likely be zlib compressed, so wouldn't this need to go through some intermediary API layer anyway? I guess if SM itself is adding them it could add them uncompressed.

Since ben-schaaf mentioned this not being about performance, but about saving memory I thought this might be something like wanting to extract a small part of a 1GB object from git for display. That seems like a thing an editor might want to do.

In that case "git cat-file --batch" would suck, but not for some intrinsic reason. An API could be added that could take the start/end of an object to print out.

Certainly some commercial products can make GPL2 components work, but in general, it is a barrier to commercial adoption. I don't think you can totally dismiss GPL licensing concerns as "there aren't any."

Oh yeah, there's definitely concerns in general. It's not as easy for shippers of proprietary software as say the BSD license, you've got to keep a clear interface separation between your proprietary code and the GPL code etc.

You might also have lawyers who are lazy about it and don't want to deal with the liability, "we heard Apple banned GPL code..." or "the FSF sued Cisco...".

But there's no license reason for why you can't use that GPL code in some way, and everyone from Google with Android to Oracle with Oracle Linux and their DB bundles GPL code that's directly used by some accompanying proprietary piece of software.

But what I was more going for is that there's also a non-legal aspects to it that go beyond the license, which is that some maintainers of free software are actively hostile to their software being used as a smaller component in some proprietary product.

The GCC project is probably the most famous example of this, I think this has changed somewhat in recent years with LLVM+Clang, but they used to jealously guard things like their AST format. So e.g. someone with a proprietary editor (or Emacs for that matter...) could never hope to use GCC for spewing out parsing information for some C code.

I think it's fair to say that the Git project isn't like that. If someone maintaining proprietary software needs some plumbing interface to hook their stuff up and is willing to submit patches it'll be received as well as any other change (subject to review, maintenance & backwards-compatibility concerns etc.). If they find it useful it's likely that other people will too...

Small side note for completeness:

The effects of oh-noes-my-file-is-gone can be somewhat mitigated by using the heuristics built into NSData (instead of using mmap directly).

For example, you call NSData’s `dataWithContentsOfFile:options:error:` with the `NSDataReadingMappedIfSafe` option [1]. The framework will then transparently mmap the file unless it believes there’s an elevated risk of the file going away.

Apple doesn’t disclose how NSData exactly makes that decision; however, I’ve found a few reports that say it uses mmap internally when the file is on the root filesystem, and fall back on an in-memory copy otherwise.

It’s a rather dumb heuristics though, and may not solve the issue entirely.

[1] https://developer.apple.com/documentation/foundation/nsdatar...

Wouldn't that mean that networked files would be loaded entirely into RAM, negating the whole reason we started using mmap in the first place?

The method also checks for things like whether the file is in /dev and whether filesystem compression is enabled.

Did you consider emulating mmap yourselves?

  "Memory mapped files work by mapping the full file into a virtual address space and then using page faults to determine which chunks to load into physical memory. In essence it allows you to access the file as if you had read the whole thing into memory, without actually doing so."
I feel like this could be done in c++ directly, by maintaining an internal cache for each file that keeps track of which parts of the file are loaded and uses read() to load chunks on demand. Error handling would be a lot simpler (no signals, just a failed read()) and there would be less OS-specific code.

This is essentially how databases like PostgreSQL work, but in essence it only avoids the sys-call overhead. The OS is already caching the file, regardless of mmap, so using pread would have likely been enough for us.

It totally would have been simpler overall, but each incremental step we made was significantly less work than the refactoring required for pread.

> The OS is already caching the file

Not necessarily. With O_DIRECT, pread() doesn't put pages into page cache: it just DMAs them directly into your process. Using O_DIRECT and the process-private caching we've been discussing, sophisticated programs (like databases) can (and do!) implement their own "page cache" systems. And because databases have access pattern information that the generic kernel VM subsystem doesn't, such a database can frequently do a better job doing this caching on its own.

I might have undersold the performance advantage of writing your own cache, but let me reiterate the point I was trying to make: The reason we didn't consider doing so was because we weren't having a performance issue. Writing our own cache would be strictly more work than just using pread and accomplished the same thing.

Yeah. For your application, you did the right thing. I was speaking more abstractly.

It totally would have been simpler overall, but each incremental step we made was significantly less work than the refactoring required for pread.


In 10 years will you be saying this about the next incremental problem that you run into? If you think this likely, then the next incremental problem is an excuse to do it right.

If it's less work to solve that problem than refactor all the relating code, and the impact on maintainability is minimal, likely yes. But considering the amount of users we have and the current lack of any crashes relating to mmap there are unlikely to be any future unforseen issues.

Mmap is right, though. Pread would also be right. There's a tradeoff and the complexity argument would only win if they knew all this when they started.

Well, then you have to implement some kind of plan for efficient caching - some kind of LRU scheme, for example, to prevent the cache from ballooning to unusable sizes - at which point you're reinventing the kernel page cache (poorly). mmap does have a big advantage here if you really need a lot of random accesses.

It’s easy enough to read a file in chunks, parsing out the information as you go. This limits memory use as long as you release the chunks when you no longer need them. The operating system can swap out memory as-needed, even if you didn’t get the memory from mmap, so it’s irrelevant where you store the parsed data.

Unless you actually need to read the file multiple times (compared to looking at the parsed in-memory data multiple times), this should be fast enough.

Thank you for the write-up. As an occasional scripter (a coder I am not), I found it very useful - a very nice and structured presentation of both coding practices, and of the challenges of supporting multiple environments with their own way of doing things.

Showing the scope of change within the editor is a rather nice touch. Visualization of complexity, if you will.

You've used thread_local for the sigjmp_buf.

Are you sure the mechanism used by thread_local is safe to use in a signal handler?

Relevant bug from Rust: "TLS accesses aren't async-signal-safe", https://github.com/rust-lang/rust/issues/43146

I'm thinking even if it works on certain OSes, it's not guaranteed, because a signal handler's context is not a thread context - or is it?

E.g. the thread_local mechanism might depend on compile-time options (affecting how thread_local, whether it allocates memory on demand, and how it relocates the memory block when loading a shared library), whether it's main program or an -fPIC shared library, and the type of thread library (different ways of implementing pthreads).

Considering we haven't had any mmap related crashes anymore I'd say its safe to assume that its good enough for our application.

Since I am a really big fan of Sublime Text I tried to switch from Gitkraken to Sublime Merge a few times but I just don't enjoy working with Sublime Merge.

Is there any chance that merging and rebasing via drag-n-drop is coming to Sublime Merge? For me that's the one big feature which keeps me from switching from Gitkraken to Sublime Merge.

I was using Sublime Merge on the weekend and after doing a few changes with my remotes using the normal command line, it exited (well crashed I assume). When things like this happen do you automatically get an error report? Because you mentioned that you use the error reporting library from Google in that article.

We do get crash reports. If you have some more details in relation to this it would be great to hear from you on our forums or on the issue tracker: https://github.com/SublimeTextIssues/Core

I believe the correct link for SublimeMerge GH issues is: https://github.com/sublimehq/sublime_merge

I'm just going to leave this right here: http://man7.org/linux/man-pages/man2/signalfd.2.html

I know about signalfd, how would it help dealing with mmap?

> The signalfd mechanism can't be used to receive signals that are synchronously generated, such as the SIGSEGV

So a common way to handle synchronous signals is via a mechanism much like signalfd, where your signal handler just appends to a pipe that can be read at a convenient time.

Just use signalfd and refactor into an event loop without jumping around.

This does not work - signalfd cannot catch synchronous signals, per the man page:

> Limitations

> The signalfd mechanism can't be used to receive signals that are synchronously generated, such as the SIGSEGV signal

This is because synchronous signals are fired at the thread that caused them, and signalfd read() calls can't be used to read signals fired at other threads.

About Sublime Merge, is there any plan for plugins?

We do want to support plugins at some point, but we don't have a timeline yet.

"In hindsight it's difficult to justify using mmap over pread"

This needs a stronger justification. mmap allows reading and writing large data structures without copying, which can be a huge benefit depending on the use case.

Using pread would have been more work from the start but provides a more robust solution and doesn't have problems on Windows. I would argue that the incrementaly built mmap based solution is strictly worse, thus difficult to justify doing again.

"strictly worse" despite no-copy memory access? Again, this needs proof.

Strictly worse because it requires more work to maintain and write new code, there's no guarantee we haven't missed any access points in our codebase so it is less robust, still locks files while in use on Windows and requires maintaining patches to Breakpad. Performance is not an issue here, the program working correctly is, and doing so in the long run is strictly worse than mmap.

Do you have a list of these strictly-better things to do and a plan on when to implement them? Do you let them pile up? Mostly asking about your decision-making process.

I'm not sure what you mean. The blog post discusses solutions to all the problems I listed. They wouldn't have been problems using pread, which has a straight forward implementation.

The original post quantifies it. Around 50% better performance for the mmap version.

They are saying that if they somehow knew up front what the performance gains would be, and what the cost in bugs and complexity would be, they wouldn’t have used mmap at all.

"some quick benchmarks for the way Sublime Merge reads git object files" is in not strong evidence.

I can’t tell what answer you’re looking for here.

You said above that “mmap allows reading and writing large data structures without copying, which can be a huge benefit depending on the use case.”

Yes, of course that’s true, and the Sublime Text authors are clearly well aware it’s true. That’s why they decided to use mmap in the first place. They agreed with you.

This is them reporting, with hindsight, that for their use case mmap introduced a lot of tricky bugs that required complex platform-dependent fixes, and that the performance gains were real but modest. Therefore, in hindsight, it probably wasn’t a good choice.

Which part are you arguing with?

I think it’s reasonable to say “it’s difficult to justify using mmap when other programs are expected to manipulate, and potentially even delete, the file while you are working with it.” But, honestly, that case will always be hard to handle, and mmap doesn’t make it any worse.

I think the problem here is that mmap does make it worse - with (p)read, if you get an errno, you flag an error. with mmap, you have to handle a signal and jump back to an appropriate point in the code (and you can't do this in a cross-platform way), and then flag an error. Obviously, what the upper levels do with the error still has to be worked out, but you can hardly argue that sigsetjmp/siglongjmp + signal handlers + Windows SEH handlers is no worse than "ret = read(...); if (ret < 0) ERROR;"

If the file changes out from under you, all you can do is reread the file. mmap gives you a pointer that you can read from, but you still have to loop through the file’s bytes.

I’m pretty sure most programmers are capable of writing a signal handler that sets a flag (volatile sig_atomic_t) for “parsing failed,” and a loop that checks that flag in addition to checking whether the loop is finished for other reasons. Signal handling doesn’t have to be complicated.

It’s not quite as simple as setting a flag - if the signal is SIGBUS/SIGSEGV then returning from the handler normally will retry the faulting instruction. So, you either have to map dummy memory into place so the fault doesn’t reoccur, or use sigsetjmp/siglongjmp to jump out to a previously configured exception handler (as Sublime Text did here.

I think you’ve mixed up synchronous signals - SEGV, BUS, ABRT - with asynchronous signals like QUIT, INT, USR1. Asynchronous signals can be handled easily with a flag and a loop as you mentioned; synchronous signals are much trickier and much more complicated.

> without copying

While it's true that memory bandwidth is sometimes a limiting factor in performance, I've found that much more frequently people overestimate the cost of memory operations and don't check their estimates against benchmarks.

Indeed. It depends on the use case.

...and why the downvotes?

...keep going...

The other big problem with mmap is what happens when your file changes out from under you. This seems to be mostly for git packfiles, which I think can be treated as immutable by convention, but that's not strongly enforced anywhere. For reading, eg, program source files, I think mmap is hugely problematic.

I've been arguing for a long time that operating systems should provide read only snapshots of files as a primitive, but that's a pretty big ask; it's especially hard to do when the file system is network-mounted. There are a couple of copy-on-write filesystems on Linux (btrfs and ZFS if memory serves) which can do this locally, but it's not mainstream.

You can assume that git packfiles are immutable. They may disappear from under you as a repack happens, but they will not be changed.

They even have a name like pack-<SHA-1>.pack where that <SHA-1> is a SHA-1 of the contents of the pack (minus the last 20 bytes, the checksum SHA-1 is also part of the pack itself).

True for this workload, iff you assume the filesystem and/or media protects you from corruption (it probably doesn't). I would guess OP is commenting on mmap IO in general, rather than TFA's specific git use case.

I feel like this is kind of a dumb design.

You want to abstract two different kinds of file reader: an mmap reader and a regular reader. (And I would add a gz reader, personally).

Then by inspecting the properties of the file, you can determine if it is local when opening, and if so, mmap the file.

I say this because if the file is coming via the network or a FAT32 partition you’re not going to save much time with mmap relative to the read speed anyways.

> You want to abstract two different kinds of file reader: an mmap reader and a regular reader.

We already do this. Small files aren't mmap'd in Sublime Merge and instead copied into memory.

> I say this because if the file is coming via the network or a FAT32 partition you’re not going to save much time with mmap relative to the read speed anyways.

Speed in this case was the absolute least of our concerns. mmap was used to deal with large files without refactoring large parts of the codebase. Not using it for networked files doesn't fix the memory usage issue.

But now you have two completely independent code paths. Both of which will need to go through the same maturation phase that the ST folks evidently went through with mmap. And if the code needs to evolve for other reasons, potentially both of these paths will need some love too.

Seems like the worst choice in a situation like this!

Butler Lampson talks about the general concept of an overlay in operating system design. It's a slightly different case, on the surface, but the possible application in this case is that if mmap fails, you probably should have a fallback code path that tries to access the file normally.

That's obviously not going to work for all applications. But it's something you consciously have to do anyway if mmap may not be available (e.g. when working across the network). The OS design should make it easy to move up and down in this hierarchy of access methods, but the application programmer still needs to know which "level" of access they have to the file.

There's work to be done here in improving the design of the OS interface. That would lift part of the the burden of maintaining multiple paths in the application code.


slides: https://bwlampson.site/Slides/Hints%20and%20principles%20(HL...

see also: https://ocw.mit.edu/courses/electrical-engineering-and-compu...

The read() API is so basic, mainly because it is synchronous and has clear return values, that I have a hard time believing it would present nearly as many issues.

There's also the matter of taking an implicit "system call" (via page fault) the first time your program touches a page that hasn't yet been faulted. This old myth that mmap is the fast and efficient way to do IO just won't die. mmap does have perfectly legitimate use cases (e.g., reducing anonymous commit charge) but you should try to make regular reads work first.

That said, there's nothing wrong with mmap or SIGBUS error reporting in concept. I think I've said it before, but you can think of mmap of a disk file as basically adding a temporary dedicated swap file to the system, with all the pages backed by that "swap file" already "swapped" out unless already cached. Sometimes that's exactly what you want!

The author's signal handler registration difficulties come from the POSIX signal API being awful, not from the idea that catching CPU traps is somehow bad. It's possible to do much better than sigaction(2). I wrote up a detailed proposal for improvement in [1].

When I ran [1] by the glibc people, though, the response from them was pretty stark and unwarranted hostility toward any use of signals at all, even in perfectly legitimate scenarios, like mmap SIGBUS or certain kinds of important JVM-style pointer check [2] optimizations. It's this "we won't change anything anywhere despite our views being out of step with real user needs" attitude that's responsible for a lot of weird friction in the Unix API surface.

Practically every program mmaps disk files has trouble recovering from IO errors. If it were easier for multiple components in a process to share responsibility for handling a signal, we'd more often get subtleties like this right. A better signal API will improve the user experience. Tilting at signal-shaped windmills won't.

[1] https://www.facebook.com/notes/daniel-colascione/toward-shar...

[2] Relying on CPU traps lets you perform certain checks for free. Why write "if (myptr != nullptr) var = myptr" when you can just write "var = myptr" and handle the SIGSEGV when myptr ends up being nullptr? The latter option is zero-overhead in the non-nullptr case. It's also much more complicated, but if you're a VM, and you generate millions and millions of these "if (myptr != nullptr)" checks in JITed code, the size and speed win of eliminating the check starts to justify the complexity.

> This old myth that mmap is the fast and efficient way to do IO just won't die.

Well... because it's not a myth in all cases?

    $ time rg zqzqzqzq OpenSubtitles2016.raw.en --mmap

    real    1.167
    user    0.815
    sys     0.349
    maxmem  9473 MB
    faults  0

    $ time rg zqzqzqzq OpenSubtitles2016.raw.en --no-mmap

    real    1.748
    user    0.506
    sys     1.239
    maxmem  9 MB
    faults  0
The OP's adventures with mmap mirror my own, which is why ripgrep includes this in its man page:

    > ripgrep may abort unexpectedly when using
    > default settings if it searches a file that
    > is simultaneously truncated. This behavior
    > can be avoided by passing the --no-mmap flag
    > which will forcefully disable the use of
    > memory maps in all cases.
mmap has its problems. But on Linux for a simple sequential read of a large file, it generally does measurably better than standard `read` calls. ripgrep doesn't even bother with madvise.

Changing the workload can dramatically alter these conclusions. For example, on a checkout of the Linux kernel:

    $ time rg zqzqzqzq --mmap

    real    1.661
    user    1.603
    sys     3.128
    maxmem  41 MB
    faults  0

    $ time rg zqzqzqzq --no-mmap

    real    0.126
    user    0.702
    sys     0.586
    maxmem  20 MB
    faults  0
Performance of mmap can also vary depending on platform as well.

FWIW, I do generally disagree with your broader point, but it's important to understand that there's actually good reason to believe that using mmaps can be faster in some circumstances.

It’s not a myth at all, mmap is faster, you save on straight copies of data and the sys-calls to do it. It should be faster in nearly all circumstances, faster by at least a copy. In exchange you pick up a lot of complexity dealing with faults and you potentially put stress on the VM system. If you are doing to ‘O’ part of I/O then mmap starts to be really complex, fast. rg is kind of a special case, it’s not writing, it’s going to do mostly (maybe only, I assume it backtracks on matches) sequential reads of mostly static files, it’s really the easy case, its not clear that madvise would help and your brand is speed so saving on those copies is worth it. What might be interesting, on certain memory constrained systems you can slide a smaller map space through the file rather than mapping the whole thing; it’s been a while since I looked at it all but mapping the smaller pieces gives huge hints to the vmm and it would probably slow down rg incrementally but speed up overall system performance.

I agree with you it's faster on the systems I've tried (modern x86 systems with local drives, mostly). The drive type or IO speed doesn't matter much because it's cached IO we are interested in: if actual IO needs to occur that is the the time that will dominate, not memcpy to/from user space (except perhaps with very, very fast devices, i.e., in the 5 GB/s range).

That said, the case isn't as obvious as you make it: you apparently save on copies and explicit system calls, but mmap replaces those with page table manipulation and "hidden" system calls (i.e., page faults).

These page faults have as much per-call overhead as regular system calls, and so if mmap actually faulted in every page, I'm pretty sure it would actually be slower than read(), since read with a 16K buffer (for example) would make only 25% as many syscalls as mmap bringing in every 4K page.

On modern Linux, by default, mmap doesn't fault in every page, due to "faultaround" which tries to map in additional nearby pages every time it faults (16 by default), so the number of faults is 1/16th what you'd expect if it faulted on every page. You can avoid additional mapping on access with MAP_POPULATE or madvise (? maybe) calls, but then this introduces the same kind of window management problem as read: you lose the abstraction of the entire file just mapped into memory.

Beyond that, mmap has to do "per page" work to map the file: adjusting VM and OS structures to map the page into the process address space, and then undoing that work on munmap (which is the more expensive half since it includes TLB invalidation, and possibly a cross-core shootdown). You'd thing that this work would be much faster than copying 4 KiB of memory, but it isn't - and on some systems with small pages sizes and/or slow TLB invalidations it can be slower overall.

Yes... I know it's not a myth. :-) I was responding to someone who was saying that it was a myth.

> It should be faster in nearly all circumstances

As my previous comment showed, that's definitely not true. If you're searching a whole bunch of small files in a short period of time, then it appears that the overhead of memory mapping leads to a significant performance regression when compared to standard `read` calls.

> it’s really the easy case

I know. :-) That's why ripgrep has both modes. It chooses between them based on the predicted workload. It uses memory maps automatically if it's searching a file or two, but otherwise falls back to standard read calls.

Moreover, if ripgrep aborts once in a while because of a SIGBUS, then it's usually not a big deal. It's fairly rare for it to happen. And if it does happen to you a lot or you never want it to happen, then you just need to `alias rg="rg --no-mmap"`.

I love ripgrep, btw, great work.

I was pondering this some more in the shower, the mmap for rg case is also sort of naturally cache oblivious, copies will consume hardware cache for the write and while there is a ton of hardware for cache on modern hardware, it’s a noticeable cost on some tests. If you’re searching through something big, then it’d be like doubling hardware cache which is probably really noticeable on smaller devices.

The small files case is interesting, copying the data is faster than patching up the page table tree, I bet there is a strong correlation to the hardware cache size vs the average file size in that case. The files probably need to be N pages in size for it to be worth it, might be an interesting heuristic to use.

You can surely win some laptop benchmarks by mmaping some files on certain close to the metal filesystems.

But for general production case mmap shouldn't even be considered a solution to the syscall and memory copy overhead problem. If that overhead is too big for you, other approaches work better, like buffering, application level caching, etc.

Wow, this is so wrong in so many ways.

Buffering and application level caching mean you're wasting memory, and also wasting code space and CPU time because you're duplicating work that the OS already does.

Did you do each of these after a clean reboot, or are we looking at possible caching effects from the kernel? If any part was in cache, then we might be just comparing shared memory against IPC, and that's an obvious performance win, but not really what's intended to be examined here.

The first numbers seem to imply that it takes equally long for pread to copy bytes from memory as it does to fetch them from the disk. For a quick back-of-the-napkin attempt at checking this, lets assume that disk IO accounts for 100% of this workload, and that local memory is one order of magnitude faster. In that case, I would expect the difference for an optimized implementation to be at most 10%.

I do think it is true that there are scenarios where the file mmap is faster, or that certain operations on each kernel might fall off a cliff. I just find it hard to believe that `mmap` must be as much faster as shown here in a typical situation (e.g. after a clean reboot, doing about the same amount of work, issuing optimal syscalls, with the OS/kernel not doing anything foolish).

Yes, the file is in cache, and that was my intent. That's why my `time` output says `faults 0` for both runs. That is, no page faults occurred. Everything is in RAM.

That is indeed a very common case for ripgrep, where you might execute many searches against the same corpus repeatedly. Optimizing that use case is important.

For cases where the file isn't cached, then it's much less interesting, because you're going to just be bottlenecked on disk I/O for the most part.

> then we might be just comparing shared memory against IPC, and that's an obvious performance win, but not really what's intended to be examined here.

Please don't take my comment out of context. I was specifically responding to this fairly broad sweeping claim with actual data:

> This old myth that mmap is the fast and efficient way to do IO just won't die.

You might think the fact that this isn't a myth is "obvious," but clearly, not everyone does. The right remedy to that is data, not back of the napkin theorizing. :-)

If you want to try your own benchmarks in your own environment, then you can: https://github.com/BurntSushi/ripgrep/

On Linux at least, you do not need to do a clean reboot to measure something without cache. You can drop the file cache with `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`.

Signal handling in POSIX breaks multithreading in practice because it is so crazy hard to get right (I would be surprised if more than 50% of the code out there does it right).

Can you link to the libc maintainer responses? I would like to know how they countered you exactly.

> Signal handling in POSIX breaks multithreading in practice because it is so crazy hard to get right

In my experience, it's pretty easy to write async-signal-safe code --- which is almost always also thread-safe --- if you follow a few simple rules. I don't understand why people have so much difficulty writing correct signal handlers.

> It's possible to do much better than sigaction(2). I wrote up a detailed proposal for improvement in [1].

Thanks for interesting read. That said, I can see why glibc people didn't appreciate the proposal.

The first part of article doesn't mention async-signal safety at all. Second part papers over async-signal safety, as if it were a non-issue. There are some dangerous-sounding paragraphs too:

> It’s occasionally useful to longjmp out of a signal handler. It’s reasonable to want to return non-locally from a shared signal handler too --- that is, to resume program execution after SIGNAL_CONTINUE_EXECUTION in a different state from the state the program had when we entered the shared signal handler. Since the signal system probably wants to maintain some kind of state to track its progress through its shared signal handler list, a plain longjmp out of a shared signal handler will likely leave the system in an unspecified state.

Are you sure, that we should worry about "signal system maintaining some kind of state"? Not about rest of application being in completely unspecified state?!!

The article proposes a primitive system for setting signal priorities, but stops at a half-baked solution. There is a mention of banning longjmp, but individual handlers still can bail via SIGNAL_CONTINUE_EXECUTION. What if I want my handler to always run regardless of registration order?

The proposal does not offer a way to retrieve a list of already installed handlers, which makes that part of it even worse than existing Posix signal API.

The proposed API does not address challenges of using signals in multi-threading programs.

The article mentions, that signal handlers can't be reliably unloaded, but proposed API does not address it.

Overall the proposed interface brings little to the table, does not work well alongside with existing sigaction() API and creates false illusion, that signal handlers are safe and ok to use. I imagine, that if it had more technical "meat" — more like robust mutexes or FD_CLOEXEC ­— it would have seen a lot more constructive discussion and less hostility from glibc maintainers.

> The first part of article doesn't mention async-signal safety at all. Second part papers over async-signal safety, as if it were a non-issue.

If you're writing a signal handler, the signal handler needs to be async-signal-safe. You can't just wave your arms and make the problem of async signal safety disappear, because CPU traps themselves are async-signal-unsafe. Even userfaultfd has to deal with async signal safety issues, since a thread causing a fault can, in principle, be anywhere!

> Are you sure, that we should worry about "signal system maintaining some kind of state"? Not about rest of application being in completely unspecified state?!!

If your handlers play by the rules and are async-signal-safe, there's no problem.

> The article proposes a primitive system for setting signal priorities, but stops at a half-baked solution.

What's half-baked about it?

> What if I want my handler to always run regardless of registration order?

That's a logical nonsense request. What if two components want to their handlers to be the highest priority?

> The proposal does not offer a way to retrieve a list of already installed handlers, which makes that part of it even worse than existing Posix signal API.

The whole point of the facility is to let different components share a signal without stepping on each other. Why would you need to retrieve the list of handlers?

> The proposed API does not address challenges of using signals in multi-threading programs.

This claim is too vague to rebut. What specific "challenges" are you talking about?

> The article mentions, that signal handlers can't be reliably unloaded, but proposed API does not address it.

The proposed API works fine with library unloading: a library can unregister whatever handlers it's installed just before being unloaded (e.g., in a static destructor), and this unregister operation is guaranteed to be safe no matter what the order handlers are unloaded.

> Overall the proposed interface brings little to the table

It allows multiple components to safely share signals. The objections you've mentioned are based on your misunderstanding my proposal.

> does not work well alongside with existing sigaction()

Yes it does. The article talks about this interaction specifically.

> I imagine, that if it had more technical "meat"

The glibc people literally think that nobody should be using signals. That's their objection, not anything you've talked about.

> The glibc people literally think that nobody should be using signals. That's their objection, not anything you've talked about.

I don't believe, that everyone holds that opinion. Even if they did, the world does not revolve around Red Hat's team, — there are still kernel mail lists and other venues for discussion. But if proposed improvements aren't well thought-out, would anyone there back them up?

In my opinion, async-signal safety in itself is much bigger problem than robust registration of signals. The later is mostly solved by chaining signal handlers, while former is mostly unsolved (and keeps getting worse). Proliferation of new libraries and async-signal unsafe conventions. People keep using printf() in signal handlers. Occurrences of fork() in multi-threaded apps. Still no async-signal safe malloc() (some Googlers tried, but the idea didn't get much traction). And then you come and propose new interface for registering signals handlers, and say that "It’s okay for two functions can be async-signal-unsafe". If your proposed API is async-signal unsafe, how would it deal with signals arriving during dispatch of signal handler list?

> But if proposed improvements aren't well thought-out, would anyone there back them up?

I think the proposal is thought-out. It's the objections I've seen that suggest a lack of thorough consideration.

> the world does not revolve around Red Hat's team

The facility I'm describing needs to be in libc to be useful, and for better or worse, if it's not in glibc and isn't something you can use as a raw system call, it might as well not exist.

> In my opinion, async-signal safety in itself is much bigger problem than robust registration of signals

Async signal safety concerns are inherent in any approach that exposes CPU traps to userspace, since traps occur at instruction granularity. As I've said elsewhere, writing async-signal-safe code isn't that hard if you follow a few basic rules. You can't solve the async signal safety problem, and we shouldn't need an all-singing, all-dancing async-signal-safety-requirement-avoidance system just to improve on the signal API.

The alternative to what I'm proposing isn't that everyone abandons signals. The alternative is that everyone keeps using sigaction, which sucks.

> The later is mostly solved by chaining signal handlers,

No, it isn't. I go into great detail in my note, in which I explain why current solutions to this problem are lacking and describe ways we can do better.

> Proliferation of new libraries and async-signal unsafe conventions. People keep using printf() in signal handlers. Occurrences of fork() in multi-threaded apps.

Okay, so don't do those things. The problems you're describing all come from people ignorant of safe signal-handler programming practices writing bad code. The problems disappear with education. The problems I'm describing don't disappear with education, since the current signal API imposes unavoidable limitations that even the best code can't work around, even in principle.

It's as if we have a car with hand-cranked windows and no brakes and you're annoyed that we would fix the brakes before adding power windows. The brakes are necessary to drive the car properly. Power windows are just an ergonomic feature.

> If your proposed API is async-signal unsafe, how would it deal with signals arriving during dispatch of signal handler list?

I don't understand this objection. Registration doesn't have to be async-signal-safe. Dispatch must be. There are multiple ways of implementing such a system --- e.g., CAS on a word pointing to a signal control structure.

If you roughly know your access patterns in advance you can reduce the page fault costs with madvise.

Sure, but you still end up with double-caching and frequent entry in the kernel. madvise will never give you as much control and flexibility as just getting the system out of the way and doing the work yourself. (DB systems can also use fun tricks like compressing their cached pages.)

Why would mmap lead to double caching? I can't follow.

It's not that mmap per se leads to double caching, but that combining the page cache with application-level caching leads to double caching. Say you're reading hugecactus.png into your image processing program. Whether you use mmap(2) or ordinary read(2), the first step in reading hugecactus.png is the kernel DMAing the bytes into the page cache. In the mmap case, the kernel maps the page cache into your application's address space. In the read case, the kernel copies from the page cache to the application read buffer. Now suppose your application PNG-decodes hugecactus.png into RGB raster data. Now, whether you used mmap or read, the kernel has both the decoded RGB data blob and the original PNG data in memory. That's usually wasteful.

(Yes, you can reduce the severity of this problem with MADV_DONTNEED and friends.)

Well, the kernel side cache is not much of a problem. The kernel is free to evict those pages at any time to respond to memory pressure etc. Linux treats its file system cache almost like unused memory in that it is normally the biggest pool from wich memory allocations for processes are drawn. Essentially, keeping the pages around in the cache is an optimization, because explicitly overwriting them too aggressively is just unnecessary work.

Interesting but how will you determine not-null is the hot path? Getting it wrong coughs cost you

In the case of VMs, null means "throw NullPointerException". How often do you see that happening?

And that's just for reading. Writing, especially if you want to be sure when the writes hit the backing file, or in what order, or if you run out of disk space, is another kettle of problems.

I wonder how Multics dealt with all this, since AIUI in that system everything was effectively an mmapped file.

Did Multics have anything like NFS? Everything is easier if your kernel has control of the underlying device.

Well the problematic example given in the article was NTFS where the whole filesystem can disappear, but the problem applied to local files too, eg if their size is changed by another process.

Where did you see mention of NTFS? I was referring to "As it turns out, the ticket comes from someone using a networked drive."

Sorry, it was a typo, I meant NFS or more generally some type of networked drive.

> Using setjmp and longjmping from a signal handler is actually unsafe. It seems to cause undefined behaviour, especially on MacOS.

Have you considered making a dispatch_source_t of type DISPATCH_SOURCE_TYPE_SIGNAL and handling all signals in a dispatch queue, instead of trying do figure out what kind of behavior is legal in a signal handler?

> If a library such as Breakpad registers for Mach exception messages, and handles those, it will prevent signals from being fired. This is of course at odds with our signal handling. The only workaround we've found so far involves patching Breakpad to not handle SIGBUS.

Would it be possible to install your own handler before Breakpad does?

> Have you considered making a dispatch_source_t

I think that would have been considerably more work than finding the SO answer that says you need to use sigsetjmp, and would probably still conflict with Breakpad ;)

> Would it be possible to install your own handler before Breakpad does?

I may be wrong, but I think you can only register one exception handler per "task" (process), so Breakpad would override ours.

When you install a mach exception handler you can get the port of the previous exception handler, which you can use to forward the messages your newly installed exception handler receives. Of course (as with all raw mach APIs) it is poorly documented and error prone.

Interestingly, it seems like Breakpad does a task_get/set_exception_ports dance instead of using task_swap_exception_ports: https://chromium.googlesource.com/breakpad/breakpad/+/refs/h.... Doesn't this break if someone registers a handler between the two lines?

Note that some BSDs have a MAP_ZERO flag which makes invalid accesses read zeros instead of triggering SIGBUS.

Which ones? I don't see it in manual pages for any of the ones I know about (Free, Net, Open, Dragonfly).

Oh I see you didn't get to caveat 5: you can't read anything more complicated than raw bytes, i.e. chars, because of unaligned memory access errors. Let's say you mmap a file and do something like this:

  char *fileContents=...mmap etc...;
  int headerOffset=*(int*)fileContents;
  int *someListOfNumbers=(int*)(fileContents+headerOffset);
  int importantSum=
    someListOfNumbers[0] +
    someListOfNumbers[1] +
    someListOfNumbers[2] +
If headerOffset is not a multiple of 4, bad things will happen. On x86 you'll get away with it... Unless you were summing many more ints and the compiler decides to use aligned SSE loads for speed[1]. On ARM you'll get a SIGBUS just for trying to read unaligned ints (IIRC). You can fix that by wrapping your ints or whatever you're trying to read in a packed struct, but it is yet another thing to keep in mind.

[1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-...

You should not directly use structs for data serialization. If you end up having to support a big endian platform, your structure orders will change, not to mention potential packing issues as you said.


Wasting a good performance optimization on the rare chance that you might one day have to support a different endian architecture is IMO a poor tradeoff. The number of big-endian machines in use today is continually shrinking, and the number that are active on a heterogeneous network is even smaller.

In LMDB we simply document "don't use this with remote filesystems" and avoid the issue - if you're never sharing files with other machines, there's never a question of mixed endian accessors.

I still think you shouldn't be directly sending structs over the wire or to disk. The alternatives are so much better - SQLite or Cap’n Proto.

I'm a bit shell shocked from supporting both big and little endian in structs from previous jobs. I've had nightmare situations with it twice. I do embedded systems and while little endian is winning there too, you still have legacy things like the LEON (SPARC) that is big endian. I've heard lots of network embedded hardware is big endian too for obvious reasons.

To disk is probably a bad idea but over the wire has some legitimate applications. Being reliant on same endian systems can be alright if for example you're building a distributed computing system where the same binary will be executed on a bunch of systems and all you're doing is sharing computation results between those instances. You do get unrivaled serialization speed that way. Timely Dataflow [1] works that way by default, but it also has an option for "real" serialization if that is required. Admittedly that's a fairly specific application but it's real and sometimes it's a good tradeoff.

[1] https://github.com/TimelyDataflow/timely-dataflow

Yeah that's true for the distributed application you mentioned with the wire. Certainly advantages there, especially when you control all the nodes.

I hate seeing the fields of a communication protocol listed in two structs - an ifdef BIG_ENDIAN and LITTLE_ENDIAN. They will invariably be inconsistent, someone will change something in little but not big, or even worse, update one incorrectly and not test it.

LMDB is orders of magnitude faster than SQLite, partly because it doesn't need to do fancy ser/deserialization.

The alternatives are not better. In fact, SQLite with its B+tree engine replaced by LMDB is still far better than vanilla SQLite - smaller, faster, and more reliable (SQLightning).

In a read-only workload, complex deserialization will become your limiting bottleneck, after you've eliminted all other bottlenecks from your code. Our profiling runs of OpenLDAP demonstrated this already, which is why LMDB was written, to allow structs to be persisted in in-memory format and used on read with no deserialization.

Why not use a memcpy here and not rely on undefined behavior? I’m sure the compiler will optimize out the call anyways, especially if you enforce alignment.

> Oh I see you didn't get to caveat 5

> On x86 you'll get away with it...

Guess we'll have to wait for the proliferation of a different architecture to encounter this one :)

Different architectures like ARM? Plenty of code will only ever run on an x86 chip, but a non-trivial amount will run on x86 and ARM at some point. I've certainly been bitten by this before.

Sure, but my point is that both Sublime products only support x86 currently. There's likely a fair amount of other stuff that would break if we ported to ARM.

Didn't realize who I was replying to. Oops. If you're confident that the code is going to stay x86 only then sure. Personally I've found my code has an annoying habit of growing legs and moving to other projects, often without my knowledge until much later. Of course when this happens the person doing it usually has little regard for any assumptions I made when writing the original code.

It's invalid for the compiler to vectorize scalar operations with unknown alignment if those vector operations require alignment. (That said, on x86 many AVX operations work fine unaligned, and are just as fast as the aligned versions on recent microarchitectures.)

If you dereference pointers with arbitrary offsets read from a file, alignment issues are the least of your issues.

Now I'm curious how Vim deals with large files. I'm assuming the went the pread route.

Vim needs to do a lot more special stuff, because it has to handle random insertions and deletions. Doing those things to a file (mmaped or otherwise) requires moving all data after the edit.

If I recall correctly, vim/vi uses a linked list of 'chunks' that is dynamically merged. For long files I would expect some form of lazy loading of chunks.

That's right. The key data structure is the rope: https://en.wikipedia.org/wiki/Rope_(data_structure)

Nice! I thought xi was the only editor that used ropes.

It would be really nice if there was a POSIX equivalent to Foundation's NSDataReadingMappedIfSafe, which uses mmap() when the file isn't backed by NFS and falls back to read() when unsafe or not worth it.

Is there a way for a process to know what kind of filesystem something lives on, or is this just a developer-provided clue?


I've successfully used mmap() a few times in the last few years... Luckily for me the use cases I've had weren't really subject to the same problems I've since read about here and elsewhere:

1) I'm always mmap()ing the whole file (and the files are power of 2 sized). 2) The files I'm mapping are stored on file systems I control (and so are never on NFS). 3) In one case, my use of mmap() is limited to read only.

mmap seems to be specifically designed to trigger the differences between NFS and Unix file system semantics.

Really any networked or distributed filesystem will struggle to implement mmap semantics well.

The problem here was wanting to parse a large file at all. You need to be able to do a) online parsing (so you don't need to read the whole file into memory first), b) stream parse (so you don't need to build a parsed representation of the whole thing before you can do anything).

How widely-used code laying on mmap (LMDB maybe being one of the prominent cases) rolls?

Or you could stream parse the file. eg. parse it one chunk at a time.

That’s what they were doing before?

Before they were reading the whole file into memory and parsing it. Now, they are using mmap to read parts of the file into memory when they are used. If you're only doing a single linear pass through a file there is a third option; you could read part of the file in a single chunk at a time, parse that chunk, then read the next part of the file into the same chunk of memory.

I don't know enough about the process being discussed here to know whether it needs to do lookups at random offsets in the file, but iff it doesn't then the chunked read solution could be a simpler way to reduce memory usage.

TRWTF is Breakpad, trying to deal with the process crashing from within the process itself. Even requires PTRACE to work.

I'd go a step further and broadly recommend not using mmap to access files, unless there's a really good overriding reason. (E.g., if the file is some sort of special virtual device/filesystem that by nature cannot have errors.)

Mmap is not good for writing to files — pages may be persisted to disk in arbitrary order, which makes it harder for filesystems to coalesce adjacent writes into fewer larger (and faster) IOs. (This is still an issue on SSDs, although not as bad as on HDDs.) This is a performance issue and can result in bad file layout (making future reads slower).

As this article describes, the mmap() model sort of assumes no errors happen. This breaks when files are truncated, even on local POSIX filesystems, and you get SIGBUS. It can also break if files disappear, such as a failing media or removed USB stick, or network filesystem. It doesn't mesh well with distributed filesystems either, for obvious reasons. If a page is to be writeable, you must take an exclusive data lock on that page's region across your distributed filesystem (and read access requires a shared data lock, to prevent corruption from concurrent writers). What if you lose quorum / availability?

TFA's trick to use thread-specific longjmps around specific virtual memory accesses probably works out ok on POSIX platforms[1] but it requires wrapping all of your mmap'd regions carefully. You can't just cast portions to a struct and access directly, except in small critical regions protected by the sigsetjmp. And as they point out, SIGBUS is global — it can conflict with error catchers (mentioned in TFA) but also can be raised for reasons other than mmap IO failure, such as attempting to access a non-canonical virtual address, and thus a long-lived global handler may mask other bugs. (Also, if you mmap many files and install a single long-lived handler in a multi-threaded program, it can become difficult to determine which file-access raised the signal.)

rtorrent, for example, used to have a ton of reports of SIGBUS due to mmap'd file access failure. I don't know if they've addressed that in some way (perhaps by simply masking SIBGUS) or continue to ignore it.

TFA claims pread was about 2/3 as fast as mmap'd access; some slightly clever application-specific use of caching, prefetch, or larger IOs might help eliminate that gap by reducing syscall overhead and/or disk wait. The best thing about pread/pwrite is they return have error reporting in the interface, and you can actually check that your IO did what you wanted.

[1]: http://man7.org/linux/man-pages/man7/signal-safety.7.html :

    If a signal handler interrupts the execution of an unsafe
    function, and the handler terminates via a call to longjmp(3) or
    siglongjmp(3) and the program subsequently calls an unsafe
    function, then the behavior of the program is undefined.

Really well-written, easy to follow piece.

I feel like you should use some kind of library for this instead of handling it all yourself.

A library for mmap would have not helped with any of the Windows or Breakpad issues. It would have also been more work up-front than mmap, since none of the problems of mmap were known to us at the time.

Which library would you propose?

That one is amazing...!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact