My first being original-idea development was rewriting with mmap so I could use an arbitrary (and programmable) number of buffers, tuning blocking against memory consumption, with logging to track performance for tuning. It was very cool. Worked great!
Until it went to production.
In production, it crashed every time it ran, shortly after starting up. Since we were doing seasonal production, backing out my change also backed out other necessary changes. It was very embarrassing and frustrating. Worse, I could not replicate the problem in testing! It only happened on the prod servers. And, as a wet behind the ears junior programmer, everyone assumed I'd just screwed up and was too dumb to understand how.
So I wrote a test program, divorced from our regular code, to test mmap() itself. Turned out that it ran fine on dev/test servers, but on the prod servers, it would randomly overwrite 1k memory pages with nulls. Yeah. Once i convinced the senior engineers and my manager, I got to report an OS bug to IBM. Who were like "What's wrong with your code, really?" I wound up sending them my C test code and the compiled executable, along with results.
It turned out the bug was caused by the order in which OS patches had been applied on the servers.
One of the first rules in the marvelous book The Pragmatic Programmer is "Select() isn't broken". Yeah, but sometimes it is.
And if a junior programmer came to me today reporting a bug in OS memory management, my first response would be "What's wrong with your code, really?"
If you mmap a file and write a byte into the mapped region, but beyond the end of the file, you get a signal; in handling the signal, you can extend the file. You also have
1. Made an error, as documented in the mmap man page,
2. Done something that works on JFS,
3. In fact, done what JFS does itself under the covers, and
4. Made a huge, huge mistake.
If you did this on a NFS filesystem, it deadlocks the filesystem. If you are using an automounter, it turns into a tar-baby: IBM's automountd was single threaded. Any attempt to touch that filesystem deadlocks. Any attempt to mount a new filesystem blocks waiting on automountd. The only solution was to reboot the system.
This is a bit of a problem if you have a database implementation class all trying this at the same time.
I reported this to IBM and their response was, "it's documented as not working, whaddya want?" I considered releasing it on BUGTRAQ, but I didn't. The AIX 4 multithready rewrite fixed it, BTW.
At the start of the dotcom boom, a server uptime measured in months was considered notable, and the idea of a client system lasting a week between some failure or voodoo reboot was laughable. Now... I dunno, when was the last time I bounced my phone? Genuinely can't remember.
Now, it just works... mostly. systemd came in and made it like it's the 90s all over again with just plain basic things no longer working. It's pretty much stabilized in my experience in the last year, though, but it took years. The latest disaster in sound systems was well into the '10s, and actually not quite "just works" yet.
As for phone, I bounce it about once a month when the bluetooth stack just stops working.
And that's all for Linux. *BSD tends to go "Yeah, we don't do that here" about POSIX, so it can be unpredictable. And Solaris is just madness.
So no, I wouldn't say we're quite into "basic stuff works" just yet.
And the improvements are less than you might think. On that same job, we were managing a couple dozen servers in two different data centers, and 200+ client machines, all remotely, able to deploy different versions of the software to specific machines. It was like "cloud", except we did it all without so much as ssh. We had a configuration management database in Sybase, and rsh/perl scripts, along with SMIT commands (I miss SMIT).
It's no longer that easy now. Most critical things don't suck that hard anymore, so it's hard to find a way to feel special, or feel like it's a special time to be a part of.
Even in startup land most work done is "build this website" or "build this app" - what the software does could be cool and easy enough to get passionate about, but the hope that "we're going to use this code to revolutionize the entire world for the better" that was omnipresent in '90s hacker culture has pretty much been murdered by the sins of Facebook and others in Big Tech. Now it's "what can I do to make sure the stuff I work on hurts the least amount of people as possible?"
IOW: It wasn't basic tech and great opportunity at the same time.
Today you can work on immature things like OpenBSD's MPLS implementation and who knows, maybe in 20 years people will say that back in 2019 there were so many things that could be improved about that basic tech! (cue replies anti-labelswitching dweebs)
 Not me, I used ssh since before OpenSSH.
For a single-purpose writing machine (and as long as you don’t mind sitting at a desk), my parents’ Mac SE from 1989 running a version of MS Word from the early 1990s is IMO significantly more usable than a top-of-the-line 2019 computer running the latest version of MS Word.
Given the amount of research and implementation effort/investment that has gone into improving computers over the past few decades, the state of user-facing software (including browsers and web apps) for the average person today is in my opinion shamefully bad.
This is a bit spun. I can't tell you what to like, but it's easy to romanticize older machines and forget what life was like with a 512x342 monochrome 60 Hz CRT, no backbuffer, and a CPU that took multiple frames to do something as simple as scroll.
I actually have a still-working Mac Plus I like to show off to friends. And yeah -- it was a great writing machine and in lots of ways current MS Office is a disaster of complexity. It was elegant in a historic way, but.. no, it's not "better" by the definition used by pretty much any working writer.
Having a small monochrome screen is not an insurmountable problem in practice for people used to doing most of their work by spreading physical paper around on the floor. Yes it could be better if it took advantage of 30 years of CPU, etc. improvements, but it’s nonetheless a fantastic tool.
Your “working writers” probably also need their computers for web browsing or whatever. And probably waste a ton of time on computer-related or computer-aided bullshit.
I had a lucky guess remembering that the test box wasn’t on the LAN and discovered that the problem depended on whether you’d enabled network sharing. You didn’t need to have shared anything or be using a share – simply having it enabled meant that a file system filter was installed and the local hard drive magically started returning results.
It turns out that there were multiple APIs which did this and apparently the original developer had unluckily picked the less common one. Switching was easy and had no apparent drawbacks. The best theory we had was that Microsoft’s QA group disliked copying builds around on floppies as much as we did and didn’t have many completely offline systems around.
poll() was broken on early versions of Mac OS X for quite a while :-). And I thought I recalled select() being broken too, but might be mistaken. Looks like some reports of kqueue() being broken too.
We never did figure out what caused it. Eventually we migrated all our instances to be HVM-only, and the problem went away.
How come you give this example just now? This can't be a coincidence. select() just caused us a major production outage.
FYI: select() is broken on pretty much all Linux kernels up to very recent ones. Doesn't work when there are more than 1024 opened file descriptors, not a lot for a server app. Side effects include crashing the application.
Can you elaborate or might you have some links? What was the cause and resolution?
If you google for select 1024 file descriptors, you will find a ton of issues affecting a ton of libraries. The correction seems to be using poll(), but in our case was to remove the routine that wasn't needed anymore.
But in general for all event notification mechanism on all systems, it's incorrect to assume that reported FDs are actually ready, because that could cause a busy waiting loop, and incorrect to assume that unreported FDs are not ready, because that could cause timeouts.
In my years of software development I had the fun but unfortunate experience of coming across not one, but two compiler bugs. The program would segfault on a seemingly innocuous line, but after looking a the assembly, it was clear the compiler was doing something screwy. These types of bugs are exceedingly rare, and they can be maddening to debug, but when you catch them you get a nice sense of satisfaction.
Honestly, pread is just a much better solution for 90% of use cases, and it works for large files on 32-bit systems (mmap does not!). If you're doing largely sequential things, fread/fseek often work remarkably well as they handle all the caching for you.
mmap tends to shine performance-wise if you need random access to a file but access certain parts of the file frequently (for example, accessing the index in a header + contents of the file), because the page cache is literally designed for this type of usage. But the performance improvement is rarely worth the technical complexity.
I forget sometimes that the young'ns never grew up in a world without threads, and often never learned how unix really works.
Signals can be sent either to a process or to a specific thread under linux. Signals sent to the process are handled by an arbitrary thread. There's a bunch of gotchas and whatnot past that.
Mixing signals and threads is pretty much trash.
Agreed, except for SIGSEGV, SIGBUS, and SIGPIPE, which are guaranteed to be delivered, if at all, to the thread which triggered the condition. (Unless, of course, they were raised explicitly.) Still trash in the sense that to do it right you're depending on installing global signal handlers and (very likely) arranging for thread-local storage (actual TLS isn't even async-signal safe; see my post elsethread regarding the sigaltstack trick), so these tricks are really only practical for the core application and not something you can stash in a library to perform transparently.
FWIW, none of this behavior is specific to Linux. Linux follows POSIX rather well in this regard.
There is probably enough evidence in this thread to use it as a reference for why typical apps should avoid mmap whenever possible -- it's clear almost nobody fully understands it
Anonymous mappings won't cause signals, they'll trigger the OOM killer. Remember that malloc() is just a fancy wrapper for mmap() (and sbrk()).
"There is probably enough evidence in this thread to use it as a reference for why typical apps should avoid virtual memory whenever possible -- it's clear almost nobody fully understands it"
I'd suggest that is ludicrous, and for the same reason your original conclusion is also excessive.
My first comment was in reply to one claiming anonymous memory did not have the same problems as file-backed memory, indicating the parent did not understand they are the same thing. The subsequent reply was to another comment continuing to claim anonymous memory was somehow safer, both instances supporting the notion that most people in this thread don't seem to understand mmap at all.
What we're examining is a powerful (and consequently hazardous) OS feature that often provides only marginal performance improvement, yet introduces many exotic error paths into a program that have their own exotic problems (memory access in thread A can raise SEGV in thread B, async-signal safety), that 7 hours' commenting has not been sufficient to fully capture. This thread is full of upvoted miscomprehension, bad advice (spawn a child to deal with SEGV!?), obviously incorrect solutions (signalfd), and yet still manages to completely omit some critical characteristics of mmap, for example, that faults take a VM-global semaphore -- mmap can easily destroy multithreaded app performance in a way read() is immune to, because nobody expects file IO in one thread to cause malloc() latency in another.
If this isn't evidence for "avoid this feature wherever possible", I really don't know what is.
If you apply the same reasoning that you've used to conclude that everyone should avoid mmap, than you are led directly to the conclusion that everyone should avoid virtual memory.
The "same problems" that you are pointing out are possible with anonymous memory aren't unique to memory you get directly from mmap, they also apply to all memory, period. mmap'ed anonymous memory might have th same problems as file-backed memory, but those are the same problems that .text and .bss have.
The "powerful (and consequently hazardous) OS feature" here isn't mmap, it's literally virtual memory. At the moment you concede that memory may be backed by something besides physical memory at any point in time, you get the possibility of all those "exotic error paths."
The exotic error paths are always there, but you don't always need to handle them. You can push some error handling to other systems, such as clients and supervisors / orchestrators. The reason why mmap with files is tractable is because we have a good error handling strategy (remap with zeroes, mark error) and we have a few understandable reasons why we might expect the error (IO errors, lost media/network failures, or even truncate). In general the problem that IO is done outside of direct syscalls like write() can be difficult even when you're not using mmap, like when Postgres was losing errors when calling fsync.
But when you have an IO error in your swap file, go ahead, eat the SIGBUS and die. This is fine.
The "powerful (and consequently hazardous) OS feature" here is using mmap for general file IO, it:
- introduces resources leaks many developers can't profile
- introduces VM bottlenecks 99% of developers can't profile
- introduces random segfaults delivered to arbitrary threads in the running process, leading to crashes many developers can't diagnose
- the mmap() interface itself is fundamentally unsafe in that it allows partially overwriting random bits of VM (MAP_FIXED) with file views, and worse still, allows those mappings to be read-write
Once again, nobody has ever suggested avoiding virtual memory except you -- once again, that is impossible in a modern environment, but it is more than possible, and ultimately incredibly sensible, not to mention entirely on topic with regards to this thread and the article it is attached to, to suggest avoiding use of mmap for general file IO
- Nobody ever said mmap was always faster than the alternative. If you care about performance then you should do whole-application performance testing with and without features enabled (like mmap IO). This is not unusual, there are plenty of aspects of performance that are counter-intuitive, where speeding up one part of your program causes a seemingly unrelated part of your part of your program to slow down.
- The signals are SIGBUS, and they can be intercepted, mapped with zeroes, and the errors can be propagated back to your app code later. This is not trivial but neither is it outrageous.
- You can overwrite arbitrary memory with read(), too, you just have to pass it a pointer to something you want to overwrite. mmap() is not any less safe. Recall that typical use of MAP_FIXED is so you can overwrite an existing mmap() region with something else, not so you can nuke random parts of your address space.
Keep in mind that you are, if nothing else, an indirect user of mmap(). The question is whether using mmap() directly is advantageous for your applicaiton. "Yes" is not an unreasonable nor outrageous answer for some applications.
> nobody has ever suggested avoiding virtual memory except you
Do you know what the "anonymous" in "anonymous mapping" means? You are the one that started asserting that anonymous mapping from mmap have the same difficulties as mapping of normal files and therefore too dangerous to use.
> paragraph, n.: a distinct section of a piece of writing, usually dealing with a single theme and indicated by a new line, indentation, or numbering
In the original comment you will find two of these, the former correcting an error in the parent comment, the latter making an observation based on the obvious brainwrong riddled throughout this thread
I'm done replying, you're of course free to continue checking in hazardous and suspect file IO code, as the rest of us are free to giggle at such things before ripping them out
> Anonymous mappings
> typical apps should avoid mmap whenever possible
All virtual memory is either a user-mode wrapper around mmap, or sbrk (which is functionally a kernel-mode wrapper around mmap).
(But that would be a sweeping generalization)
Please clarify how you think that statement is incorrect. As far as I'm concerned, it won't die because that's how virtual memory works, and you have something going on in your head that is either wrong or massive hairsplitting.
I would have expected a better understanding from someone bloviating about how the try of the commenters are too thick to understand mmap and virtual memory.
So is normal memory. Many allocators today even use mmap internally.
* Ther's sbrk too, but it's just a fancy legacy path that amounts to the same thing as anonymous mmap
OP seems to be trying to cope with NFS errors....but if you are using mmap on NFS you have bigger problems...
A better way to handle SIGBUS is to just map zeroes over the offending pages using MAP_FIXED and then setting a flag. After every operation that works on a file, you check the flag.
So if I have multiple threads reading the same mmap'd file, I use si_addr in the signal handler to know which page to call MAP_FIXED on?
That said, if I could do it over I wouldn't use mmap again. Especially since io_uring is around the corner (on Linux) that allows zero-copy reads/writes with no syscall overhead.
The safest (and also portable) way I've found to setup per-thread async-signal-safe local storage is to install an alternate signal stack for each thread using sigaltstack. Allocate a larger buffer than needed (and reported) for the stack and use the remaining memory for storage and guard pages. For example, allocate ((PAGE_SIZE * 3) + roundup(MINSIGSTKSZ, PAGE_SIZE)--two guard pages, one page for your local storage, and the remainder for the stack.
* Signal handlers are process global
* Signal handlers need to be re-entrant safe
Re-entrancy is painful but can be done, but process-global signal handlers means that pulling in a totally unrelated library can break your code. Moreover, it makes the combined use of certain libraries straight-up impossible. Similarly, it means that the use of library precludes you from using certain features.
Combined, this means that the use of signal handlers is simply toxic. Which makes them an anti-feature. It feels to me like having the restrictions of kernel-space, with all of the downsides of user-space.
Are there any plans in linux to replace signals?
Being multi-process would solve the issues with a process-global signal handler because there would no longer be a question of which thread generated the signal.
At the same time, shared mutable state, specifically a fully shared and transparent (ignoring caches) address space is the least effort way to take advantage of multi-core CPUs. Hence, people will be using it. There is essentially no getting around that, and there are even some reasons for wanting it.
Being multi-process would solve a lot of issues, but so would settling on a single convention for endianness, rewriting C++ to use unique_ and shared_ptr where applicable, and many other nice to haves.
At the end of the day, the easiest road is going to be taken more often. If that happens road is lined with bandits and barely visible traps, that is a problem. No matter how often you tell people not to take that road, and climb a slippery mountain path instead.
These problems are all fixable without "replac[ing] signals" as the mechanism. Ultimately, as long as processors have traps (which they will, as long as we have virtual memory) and as long as you want to give userspace to do something in response to these traps other than immediately die (an ability that's tremendously useful) you need some kind of stop-the-thread-that-trapped-and-call-a-callback mechanism, and whatever shape that mechanism takes, it's going to end up looking at least somewhat like signals.
Instead of just saying "signals are awful" and burying our heads in the sand, we should talk about what a better signals API should look like. I've already written a detailed proposal that I've linked elsewhere.
The real problem here is that the glibc people are completely uninterested in actually improving the signals API. Instead, they've taken the radical, unhelpful, and realistic stance that nobody should be using signals. As long as they think that way, people will keep using sigaction(2), and the world will remain in a half-broken and awful state.
Also since we're on the topic, here's a nice vulnerability caused by bad signal handling: https://news.ycombinator.com/item?id=16753013
Pretty sure there are no plans to replace signals, but maybe there are libraries that make signal handling easier?
That statement alone should be enough to re-examine a feature. Code composition is a rather nice thing to have, making that very complicated is not a good thing.
Non-portable, but sigprocmask() SIG_BLOCK plus signalfd() (Linux) or kqueue() EVFILT_SIGNAL (BSDs). Neither is a good solution for handling mmap SIGBUSes, but they're generally good for handling most signals (USRn, TERM, HUP, CHLD, etc) more similarly to other kinds of events.
By contrast, with kqueue all EVFILT_SIGNAL listeners will be notified of a signal, even if the signal was also delivered to a handler. Big difference.
This is one among many reasons why Linux's event syscalls are widely considered inferior to BSD kqueue. It's ridiculous considering that kqueue not only predated epoll by several years and signalfd by nearly a decade, with ample use cases, it was well documented in a paper that described the rationale for all the semantics. Somehow the authors of epoll, signalfd, etc couldn't even bothered do to basic research; they just spitballed semantics without having much real-world experience about what was needed and most useful.
Depending on OS internals within threading is a problem. Libraries that implement their own signal handling are a problem. Trying to get thread-local signal handling behavior when signals are process-global is a problem.
Threads are not a wholesale substitute for processes. Signals work just fine, as long as you plan for ways they can affect your code's behavior.
1. cannot be used if we compose two systems that use it 2. precludes using other mainstream features 3. is an essential part of the interface between userspace and kernelspace
is a bad feature. I am not saying we should drop the idea of allowing userspace to respond to things the kernel raises. I am saying we need a way that will allow 1. program composition and 2. play nice with threads.
I am not a unix/linux expert, but those requirements seem like they could be met.
It's sloppy thread programming that doesn't play nice with signals.
Two threads both want to MMAP a file / catch SIGSEGV for different pieces of code.
Moreover, these threads come from different modules of a system maintained by different people.
All of a sudden, these modules become coupled because we need some system for delegation of signal handlers between them.
Or, we need to introduce a custom signal-handler module to deal with our delegation.
I would not call that 'playing nice with threads' nor would I call that sloppy thread programming.
"I found a paragraph in this article http://www.linusakesson.net/programming/tty/ very apt at describing what Unix signals are like:
In *The Hitchhiker's Guide to the Galaxy*, Douglas Adams
mentions an extremely dull planet, inhabited by a bunch of
depressed humans and a certain breed of animals with sharp
teeth which communicate with the humans by biting them very
hard in the thighs. This is strikingly similar to UNIX, in
which the kernel communicates with processes by sending
paralyzing or deadly signals to them."
And yes, mmap is the awesomest thing out there.
While mmap is fast, the combination of factors that can make it decide to stall your thread while it commits to disk is difficult to manage from an operational standpoint. A slight misconfiguration is all it takes to introduce a rare and hard to notice multi-millisecond delay.
Whereas with a spinlocked sized-reserved vector, the fail state performance is however long it takes to allocate more space which is on the level of microseconds. You do pay 50-150ns for that spinlock though.
Just make sure you put it on another core to protect your cache.
But if you want mmap, you do you.
I contribute to git.git, and it would be interesting to know if there's inherent issues stopping you from doing that, or if it's implementation problems in some cases (e.g. missing plumbing commands or features). There's definitely interest from upstream in reviewing patches / helping if there's missing or inadequate plumbing.
You'd get upstream features for free as they come along. E.g. presumably you haven't implemented the new MIDX format, but that speeds up pack file access by a lot for some use-cases, and presumably the boring bits of low-level git operations aren't much of a selling point in and of themselves.
Aside from whether you'd use "git" itself, such a trick of using a slave process you'd talk to over IPC of some sort would cover some of the issues you wrote about, e.g. issue with sharing global state with libraries like Breakpad.
In terms of getting the right data, one example is that we need to know the full set of non-ignored sub-directories in the working directory, so we can watch them for changes. It's easy enough to generate this ourselves as we calculate the status output, but I don't believe that git will emit it.
In terms of performance, we rely on being able to read objects efficiently. For example, to show a commit, we can't just use the output of "git diff", as we need the full file contents to be able to calculate syntax highlighting correctly. You could go a long way with "git cat-file --batch", but there are plenty of contexts where you can't practically batch requests, and process creation costs + the lack of caching across requests (which can be quite significantly due to the delta encoding of objects) would be quite significant.
There's going to be cases where it sucks, e.g. what you point out with wanting both raw blobs and their diffs, you'd need to do that in two plumbing commands now.
But just on that example: Having poked at some of the diff code recently I can tell you there's no big technical hurdle to just exposing that sort of thing. I.e. spewing out machine-readable raw blobs and their diffs, it just happens not to be exposed now.
I think what a program like Sublime Merge would want/need short of C API access (which is unlikely to happen) is a git version of an open-ended "plumbing" IPC protocol of the sort that Common Lisp VMs tend to expose. I.e. being able to have one (or few) "git command-server" processes spawned, and ask them questions like "look up this blob" or "diff these two blobs" (where the previous blob lookup would be cached).
Obviously patching/coordinating/upstreaming those sorts of changes is going to take work, but so is duplicating and keeping up-to-date with the diff, pack, status etc. code.
I'm not trying to tell you what to do, just saying that the git project is definitely friendly to "we're a commercial product and need this missing plumbing for our editor" (unlike say, GCC).
The plumbing that's there now is mostly in the state it's in because it's what git itself needed in the past when it was more of a collection of shellscripts, as well as being biased towards what git server operators like GitHub needed (because they sent more patches), which is why plumbing for say batch blob operations tends to be better than the one for "status".
In any case it would be very interesting to have some post about the sort of read-only operations Sublime Merge is doing with its own custom git code.
(Also, IPC and fork+exec has overhead that mmap or thread in the same program does not.)
The libgit2 code is GPL with a linking exception, so you can use it (unlike "git" itself) as a C library in a proprietary commercial product.
> IPC and fork+exec has overhead[...]
The "git cat-file --batch" command is something you'd invoke once, and then as your program runs you keep feeding it SHA-1s on stdin and it spews out their content on stdout. So even on Windows the overhead of that should be fine.
It's clear from ben-schaaf's other comments (which I read later) that one concern was the simplicity of downstream APIs being able to read the data using a normal C variable.
But that just leaves more questions. People in this thread are mentioning pack files, assuming that a multi-GB "git object" must be in a pack, but I notice the original post doesn't say anything about it.
If they're reading packs with this they'll need to parse it, resolve deltas etc. So likely the code that deals with the mmap()'d variable is small in any sane codebase (they're surely not doing delta resolution repeatedly all over the place...).
If they're very large loose objects those will most likely be zlib compressed, so wouldn't this need to go through some intermediary API layer anyway? I guess if SM itself is adding them it could add them uncompressed.
Since ben-schaaf mentioned this not being about performance, but about saving memory I thought this might be something like wanting to extract a small part of a 1GB object from git for display. That seems like a thing an editor might want to do.
In that case "git cat-file --batch" would suck, but not for some intrinsic reason. An API could be added that could take the start/end of an object to print out.
You might also have lawyers who are lazy about it and don't want to deal with the liability, "we heard Apple banned GPL code..." or "the FSF sued Cisco...".
But there's no license reason for why you can't use that GPL code in some way, and everyone from Google with Android to Oracle with Oracle Linux and their DB bundles GPL code that's directly used by some accompanying proprietary piece of software.
But what I was more going for is that there's also a non-legal aspects to it that go beyond the license, which is that some maintainers of free software are actively hostile to their software being used as a smaller component in some proprietary product.
The GCC project is probably the most famous example of this, I think this has changed somewhat in recent years with LLVM+Clang, but they used to jealously guard things like their AST format. So e.g. someone with a proprietary editor (or Emacs for that matter...) could never hope to use GCC for spewing out parsing information for some C code.
I think it's fair to say that the Git project isn't like that. If someone maintaining proprietary software needs some plumbing interface to hook their stuff up and is willing to submit patches it'll be received as well as any other change (subject to review, maintenance & backwards-compatibility concerns etc.). If they find it useful it's likely that other people will too...
The effects of oh-noes-my-file-is-gone can be somewhat mitigated by using the heuristics built into NSData (instead of using mmap directly).
For example, you call NSData’s `dataWithContentsOfFile:options:error:` with the `NSDataReadingMappedIfSafe` option . The framework will then transparently mmap the file unless it believes there’s an elevated risk of the file going away.
Apple doesn’t disclose how NSData exactly makes that decision; however, I’ve found a few reports that say it uses mmap internally when the file is on the root filesystem, and fall back on an in-memory copy otherwise.
It’s a rather dumb heuristics though, and may not solve the issue entirely.
"Memory mapped files work by mapping the full file into a virtual address space and then using page faults to determine which chunks to load into physical memory. In essence it allows you to access the file as if you had read the whole thing into memory, without actually doing so."
It totally would have been simpler overall, but each incremental step we made was significantly less work than the refactoring required for pread.
Not necessarily. With O_DIRECT, pread() doesn't put pages into page cache: it just DMAs them directly into your process. Using O_DIRECT and the process-private caching we've been discussing, sophisticated programs (like databases) can (and do!) implement their own "page cache" systems. And because databases have access pattern information that the generic kernel VM subsystem doesn't, such a database can frequently do a better job doing this caching on its own.
In 10 years will you be saying this about the next incremental problem that you run into? If you think this likely, then the next incremental problem is an excuse to do it right.
Unless you actually need to read the file multiple times (compared to looking at the parsed in-memory data multiple times), this should be fast enough.
Showing the scope of change within the editor is a rather nice touch. Visualization of complexity, if you will.
Are you sure the mechanism used by thread_local is safe to use in a signal handler?
Relevant bug from Rust: "TLS accesses aren't async-signal-safe", https://github.com/rust-lang/rust/issues/43146
I'm thinking even if it works on certain OSes, it's not guaranteed, because a signal handler's context is not a thread context - or is it?
E.g. the thread_local mechanism might depend on compile-time options (affecting how thread_local, whether it allocates memory on demand, and how it relocates the memory block when loading a shared library), whether it's main program or an -fPIC shared library, and the type of thread library (different ways of implementing pthreads).
Is there any chance that merging and rebasing via drag-n-drop is coming to Sublime Merge? For me that's the one big feature which keeps me from switching from Gitkraken to Sublime Merge.
> The signalfd mechanism can't be used to receive signals that are synchronously generated, such as the SIGSEGV
> The signalfd mechanism can't be used to receive signals that are synchronously generated, such as the SIGSEGV signal
This is because synchronous signals are fired at the thread that caused them, and signalfd read() calls can't be used to read signals fired at other threads.
This needs a stronger justification. mmap allows reading and writing large data structures without copying, which can be a huge benefit depending on the use case.
They are saying that if they somehow knew up front what the performance gains would be, and what the cost in bugs and complexity would be, they wouldn’t have used mmap at all.
You said above that “mmap allows reading and writing large data structures without copying, which can be a huge benefit depending on the use case.”
Yes, of course that’s true, and the Sublime Text authors are clearly well aware it’s true. That’s why they decided to use mmap in the first place. They agreed with you.
This is them reporting, with hindsight, that for their use case mmap introduced a lot of tricky bugs that required complex platform-dependent fixes, and that the performance gains were real but modest. Therefore, in hindsight, it probably wasn’t a good choice.
Which part are you arguing with?
I’m pretty sure most programmers are capable of writing a signal handler that sets a flag (volatile sig_atomic_t) for “parsing failed,” and a loop that checks that flag in addition to checking whether the loop is finished for other reasons. Signal handling doesn’t have to be complicated.
I think you’ve mixed up synchronous signals - SEGV, BUS, ABRT - with asynchronous signals like QUIT, INT, USR1. Asynchronous signals can be handled easily with a flag and a loop as you mentioned; synchronous signals are much trickier and much more complicated.
While it's true that memory bandwidth is sometimes a limiting factor in performance, I've found that much more frequently people overestimate the cost of memory operations and don't check their estimates against benchmarks.
I've been arguing for a long time that operating systems should provide read only snapshots of files as a primitive, but that's a pretty big ask; it's especially hard to do when the file system is network-mounted. There are a couple of copy-on-write filesystems on Linux (btrfs and ZFS if memory serves) which can do this locally, but it's not mainstream.
They even have a name like pack-<SHA-1>.pack where that <SHA-1> is a SHA-1 of the contents of the pack (minus the last 20 bytes, the checksum SHA-1 is also part of the pack itself).
You want to abstract two different kinds of file reader: an mmap reader and a regular reader. (And I would add a gz reader, personally).
Then by inspecting the properties of the file, you can determine if it is local when opening, and if so, mmap the file.
I say this because if the file is coming via the network or a FAT32 partition you’re not going to save much time with mmap relative to the read speed anyways.
We already do this. Small files aren't mmap'd in Sublime Merge and instead copied into memory.
> I say this because if the file is coming via the network or a FAT32 partition you’re not going to save much time with mmap relative to the read speed anyways.
Speed in this case was the absolute least of our concerns. mmap was used to deal with large files without refactoring large parts of the codebase. Not using it for networked files doesn't fix the memory usage issue.
Seems like the worst choice in a situation like this!
That's obviously not going to work for all applications. But it's something you consciously have to do anyway if mmap may not be available (e.g. when working across the network). The OS design should make it easy to move up and down in this hierarchy of access methods, but the application programmer still needs to know which "level" of access they have to the file.
There's work to be done here in improving the design of the OS interface. That would lift part of the the burden of maintaining multiple paths in the application code.
see also: https://ocw.mit.edu/courses/electrical-engineering-and-compu...
That said, there's nothing wrong with mmap or SIGBUS error reporting in concept. I think I've said it before, but you can think of mmap of a disk file as basically adding a temporary dedicated swap file to the system, with all the pages backed by that "swap file" already "swapped" out unless already cached. Sometimes that's exactly what you want!
The author's signal handler registration difficulties come from the POSIX signal API being awful, not from the idea that catching CPU traps is somehow bad. It's possible to do much better than sigaction(2). I wrote up a detailed proposal for improvement in .
When I ran  by the glibc people, though, the response from them was pretty stark and unwarranted hostility toward any use of signals at all, even in perfectly legitimate scenarios, like mmap SIGBUS or certain kinds of important JVM-style pointer check  optimizations. It's this "we won't change anything anywhere despite our views being out of step with real user needs" attitude that's responsible for a lot of weird friction in the Unix API surface.
Practically every program mmaps disk files has trouble recovering from IO errors. If it were easier for multiple components in a process to share responsibility for handling a signal, we'd more often get subtleties like this right. A better signal API will improve the user experience. Tilting at signal-shaped windmills won't.
 Relying on CPU traps lets you perform certain checks for free. Why write "if (myptr != nullptr) var = myptr" when you can just write "var = myptr" and handle the SIGSEGV when myptr ends up being nullptr? The latter option is zero-overhead in the non-nullptr case. It's also much more complicated, but if you're a VM, and you generate millions and millions of these "if (myptr != nullptr)" checks in JITed code, the size and speed win of eliminating the check starts to justify the complexity.
Well... because it's not a myth in all cases?
$ time rg zqzqzqzq OpenSubtitles2016.raw.en --mmap
maxmem 9473 MB
$ time rg zqzqzqzq OpenSubtitles2016.raw.en --no-mmap
maxmem 9 MB
> ripgrep may abort unexpectedly when using
> default settings if it searches a file that
> is simultaneously truncated. This behavior
> can be avoided by passing the --no-mmap flag
> which will forcefully disable the use of
> memory maps in all cases.
Changing the workload can dramatically alter these conclusions. For example, on
a checkout of the Linux kernel:
$ time rg zqzqzqzq --mmap
maxmem 41 MB
$ time rg zqzqzqzq --no-mmap
maxmem 20 MB
FWIW, I do generally disagree with your broader point, but it's important
to understand that there's actually good reason to believe that using mmaps
can be faster in some circumstances.
That said, the case isn't as obvious as you make it: you apparently save on copies and explicit system calls, but mmap replaces those with page table manipulation and "hidden" system calls (i.e., page faults).
These page faults have as much per-call overhead as regular system calls, and so if mmap actually faulted in every page, I'm pretty sure it would actually be slower than read(), since read with a 16K buffer (for example) would make only 25% as many syscalls as mmap bringing in every 4K page.
On modern Linux, by default, mmap doesn't fault in every page, due to "faultaround" which tries to map in additional nearby pages every time it faults (16 by default), so the number of faults is 1/16th what you'd expect if it faulted on every page. You can avoid additional mapping on access with MAP_POPULATE or madvise (? maybe) calls, but then this introduces the same kind of window management problem as read: you lose the abstraction of the entire file just mapped into memory.
Beyond that, mmap has to do "per page" work to map the file: adjusting VM and OS structures to map the page into the process address space, and then undoing that work on munmap (which is the more expensive half since it includes TLB invalidation, and possibly a cross-core shootdown). You'd thing that this work would be much faster than copying 4 KiB of memory, but it isn't - and on some systems with small pages sizes and/or slow TLB invalidations it can be slower overall.
> It should be faster in nearly all circumstances
As my previous comment showed, that's definitely not true. If you're searching a whole bunch of small files in a short period of time, then it appears that the overhead of memory mapping leads to a significant performance regression when compared to standard `read` calls.
> it’s really the easy case
I know. :-) That's why ripgrep has both modes. It chooses between them based on the predicted workload. It uses memory maps automatically if it's searching a file or two, but otherwise falls back to standard read calls.
Moreover, if ripgrep aborts once in a while because of a SIGBUS, then it's usually not a big deal. It's fairly rare for it to happen. And if it does happen to you a lot or you never want it to happen, then you just need to `alias rg="rg --no-mmap"`.
I was pondering this some more in the shower, the mmap for rg case is also sort of naturally cache oblivious, copies will consume hardware cache for the write and while there is a ton of hardware for cache on modern hardware, it’s a noticeable cost on some tests. If you’re searching through something big, then it’d be like doubling hardware cache which is probably really noticeable on smaller devices.
The small files case is interesting, copying the data is faster than patching up the page table tree, I bet there is a strong correlation to the hardware cache size vs the average file size in that case. The files probably need to be N pages in size for it to be worth it, might be an interesting heuristic to use.
But for general production case mmap shouldn't even be considered a solution to the syscall and memory copy overhead problem. If that overhead is too big for you, other approaches work better, like buffering, application level caching, etc.
Buffering and application level caching mean you're wasting memory, and also wasting code space and CPU time because you're duplicating work that the OS already does.
The first numbers seem to imply that it takes equally long for pread to copy bytes from memory as it does to fetch them from the disk. For a quick back-of-the-napkin attempt at checking this, lets assume that disk IO accounts for 100% of this workload, and that local memory is one order of magnitude faster. In that case, I would expect the difference for an optimized implementation to be at most 10%.
I do think it is true that there are scenarios where the file mmap is faster, or that certain operations on each kernel might fall off a cliff. I just find it hard to believe that `mmap` must be as much faster as shown here in a typical situation (e.g. after a clean reboot, doing about the same amount of work, issuing optimal syscalls, with the OS/kernel not doing anything foolish).
That is indeed a very common case for ripgrep, where you might execute many searches against the same corpus repeatedly. Optimizing that use case is important.
For cases where the file isn't cached, then it's much less interesting, because you're going to just be bottlenecked on disk I/O for the most part.
> then we might be just comparing shared memory against IPC, and that's an obvious performance win, but not really what's intended to be examined here.
Please don't take my comment out of context. I was specifically responding to this fairly broad sweeping claim with actual data:
> This old myth that mmap is the fast and efficient way to do IO just won't die.
You might think the fact that this isn't a myth is "obvious," but clearly, not everyone does. The right remedy to that is data, not back of the napkin theorizing. :-)
If you want to try your own benchmarks in your own environment, then you can: https://github.com/BurntSushi/ripgrep/
On Linux at least, you do not need to do a clean reboot to measure something without cache. You can drop the file cache with `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'`.
Can you link to the libc maintainer responses? I would like to know how they countered you exactly.
In my experience, it's pretty easy to write async-signal-safe code --- which is almost always also thread-safe --- if you follow a few simple rules. I don't understand why people have so much difficulty writing correct signal handlers.
Thanks for interesting read. That said, I can see why glibc people didn't appreciate the proposal.
The first part of article doesn't mention async-signal safety at all. Second part papers over async-signal safety, as if it were a non-issue. There are some dangerous-sounding paragraphs too:
> It’s occasionally useful to longjmp out of a signal handler. It’s reasonable to want to return non-locally from a shared signal handler too --- that is, to resume program execution after SIGNAL_CONTINUE_EXECUTION in a different state from the state the program had when we entered the shared signal handler. Since the signal system probably wants to maintain some kind of state to track its progress through its shared signal handler list, a plain longjmp out of a shared signal handler will likely leave the system in an unspecified state.
Are you sure, that we should worry about "signal system maintaining some kind of state"? Not about rest of application being in completely unspecified state?!!
The article proposes a primitive system for setting signal priorities, but stops at a half-baked solution. There is a mention of banning longjmp, but individual handlers still can bail via SIGNAL_CONTINUE_EXECUTION. What if I want my handler to always run regardless of registration order?
The proposal does not offer a way to retrieve a list of already installed handlers, which makes that part of it even worse than existing Posix signal API.
The proposed API does not address challenges of using signals in multi-threading programs.
The article mentions, that signal handlers can't be reliably unloaded, but proposed API does not address it.
Overall the proposed interface brings little to the table, does not work well alongside with existing sigaction() API and creates false illusion, that signal handlers are safe and ok to use. I imagine, that if it had more technical "meat" — more like robust mutexes or FD_CLOEXEC — it would have seen a lot more constructive discussion and less hostility from glibc maintainers.
If you're writing a signal handler, the signal handler needs to be async-signal-safe. You can't just wave your arms and make the problem of async signal safety disappear, because CPU traps themselves are async-signal-unsafe. Even userfaultfd has to deal with async signal safety issues, since a thread causing a fault can, in principle, be anywhere!
> Are you sure, that we should worry about "signal system maintaining some kind of state"? Not about rest of application being in completely unspecified state?!!
If your handlers play by the rules and are async-signal-safe, there's no problem.
> The article proposes a primitive system for setting signal priorities, but stops at a half-baked solution.
What's half-baked about it?
> What if I want my handler to always run regardless of registration order?
That's a logical nonsense request. What if two components want to their handlers to be the highest priority?
> The proposal does not offer a way to retrieve a list of already installed handlers, which makes that part of it even worse than existing Posix signal API.
The whole point of the facility is to let different components share a signal without stepping on each other. Why would you need to retrieve the list of handlers?
> The proposed API does not address challenges of using signals in multi-threading programs.
This claim is too vague to rebut. What specific "challenges" are you talking about?
> The article mentions, that signal handlers can't be reliably unloaded, but proposed API does not address it.
The proposed API works fine with library unloading: a library can unregister whatever handlers it's installed just before being unloaded (e.g., in a static destructor), and this unregister operation is guaranteed to be safe no matter what the order handlers are unloaded.
> Overall the proposed interface brings little to the table
It allows multiple components to safely share signals. The objections you've mentioned are based on your misunderstanding my proposal.
> does not work well alongside with existing sigaction()
Yes it does. The article talks about this interaction specifically.
> I imagine, that if it had more technical "meat"
The glibc people literally think that nobody should be using signals. That's their objection, not anything you've talked about.
I don't believe, that everyone holds that opinion. Even if they did, the world does not revolve around Red Hat's team, — there are still kernel mail lists and other venues for discussion. But if proposed improvements aren't well thought-out, would anyone there back them up?
In my opinion, async-signal safety in itself is much bigger problem than robust registration of signals. The later is mostly solved by chaining signal handlers, while former is mostly unsolved (and keeps getting worse). Proliferation of new libraries and async-signal unsafe conventions. People keep using printf() in signal handlers. Occurrences of fork() in multi-threaded apps. Still no async-signal safe malloc() (some Googlers tried, but the idea didn't get much traction). And then you come and propose new interface for registering signals handlers, and say that "It’s okay for two functions can be async-signal-unsafe". If your proposed API is async-signal unsafe, how would it deal with signals arriving during dispatch of signal handler list?
I think the proposal is thought-out. It's the objections I've seen that suggest a lack of thorough consideration.
> the world does not revolve around Red Hat's team
The facility I'm describing needs to be in libc to be useful, and for better or worse, if it's not in glibc and isn't something you can use as a raw system call, it might as well not exist.
> In my opinion, async-signal safety in itself is much bigger problem than robust registration of signals
Async signal safety concerns are inherent in any approach that exposes CPU traps to userspace, since traps occur at instruction granularity. As I've said elsewhere, writing async-signal-safe code isn't that hard if you follow a few basic rules. You can't solve the async signal safety problem, and we shouldn't need an all-singing, all-dancing async-signal-safety-requirement-avoidance system just to improve on the signal API.
The alternative to what I'm proposing isn't that everyone abandons signals. The alternative is that everyone keeps using sigaction, which sucks.
> The later is mostly solved by chaining signal handlers,
No, it isn't. I go into great detail in my note, in which I explain why current solutions to this problem are lacking and describe ways we can do better.
> Proliferation of new libraries and async-signal unsafe conventions. People keep using printf() in signal handlers. Occurrences of fork() in multi-threaded apps.
Okay, so don't do those things. The problems you're describing all come from people ignorant of safe signal-handler programming practices writing bad code. The problems disappear with education. The problems I'm describing don't disappear with education, since the current signal API imposes unavoidable limitations that even the best code can't work around, even in principle.
It's as if we have a car with hand-cranked windows and no brakes and you're annoyed that we would fix the brakes before adding power windows. The brakes are necessary to drive the car properly. Power windows are just an ergonomic feature.
> If your proposed API is async-signal unsafe, how would it deal with signals arriving during dispatch of signal handler list?
I don't understand this objection. Registration doesn't have to be async-signal-safe. Dispatch must be. There are multiple ways of implementing such a system --- e.g., CAS on a word pointing to a signal control structure.
(Yes, you can reduce the severity of this problem with MADV_DONTNEED and friends.)
I wonder how Multics dealt with all this, since AIUI in that system everything was effectively an mmapped file.
Have you considered making a dispatch_source_t of type DISPATCH_SOURCE_TYPE_SIGNAL and handling all signals in a dispatch queue, instead of trying do figure out what kind of behavior is legal in a signal handler?
> If a library such as Breakpad registers for Mach exception messages, and handles those, it will prevent signals from being fired. This is of course at odds with our signal handling. The only workaround we've found so far involves patching Breakpad to not handle SIGBUS.
Would it be possible to install your own handler before Breakpad does?
I think that would have been considerably more work than finding the SO answer that says you need to use sigsetjmp, and would probably still conflict with Breakpad ;)
> Would it be possible to install your own handler before Breakpad does?
I may be wrong, but I think you can only register one exception handler per "task" (process), so Breakpad would override ours.
char *fileContents=...mmap etc...;
In LMDB we simply document "don't use this with remote filesystems" and avoid the issue - if you're never sharing files with other machines, there's never a question of mixed endian accessors.
I'm a bit shell shocked from supporting both big and little endian in structs from previous jobs. I've had nightmare situations with it twice. I do embedded systems and while little endian is winning there too, you still have legacy things like the LEON (SPARC) that is big endian. I've heard lots of network embedded hardware is big endian too for obvious reasons.
I hate seeing the fields of a communication protocol listed in two structs - an ifdef BIG_ENDIAN and LITTLE_ENDIAN. They will invariably be inconsistent, someone will change something in little but not big, or even worse, update one incorrectly and not test it.
The alternatives are not better. In fact, SQLite with its B+tree engine replaced by LMDB is still far better than vanilla SQLite - smaller, faster, and more reliable (SQLightning).
In a read-only workload, complex deserialization will become your limiting bottleneck, after you've eliminted all other bottlenecks from your code. Our profiling runs of OpenLDAP demonstrated this already, which is why LMDB was written, to allow structs to be persisted in in-memory format and used on read with no deserialization.
> On x86 you'll get away with it...
Guess we'll have to wait for the proliferation of a different architecture to encounter this one :)
If I recall correctly, vim/vi uses a linked list of 'chunks' that is dynamically merged. For long files I would expect some form of lazy loading of chunks.
1) I'm always mmap()ing the whole file (and the files are power of 2 sized).
2) The files I'm mapping are stored on file systems I control (and so are never on NFS).
3) In one case, my use of mmap() is limited to read only.
I don't know enough about the process being discussed here to know whether it needs to do lookups at random offsets in the file, but iff it doesn't then the chunked read solution could be a simpler way to reduce memory usage.
Mmap is not good for writing to files — pages may be persisted to disk in arbitrary order, which makes it harder for filesystems to coalesce adjacent writes into fewer larger (and faster) IOs. (This is still an issue on SSDs, although not as bad as on HDDs.) This is a performance issue and can result in bad file layout (making future reads slower).
As this article describes, the mmap() model sort of assumes no errors happen. This breaks when files are truncated, even on local POSIX filesystems, and you get SIGBUS. It can also break if files disappear, such as a failing media or removed USB stick, or network filesystem. It doesn't mesh well with distributed filesystems either, for obvious reasons. If a page is to be writeable, you must take an exclusive data lock on that page's region across your distributed filesystem (and read access requires a shared data lock, to prevent corruption from concurrent writers). What if you lose quorum / availability?
TFA's trick to use thread-specific longjmps around specific virtual memory accesses probably works out ok on POSIX platforms but it requires wrapping all of your mmap'd regions carefully. You can't just cast portions to a struct and access directly, except in small critical regions protected by the sigsetjmp. And as they point out, SIGBUS is global — it can conflict with error catchers (mentioned in TFA) but also can be raised for reasons other than mmap IO failure, such as attempting to access a non-canonical virtual address, and thus a long-lived global handler may mask other bugs. (Also, if you mmap many files and install a single long-lived handler in a multi-threaded program, it can become difficult to determine which file-access raised the signal.)
rtorrent, for example, used to have a ton of reports of SIGBUS due to mmap'd file access failure. I don't know if they've addressed that in some way (perhaps by simply masking SIBGUS) or continue to ignore it.
TFA claims pread was about 2/3 as fast as mmap'd access; some slightly clever application-specific use of caching, prefetch, or larger IOs might help eliminate that gap by reducing syscall overhead and/or disk wait. The best thing about pread/pwrite is they return have error reporting in the interface, and you can actually check that your IO did what you wanted.
: http://man7.org/linux/man-pages/man7/signal-safety.7.html :
If a signal handler interrupts the execution of an unsafe
function, and the handler terminates via a call to longjmp(3) or
siglongjmp(3) and the program subsequently calls an unsafe
function, then the behavior of the program is undefined.