
Use mmap with care - ingve
https://www.sublimetext.com/blog/articles/use-mmap-with-care
======
beat
The first serious bug I ever dealt with professionally was a result of the
hazards of mmap(). This was 1995, and I was working on AIX with a system that
used a series of shared memory buffers for IPC. It was originally written with
shmat(), and on AIX (at least in those days), shmat was limited to three
shared segments, so we had a lot of performance-wrecking blocking going on
while waiting for the buffers to be cleared.

My first being original-idea development was rewriting with mmap so I could
use an arbitrary (and programmable) number of buffers, tuning blocking against
memory consumption, with logging to track performance for tuning. It was very
cool. Worked great!

Until it went to production.

In production, it crashed every time it ran, shortly after starting up. Since
we were doing seasonal production, backing out my change also backed out other
necessary changes. It was very embarrassing and frustrating. Worse, I could
not replicate the problem in testing! It only happened on the prod servers.
And, as a wet behind the ears junior programmer, everyone assumed I'd just
screwed up and was too dumb to understand how.

So I wrote a test program, divorced from our regular code, to test mmap()
itself. Turned out that it ran fine on dev/test servers, but on the prod
servers, it would randomly overwrite 1k memory pages with nulls. Yeah. Once i
convinced the senior engineers and my manager, I got to report an OS bug to
IBM. Who were like "What's wrong with your code, really?" I wound up sending
them my C test code and the compiled executable, along with results.

It turned out the bug was caused by the order in which OS patches had been
applied on the servers.

One of the first rules in the marvelous book _The Pragmatic Programmer_ is
"Select() isn't broken". Yeah, but sometimes it is.

And if a junior programmer came to me today reporting a bug in OS memory
management, my first response would be "What's wrong with your code, really?"

~~~
ajross
Yeah. It's sometimes hard to remember just How Bad software was even in living
memory. These days, we all just assume that the basics all work and are
surprised to see bugs, which we vote up to the top of HN. But it wasn't always
like that, things just failed in crazy ways, at all levels of the stack.

At the start of the dotcom boom, a server uptime measured in months was
considered notable, and the idea of a client system lasting a week between
some failure or voodoo reboot was laughable. Now... I dunno, when was the last
time I bounced my phone? Genuinely can't remember.

~~~
pcwalton
Yep, this is part of the reason why it drives me crazy when people say the
'90s were the heyday of computing. Computers were _awful_ back then.

~~~
nexuist
I think they still have a point. It was the heyday of computing precisely
_because_ they were awful; there was always something to work on or improve.
Fixing bugs in operating systems offered clear improvements to the livelihoods
of thousands or millions.

It's no longer that easy now. Most critical things don't suck that hard
anymore, so it's hard to find a way to feel special, or feel like it's a
special time to be a part of.

Even in startup land most work done is "build this website" or "build this
app" \- what the software does could be cool and easy enough to get passionate
about, but the hope that "we're going to use this code to revolutionize the
entire world for the better" that was omnipresent in '90s hacker culture has
pretty much been murdered by the sins of Facebook and others in Big Tech. Now
it's "what can I do to make sure the stuff I work on hurts the least amount of
people as possible?"

~~~
knorker
On the flip side, while OpenSSH may be great and high quality now, and wasn't
in the '90s, in the '90s people used telnet on the public Internet, and telnet
worked.[1]

IOW: It wasn't basic tech _and_ great opportunity at the same time.

Today you can work on immature things like OpenBSD's MPLS implementation and
who knows, maybe in 20 years people will say that back in 2019 there were so
many things that could be improved about that basic tech! (cue replies anti-
labelswitching dweebs)

[1] Not me, I used ssh since before OpenSSH.

~~~
icedchai
Telnet and rlogin were both very common. I remember installing ssh on my
Slackware Linux box back in 1996. I had to build it from source... fun times!

------
nneonneo
Even without NFS, using mmap requires being real careful about signals -
SIGBUS can be raised any time the underlying file operation fails, including
because someone else truncated the file, or because the underlying storage had
an error (disk error, removed media, network storage). And, as this post so
eloquently illustrates (and through my personal experience), handling
SIGBUS/SIGSEGV cleanly in a multithreaded program on POSIX is incredibly
painful.

Honestly, pread is just a much better solution for 90% of use cases, and it
works for large files on 32-bit systems (mmap does not!). If you're doing
largely sequential things, fread/fseek often work remarkably well as they
handle all the caching for you.

mmap tends to shine performance-wise if you need random access to a file but
access certain parts of the file frequently (for example, accessing the index
in a header + contents of the file), because the page cache is literally
designed for this type of usage. But the performance improvement is rarely
worth the technical complexity.

~~~
fpoling
Alternatively one can run a separated process that does mmap and runs the
calculations or whatever that needs to access the file as quickly as possible
and do the the straightforward recovery in the parent process when the child
process dies. The drawback is the need to some form of RPC, but there a lot of
libraries to do that without much hustle.

~~~
quotemstr
You can do that with a MAP_ANONYMOUS | MAP_SHARED mapping too: that kind of
mapping is writable by both parent and child, but isn't backed by a disk file
and so can't be truncated or surprise-removed. The article's points about mmap
infelicity applies mostly to mappings of disk files. Anonymous mappings don't
have the same problems.

~~~
_wmd
Anonymous mappings are backed by swap and may be overcommitted, it's still
possible to catch signals in a wide variety of circumstances

There is probably enough evidence in this thread to use it as a reference for
why typical apps should avoid mmap whenever possible -- it's clear almost
nobody fully understands it

~~~
gpderetta
> Anonymous mappings are backed by swap and may be overcommitted,

So is normal memory. Many allocators today even use mmap internally.

~~~
quotemstr
All* allocation is mmap, really. All mmap does is dedicate a region of address
space to some kind of backing storage. The particular kind of backing storage
makes all the difference. The problem is that people colloquially use "mmap"
to mean "mmap of a conventional disk file" and don't mean all the other kinds
of mmap out there, so discussions can become confusing.

* Ther's sbrk too, but it's just a fancy legacy path that amounts to the same thing as anonymous mmap

------
rocqua
Honestly, this reads like a thorough indictment of signals in user space.

* Signal handlers are process global * Signal handlers need to be re-entrant safe

Re-entrancy is painful but can be done, but process-global signal handlers
means that pulling in a totally unrelated library can break your code.
Moreover, it makes the combined use of certain libraries straight-up
impossible. Similarly, it means that the use of library precludes you from
using certain features.

Combined, this means that the use of signal handlers is simply toxic. Which
makes them an anti-feature. It feels to me like having the restrictions of
kernel-space, with all of the downsides of user-space.

Are there any plans in linux to replace signals?

~~~
jschwartzi
The real issue is that people are using threads for things that processes were
originally intended for. Having one thread for the UI, one thread for network
code, one thread for program logic, etc. is actually a mis-use of threads in
POSIX-land. POSIX is designed for you to use multiple processes where in
Windows you would use multiple threads. This gets you a lot of robust
interprocess communication mechanisms and also memory isolation on systems
with an MMU. It also had performance implications on the state of the art
hardware at the time, which is why threads were invented. And finally it
requires you to expand your usage of the operating system.

Being multi-process would solve the issues with a process-global signal
handler because there would no longer be a question of which thread generated
the signal.

~~~
rocqua
This seems like an ivory tower argument, people shouldn't want shared mutable
state, so anyone suffering is doing so at their own hand.

At the same time, shared mutable state, specifically a fully shared and
transparent (ignoring caches) address space is the least effort way to take
advantage of multi-core CPUs. Hence, people will be using it. There is
essentially no getting around that, and there are even some reasons for
wanting it.

Being multi-process would solve a lot of issues, but so would settling on a
single convention for endianness, rewriting C++ to use unique_ and shared_ptr
where applicable, and many other nice to haves.

At the end of the day, the easiest road is going to be taken more often. If
that happens road is lined with bandits and barely visible traps, that is a
problem. No matter how often you tell people not to take that road, and climb
a slippery mountain path instead.

------
klodolph
The mistake here is using longjmp / siglongjmp. This is a possible way to
handle SIGBUS, but in practice it will be intractable in larger programs
written in C or C++. The compiler is generally free to move loads and stores
around, and you might be completely blindsided by how the compiler has
reordered your memory operations once you add side effects to one of the
operations. Theoretically, if accessing a memory location can siglongjmp out
then at least the memory location should be volatile.

A better way to handle SIGBUS is to just map zeroes over the offending pages
using MAP_FIXED and then setting a flag. After every operation that works on a
file, you check the flag.

~~~
planteen
This sounds interesting. Do you know of an example that does this?

So if I have multiple threads reading the same mmap'd file, I use si_addr in
the signal handler to know which page to call MAP_FIXED on?

~~~
klodolph
I don’t know an open-source example off the top of my head. But that’s the
gist of it… you keep an array somewhere with all the address ranges mapped. If
you get SIGBUS, find the corresponding map, replace it with zeroes, and mark
it as having an error. If there is no corresponding map, uninstall the signal
handler and return—the thread will SIGBUS immediately after return and the
default action will kill the process.

------
ysleepy
Really cool talk by Bryan Cantrill about the joys of mmap in a "simple" use-
case:
[https://m.youtube.com/watch?v=vm1GJMp0QN4#](https://m.youtube.com/watch?v=vm1GJMp0QN4#)

~~~
karussell
Such a great talk - thanks a lot! Here is the direct link as it is the last of
the lightning talks:
[https://www.youtube.com/watch?time_continue=2462&v=vm1GJMp0Q...](https://www.youtube.com/watch?time_continue=2462&v=vm1GJMp0QN4)

------
atemerev
Things are so much better if you are not writing apps for general public (mine
are trading-related). You can tell your few clients — make sure that the
access to mmapped file is exclusive — and get away with it.

And yes, mmap is the awesomest thing out there.

~~~
CoolGuySteve
I’ve moved away from mmap for trading in favor of a separate write thread.

While mmap is fast, the combination of factors that can make it decide to
stall your thread while it commits to disk is difficult to manage from an
operational standpoint. A slight misconfiguration is all it takes to introduce
a rare and hard to notice multi-millisecond delay.

Whereas with a spinlocked sized-reserved vector, the fail state performance is
however long it takes to allocate more space which is on the level of
microseconds. You do pay 50-150ns for that spinlock though.

~~~
jacobush
Or better yet, mmap in a read-write-thread?

~~~
CoolGuySteve
Ya sure, go nuts on the other thread. It’s not critical to trading. I normally
fprintf all sorts of time stamps, log level info, and handle formatting over
there.

Just make sure you put it on another core to protect your cache.

But if you want mmap, you do you.

------
ben-schaaf
Author here, if anyone has any questions in relation to me or Sublime HQ
please feel free to ask.

~~~
eeZah7Ux
"In hindsight it's difficult to justify using mmap over pread"

This needs a stronger justification. mmap allows reading and writing large
data structures without copying, which can be a huge benefit depending on the
use case.

~~~
ben-schaaf
Using pread would have been more work from the start but provides a more
robust solution and doesn't have problems on Windows. I would argue that the
incrementaly built mmap based solution is strictly worse, thus difficult to
justify doing again.

~~~
eeZah7Ux
"strictly worse" despite no-copy memory access? Again, this needs proof.

~~~
ben-schaaf
Strictly worse because it requires more work to maintain and write new code,
there's no guarantee we haven't missed any access points in our codebase so it
is less robust, still locks files while in use on Windows and requires
maintaining patches to Breakpad. Performance is not an issue here, the program
working correctly is, and doing so in the long run is strictly worse than
mmap.

~~~
superlopuh
Do you have a list of these strictly-better things to do and a plan on when to
implement them? Do you let them pile up? Mostly asking about your decision-
making process.

~~~
ben-schaaf
I'm not sure what you mean. The blog post discusses solutions to all the
problems I listed. They wouldn't have been problems using pread, which has a
straight forward implementation.

------
raphlinus
The other big problem with mmap is what happens when your file changes out
from under you. This seems to be mostly for git packfiles, which I think can
be treated as immutable by convention, but that's not strongly enforced
anywhere. For reading, eg, program source files, I think mmap is hugely
problematic.

I've been arguing for a long time that operating systems should provide read
only snapshots of files as a primitive, but that's a pretty big ask; it's
especially hard to do when the file system is network-mounted. There are a
couple of copy-on-write filesystems on Linux (btrfs and ZFS if memory serves)
which can do this locally, but it's not mainstream.

~~~
avar
You can assume that git packfiles are immutable. They may disappear from under
you as a repack happens, but they will not be changed.

They even have a name like pack-<SHA-1>.pack where that <SHA-1> is a SHA-1 of
the contents of the pack (minus the last 20 bytes, the checksum SHA-1 is also
part of the pack itself).

~~~
loeg
True for this workload, _iff_ you assume the filesystem and/or media protects
you from corruption (it probably doesn't). I would guess OP is commenting on
mmap IO in general, rather than TFA's specific git use case.

------
CoolGuySteve
I feel like this is kind of a dumb design.

You want to abstract two different kinds of file reader: an mmap reader and a
regular reader. (And I would add a gz reader, personally).

Then by inspecting the properties of the file, you can determine if it is
local when opening, and if so, mmap the file.

I say this because if the file is coming via the network or a FAT32 partition
you’re not going to save much time with mmap relative to the read speed
anyways.

~~~
samcday
But now you have two completely independent code paths. Both of which will
need to go through the same maturation phase that the ST folks evidently went
through with mmap. And if the code needs to evolve for other reasons,
potentially both of these paths will need some love too.

Seems like the worst choice in a situation like this!

~~~
theoh
Butler Lampson talks about the general concept of an overlay in operating
system design. It's a slightly different case, on the surface, but the
possible application in this case is that if mmap fails, you probably _should_
have a fallback code path that tries to access the file normally.

That's obviously not going to work for all applications. But it's something
you consciously have to do anyway if mmap may not be available (e.g. when
working across the network). The OS design should make it easy to move up and
down in this hierarchy of access methods, but the application programmer still
needs to know which "level" of access they have to the file.

There's work to be done here in improving the design of the OS interface. That
would lift part of the the burden of maintaining multiple paths in the
application code.

[https://www.youtube.com/watch?v=TRLJ6XdmgnA&t=11m35s](https://www.youtube.com/watch?v=TRLJ6XdmgnA&t=11m35s)

slides:
[https://bwlampson.site/Slides/Hints%20and%20principles%20(HL...](https://bwlampson.site/Slides/Hints%20and%20principles%20\(HLF%202015\).pdf)

see also: [https://ocw.mit.edu/courses/electrical-engineering-and-
compu...](https://ocw.mit.edu/courses/electrical-engineering-and-computer-
science/6-826-principles-of-computer-systems-spring-2002/lecture-notes/30.pdf)

------
quotemstr
There's also the matter of taking an implicit "system call" (via page fault)
the first time your program touches a page that hasn't yet been faulted. This
old myth that mmap is the fast and efficient way to do IO just won't die. mmap
does have perfectly legitimate use cases (e.g., reducing anonymous commit
charge) but you should try to make regular reads work first.

 _That said_ , there's nothing wrong with mmap or SIGBUS error reporting in
concept. I think I've said it before, but you can think of mmap of a disk file
as basically adding a temporary dedicated swap file to the system, with all
the pages backed by that "swap file" already "swapped" out unless already
cached. Sometimes that's exactly what you want!

The author's signal handler registration difficulties come from the POSIX
signal API being awful, not from the idea that catching CPU traps is somehow
bad. It's possible to do much better than sigaction(2). I wrote up a detailed
proposal for improvement in [1].

When I ran [1] by the glibc people, though, the response from them was pretty
stark and unwarranted hostility toward any use of signals at all, even in
perfectly legitimate scenarios, like mmap SIGBUS or certain kinds of important
JVM-style pointer check [2] optimizations. It's this "we won't change anything
anywhere despite our views being out of step with real user needs" attitude
that's responsible for a lot of weird friction in the Unix API surface.

Practically every program mmaps disk files has trouble recovering from IO
errors. If it were easier for multiple components in a process to share
responsibility for handling a signal, we'd more often get subtleties like this
right. A better signal API will improve the user experience. Tilting at
signal-shaped windmills won't.

[1] [https://www.facebook.com/notes/daniel-colascione/toward-
shar...](https://www.facebook.com/notes/daniel-colascione/toward-shareable-
posix-signals/10157129032641102/)

[2] Relying on CPU traps lets you perform certain checks for free. Why write
"if (myptr != nullptr) var = _myptr " when you can just write "var = _myptr"
and handle the SIGSEGV when myptr ends up being nullptr? The latter option is
zero-overhead in the non-nullptr case. It's also much more complicated, but if
you're a VM, and you generate _millions_ and _millions_ of these "if (myptr !=
nullptr)" checks in JITed code, the size and speed win of eliminating the
check starts to justify the complexity.

~~~
burntsushi
> This old myth that mmap is the fast and efficient way to do IO just won't
> die.

Well... because it's not a myth in all cases?

    
    
        $ time rg zqzqzqzq OpenSubtitles2016.raw.en --mmap
    
        real    1.167
        user    0.815
        sys     0.349
        maxmem  9473 MB
        faults  0
    
        $ time rg zqzqzqzq OpenSubtitles2016.raw.en --no-mmap
    
        real    1.748
        user    0.506
        sys     1.239
        maxmem  9 MB
        faults  0
    

The OP's adventures with mmap mirror my own, which is why ripgrep includes
this in its man page:

    
    
        > ripgrep may abort unexpectedly when using
        > default settings if it searches a file that
        > is simultaneously truncated. This behavior
        > can be avoided by passing the --no-mmap flag
        > which will forcefully disable the use of
        > memory maps in all cases.
    

mmap has its problems. But _on Linux_ for a simple sequential read of a large
file, it generally does measurably better than standard `read` calls. ripgrep
doesn't even bother with madvise.

Changing the workload can dramatically alter these conclusions. For example,
on a checkout of the Linux kernel:

    
    
        $ time rg zqzqzqzq --mmap
    
        real    1.661
        user    1.603
        sys     3.128
        maxmem  41 MB
        faults  0
    
        $ time rg zqzqzqzq --no-mmap
    
        real    0.126
        user    0.702
        sys     0.586
        maxmem  20 MB
        faults  0
    

Performance of mmap can also vary depending on platform as well.

FWIW, I do generally disagree with your broader point, but it's important to
understand that there's actually good reason to believe that using mmaps can
be faster in some circumstances.

~~~
TheCondor
It’s not a myth at all, mmap is faster, you save on straight copies of data
and the sys-calls to do it. It should be faster in nearly all circumstances,
faster by at least a copy. In exchange you pick up a lot of complexity dealing
with faults and you potentially put stress on the VM system. If you are doing
to ‘O’ part of I/O then mmap starts to be really complex, fast. rg is kind of
a special case, it’s not writing, it’s going to do mostly (maybe only, I
assume it backtracks on matches) sequential reads of mostly static files, it’s
really the easy case, its not clear that madvise would help and your brand is
speed so saving on those copies is worth it. What might be interesting, on
certain memory constrained systems you can slide a smaller map space through
the file rather than mapping the whole thing; it’s been a while since I looked
at it all but mapping the smaller pieces gives huge hints to the vmm and it
would probably slow down rg incrementally but speed up overall system
performance.

~~~
burntsushi
Yes... _I_ know it's not a myth. :-) I was responding to someone who was
saying that it was a myth.

> It should be faster in nearly all circumstances

As my previous comment showed, that's definitely not true. If you're searching
a whole bunch of small files in a short period of time, then it appears that
the overhead of memory mapping leads to a significant performance regression
when compared to standard `read` calls.

> it’s really the easy case

I know. :-) That's why ripgrep has both modes. It chooses between them based
on the predicted workload. It uses memory maps automatically if it's searching
a file or two, but otherwise falls back to standard read calls.

Moreover, if ripgrep aborts once in a while because of a SIGBUS, then it's
usually not a big deal. It's fairly rare for it to happen. And if it does
happen to you a lot or you never want it to happen, then you just need to
`alias rg="rg --no-mmap"`.

~~~
TheCondor
I love ripgrep, btw, great work.

I was pondering this some more in the shower, the mmap for rg case is also
sort of naturally cache oblivious, copies will consume hardware cache for the
write and while there is a ton of hardware for cache on modern hardware, it’s
a noticeable cost on some tests. If you’re searching through something big,
then it’d be like doubling hardware cache which is probably really noticeable
on smaller devices.

The small files case is interesting, copying the data is faster than patching
up the page table tree, I bet there is a strong correlation to the hardware
cache size vs the average file size in that case. The files probably need to
be N pages in size for it to be worth it, might be an interesting heuristic to
use.

------
rwmj
And that's just for reading. Writing, especially if you want to be sure when
the writes hit the backing file, or in what order, or if you run out of disk
space, is another kettle of problems.

I wonder how Multics dealt with all this, since AIUI in that system everything
was effectively an mmapped file.

~~~
mcculley
Did Multics have anything like NFS? Everything is easier if your kernel has
control of the underlying device.

~~~
BeeOnRope
Well the problematic example given in the article was NTFS where the whole
filesystem can disappear, but the problem applied to local files too, eg if
their size is changed by another process.

~~~
mcculley
Where did you see mention of NTFS? I was referring to "As it turns out, the
ticket comes from someone using a networked drive."

~~~
BeeOnRope
Sorry, it was a typo, I meant NFS or more generally some type of networked
drive.

------
saagarjha
> Using setjmp and longjmping from a signal handler is actually unsafe. It
> seems to cause undefined behaviour, especially on MacOS.

Have you considered making a dispatch_source_t of type
DISPATCH_SOURCE_TYPE_SIGNAL and handling all signals in a dispatch queue,
instead of trying do figure out what kind of behavior is legal in a signal
handler?

> If a library such as Breakpad registers for Mach exception messages, and
> handles those, it will prevent signals from being fired. This is of course
> at odds with our signal handling. The only workaround we've found so far
> involves patching Breakpad to not handle SIGBUS.

Would it be possible to install your own handler before Breakpad does?

~~~
ben-schaaf
> Have you considered making a dispatch_source_t

I think that would have been considerably more work than finding the SO answer
that says you need to use sigsetjmp, and would probably still conflict with
Breakpad ;)

> Would it be possible to install your own handler before Breakpad does?

I may be wrong, but I think you can only register one exception handler per
"task" (process), so Breakpad would override ours.

~~~
lgg
When you install a mach exception handler you can get the port of the previous
exception handler, which you can use to forward the messages your newly
installed exception handler receives. Of course (as with all raw mach APIs) it
is poorly documented and error prone.

~~~
saagarjha
Interestingly, it seems like Breakpad does a task_get/set_exception_ports
dance instead of using task_swap_exception_ports:
[https://chromium.googlesource.com/breakpad/breakpad/+/refs/h...](https://chromium.googlesource.com/breakpad/breakpad/+/refs/heads/master/src/client/mac/handler/exception_handler.cc#675).
Doesn't this break if someone registers a handler between the two lines?

------
emersion
Note that some BSDs have a MAP_ZERO flag which makes invalid accesses read
zeros instead of triggering SIGBUS.

~~~
loeg
Which ones? I don't see it in manual pages for any of the ones I know about
(Free, Net, Open, Dragonfly).

------
Asooka
Oh I see you didn't get to caveat 5: you can't read anything more complicated
than raw bytes, i.e. chars, because of unaligned memory access errors. Let's
say you mmap a file and do something like this:

    
    
      char *fileContents=...mmap etc...;
      int headerOffset=*(int*)fileContents;
      int *someListOfNumbers=(int*)(fileContents+headerOffset);
      int importantSum=
        someListOfNumbers[0] +
        someListOfNumbers[1] +
        someListOfNumbers[2] +
        someListOfNumbers[3];
    

If headerOffset is not a multiple of 4, bad things will happen. On x86 you'll
get away with it... Unless you were summing many more ints and the compiler
decides to use aligned SSE loads for speed[1]. On ARM you'll get a SIGBUS just
for trying to read unaligned ints (IIRC). You can fix that by wrapping your
ints or whatever you're trying to read in a packed struct, but it is yet
another thing to keep in mind.

[1] [http://pzemtsov.github.io/2016/11/06/bug-story-alignment-
on-...](http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html)

~~~
planteen
You should not directly use structs for data serialization. If you end up
having to support a big endian platform, your structure orders will change,
not to mention potential packing issues as you said.

[https://commandcenter.blogspot.com/2012/04/byte-order-
fallac...](https://commandcenter.blogspot.com/2012/04/byte-order-
fallacy.html?m=1)

~~~
hyc_symas
Wasting a good performance optimization on the rare chance that you might one
day have to support a different endian architecture is IMO a poor tradeoff.
The number of big-endian machines in use today is continually shrinking, and
the number that are active on a heterogeneous network is even smaller.

In LMDB we simply document "don't use this with remote filesystems" and avoid
the issue - if you're never sharing files with other machines, there's never a
question of mixed endian accessors.

~~~
planteen
I still think you shouldn't be directly sending structs over the wire or to
disk. The alternatives are so much better - SQLite or Cap’n Proto.

I'm a bit shell shocked from supporting both big and little endian in structs
from previous jobs. I've had nightmare situations with it twice. I do embedded
systems and while little endian is winning there too, you still have legacy
things like the LEON (SPARC) that is big endian. I've heard lots of network
embedded hardware is big endian too for obvious reasons.

~~~
cerebellum42
To disk is probably a bad idea but over the wire has some legitimate
applications. Being reliant on same endian systems can be alright if for
example you're building a distributed computing system where the same binary
will be executed on a bunch of systems and all you're doing is sharing
computation results between those instances. You do get unrivaled
serialization speed that way. Timely Dataflow [1] works that way by default,
but it also has an option for "real" serialization if that is required.
Admittedly that's a fairly specific application but it's real and sometimes
it's a good tradeoff.

[1] [https://github.com/TimelyDataflow/timely-
dataflow](https://github.com/TimelyDataflow/timely-dataflow)

~~~
planteen
Yeah that's true for the distributed application you mentioned with the wire.
Certainly advantages there, especially when you control all the nodes.

I hate seeing the fields of a communication protocol listed in two structs -
an ifdef BIG_ENDIAN and LITTLE_ENDIAN. They will invariably be inconsistent,
someone will change something in little but not big, or even worse, update one
incorrectly and not test it.

------
snarfy
Now I'm curious how Vim deals with large files. I'm assuming the went the
pread route.

~~~
rocqua
Vim needs to do a lot more special stuff, because it has to handle random
insertions and deletions. Doing those things to a file (mmaped or otherwise)
requires moving all data after the edit.

If I recall correctly, vim/vi uses a linked list of 'chunks' that is
dynamically merged. For long files I would expect some form of lazy loading of
chunks.

~~~
dullgiulio
That's right. The key data structure is the rope:
[https://en.wikipedia.org/wiki/Rope_(data_structure)](https://en.wikipedia.org/wiki/Rope_\(data_structure\))

~~~
snarfy
Nice! I thought xi was the only editor that used ropes.

------
gok
It would be really nice if there was a POSIX equivalent to Foundation's
NSDataReadingMappedIfSafe, which uses mmap() when the file isn't backed by NFS
and falls back to read() when unsafe or not worth it.

~~~
thomasjudge
Is there a way for a process to know what kind of filesystem something lives
on, or is this just a developer-provided clue?

~~~
jfhufl
getmntent(3)

------
dicroce
I've successfully used mmap() a few times in the last few years... Luckily for
me the use cases I've had weren't really subject to the same problems I've
since read about here and elsewhere:

1) I'm always mmap()ing the whole file (and the files are power of 2 sized).
2) The files I'm mapping are stored on file systems I control (and so are
never on NFS). 3) In one case, my use of mmap() is limited to read only.

~~~
mcguire
mmap seems to be specifically designed to trigger the differences between NFS
and Unix file system semantics.

~~~
loeg
Really any networked or distributed filesystem will struggle to implement mmap
semantics well.

------
cryptonector
The problem here was wanting to parse a large file at all. You need to be able
to do a) online parsing (so you don't need to read the whole file into memory
_first_ ), b) stream parse (so you don't need to build a parsed representation
of the whole thing before you can do anything).

------
natmaka
How widely-used code laying on mmap (LMDB maybe being one of the prominent
cases) rolls?

------
z3t4
Or you could stream parse the file. eg. parse it one chunk at a time.

~~~
saagarjha
That’s what they were doing before?

~~~
0xffff2
Before they were reading the whole file into memory and parsing it. Now, they
are using mmap to read parts of the file into memory when they are used. If
you're only doing a single linear pass through a file there is a third option;
you could read part of the file in a single chunk at a time, parse that chunk,
then read the next part of the file into the same chunk of memory.

I don't know enough about the process being discussed here to know whether it
needs to do lookups at random offsets in the file, but iff it doesn't then the
chunked read solution could be a simpler way to reduce memory usage.

------
strictfp
TRWTF is Breakpad, trying to deal with the process crashing from within the
process itself. Even requires PTRACE to work.

------
loeg
I'd go a step further and broadly recommend not using mmap to access files,
unless there's a _really_ good overriding reason. (E.g., if the file is some
sort of special virtual device/filesystem that by nature cannot have errors.)

Mmap is not good for writing to files — pages may be persisted to disk in
arbitrary order, which makes it harder for filesystems to coalesce adjacent
writes into fewer larger (and faster) IOs. (This is still an issue on SSDs,
although not as bad as on HDDs.) This is a performance issue and can result in
bad file layout (making future reads slower).

As this article describes, the mmap() model sort of assumes no errors happen.
This breaks when files are truncated, even on local POSIX filesystems, and you
get SIGBUS. It can also break if files disappear, such as a failing media or
removed USB stick, or network filesystem. It doesn't mesh well with
distributed filesystems either, for obvious reasons. If a page is to be
writeable, you must take an exclusive data lock on that page's region across
your distributed filesystem (and read access requires a shared data lock, to
prevent corruption from concurrent writers). What if you lose quorum /
availability?

TFA's trick to use thread-specific longjmps around specific virtual memory
accesses probably works out ok on POSIX platforms[1] but it requires wrapping
_all_ of your mmap'd regions carefully. You can't just cast portions to a
struct and access directly, except in small critical regions protected by the
sigsetjmp. And as they point out, SIGBUS is global — it can conflict with
error catchers (mentioned in TFA) but also can be raised for reasons other
than mmap IO failure, such as attempting to access a non-canonical virtual
address, and thus a long-lived global handler may mask other bugs. (Also, if
you mmap many files and install a single long-lived handler in a multi-
threaded program, it can become difficult to determine which file-access
raised the signal.)

rtorrent, for example, used to have a ton of reports of SIGBUS due to mmap'd
file access failure. I don't know if they've addressed that in some way
(perhaps by simply masking SIBGUS) or continue to ignore it.

TFA claims pread was about 2/3 as fast as mmap'd access; some slightly clever
application-specific use of caching, prefetch, or larger IOs might help
eliminate that gap by reducing syscall overhead and/or disk wait. The best
thing about pread/pwrite is they return have error reporting in the interface,
and you can actually check that your IO did what you wanted.

[1]: [http://man7.org/linux/man-pages/man7/signal-
safety.7.html](http://man7.org/linux/man-pages/man7/signal-safety.7.html) :

    
    
        If a signal handler interrupts the execution of an unsafe
        function, and the handler terminates via a call to longjmp(3) or
        siglongjmp(3) and the program subsequently calls an unsafe
        function, then the behavior of the program is undefined.

------
adamc
Really well-written, easy to follow piece.

------
tobyhinloopen
I feel like you should use some kind of library for this instead of handling
it all yourself.

~~~
ben-schaaf
A library for mmap would have not helped with any of the Windows or Breakpad
issues. It would have also been more work up-front than mmap, since none of
the problems of mmap were known to us at the time.

------
b4shout
That one is amazing...!

