
The mmap pattern - physicsistic
http://john.freml.in/mmap-pattern
======
tsukikage
"By preserving state explicitly to memory backed files, several classes of
unexpected events causing the program to crash can be recovered from with
minimal disruption."

The cure is surely worse than the disease!

The program state in memory at the time of abnormal termination is likely to
be inconsistent, leading to an unusable file. The subset of that that happens
to have been committed to disk is likely to be worse.

Contrast with the traditional approach of explicit serialisation /
deserialisation for persistence:

* serialisation occurs when the program is in a known state. We can reason about what invariants hold at the point when state is sent to storage. We can take steps to confirm that data actually has been committed to persistent storage before we treat the operation as complete.

* deserialisation recreates program state in a clean environment from minimal data. This has the side effect of reinitialising state that may have caused the unexpected termination. Coupled with the ability to reason about program state at the point of serialisation it makes it feasible to attempt error recovery.

* Explicit serialisation and deserialisation stages make it possible to construct persistent state that is portable across environments and through development

mmap() is blindingly fast, but it is a tradeoff between performance,
complexity and reliability - naive use across the board is unlikely to improve
all three!

~~~
ninkendo
Also, beyond only the state in the file being inconsistent, even if you manage
to get a complete atomic write of your program state before you crash, the
very same program state that caused the crash (if the crash was your own
program's fault) is now being faithfully read back from disk! You're just
asking to crash again immediately by blindly loading it all back.

~~~
hyc_symas
This is why you must have ACID transactions to use this approach.

Even without mmap, if your program successfully writes an invalid data
structure to disk, the next time you run and read it back, you will crash.
That's orthogonal to the method used to write and read the program state.

------
barrkel
The act of serializing data structures and reloading them can be a guard
against long-term corruption.

If you only use mmap, there's a risk of some corner of the object graph
getting subtly wrong owing to a bug in one version of the software, and never
getting repaired.

Versioning of data structures is also a problem.

I'd leave this pattern for use cases for which copying of memory on load has a
measurable impact on performance, and where required integrity of data isn't
very high - ideally, the data should be able to be regenerated from primary
sources.

One case I know of where this technique was used was the precompiled headers
file for the Borland C++ compiler. There, it makes sense on both counts.

~~~
mikeash
Isn't the potential for long-term corruption purely related to not rewriting
the entire file from scratch every time you make changes? Even without mmap,
any data store where you make incremental changes is potentially vulnerable to
this sort of thing.

One obvious example would be filesystems. They're basically specialized
databases which treat your whole disk as a single gigantic file, and of course
there's a long, proud history of programs used to repair corruption in them
due to bugs or other problematic events.

To me, that's an argument for not using formats with incremental formats if
you can get away with rewriting the file each time, but once you have enough
data to where you can't afford a total rewrite each time, does mmap make the
problem any worse?

~~~
barrkel
Most app file formats do not have the same reliability testing level as
filesystems. Filesystems need years of use by brave people before they're
considered OK for production.

If you're updating files in place without a whole rewrite, then you are
effectively writing a filesystem.

I prefer to use a different approach where possible: write a diff log, and
occasionally do a rewrite of the whole file.

When you do have to treat the on-disk file as a filesystem, I'd explicitly
treat it as such. Have a low-level structure that does not need modification
for different versions, is portable and highly tested. Do application-level
structure with normal read-modify-write, but layered on top of the virtual
filesystem layer. Consider having two modes, one where the virtual filesystem
just proxies the real filesystem, and instead of a "file", you have a
directory structure on disk. Consider using an existing component for this
(zip archive libraries are popular, but not particularly efficient for
writing).

Mmap does make the problem worse. It encourages you to be lazy towards
versioning and portability.

------
rdtsc
This is basically used for this little gem:

[http://symas.com/mdb/](http://symas.com/mdb/)

LMDB is at the heart of the ubiquitous LDAP ( OpenLDAP ) and is very well
optimized ( look at his benchmarks ). Now they are optimized for reading,
which is important.

I would imagine mmap-ing with large amount of write will result in
unpredictable performance....

~~~
edwintorok
Writes to mmap-ed regions aren't very reliable: besides the synchronization
issue there's also the question of what happens on an I/O error or ENOSPC.
Usually your program just gets a SIGBUS and crashes. IIRC LMDB uses regular
syscalls to update the mapping, and use mmap only for read.

~~~
hyc_symas
Correct, at least for the original design of LMDB. We later added an option to
write thru the mmap as well, at the request of some other users. Since LMDB is
quite careful about writing and use of fsync/msync, reliability generally
isn't an issue, but yes, hard errors on the disk and such can crash your
program when using a writable mmap. Also, in the default mode of read-only
mmap, the DB is incorruptible, stray pointer writes will crash the program
instead of corrupting the DB. With the writable mmap you obviously have no
such protection. We recommend developing your app with the default read-only
mode, and only enabling the writable mmap after your app code is well tested.

------
comex
For a while I've wanted a nice C library for an mmapped heap with allocation
and common data structures (and maybe locks?) - all the usual stuff you'd
expect in a standard library, but with support for relative offsets instead of
pointers, crash robustness, introspection, and other features required to work
well with a persistent file. I do not know any library of this type that
currently exists.

~~~
jasonjackson
I believe this would qualify as a Java library [https://github.com/peter-
lawrey/Java-Chronicle](https://github.com/peter-lawrey/Java-Chronicle)

"This library also supports distributed, durable, observable collections (Map,
List, Set)" "It uses almost no heap, trivial GC impact, can be much larger
than your physical memory size (only limited by the size of your disk) and can
be shared between processes with better than 1/10th latency of using Sockets
over loopback."

------
brianpgordon
> One very key architectural decision for a system is the degree of
> reliability that it should possess. This is an explicit trade-off between
> the rapidity of development (in particular the level of indoctrination
> needed before new contributors are able to augment the feature set) and the
> operational stability. By preserving state explicitly to memory backed
> files, several classes of unexpected events causing the program to crash can
> be recovered from with minimal disruption. The benefit here is that the
> system can be developed rapidly with code reviews focusing on data integrity
> and paying less attention to complex interactions that can lead to crashes.

The mmapped files are going to break every time you make the smallest change
to your data structures. This strikes me as not terribly useful for rapid
development.

~~~
Negitivefrags
That depends on how they were created.

On our game project we use a data tool that allows you to create tables of
data out of columns with specific data types as well as references to other
table rows.

When you export from the tool it creates data files but also generates the C++
struct that represents the schema. The loader simply mmaps the data in to
memory. Variable length data like strings gets packed at the end of the table
with the fixed length records containing relative pointers in to it.

The data on disk is stored in the exact format that is needed at run time.

References to rows in other tables turn in to pointer like objects so you get
a perfectly natural syntax as if the data had been loaded in to hand written
classes.

It's perfect for rapid development because designers can add new columns or
tables of data and then just tell the programmer what they are called.

It also loads lightning fast and has excellent run time properties due to
cache coherency.

------
kristianp
Google code search uses an mmapped index of trigrams, in Go:

[https://code.google.com/p/codesearch/source/browse/index/mma...](https://code.google.com/p/codesearch/source/browse/index/mmap_linux.go)

[http://swtch.com/~rsc/regexp/regexp4.html](http://swtch.com/~rsc/regexp/regexp4.html)

------
samlittlewood
If you can live with boost, then boost::interprocess is a good way of managing
a mmapped segment - shared with other processes or shared with yourself over
time.

This give string,set,map,list,deque,vector types that work in the shared
space.

[http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/a...](http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/allocators_containers.html#interprocess.allocators_containers.containers_explained.containers)

------
tlb
I did this for a high-performance time sequence database of robot activity.
It's pretty hard. You need custom data structures for everything (no STL),
adding fields to the data structures requires a schema migration tool, and
it's easy to have subtle bugs around things like hash table growth.

However, it is insanely fast when it works.

~~~
simcop2387
This is one place where I'm wondering how Cap'n Proto[1] would help. Since the
structures are already made for in memory usage I think it might work pretty
well. And given that it's all versioned/tagged structure wise it should be
easy to do the schema migration.

[1] [http://kentonv.github.io/capnproto/](http://kentonv.github.io/capnproto/)

~~~
stuaxo
Wow, that looks great !

------
guard-of-terra
"explicitly choosing a file and mapping that into the memory space of the
program means it can persist across crashes"

It's not that your program crashes, it's that your memory becomes corrupt and
that causes the crash.

Restarting program is essentially redoing all the memory. If you persist your
memory your program will crash on run.

~~~
IgorPartola
Different types of crashes and different memory. You can crash because you
program did not handle a network error well, etc. Also, you do not need to use
mmap'ed memory for everything. You can still use the stack and the heap for
temporary structures, but store permanent data in the mmap'ed segments. Think
about persistent data vs running queries in a database server.

~~~
guard-of-terra
How is it then different from sqlite?

~~~
mtdewcmu
SQLite does it for you; here you do it yourself. Presumably you'd have done a
cost/benefit analysis before choosing this over SQLite, so your reasoning
would be sound...

------
kentonv
I had a G+ thread about mmap a couple months ago that surfaced a lot of really
good points. I like mmap a lot, but there are certainly some down sides:

[https://plus.google.com/u/0/+KentonVarda/posts/NKUUzx2nEsN](https://plus.google.com/u/0/+KentonVarda/posts/NKUUzx2nEsN)

If you're looking for an easy way to exploit mmap in your code, Cap'n Proto is
a serialization format that works similarly to Protocol Buffers but is
designed to work well with mmap():

[http://capnproto.org](http://capnproto.org)

(Disclosure: I am the author of Cap'n Proto... and also the former maintainer
of protobufs.)

------
malkia
Microsoft has __based pointers - e.g. pointers that are based on others - not
sure whether gcc/clang has something like it - [http://msdn.microsoft.com/en-
us/library/57a97k4e.aspx](http://msdn.microsoft.com/en-
us/library/57a97k4e.aspx)

------
wging
Reminds me of a post about the internals of ITA, which mentions that they used
memory-mapped files to make data access fast:

[http://www.paulgraham.com/carl.html](http://www.paulgraham.com/carl.html)

------
nly
> Modern CPUs have the ability to arbitrarily (generally at a 4kB granularity)
> map memory addresses as visible to a program to physical RAM addresses

While true and exceptionally powerful TLBs are limited in size and, being
essentially a hash table put in to your CPU by either Intel and AMD, are
likely optimised for common-case mapping patterns. Of course, these penalties
still pale in comparison to disk I/O... but kind of a downer if you're
dreaming of something crazy like a linked list where every node lives in a
separate random page.

------
chrisrohlf
"or try to mmap each region to a well-known start address (and fail if it
cannot obtain that address)."

Optimizations aside, beware this approach. ASLR is one of your two best
friends (the other is DEP). When you purposely circumvent the protection it
provides a security researcher somewhere will make you the topic of a very
pointy blog post.

------
asveikau
Is it me or does this article completely ignore two huge downsides with mmap -
number one is that page faults are expensive, number two is that i/o failures
become bad pointer dereferences.

~~~
hyc_symas
Page faults are no more expensive than any other physical I/O. If the data you
want isn't already in cache, a seek()/read() is as expensive as a page fault.
Moreso really, because the read has to do a buffer copy too.

Fair point about I/O failures though.

------
jzwinck
TFA says "By mmap'ing a region and then accessing into it, the overhead of
creating multitudes of small interlinked items can be reduced hugely."

You do not need to call mmap directly, because malloc() will do it if your
allocation exceeds MMAP_THRESHOLD (usually on the order of a megabyte). So you
get this optimization "for free"\--the only important part is that you do not
call malloc() for each tiny object.

~~~
mpweiher
I think TFA talks about mapping files into memory...

------
jpollock
I'm sorry, this is probably a newbish question, but why didn't he use mmap for
writes?

