
Mio – Cross-platform header-only C++11 library for memory-mapped file IO - starbugs
https://github.com/mandreyel/mio
======
mandreyel
Author here. Long time lurker, but made an an account now.

Wow, I did _not_ expect this. I'm really touched. I wrote this as a small
utility for my own consumption because I was unsatisfied with the existing
selection at the time, so I'm both surprised and delighted to learn that
people are finding it useful. Although to be completely frank, I think this
library is way too small and insignificant to deserve a spot on HN's front
page, but it definitely made my day. So thank you kind stranger who posted it!

~~~
starbugs
Op here. Thank you for creating mio! Your project clearly deserves the
attention. I just found it and thought it would belong here. A lot of people
seem to share that opinion :)

I am sure this won't be the last top HN post about one of your projects.

Perfect is the enemy of good.

~~~
mandreyel
Thank you for the kind words. I'll buy you a beer if we ever meet!

------
raphlinus
The name definitely made me think of Rust's mio. On the other hand, `namespace
cplusplus` and `mod rust` maybe are disjoint.

~~~
mandreyel
This is definitely unfortunate, but in my defense I was not aware of Rust's
mio (or anything related to Rust beyond its existence) at the time of writing
and naming my library. I have no emotional investment in the name, so I'm open
to suggestions should anyone take issue with it.

------
rwbt
For those of you already using Boost, Boost also has a MMAP in their IOStreams
library and it works pretty well (like most things boost).

[https://www.boost.org/doc/libs/1_68_0/libs/iostreams/doc/cla...](https://www.boost.org/doc/libs/1_68_0/libs/iostreams/doc/classes/mapped_file.html)

~~~
StreamBright
And? I am not sure what is your comment about.

~~~
mandreyel
It's a valid point, I think. I'd probably also trust something so established
as Boost more than some random guy's lib on GitHub. However, I specifically
wrote mio because I prefer not to use Boost, and from what I understand, many
others don't either.

~~~
cleeus
boosts quality varies

I've seen parts that weren't much better then someones lib on github, because
essentially that's what boost is.

------
quotemstr
I wish people used mmap less.

Creating a new memory mapping can be pretty expensive! On both Windows and
Linux, it involves taking a process-wide reader-writer lock in exclusive mode
(meaning you get to sit and wait behind page faults), doing a bunch of VMA
tree manipulation work, doing various kinds of bookkeeping (hello, rmap!) and
then, after you return to userspace, entering the kernel _again_ in response
to VM faults just to fill in a few pages by doing, inside the kernel, what
amounts to a read (2) anyway!

Sure, if you use mmap, you get to look at the page cache pages directly
instead of copying from them into some application-provided buffer, but most
of the time, it's not worth the cost.

There are exceptions of course, but you should always default to conventional
reads.

~~~
mandreyel
This is a valid point. My use case was very frequent reads of large files at
pretty much unpredictable positions, so in theory mmap seemed justified.
However, I never got around thoroughly testing this assumption, and may indeed
just have been better off using read(2) and its variants.

You seem very experienced, so I hope you don't mind a question. In my use case
the files were as large as tens of gigabytes and I was creating read-only
mappings of 256KB-1MB chunks in them, keeping the mmap handles around
according to a cache policy and RAM usage limit. Do you think in this case
using mmap could in theory introduce performance gains?

[edit: typo]

~~~
gmueckl
I think that this is the wrong way to use mmap. Just map the whole file at
once. The operating system will automatically read the pages you access from
disk. And if memory gets tight, these pages will be flushed to disk if they
are dirty and then discarded before the system starts paging. These mmapped
pages essentially live in the disk cache.

~~~
quotemstr
> The operating system will automatically read the pages you access from disk.
> And if memory gets tight, these pages will be flushed to disk if they are
> dirty and then discarded before the system starts paging.

You can tell that you understand how modern OS memory management works when
you realize that the OS "automatically read[ing] the pages...from disk" and
"flush[ing them] to disk" on memory pressure _is_ paging whether those pages
are anonymous pages or mmaped file pages. :-)

[Edit: flushing _dirty_ file-paged pages is analogous to swapping anonymous
memory to the swapfile. Discarding clean file-backed pages is a bit like
discarding anonymous pages that have been made unused through munmap, process
death, etc.]

But to the GP's point: you don't need (except to conserve address space) to
limit file mapping size. I think he really wants something like MADV_FREE. But
it's complicated.

~~~
gmueckl
Therr is a subtle difference between anonymous and mapped readonly pages: the
later can be discarded right away because their contents were read from
permanemt storage to begin with. Anonymous pages need to be written to disk
first and that is significantly slower.

------
salvad0r
Bit of a plug here. But I used this library in a small side project (C++
publisher-subscriber library) and it worked like a charm. There are a few
things you can customize with mmapping but most use cases will do fine with
this. Great work.

[https://github.com/drali/pubsub](https://github.com/drali/pubsub)

------
chris_wot
I wonder if this could replace the cross platform memory mapping in
LibreOffice. This is part of the Operating System Layer (OSL) which is at
least several decades old and uses a C interface.

[https://opengrok.libreoffice.org/xref/core/include/osl/file....](https://opengrok.libreoffice.org/xref/core/include/osl/file.h#811)

~~~
chris_wat
That would not be a good idea at all. The memory mapping works cross platform
to prevent an unsubstantiated configuration of the elon-burrow mechanisms.

------
yread
not a C++ person here but is header-only library an advantage?

~~~
orbifold
Yes because there is no standardised build system for C++, which makes
integrating non-header only libraries a pain (or at least somewhat more
painful), especially if your aim is cross platform code. In that case you can
not rely on a reasonable package manager being present and will have to
essentially include all your dependencies in the build, this is trivial for
header only libraries.

~~~
StreamBright
Is there a reason not to have a standard build system?

~~~
twic
There's no reason not to have one, but there isn't one. There are a few build
systems for C++, but none of them are perfect, and none has won out.

------
freekh
I wonder: is there something like this written in rust?

~~~
steveklabnik
[https://crates.io/crates/memmap](https://crates.io/crates/memmap) is the most
used mmap package, followed by
[https://crates.io/crates/mmap](https://crates.io/crates/mmap)

------
kthielen
Pretty cool, must be something in the air. This is a useful technique, and
having it in a self-contained header-only lib is handy too.

I made a similar library for writing data into memory mapped files, also a
self-contained header-only lib:

[https://github.com/Morgan-
Stanley/hobbes/blob/master/include...](https://github.com/Morgan-
Stanley/hobbes/blob/master/include/hobbes/fregion.H)

This one also serializes a representation of the type structure of recorded
data so that it can be safely concurrently mmapped and read either with the
same code or with the generic PL/compiler that I've developed in this hobbes
project.

Where unpredictable disk latency is a problem, we've got a similar header-only
lib for logging into shared memory (then have another process to consume this
shared memory ring buffer and dump it to disk for concurrent querying):

[https://github.com/Morgan-
Stanley/hobbes/blob/master/include...](https://github.com/Morgan-
Stanley/hobbes/blob/master/include/hobbes/storage.H)

This pipeline works well for having lightweight C++ processes feeding large
volumes of data to generic query processes that we can run out of band to look
at this data in various ways (with a Haskell-like query language).

We did hit a slight problem doing things this way that the straightforward
representation of data (as in memory) for some cases just used too much space
and too much time wasted in I/O. Basically for complex market data, where data
structures aren't trivial and recording ~100GB/day makes it very awkward to
keep around a few weeks of data for random querying.

So I also made this header-only lib to write data into these mmapped files
with a simple compression method (I like to describe it as generalizing Curry-
Howard to probabilities) that gives us much better throughput, much smaller
files, faster query times, and still support concurrent constant time random
access queries:

[https://github.com/Morgan-
Stanley/hobbes/blob/master/include...](https://github.com/Morgan-
Stanley/hobbes/blob/master/include/hobbes/cfregion.H)

It gives us compression ratios about the same as EOD gzip, but much faster and
importantly works online and with these query use-cases we have with hobbes.

Anyway, maybe I should write up those details somewhere else, I just mean to
say that this is a useful technique and you can push it very far and do many
things with it in a very straightforward way.

------
alexnewman
Already tons of projects called mio. Please rename

------
loup-vaillant
Why, _why_ do they make it header only? Is it so difficult to integrate a
couple source file along with the existing headers?

We should not forget compilation times. A project I use depends on spdlog, a
header-only C++ logging library. The thing adds almost _two seconds per
compilation unit_ to single threaded build times. And since logging is kinda
used everywhere, the whole project takes forever to build (trice the build
time it would have had without spdlog, I've measured).

What benefit is so great that it is worth killing compilation times?

~~~
ur-whale
Doesn't your compiler support pre-compiled headers?

~~~
loup-vaillant
The "rebuild from scratch" integration server does not.

------
ajross
The code seems clean. I'm not sure this is a great idea in practice, though.
Generally the only good reason for mapping stuff out of the filesystem is
performance, and VM behavior with mmap() varies _wildly_ across systems (and
filesystem backends, and drivers if it's a hardware device, and hardware if
it's a framebuffer, and...). Frankly on windows this is AFAIK a mostly-
unheard-of technique. No one does mapping.

This just isn't really something that can be cleanly abstracted to do what you
want it to do, even if you can make the code "look" the same.

But again, it looks like a nice, clean, modern C++ library. Just IMHO
misapplied.

~~~
bsenftner
MMapping is used in most fault tolerant software as a simplified method of
data persistence. Granted, not the only method in play, but it is a common
data safety net.

~~~
ajross
Flushing behavior is precisely the kind of thing that varies the most between
systems. I'd be _very_ suspicious of a "fault tolerant" system that tried to
use a library like this to be "cross platform". That's almost a contradiction
in terms.

~~~
bsenftner
Was not talking about that library, just memory mapped files.

