
Writing a Memory Allocator for Fast Serialization - dryman
http://www.idryman.org/blog/2017/06/28/opic-a-memory-allocator-for-fast-serialization/
======
sriram_malhar
Nice bit of work.

Jiri Soukup's 2001 book on "Serialization and Persistent Objects: Turning Data
Structures into Efficient Databases" consists of many techniques, including
the serialisation of mmap'd pages.

[https://books.google.co.in/books?id=DHDABAAAQBAJ&pg=PA74&dq=...](https://books.google.co.in/books?id=DHDABAAAQBAJ&pg=PA74&dq=jiri+soukup+mmap&hl=en&sa=X&ved=0ahUKEwib_cSD8O7UAhXIOI8KHSLACRUQ6AEIJzAA#v=onepage&q=jiri%20soukup%20mmap&f=false)

------
cjhanks
The idea in general is well understood, the popular HDF5 library has an IO
driver which essentially memory maps structures in this way. Numpy also allows
for memory mapping numpy complex structures. In the past I have used this
strategy for mapping point data (on the order of billions of points) which
would never possibly fit into memory.

Tuning the allocator is not as straight forward as you may believe. If you
have variable sized allocations the problem is fairly difficult... you
essentially are forced to rewrite a worse version of ptmalloc, jemalloc, or
tcmalloc. If your allocations are fixed, you're in a slightly rosier
situation. However, you have to consider - how will you support deletions?
Will you journal and garbage collect? Are you going to force variable latency?
Are you going to implement atomic barriers on lockless structures? Now that I
think of it... what is the cost of an atomic operation on a shared memory map?
You will also need to concern yourself with cache hits/misses. In my
experience it is somewhat difficult to predict what memory in your map is
going to be in cache and what won't. If your data scatters... your performance
is going to be fairly slow.

~~~
barrkel
I've implemented this scheme in the past, for fast serialization in a C# app -
a byte array was used for session storage, and requests updated values
directly inside the byte array rather than serializing and deserializing an
object graph for every request. It needs to be combined with a GC to work well
for precisely the reasons you mention: variable sized allocations, and
reallocations, will bloat the blob of memory otherwise. Also, if using a GC,
allocation is a much simpler problem, because fragmentation isn't an issue.

In some cases, where data is built up once, then reused read-only, you don't
need to pay the reallocation cost. For example, the Borland C++ compiler used
this approach for precompiled headers. The symbol table information was
allocated in a single contiguous blob of memory, with pointer locations noted
just like you'd note fixups when writing an object file. Then, when the
precompiled header was loaded, the fixups would be iterated over and the
difference in old load address and new load address would be added to every
pointer location: and all the pointers work again!

The idea is in this conjunction of linkers, loaders and garbage collectors;
all three are strongly related functionalities. A smart linker is a copying
GC; a moving GC is almost isomorphic to an OS loader, except the source is
memory rather than disk (or mmap); a loader is a runtime linker; etc.

------
faragon
With libsrt C library [1][2] using one thread on Intel Core i5 3330 (maps
based on Red-Black tree, not hash tables, i.e. no rehash required):

* libsrt i64-i64 map (equivalent to std::map <int64_t, int64_t>: > 10M QPS

* libsrt string-string map (equivalent to std::map <std::string, std::string>: > 1M QPS (> 2M QPS if key size <= 19 bytes)

[1] Repository:
[https://github.com/faragon/libsrt](https://github.com/faragon/libsrt)

[2] Benchmarks:
[https://github.com/faragon/libsrt/blob/master/doc/benchmarks...](https://github.com/faragon/libsrt/blob/master/doc/benchmarks.md)

------
gpderetta
The survey of the state of the art is missing boost.interprocess, which
provides shared memory allocators plus drop-in replacements for c++ standard
library containers that can handle these allocators [1].

The biggest problem with interprocess is that out of the box it is not capable
of handling application failures gracefully and transparently.

[1] Basically the container must not assume that the allocator::pointer type
is an actual raw pointer as interprocess uses a custom offest pointer.

~~~
dryman
Awesome. This will be a super useful reference once I want to migrate the
library to C++.

------
trishume
This is a similar approach to what Jonathan Blow demo'd recently in his Jai
language. He uses relative offsets from the location of the pointer instead of
from the base of the region though. The advantage there is that you don't need
to pass around a reference to the current heap, the disadvantage is that
copying/moving is harder but that can be mitigated wit compiler support.

~~~
dryman
Author here. I did think about creating a programming language specialized for
serialization. Fortunately, using just C seems to be sufficient for building a
POC. Another advantage for using C is it is easier to embed into other
languages. OPIC is more library focused, which would benefit for integrating
into other languages. Jai language seems to be a application (gaming) focused
language, and the language abstraction makes developer faster to code is more
important.

------
cpp_developer
When I see posts about tuning the OS specific things (like mem allocators,
Page size tuning, VMs etc..) , I get the following question (not exactly sure
it's the right one):

As there is exponential increase in the in the performance critical software
which runs on the dedicated machines, what if we avoid the abstraction of
Operating System for them and run using bare minimum, optimized system
software?

What if we can pick the specific OS kernel modules + drivers required for our
application/ machine needs, tune it for the app and deploy the whole stack
(kernel modules + app) as one software?

Example: For a database to work, we need Networking module (accept and send
request on a given port), Memory module (to access disk, memory, cache, etc),
and Processor handling modules (to create a process/ thread, if we can call
that)

Let's say a Dockerfile kind of thing, which specifies all the required
modules/ driver for the given software. The modules can be _compiled_ for the
architecture and deployed.

Advantages I can see are:

1\. Lesser abstraction 2\. Full control on scheduling the software. Hence,
lesser synchronization issues. 3\. Kernel modules optimized for the
application software.

All the above three things leads to much better performance.

I see the following problems with the approach:

1\. Existing softwares(both user and system) doesn't suite well. 2\. Increased
development time (as system software needs to be tuned as well). 3\. Not so
many system software developers.

Isn't that's how the software architecture for the high performance software
(which runs the dedicated machines) should be in the first place? Given that
we are running so many of them now.

A complete OS (with all the modules and generically coded) looks fine for just
the end users who uses variety of not-so-perf critical apps.

What do you think about it?

~~~
ilikebits
You've described a unikernel. Check out MirageOS and HaLVM; they're very
similar to what you've described.

One interesting thing I've noticed about these kernels is their tendency for
full-stack language integration a la Lisp machines and Lisp OSes like Genera.
For example, Mirage integrates strongly with programs written in OCaML (the
home page describes the project as a "library operating system") and HaLVM
with Haskell. A quick search shows unikernels for Golang (Clive) and JS
(Runtime.js) as well.

~~~
cpp_developer
Why do you think unikernels are not famous?

Shouldn't software running on dedicated machines need lesser management
(scheduling), abstraction (virtual memory) and finer control (optimized for
the architecture) over the hardware?

------
setpatchaddress
I don’t understand the comment about not being able to use C++. In C++, you
certainly have the same low-level control over your objects as you do in C.
You gain a lot of syntax sugar and additional type safety.

If you just don’t wanna use C++, fine. But this looks like a great fit for it
to me. Avoid the STL, turn off the features you don’t need, etc.

~~~
dryman
The problem is C++ brings in many extra pointers. For example, the vtable
pointer used in virtual functions. All the pointers not converted to offset
can be invalid in next process that deserializes the object.

~~~
bananaboy
You don't have to use virtual functions though. You can just use plain old
data structures.

~~~
dryman
If I use a strict subset of C++, probably will do. However, figuring out the
subset of different C++ standards and implementation that doesn't include
extra pointers is hard. I might need to re-implement some useful utilities
like unique pointer and share pointers as well. Some fundamental data
structure in C++ includes pointers as well, like short string optimization
introduce extra pointers. With my poor C++ knowledge, I don't even know what
are the other pointers are missing..

------
Erwin
I had a similar idea but my target was a Python application, that was parsing
text to create a complex in-memory object tree with tens of thousands of
objects.

The idea was that the individual elements were not always fully accessed when
the application ran, so if I created them on demand from such a dense memory-
mappable dump I could persist that instead of parsing every time.

The overhead of creating Python object was too high for that. But if you are
using Python and are deserializing read-only dictionaries,
[http://discodb.readthedocs.io/en/latest/](http://discodb.readthedocs.io/en/latest/)
does a subset of that -- if your app e.g. reads in 100,000 translations from a
JSON file, Disco's serialization will let you just mmap them.

------
bertr4nd
What happens if the program unexpectedly terminates during an update to these
persistent data structures? Is the mmap'ed file corrupted, or still in a
usable state?

~~~
dryman
I don't have a good answer to this question yet. For now I only create the
heap in swap, write it to disk, and then use it as read only mmap. To ensure
the written file is valid, one can write it to a temporal file first, once
confirmed the file is written, then mv the file to desired file name and
location. This works for immutable data, but is a big blocker for me to make
OPIC work on mutable back store.

This problem is generally hard. See [Ensuring data reaches
disk]([https://lwn.net/Articles/457667/](https://lwn.net/Articles/457667/))

~~~
hyc_symas
You should have looked more closely at LMDB, which already solves this
problem. Also you could look at LMDB's API for using fixed-address mmap, which
lets you store pointer-based structures without any deserialization step at
all.

------
vasquque
Not enough info. Please checkout [http://tarantool.org](http://tarantool.org)
to make good measures.

