
Zero-copy deserialization in Julia - jamii
http://scattered-thoughts.net/blog/2018/08/28/zero-copy-deserialization-in-julia/
======
rwmj
This reminds me of the OCaml "Ancient" library[1] that I wrote. It lets you
have an extra heap of OCaml objects which can be stored in an mmap'd file and
even shared between processes (although the sharing must be read-only and has
a number of problems like everything has to be mapped at the same address
which conflicts with ASLR - we didn't use pointer<->offset conversion as done
in this article because it would touch every page and you'd end up with no
sharing).

I originally wrote it to analyze large data sets (where "large" meant 16+ GB
which was much larger than the commonly available RAM at the time).

It integrated nicely with OCaml. You could create ordinary OCaml objects then
incrementally "mark" them so those objects would be moved (recursively) into
the ancient heap. The objects could still be accessed as if they were ordinary
OCaml objects even if they were on the ancient heap. On the other side you
could mmap a previous heap and access the objects as if they were regular
OCaml objects directly (with a few shortcomings - see README). Depending on
your access pattern this worked well even if available RAM was much smaller
than the size of the data set.

[1] [http://git.annexia.org/?p=ocaml-
ancient.git;a=blob;f=README....](http://git.annexia.org/?p=ocaml-
ancient.git;a=blob;f=README.txt)

~~~
jamii
I used Ancient years ago for
[https://github.com/jamii/texsearch](https://github.com/jamii/texsearch) to
stop the GC pauses caused by pointlessly traversing the huge immutable index.
I don't know how else I would have met the latency requirements. Thanks for
writing it :)

------
dan-robertson
I suppose the real Julia trick used to make this nice is the generated
function. There isn’t really anything like this in C without code generation.

Other languages with macros can do this sort of thing but one then needs to
“derive blob” or something like that.

In Haskell one could probably do it with Generic (ie the compiler generates a
data representation of the type; you write a function to go from this
representation to your deserialisation function; and then the compiler does a
crazy amount of inlining)

An alternative way to do this in an object oriented language is with a
different metaclass that does the appropriate dereferencing. But this risks
being slow if not compiled well

~~~
jamii
Also being able to remove all the dispatch and stack-allocate everything.
Haskell would probably do a good job of that too but in python or js, even
with codegen, it would be hard to avoid heap allocation. The simple examples
here might fall to escape analysis, but in production code the intermediate
Blob values typically cross a lot of function boundaries.

------
isoos
A similar tech is flatbuffers:
[https://github.com/google/flatbuffers](https://github.com/google/flatbuffers)

~~~
vvanders
Flatbuffers is pretty great, one side benefit of these types of technologies
is you can also use them to structure your memory accesses.

I've done this in Java to do cache aware reading to pretty great success.

~~~
vladf
Do you mind elaborating?

~~~
blitmap
I am not OP but I'm imagining data-oriented design. So like, things are
structured and then serialized - and at some other end deserialized in place
ready to read cache-coherently?

~~~
vvanders
Yup pretty much. Zero-copy deserialization implies one large read into a block
of memory(or mmap) which forces a specific layout. From there it's
straightforward to follow standard data oriented design principals.

------
cdsousa
That is pretty amazing! I wonder how would it look like if it was written in
C. Anyone has some links to "Blobs" implementations in C?

~~~
jamii
If you allow codegen in advance it looks pretty similar eg
[https://capnproto.org/cxx.html](https://capnproto.org/cxx.html)

------
comnetxr
Cool! Could this be used to make on-disk DataFrames that store arbitrary Julia
types? Serializing/deserializing such dataframes with JLD has given me lots of
trouble (both too slow and also sometimes the files get corrupted in ways that
can't be debugged.)

~~~
jamii
JuliaDB already does this eg `loadtable("foo.csv", output="foo/")` will store
the data in "foo/" in binary form and mmap it.

~~~
comnetxr
Thanks for the tip Jamii, and for the nice original post. This is pretty close
to what I've been doing, but since I'm storing Julia values rather than
numeric data I still to need to convert my data to strings to store in
"foo.csv" and then parse back from strings when reading in individual values
using `loadtable("foo.csv", output="foo/", colparsers=[parserfortype1,
parserfortype2, ...])`. As a parser for nested Julia types is fairly
complicated (structs containing dictionaries as fields which contain nested
arrays as values which contain ...), it'd be nice to go directly to the binary
form used by JuliaDB without converting to/from the string intermediate
representation needed by the csv format. Perhaps Blob isn't quite the
abstraction needed, but if such a thing exists I'd love to know about it!

------
toolslive
At some point (maybe still? I haven't touched Windows in over a decade) the
windows registry had this: You could just load the entire registry into a
buffer and then just cast that buffer pointer to the root type.

------
lmeyerov
Curious if making the Julia reader for Apache Arrow support Plasma may be more
future friendly:

* [https://github.com/ExpandingMan/Arrow.jl](https://github.com/ExpandingMan/Arrow.jl)

* [https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-ob...](https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)

~~~
jamii
Arrow is basically a dataframe format. It's not really suitable for building
data- _structures_. Eg one of the things Blobs is being used to implement is a
Bε-tree
([http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf](http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf)).

Blobs is less a data-format and more a replacement for C's ability to cast
arbitrary memory locations to structs.

------
lifthrasiir
That kind of reminds me of dumpable [1] library doing a similar thing in C++.

[1] [https://github.com/ipkn/dumpable](https://github.com/ipkn/dumpable)

~~~
jamii
Huh, yeah, this is surprisingly similar, even down to needing a custom pointer
type to do the offset<->pointer conversion -
[https://github.com/ipkn/dumpable/blob/master/dptr.h#L64-L69](https://github.com/ipkn/dumpable/blob/master/dptr.h#L64-L69)

------
jnordwick
KDB+ the database and the underlying K language do this. The entire heap and
all data structures and kept so that they can be written to the wire, read, or
mapped efficiently.

------
a-dub
Looks cool. Actually gave Julia 1.0 a whirl just now. The best plotting
library, Plots.jl, doesn't work with 1.0. Oh well, guess it's not cooked yet.

------
sgt101
So, I have to check that libraries don't do this if I am to rely on the idea
that Julia offers safety?

~~~
myrryr
if that worries you, I wouldn't look under the covers of almost any database
you run, nor network stack you have.

Graphics card drivers are WAY out. File systems too.

Julia offers as much safety as most languages. You would be hard pressed to
find a language which doesn't let you pull something like this.

Julia wouldn't solve the 2 languages problem if there was a lot of things you
would have to drop back to C to do....

~~~
sgt101
Point taken. It would be nice to be able to query my code base to see if
unsafe methods have been used (and where) in the executable.

~~~
myrryr
Rust is really good that way. It would be nice to have for sure.

~~~
frankmcsherry
This is a Rust version, btw:

    
    
       https://github.com/frankmcsherry/abomonation
    

It still needs me to opt in to declaring methods as unsafe, rather than
automatically surfacing my uses of unsafe. I could definitely "trick users" by
not doing that and pretending it always works (it doesn't).

------
ChrisRackauckas
Awesome article and demonstration! Blob seems to be a pretty nice package as
well.

------
mariogintili
can someone explain to me why is this on the homepage? Not being sarcastic. I
genuinely want to know why is Zero-copy deserialization in Julia relevant to
most of hacker new's commmunity

~~~
dlahoda
because i want to write same stuff on f#. good question overall. keep going.

------
nickbauman
The fastest way to expedite a job is not to have to do it in the first place.
A homoiconic language doesn't require de/serialization. So many problems
disappear when you have this design.

~~~
darn
You mean structured editors?

