
LoCal: A Language for Programs Operating on Serialized Data - matt_d
https://pldi19.sigplan.org/details/pldi-2019-papers/28/LoCal-A-Language-for-Programs-Operating-on-Serialized-Data
======
kentonv
TL;DR: They created a functional programming language which uses an in-memory
representation of data structures that avoids pointers, instead organizing
object trees in preorder, much like most serialization formats would use.

To support access patterns that aren't strictly preorder, the compiler
(optionally?) injects offsets into the structure, to allow "skipping over"
branches to get to the one you want.

Apparently, any value can optionally be a pointer, in order to allow for
aliasing. But this is the exception rather than the norm. Aliasing is of
course critical to allow data structures to be updated -- in the FP copy-on-
write sense, where you clone and update all objects on the path from the root
to the modified one.

This seems to be a nice fit for functional programming and is pretty cool in
that context. I suspect it would be a lot more difficult to take this approach
in an imperative model.

Amusingly, in a way this is the opposite of what Cap'n Proto does (disclosure:
I'm the author of Cap'n proto). LoCal is a programming language whose in-
memory representation looks like a serialization format (preorder; no
pointers). Cap'n Proto is a serialization format which looks like an in-memory
representation (arbitrary order; using pointers).

They have a table of benchmarks showing LoCal being faster than Cap'n Proto.
It would be nice to see code for these benchmarks. It sounds like the test
case is a binary tree. This is an almost pathologically pointer-heavy use
case, and as a result would tend to show LoCal in the best possible light and
Cap'n Proto in the worst. Cap'n Proto will spend 8 bytes on every pointer;
LoCal will spend zero.

I take issue with the "treeInsert" benchmark, in which the paper claims Cap'n
Proto is O(n) whereas LoCal is O(log(N)). It sounds like for some reason the
authors decided they needed to copy the entire Cap'n Proto message before they
could mutate it. Cap'n Proto supports in-place mutation, so should be
O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and writes for
LoCal). (Admittedly in-place modification of files loaded from disk is tricky
to set up at present, but this is an implementation limitation, not a
theoretical one -- and it's not clear to me that this test was loading files
from disk anyhow.)

Maybe they decided that a complete copy was needed in order to get FP copy-on-
write semantics, but that's not really true either. In principle, it would be
possible to create a Cap'n Proto implementation that can start from a file on
disk and modify it in an object-by-object copy-on-write way, appending
new/modified objects to the end of the file in order to avoid destroying any
information. At present this hasn't been implemented, but that's largely
because no one has had a pressing need for it, to my knowledge.

~~~
rrnewton
Hi Kenton,

TL;DR: we'd love to get some pointers on our Cap'n Proto usage -- let's set up
a call between all us authors and you.

    
    
      > difficult to take this approach in an imperative model.
    

Indeed, we are eventually adding mutation of fixed sized fields, but mutating
the structure of the tree is really problematic with these kinds of encodings.
But one silver lining of this functional approach is that for some tree
traversals the out-of-place copying Gibbon version touches less total memory
than the idiomatic C version that updates the tree in place in-place.

    
    
      > LoCal is a programming language whose in-memory representation looks
      like a serialization format (preorder; no pointers). Cap'n Proto is
      a serialization format which looks like an in-memory representation
      (arbitrary order; using pointers).
    

Yes, well put! We're big fans of Cap'n Proto. The comparison in our evaluation
is _not_ apples-to-apples, because Gibbon and Cap'n Proto serve very different
purposes. We used Cap'n Proto here mainly because of a paucity of anything
close to Gibbon in this space.

While there's a bit of discussion of differences in 7.2, I think my student,
Michael Vollmer, emphasized this a bit better in the talk (which should be
posted online at some point).

    
    
      > They have a table of benchmarks showing LoCal being faster than
      Cap'n Proto. It would be nice to see code for these benchmarks. It
      sounds like the test case is a binary tree.
    

Yes: several operations on binary trees, plus the benchmarks reading tweets
and large ASTs from disk.

The compiler is here: [https://github.com/iu-
parfunc/gibbon](https://github.com/iu-parfunc/gibbon) And the benchmarks from
the paper are here: [https://github.com/iu-parfunc/tree-velocity-
benchdata](https://github.com/iu-parfunc/tree-velocity-benchdata) (The ACM
digital library entry for the paper is supposed to link the artifact, but it
seems its not posted yet.)

    
    
      > "treeInsert" .. Cap'n Proto supports in-place mutation, so should be
      O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and
      writes for LoCal).
    

The "insertDestr" variant is the in-place tree-insert with Cap'n Proto
(1.72μs). The code is here [1] and capnp data definition here [2]. Maybe it
could be further improved.

    
    
      > In principle, it would be possible to create a Cap'n Proto
     implementation that can start from a file on disk and modify it in an
     object-by-object copy-on-write way, appending new/modified objects to
     the end
    

Yes, that would work the same as the Gibbon version, tacking on a new chunk of
`log(N)` bytes on each insert. I submit that no one has needed this in Cap'n
Proto because they have not used it as the in-memory representation for all
values in a functional-language compiler ;-).

[1] [https://github.com/iu-parfunc/tree-velocity-
benchdata/blob/m...](https://github.com/iu-parfunc/tree-velocity-
benchdata/blob/master/pldi19/litmus_tests/capnproto/test_litmus_insert_destructive/main.cpp)
[2] [https://github.com/iu-parfunc/tree-velocity-
benchdata/blob/m...](https://github.com/iu-parfunc/tree-velocity-
benchdata/blob/master/pldi19/litmus_tests/capnproto/test_litmus_insert_destructive/tree.capnp)

~~~
kentonv
Thanks for the updates! I had seen insertDestr in the table but ctrl-f didn't
find any references to it, so I wasn't sure what it meant. Somehow I think I
didn't manage to realize "Destr" meant "destructive" at the time, though it
seems obvious in retrospect.

~~~
rrnewton
That really stupid naming was a function of fitting that table into one column
;-). Page limits.

------
firethief
I can't believe this. I had the same idea and I've been working on a demo. The
abstract and introduction of this paper could be describing my project,
verbatim.

Well, there's nothing more powerful than an idea whose time has come. This is
going to shake up anything that operates on hierarchical data in passes,
because traditional structure layouts have everyone doing a lot of pointer
chasing. That approach got old when CPUs got fast.

~~~
firethief
Now that I've read the full paper: they've have taken all the research novelty
out of my sails. I'm amazed how parallel our designs are. I guess once you've
gone down the rabbit hole of committing to operating on a dense serialized
format, things like decoupling application data types from concrete
representations suggest themselves. I'd be sunk if I were doing this for
Science.

But I'm doing this for Engineering, and I like my implementation choices. I
have achieved some properties that I think are important for practical
integration with other code. The working name of my project is Re, because the
core data structure is reified traversals and operating on them involves
reifying some other things that aren't normally reified.

Edit to add: my elevator pitch (needs condensing, I haven't pitched it to
anyone but my mom):

It used to be common practice to use tree structures to represent linear data.
As CPUs have become faster than RAM the pointer-heaviness of this approach has
made the constant factors awful, and it's now avoided when possible. It's now
actually efficient to do the opposite and represent trees as linear data when
you can. This can be done explicitly, e.g. in the WASM format, but such
structures can't be operated on directly unless it's done "in the raw",
outside the type system and other facilities of the language. What's missing
is a generic abstraction layer that maps between operations on conceptual
application objects and linearized trees. I've found that once you have that,
you can also do things with linearized data that would be impractical with
trees.

(Actually, it happened the other way around. Some time ago I prototyped a
regex-like engine for finding and replacing fragments of ASTs
[[https://blog.lambdaverse.org/comacro/](https://blog.lambdaverse.org/comacro/)],
and I discovered operating directly on serialized representation enabled that.
I haven't had a chance to make a production implementation yet. Then recently
I was working on tree-structured template inference to turn collections of
webpages back into the databases that generated them with minimal interaction,
and I realized... the comacro construction is a perfect fit here. So it became
an excuse to develop that idea ostensibly as part of a project I could
actually make money with.)

~~~
chubot
FWIW I've also gone down a similar rabbit hole, and written a bunch of code
too in that direction. Resources here:

[https://github.com/oilshell/oil/wiki/Compact-AST-
Representat...](https://github.com/oilshell/oil/wiki/Compact-AST-
Representation)

I wouldn't be surprised that somebody else thought of the same idea -- that
happens all the time. The canonical example is that two people independently
invented calculus :)

[https://en.wikipedia.org/wiki/Multiple_discovery](https://en.wikipedia.org/wiki/Multiple_discovery)

I also found that the academic projects Morbig/Colis and Smoosh independently
did some of the same things I'm doing Oil, at pretty much the exact same time
(2016-2017).

Links here:
[https://github.com/oilshell/oil/wiki/ExternalResources](https://github.com/oilshell/oil/wiki/ExternalResources)

Some things are just "in the air". I think there are probably a lot more
precursors to what you were doing than you were aware of (some of which may be
on that wiki page -- I also have a private wiki page called "Avoiding
Pointers" with some more academic papers, if you're interested).

I mention Ken Thompson's encodings of regexes, and that was in the 60's. He
didn't motivate it by "avoiding pointers", and it's specific to a particular
application rather than a general purpose language, but the idea is the same.

The code I wrote was inspired by capnproto to a large extent, and the author
chimed in here with the similarities to this language. protobufs/capnproto are
basically a language-independent type system, and it's not that big a jump to
go from a type system to a general programming language.

~~~
sitkack
Funny, because I been kicking around an idea to remove pointers entirely form
a language, the compiler and the implementation. Maybe even the ISA. So heavy,
literal, concrete, absolute. Pointers do _not_ go with the flow.

Pointers are a junk-food abstraction that needs to go away.

------
_pmf_
Every serializer/deserializer for nested, dynamically sized blocks of binary
data contains a buggy, ad hoc variant of this. I'd already be happy with
something like Erlang's standard binary matching library[0] as part of each
language's stdlib.

Block deduplication, block reference counting (and the requirement to in-file-
GC unused blocks), n:m references are things that arise in some formats that
don't seem to be part of this language?

[0]
[http://erlang.org/doc/efficiency_guide/binaryhandling.html](http://erlang.org/doc/efficiency_guide/binaryhandling.html)

------
rrnewton
One thing that might not be clear from the paper: while the Gibbon compiler
currently handles a small subset of Haskell, we plan to deploy it by
integrating with GHC. That is, we want to make it possible to compile a subset
of your Haskell program through Gibbon, while seamlessly calling that
accelerated code from regular Haskell code utilizing the full library
ecosystem.

------
perfunctory
Definitely going to dive into this. Majority of enterprise software is
processing serialized data - send data from the browser to the middleware,
mung it, send it to the database, and then all the way back in reverse order.
And we don't have good tool support for this.

~~~
pavlov
Is LoCal going to help with this? The abstract suggests that they've figured
out a unified representation for in-memory and disk/wire layout. That's an
interesting performance feature for new applications, but doesn't seem to do
anything to help existing serialization wranglers since you can't make the old
formats go away.

~~~
rrnewton
Absolutely true. It's future work to train our Gibbon/LoCal compiler to
generate code operating over more interesting, existing binary formats (CBOR,
Avro, CapNP...), some of which are a better fit than others. Or for text
formats we could focus on writing high-performance parsers and fusing them
with recursive functions that consume the result. Lots more to do here.

~~~
sitkack
Could you differentiate LoCal programs and generate succinct data structures?
Apply transforms to make the code incremental?

Sounds like magic, so I make requests for more magic.

------
misterman0
This is interesting. Without fully understanding the design of this system, I
wonder how far it scales. Can you process big data with LoCal?

~~~
jacobush
Is it web scale? :-) Sorry couldn’t resist...
[https://m.youtube.com/watch?v=b2F-DItXtZs](https://m.youtube.com/watch?v=b2F-DItXtZs)

~~~
misterman0
I actually thought that by now, in 2019, this forum would have realized that
big data is real. It's a thing. Some of us are forced to deal with it. It's a
hard problem that is well-suited for having discussions around on forums that
are populated by intelligent people, but apparently it's not a suitable topic
on this particular forum. I will certainly keep that in mind from hereon out.

~~~
tempguy9999
Big data is mainly a marketing term to me. I've been involved in a Big Data
project that wasn't because someone wanted to massage their CV with fancy
tech. With current multicore machines and SSDs or cheap spinny disks you can
hold and process incredible amounts of data.

So, serious question, how do you define big data? By some size threshold, or
"it can't possibly be done on a single box", or something else?

And If I may, what is the BD work that you do (feel free to decline that one).

TIA

------
enriquto
In a few years, young people will rediscover unix and we will have gone full
circle.

------
the_duke
So after skimming the paper, this seems like a formalization of something like
Protobuf/CaptnProto with native language support.

Interesting, but not necessarily very practical.

~~~
firethief
Capn proto is like a RAM format that can be directly serialized. This is a
serialization format that can be directly worked with. They both allow
skipping deserialization, but that's not the main draw of this approach. The
ability to design data representation around expected access patterns can make
traversal faster and data more compact even if you don't plan to ever
"serialize" it.

