Hacker News new | past | comments | ask | show | jobs | submit login
LoCal: A Language for Programs Operating on Serialized Data (sigplan.org)
89 points by matt_d on June 24, 2019 | hide | past | favorite | 27 comments



TL;DR: They created a functional programming language which uses an in-memory representation of data structures that avoids pointers, instead organizing object trees in preorder, much like most serialization formats would use.

To support access patterns that aren't strictly preorder, the compiler (optionally?) injects offsets into the structure, to allow "skipping over" branches to get to the one you want.

Apparently, any value can optionally be a pointer, in order to allow for aliasing. But this is the exception rather than the norm. Aliasing is of course critical to allow data structures to be updated -- in the FP copy-on-write sense, where you clone and update all objects on the path from the root to the modified one.

This seems to be a nice fit for functional programming and is pretty cool in that context. I suspect it would be a lot more difficult to take this approach in an imperative model.

Amusingly, in a way this is the opposite of what Cap'n Proto does (disclosure: I'm the author of Cap'n proto). LoCal is a programming language whose in-memory representation looks like a serialization format (preorder; no pointers). Cap'n Proto is a serialization format which looks like an in-memory representation (arbitrary order; using pointers).

They have a table of benchmarks showing LoCal being faster than Cap'n Proto. It would be nice to see code for these benchmarks. It sounds like the test case is a binary tree. This is an almost pathologically pointer-heavy use case, and as a result would tend to show LoCal in the best possible light and Cap'n Proto in the worst. Cap'n Proto will spend 8 bytes on every pointer; LoCal will spend zero.

I take issue with the "treeInsert" benchmark, in which the paper claims Cap'n Proto is O(n) whereas LoCal is O(log(N)). It sounds like for some reason the authors decided they needed to copy the entire Cap'n Proto message before they could mutate it. Cap'n Proto supports in-place mutation, so should be O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and writes for LoCal). (Admittedly in-place modification of files loaded from disk is tricky to set up at present, but this is an implementation limitation, not a theoretical one -- and it's not clear to me that this test was loading files from disk anyhow.)

Maybe they decided that a complete copy was needed in order to get FP copy-on-write semantics, but that's not really true either. In principle, it would be possible to create a Cap'n Proto implementation that can start from a file on disk and modify it in an object-by-object copy-on-write way, appending new/modified objects to the end of the file in order to avoid destroying any information. At present this hasn't been implemented, but that's largely because no one has had a pressing need for it, to my knowledge.


Hi Kenton,

TL;DR: we'd love to get some pointers on our Cap'n Proto usage -- let's set up a call between all us authors and you.

  > difficult to take this approach in an imperative model.
Indeed, we are eventually adding mutation of fixed sized fields, but mutating the structure of the tree is really problematic with these kinds of encodings. But one silver lining of this functional approach is that for some tree traversals the out-of-place copying Gibbon version touches less total memory than the idiomatic C version that updates the tree in place in-place.

  > LoCal is a programming language whose in-memory representation looks
  like a serialization format (preorder; no pointers). Cap'n Proto is
  a serialization format which looks like an in-memory representation
  (arbitrary order; using pointers).
Yes, well put! We're big fans of Cap'n Proto. The comparison in our evaluation is not apples-to-apples, because Gibbon and Cap'n Proto serve very different purposes. We used Cap'n Proto here mainly because of a paucity of anything close to Gibbon in this space.

While there's a bit of discussion of differences in 7.2, I think my student, Michael Vollmer, emphasized this a bit better in the talk (which should be posted online at some point).

  > They have a table of benchmarks showing LoCal being faster than
  Cap'n Proto. It would be nice to see code for these benchmarks. It
  sounds like the test case is a binary tree.
Yes: several operations on binary trees, plus the benchmarks reading tweets and large ASTs from disk.

The compiler is here: https://github.com/iu-parfunc/gibbon And the benchmarks from the paper are here: https://github.com/iu-parfunc/tree-velocity-benchdata (The ACM digital library entry for the paper is supposed to link the artifact, but it seems its not posted yet.)

  > "treeInsert" .. Cap'n Proto supports in-place mutation, so should be
  O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and
  writes for LoCal).
The "insertDestr" variant is the in-place tree-insert with Cap'n Proto (1.72μs). The code is here [1] and capnp data definition here [2]. Maybe it could be further improved.

  > In principle, it would be possible to create a Cap'n Proto
 implementation that can start from a file on disk and modify it in an
 object-by-object copy-on-write way, appending new/modified objects to
 the end
Yes, that would work the same as the Gibbon version, tacking on a new chunk of `log(N)` bytes on each insert. I submit that no one has needed this in Cap'n Proto because they have not used it as the in-memory representation for all values in a functional-language compiler ;-).

[1] https://github.com/iu-parfunc/tree-velocity-benchdata/blob/m... [2] https://github.com/iu-parfunc/tree-velocity-benchdata/blob/m...


Thanks for the updates! I had seen insertDestr in the table but ctrl-f didn't find any references to it, so I wasn't sure what it meant. Somehow I think I didn't manage to realize "Destr" meant "destructive" at the time, though it seems obvious in retrospect.


That really stupid naming was a function of fitting that table into one column ;-). Page limits.


I can't believe this. I had the same idea and I've been working on a demo. The abstract and introduction of this paper could be describing my project, verbatim.

Well, there's nothing more powerful than an idea whose time has come. This is going to shake up anything that operates on hierarchical data in passes, because traditional structure layouts have everyone doing a lot of pointer chasing. That approach got old when CPUs got fast.


Now that I've read the full paper: they've have taken all the research novelty out of my sails. I'm amazed how parallel our designs are. I guess once you've gone down the rabbit hole of committing to operating on a dense serialized format, things like decoupling application data types from concrete representations suggest themselves. I'd be sunk if I were doing this for Science.

But I'm doing this for Engineering, and I like my implementation choices. I have achieved some properties that I think are important for practical integration with other code. The working name of my project is Re, because the core data structure is reified traversals and operating on them involves reifying some other things that aren't normally reified.

Edit to add: my elevator pitch (needs condensing, I haven't pitched it to anyone but my mom):

It used to be common practice to use tree structures to represent linear data. As CPUs have become faster than RAM the pointer-heaviness of this approach has made the constant factors awful, and it's now avoided when possible. It's now actually efficient to do the opposite and represent trees as linear data when you can. This can be done explicitly, e.g. in the WASM format, but such structures can't be operated on directly unless it's done "in the raw", outside the type system and other facilities of the language. What's missing is a generic abstraction layer that maps between operations on conceptual application objects and linearized trees. I've found that once you have that, you can also do things with linearized data that would be impractical with trees.

(Actually, it happened the other way around. Some time ago I prototyped a regex-like engine for finding and replacing fragments of ASTs [https://blog.lambdaverse.org/comacro/], and I discovered operating directly on serialized representation enabled that. I haven't had a chance to make a production implementation yet. Then recently I was working on tree-structured template inference to turn collections of webpages back into the databases that generated them with minimal interaction, and I realized... the comacro construction is a perfect fit here. So it became an excuse to develop that idea ostensibly as part of a project I could actually make money with.)


FWIW I've also gone down a similar rabbit hole, and written a bunch of code too in that direction. Resources here:

https://github.com/oilshell/oil/wiki/Compact-AST-Representat...

I wouldn't be surprised that somebody else thought of the same idea -- that happens all the time. The canonical example is that two people independently invented calculus :)

https://en.wikipedia.org/wiki/Multiple_discovery

I also found that the academic projects Morbig/Colis and Smoosh independently did some of the same things I'm doing Oil, at pretty much the exact same time (2016-2017).

Links here: https://github.com/oilshell/oil/wiki/ExternalResources

Some things are just "in the air". I think there are probably a lot more precursors to what you were doing than you were aware of (some of which may be on that wiki page -- I also have a private wiki page called "Avoiding Pointers" with some more academic papers, if you're interested).

I mention Ken Thompson's encodings of regexes, and that was in the 60's. He didn't motivate it by "avoiding pointers", and it's specific to a particular application rather than a general purpose language, but the idea is the same.

The code I wrote was inspired by capnproto to a large extent, and the author chimed in here with the similarities to this language. protobufs/capnproto are basically a language-independent type system, and it's not that big a jump to go from a type system to a general programming language.


Funny, because I been kicking around an idea to remove pointers entirely form a language, the compiler and the implementation. Maybe even the ISA. So heavy, literal, concrete, absolute. Pointers do not go with the flow.

Pointers are a junk-food abstraction that needs to go away.


It sounds like your work would make a great Show HN once it's ready for people to try. If you want to do that, email hn@ycombinator.com when you're ready and we'll give you some tips.


@firethief -- glad to hear you're interested in this idea too. Do you have an implementation for Re that we could learn more about? Maybe we could swap some ideas.


Every serializer/deserializer for nested, dynamically sized blocks of binary data contains a buggy, ad hoc variant of this. I'd already be happy with something like Erlang's standard binary matching library[0] as part of each language's stdlib.

Block deduplication, block reference counting (and the requirement to in-file-GC unused blocks), n:m references are things that arise in some formats that don't seem to be part of this language?

[0] http://erlang.org/doc/efficiency_guide/binaryhandling.html


One thing that might not be clear from the paper: while the Gibbon compiler currently handles a small subset of Haskell, we plan to deploy it by integrating with GHC. That is, we want to make it possible to compile a subset of your Haskell program through Gibbon, while seamlessly calling that accelerated code from regular Haskell code utilizing the full library ecosystem.


Definitely going to dive into this. Majority of enterprise software is processing serialized data - send data from the browser to the middleware, mung it, send it to the database, and then all the way back in reverse order. And we don't have good tool support for this.


Is LoCal going to help with this? The abstract suggests that they've figured out a unified representation for in-memory and disk/wire layout. That's an interesting performance feature for new applications, but doesn't seem to do anything to help existing serialization wranglers since you can't make the old formats go away.


Absolutely true. It's future work to train our Gibbon/LoCal compiler to generate code operating over more interesting, existing binary formats (CBOR, Avro, CapNP...), some of which are a better fit than others. Or for text formats we could focus on writing high-performance parsers and fusing them with recursive functions that consume the result. Lots more to do here.


Could you differentiate LoCal programs and generate succinct data structures? Apply transforms to make the code incremental?

Sounds like magic, so I make requests for more magic.


This is interesting. Without fully understanding the design of this system, I wonder how far it scales. Can you process big data with LoCal?


Is it web scale? :-) Sorry couldn’t resist... https://m.youtube.com/watch?v=b2F-DItXtZs


I actually thought that by now, in 2019, this forum would have realized that big data is real. It's a thing. Some of us are forced to deal with it. It's a hard problem that is well-suited for having discussions around on forums that are populated by intelligent people, but apparently it's not a suitable topic on this particular forum. I will certainly keep that in mind from hereon out.


Big data is mainly a marketing term to me. I've been involved in a Big Data project that wasn't because someone wanted to massage their CV with fancy tech. With current multicore machines and SSDs or cheap spinny disks you can hold and process incredible amounts of data.

So, serious question, how do you define big data? By some size threshold, or "it can't possibly be done on a single box", or something else?

And If I may, what is the BD work that you do (feel free to decline that one).

TIA


I actually thought that by now, in 2019, that people doing big data work understood that big data covered a vast number of processing use cases and saying "can it process big data" has about as much semantic meaning as "can it process blue". I will certainly keep that in mind from hereon out.


Well, if you have a big document collection or something, the faster you can process each individual document the more you can scale, right?

I guess I'm not sure what scalability would mean here other than speeding up that individual, sequential processing job on one node (e.g. the map step in a MapReduce).


In a few years, young people will rediscover unix and we will have gone full circle.


So after skimming the paper, this seems like a formalization of something like Protobuf/CaptnProto with native language support.

Interesting, but not necessarily very practical.


Capn proto is like a RAM format that can be directly serialized. This is a serialization format that can be directly worked with. They both allow skipping deserialization, but that's not the main draw of this approach. The ability to design data representation around expected access patterns can make traversal faster and data more compact even if you don't plan to ever "serialize" it.


Our main focus of the Gibbon project is to make tree traversals very fast, and only secondarily to work with serialized data.

CapnProto is a wonderful library interface for dealing with serialized data, but I think the usual library-vs-compiler distinction applies. A library operates at the level of individual data access methods, but it doesn't try to change the control flow of the code calling those methods.

That's our goal with this project. We sometimes turn the code a bit inside out in order to make it better operate on serialized data. (For instance, switching a postorder traversal to a preorder one.)


You may want to read this comment:

https://news.ycombinator.com/item?id=20266970




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: