To support access patterns that aren't strictly preorder, the compiler (optionally?) injects offsets into the structure, to allow "skipping over" branches to get to the one you want.
Apparently, any value can optionally be a pointer, in order to allow for aliasing. But this is the exception rather than the norm. Aliasing is of course critical to allow data structures to be updated -- in the FP copy-on-write sense, where you clone and update all objects on the path from the root to the modified one.
This seems to be a nice fit for functional programming and is pretty cool in that context. I suspect it would be a lot more difficult to take this approach in an imperative model.
Amusingly, in a way this is the opposite of what Cap'n Proto does (disclosure: I'm the author of Cap'n proto). LoCal is a programming language whose in-memory representation looks like a serialization format (preorder; no pointers). Cap'n Proto is a serialization format which looks like an in-memory representation (arbitrary order; using pointers).
They have a table of benchmarks showing LoCal being faster than Cap'n Proto. It would be nice to see code for these benchmarks. It sounds like the test case is a binary tree. This is an almost pathologically pointer-heavy use case, and as a result would tend to show LoCal in the best possible light and Cap'n Proto in the worst. Cap'n Proto will spend 8 bytes on every pointer; LoCal will spend zero.
I take issue with the "treeInsert" benchmark, in which the paper claims Cap'n Proto is O(n) whereas LoCal is O(log(N)). It sounds like for some reason the authors decided they needed to copy the entire Cap'n Proto message before they could mutate it. Cap'n Proto supports in-place mutation, so should be O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and writes for LoCal). (Admittedly in-place modification of files loaded from disk is tricky to set up at present, but this is an implementation limitation, not a theoretical one -- and it's not clear to me that this test was loading files from disk anyhow.)
Maybe they decided that a complete copy was needed in order to get FP copy-on-write semantics, but that's not really true either. In principle, it would be possible to create a Cap'n Proto implementation that can start from a file on disk and modify it in an object-by-object copy-on-write way, appending new/modified objects to the end of the file in order to avoid destroying any information. At present this hasn't been implemented, but that's largely because no one has had a pressing need for it, to my knowledge.
TL;DR: we'd love to get some pointers on our Cap'n Proto usage --
let's set up a call between all us authors and you.
> difficult to take this approach in an imperative model.
> LoCal is a programming language whose in-memory representation looks
like a serialization format (preorder; no pointers). Cap'n Proto is
a serialization format which looks like an in-memory representation
(arbitrary order; using pointers).
While there's a bit of discussion of differences in 7.2, I think my
student, Michael Vollmer, emphasized this a bit better in the talk
(which should be posted online at some point).
> They have a table of benchmarks showing LoCal being faster than
Cap'n Proto. It would be nice to see code for these benchmarks. It
sounds like the test case is a binary tree.
The compiler is here:
And the benchmarks from the paper are here:
(The ACM digital library entry for the paper is supposed to link the
artifact, but it seems its not posted yet.)
> "treeInsert" .. Cap'n Proto supports in-place mutation, so should be
O(log(N)) reads and O(1) writes (compared to O(log(n)) reads and
writes for LoCal).
> In principle, it would be possible to create a Cap'n Proto
implementation that can start from a file on disk and modify it in an
object-by-object copy-on-write way, appending new/modified objects to
Well, there's nothing more powerful than an idea whose time has come. This is going to shake up anything that operates on hierarchical data in passes, because traditional structure layouts have everyone doing a lot of pointer chasing. That approach got old when CPUs got fast.
But I'm doing this for Engineering, and I like my implementation choices. I have achieved some properties that I think are important for practical integration with other code. The working name of my project is Re, because the core data structure is reified traversals and operating on them involves reifying some other things that aren't normally reified.
Edit to add: my elevator pitch (needs condensing, I haven't pitched it to anyone but my mom):
It used to be common practice to use tree structures to represent linear data. As CPUs have become faster than RAM the pointer-heaviness of this approach has made the constant factors awful, and it's now avoided when possible. It's now actually efficient to do the opposite and represent trees as linear data when you can. This can be done explicitly, e.g. in the WASM format, but such structures can't be operated on directly unless it's done "in the raw", outside the type system and other facilities of the language. What's missing is a generic abstraction layer that maps between operations on conceptual application objects and linearized trees. I've found that once you have that, you can also do things with linearized data that would be impractical with trees.
(Actually, it happened the other way around. Some time ago I prototyped a regex-like engine for finding and replacing fragments of ASTs [https://blog.lambdaverse.org/comacro/], and I discovered operating directly on serialized representation enabled that. I haven't had a chance to make a production implementation yet. Then recently I was working on tree-structured template inference to turn collections of webpages back into the databases that generated them with minimal interaction, and I realized... the comacro construction is a perfect fit here. So it became an excuse to develop that idea ostensibly as part of a project I could actually make money with.)
I wouldn't be surprised that somebody else thought of the same idea -- that happens all the time. The canonical example is that two people independently invented calculus :)
I also found that the academic projects Morbig/Colis and Smoosh independently did some of the same things I'm doing Oil, at pretty much the exact same time (2016-2017).
Links here: https://github.com/oilshell/oil/wiki/ExternalResources
Some things are just "in the air". I think there are probably a lot more precursors to what you were doing than you were aware of (some of which may be on that wiki page -- I also have a private wiki page called "Avoiding Pointers" with some more academic papers, if you're interested).
I mention Ken Thompson's encodings of regexes, and that was in the 60's. He didn't motivate it by "avoiding pointers", and it's specific to a particular application rather than a general purpose language, but the idea is the same.
The code I wrote was inspired by capnproto to a large extent, and the author chimed in here with the similarities to this language. protobufs/capnproto are basically a language-independent type system, and it's not that big a jump to go from a type system to a general programming language.
Pointers are a junk-food abstraction that needs to go away.
Block deduplication, block reference counting (and the requirement to in-file-GC unused blocks), n:m references are things that arise in some formats that don't seem to be part of this language?
Sounds like magic, so I make requests for more magic.
So, serious question, how do you define big data? By some size threshold, or "it can't possibly be done on a single box", or something else?
And If I may, what is the BD work that you do (feel free to decline that one).
I guess I'm not sure what scalability would mean here other than speeding up that individual, sequential processing job on one node (e.g. the map step in a MapReduce).
Interesting, but not necessarily very practical.
CapnProto is a wonderful library interface for dealing with serialized data, but I think the usual library-vs-compiler distinction applies. A library operates at the level of individual data access methods, but it doesn't try to change the control flow of the code calling those methods.
That's our goal with this project. We sometimes turn the code a bit inside out in order to make it better operate on serialized data. (For instance, switching a postorder traversal to a preorder one.)