
Hitchhiker trees: functional, persistent, off-heap sorted maps - hadronzoo
https://github.com/datacrypt-project/hitchhiker-tree/blob/master/doc/hitchhiker.adoc
======
rdtsc
Those are very cool, and functional too!

> As it turns out, after I implemented the fractal tree, I spoke with a former
> employee of Tokutek, a company that commercialized fractal tree indices.
> That person told me that we’d actually implemented fractal reads
> identically!

That's not necessarily a good thing. Last I heard Tokutek patented that. Not a
laywer, from what I understand patents specifically pertain to
"implementation".

Not sure what happened after they were acquired by Percona, but, as I am in US
I would personally stay away from it. Even if current owner doesn't want to
enforce the patent, the next one might.

~~~
dgrnbrg
Creator here. I would absolutely agree with this, but let me shed a bit more
light to clarify: the underlying idea of B+ trees with inline caches is not
patented (or maybe it was, but a long time ago?). Tokutek's patents are all
focused on the methods they use to achieve concurrent operations on a single
tree--they have a very interesting & novel locking protocol.

The Hitchhiker tree completely avoids their patents, since it makes very
different decisions in order to become purely functional.

~~~
rdtsc
Thanks for explaining. That makes sense.

Great work by the way! Really like the description of the code structure on
the README page.

If patent is indeed not an issue, I might play with translating that to
Erlang. You already did the hard work and also made it functional. I'd have to
learn to read Clojure.

Speaking of functional, I was surprised to find the other day, Erlang's queue
module implements an Okasaki API, complete with funny function names like
deah, liat, and snoc.

[http://erlang.org/doc/man/queue.html](http://erlang.org/doc/man/queue.html)

------
rix0r
Funny: I implemented the exact same thing, with a focus on AWS and pagination
(albeit less well-documented):

[https://github.com/rix0rrr/libbruce/blob/master/README.md](https://github.com/rix0rrr/libbruce/blob/master/README.md)

~~~
dgrnbrg
That is very cool! When you end up doing reads, do you compute all the pending
writes along the path from root to leaf, and then run queries on the
projection?

~~~
rix0r
For index-seeks, yes.

For key-seeks, it's a normal tree walk to the leaf (unless an INSERT is found
at a higher level, iirc)

In practice, the blocks are so huge that any tree is going to have at most
depth 3, and the root page is likely in cache, so 1 or 2 s3 fetches (= roughly
200ms, let's say)

I've been toying with the idea of putting the root page (or maybe the top 2
levels of pages) in DynamoDB, which would make them fast even if they weren't
in cache.

------
anorwell
I'm not sure I understand this. The claim is that "we dramatically improve the
performance of insertions without hurting the IO cost of queries" by writing
to the root's event log. However, it seems to me that this is only delaying
the eventual write to the leaf node, not avoiding it. As more items are
written, eventually the log will overflow and write to its children, and so
on, until it is written to the leaf. This is what would happen in a regular B+
tree anyway, so what has been gained?

What am I missing?

~~~
dgrnbrg
It's not just the root's event log--it's all the event logs. The deferrals
allow us to batch writes in an optimized way, and do so with a fully
incremental algorithm. In the IO cost model, each "block" read or costs 1 IOP
to perform, so we can reduce the IOP cost of writes by a factor of 100-1000x,
but reads will still have the same # of nodes to fetch. The event logs
augmenting the tree are an effective IO optimization on a B+ tree.

~~~
anorwell
Thanks, makes sense.

------
dgreensp
How many bytes is a node, in practice? If you stuff hundreds or thousands of
pointers in a node, plus a bunch of log, isn't that a lot of data to clone
when you do a path-copy? It seems like there's a major trade-off there, unless
you always write in large batches. I suppose you could have fast "cons-ing" of
events onto the root log without any copying. It would be interesting to know
what choices lead to good performance in a sample use case.

~~~
dgrnbrg
The code actually includes a benchmarking tool that's meant to help in
figuring out this decision. Once you've selected your key size (or
distribution), value size (or distribution), and backend, you can play with
these factors. Intuitively, targeting "block" size should improve perf. So,
4k-1MB if you're in memory or local, 1.5k-9k if you're over the network (and
depending on configuration).

I did some work to split nodes based on size rather than number of children,
but that requires accurate & fast size estimation of the decompressed objects,
which isn't possible in general.

------
falcolas
> the best-performing data structure for looking up sorted keys cannot do
> those queries faster than O(log(n))

Feels like a digression off the actual tree structure, but isn't this
incorrect, since a hash map can do key lookups in O(1)? With caveats, of
course.

It's a great idea over all, but I'd also be curious how well overlaying an
event log on a cuckoo hash would work in comparison.

~~~
continuational
It's a common misconception that hashmaps have constant time lookups. In order
for the lookup to be constant time, the hash function must output enough bits
to avoid causing too many collisions. For example, 16 bits is too little for a
hash map with a million keys. In fact, the hash function output size should be
O(log(n)) where n is the number of entries. Processing log(n) bits takes at
least O(log(n)) time. QED

~~~
jasonwatkinspdx
> Processing log(n) bits takes at least O(log(n)) time.

No, because log(n) might be smaller than the word size of the machine, which
is the common case for hash table implementations.

~~~
continuational
If you limit log(n) and thus n to a constant, like the word size of the
machine, then _every algorithm is constant time_.

~~~
jasonwatkinspdx
No, I am not forcing all n of any size to a constant, I'm saying that we can
bottom out inductively for small n.

This is a very old discussion. Read Knuth's rationale. Then read the similar
sections in Segwick et all or whatever other basic competence textbook on
algorithms you like. Compare them, think about which contexts each approach is
most useful in.

In other words, no, you have not somehow cleverly invalidated all of
complexity analysis.

------
Hugie
Does someone know of an existing visualization of this algorithm, for better
understanding?

~~~
dgrnbrg
Creator here. There's no visualizations of this yet, but I'll be speaking
about it this year at Strange Loop. For that talk, I'll be creating
visualizations, so stay tuned!

------
adamretter
Is it possible to use this from Java directly, or does the Clojure API make
this difficult?

~~~
dgrnbrg
Creator here. You could use this from Java using the Clojure API for Java;
however, many of the extension hooks use Clojure protocols. I would recommend
writing a shim to Java for your particular application, so that your data
types are represented in the way you want.

~~~
sievebrain
That'd required the person doing the using to know Clojure, right?

~~~
dgrnbrg
Indeed that would.

------
tuddman
very cool!

any idea how this would compare to clojure-wrapping

[https://github.com/OpenHFT/Chronicle-
Map](https://github.com/OpenHFT/Chronicle-Map) ?

Seems like there are some similar design goals between the two.

