
Crit-bit trees - siromoney
http://cr.yp.to/critbit.html
======
agl
For a fuller explanation:
[https://github.com/agl/critbit/blob/master/critbit.pdf?raw=t...](https://github.com/agl/critbit/blob/master/critbit.pdf?raw=true)

(Note: self promotion.)

~~~
nly
Github could really do with treating PDF files differently so I can view them
in my browser.

~~~
phaer
An updated firefox should include pdf.js and there are plugins for at least
chromium. Github could do it, but you could also do it yourself.

~~~
nly
The problem is Github triggers a download action, regardless of whether you
have a PDF plugin.

~~~
mh-
this is a security feature of sorts, mime sniffing is disabled.

------
yjh0502
Adaptive radix tree
([https://github.com/armon/libart](https://github.com/armon/libart)) is also
an impressive data structure. It also supports ordered iterations while
showing similar random read/write performance to hash tables. Crit-bit tree is
memory efficient, but it suffers for cache misses with many keys (> 1M).

~~~
eloff
That's the problem I see, it's memory efficient, but lookups will cost
thousands of cycles for modestly sized trees. A hash table on the other hand
can do lookups in a couple hundred cycles (cold cache for both.)

~~~
sprachspiel
But cold caches are an unrealistic assumption. The top-most levels of a tree
will always be in cache, unless you almost never access them -- in which case
there's no problem either. Additionally, a radix tree is ordered, whereas a
hash table is not.

~~~
eloff
Yes, that's correct. However, that's still over a thousand cycles for a tree
of depth 5 below the cached part. That's a modestly sized tree (or several
smaller trees.) Don't forget lot's of things compete for cache, it's usually
safer to assume a cold cache unless you know your data structure is very high
traffic.

------
tel
If you're interested in working to help with a nice Haskell library, one of
the core Haskell lib developers Bryan O'Sullivan is publically-building a
critbit library for Haskell

[http://hackage.haskell.org/package/critbit](http://hackage.haskell.org/package/critbit)

Contribution information is available on the github page

[https://github.com/bos/critbit](https://github.com/bos/critbit)

Also, as always, Edward Kmett weighs in with some particularly insightful
comparisons of critbit trees, PATRICIA trees, and other variants

[http://www.reddit.com/r/haskell/comments/1e1ywq/critbit_tree...](http://www.reddit.com/r/haskell/comments/1e1ywq/critbit_trees_in_haskell_fast_and_open_to/c9wcbgd)

~~~
Dewie
The activity for critbit seems to have died in the last half year:

[https://github.com/bos/critbit/graphs/contributors](https://github.com/bos/critbit/graphs/contributors)

------
colmmacc
I while ago I took Adam Langley's and DJB's crit-bit code and put a CDB-
compatible API around it, it's at;

[https://github.com/colmmacc/nutrient](https://github.com/colmmacc/nutrient)

still a work in progress, but it made it considerably easier for me to fully
understand what's going on. May help others.

------
norswap
For those like me who did not understand the rather vague explanation, this is
simply a bit-based trie. Not as exciting as the page makes it sound.

------
po
The article mentions replacing the python hash-backed dict with a crit-bit
tree… Sounds like a good opportunity to try it out with pypy. If it has no
drawbacks then it should show up up as a speed improvement in their
benchmarks.

~~~
eloff
I can already tell you that no tree can ever compare performance-wise to a
well designed hash table (like python's.) "costly string compares" are much
cheaper than cache misses. And for class attributes and the like Python
doesn't even do the string compares, just a quick pointer compare.

cache miss = 100 cycles

cache miss + tlb miss = 200 cycles

memcmp compares many bytes per cycle, it's clear to see that the cache misses
will dominate the runtime, and trees involve O(log(N)) cache misses. A hash
table is typically two cache misses. For interned python strings it's only
one.

~~~
optimiz3
"no tree can ever compare performance-wise to a well designed hash table"

This is not always true...if you have to hash strings when executing queries,
data structures like TRIEs or naive 256-ary trees may be faster (where the
O(strlen) time it takes to hash the string is instead used to walk the tree).

~~~
nly
It's true that both hashing a string of k bits and a lexicographically compare
over k bits is O(k), but the chances are, if you're about to look something
up, you've recently read it from somewhere and it's going to be hot in the
cache.

Memoizing a hash code for a string is pretty cheap also, especially if you
only dynamically allocate on the heap for long strings.

~~~
danieldk
Or precompute the hash and store the hash when a string type is immutable.
This is what Java does and it gives hash tables an advantage over graphs
(trees, automata) in lookups.

The advantages of trees lay elsewhere, such as persistence and ordering.

~~~
nly
If you want persistence (in the functional sense) and ordering I'd rather
implement a skip list personally. You can use your existing hash code as a
randomiser to determine level promotion. If you're going to keep it around you
may as well make use of it. Skip Lists also have the nice property that you
can stick a immutable linked list facade on them (since the base layer is
fully linked)

------
headgasket
Lookup and operations on tree structures are is O(Log(n)). A hash is O(1) (on
average). They have different uses; defaulting a dictionary to a hash is a
valid language decision IMHO, its lightweight and it fits the bill. And on
average it will be faster.

An explicit language construct for radix trees is an interesting idea, but
once you really need trees you might be closer to needing a real RDBMS, or an
in process extension such as sqlite.

~~~
__david__
> Lookup and operations on tree structures are is O(Log(n)). A hash is O(1)
> (on average).

Sure, but remember it also takes time to compute the actual hash value. That
process itself is O(n) where n is length of the key. For large keys sizes and
small set sizes the tree probably wins (for some definition of "large" and
"small").

The critbit algorithm walks the tree while simultaneously moving through the
key bytes/bits. It seems to me that for most modestly sized sets it has the
advantage.

~~~
cpeterso
That's why C++ STL's std::map and std::set are typically implemented using
(some flavor of) binary trees.

~~~
headgasket
standard C++11 added std::unordered_map which is hash backed. Previously gcc
and ms had the STL hash in the std namespace, but it was not part of the
standard.

I always wondered why. In my coding experience, for most generic coding tasks,
key-value dictionary backed by a hash was a better go-to construct. I wonder
if it has something to do with processor branch prediction; a random guess
would be that it would be hard to do b-tree branch prediction on a well
balanced tree, while hash is constant.

Its a very good point that when it comes to hashing a very long key string
this crit-bit tree structure has interesting properties. I wonder if this
structure could be used to implement a good(better?) average case
implementation of the LCS problem.
[http://en.wikipedia.org/wiki/Longest_common_subsequence_prob...](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem)

------
slashdotaccount
DJB, if you are reading this, your site is such a treasure, could you please
set up an rsync server so people can create online and offline mirrors?

~~~
phaer
[http://cr.yp.to/mirrors.html](http://cr.yp.to/mirrors.html)

------
phaer
Are there any drawbacks? The implementation linked by agl does not seem too
complicated, but at a first glance it almost sounds too good to be true.

~~~
finnw
I imagine it would be hard to implement efficiently in Java.

