
Crit-bit Trees: the best way to store sets - bumbledraven
http://cr.yp.to/critbit.html
======
jfager
If you're looking for a jvm implementation, there are at least a couple:

<https://github.com/rkapsi/patricia-trie>
<https://github.com/jfager/functional-critbit>

The latter is my own, with a focus on memory efficiency (by my benchmarks, it
uses about 40% less space than patricia-trie for the same data set). Would
love to get some extra eyes on it to figure out how to trim it further or
speed it up.

------
ot
Crit-bit trees are great but they're not silver-bullet: like every tree
structure, even in the lucky case a search can cause O(log n) cache misses,
while a good hash table needs just one or two.

Also, as pointed out in another comment, it is not necessarily balanced (like
red-black trees) and there are some real-world datasets where crit-bit (or
Patricia) trees behave poorly, for example URLs. I myself have seen Patricia
trees with height greater than 100, for sets of URLs where log n ~ 30.
According to VTune, most time in searches was spent in cache misses.

~~~
fmap
That's a very good point. One way to circumvent this problem is to combine
hashing and crit-bit trees, by constructing an unordered tree on the hash
values of the set elements. The nice thing here is that if the hash function
behaves like a uniform random variable the resulting tree will be balanced.
Furthermore, without ordering the implementation is trivial.

At the same time you will not need rehashing once the table "fills up" and
since you use don't discard bits from the hash function the expected number of
collisions after inserting n elements with a 32 bit hash is n / 2^32 - or 0
for reasonable values of n.

Additionally you can use clustering (e.g. build a crit-nibble tree, with 16
pointers per node - one 64 byte cache line) to reduce the number of cache
misses. This can backfire spectacularly unless the data is essentially random,
so the hashing step remains important.

It may not be a silver bullet, but there are some interesting trade-offs.

~~~
ot
That is true but you lose some of the most important properties of crit-bit
trees, ordered operations.

If the data structure is unordered, I don't see any reason why it should be
better than a normal hash table.

~~~
fmap
The data structure is not better than a normal hash table. It is a different
trade-off.

For instance, you don't have to do a rehashing step, which is important for
real time applications. Memory usage is also very deterministic at 2*(n-1)
words in an n element table (with 2 words per node).

If none of this matters to you and you merely want a map with fast lookups
then a simple hash table is probably a better choice.

------
viraptor
> A crit-bit tree for a nonempty prefix-free set S of bit strings has an
> external node for each string in S; an internal node for each bit string x
> such that x0 and x1 are prefixes of strings in S; and ancestors defined by
> prefixes.

And that's a perfect (missing) example for "a picture is worth a thousand
words". Then again... this coming from DJB - I should be surprised the
description is as verbose as it is.

~~~
kragen
The particular words you've quoted have managed to condense a thousand
pictures — an infinite number, really — into 45 words. It's kind of a
counterexample.

Some examples would certainly help, and probably diagrams would help
understanding the examples.

~~~
viraptor
That's assuming you take "nonempty prefix-free set", "internal node", "bit
string x such that x0 and x1" (not sure if they're supposed to be subscripts
or are 0 and 1 bits) for granted. A picture solves the problem in many ways
giving you real examples.

It's 45 words with lots of assumed knowledge that you could write books about
;)

------
jbapple
> Another advantage is that a crit-bit tree guarantees good performance: it
> doesn't have any tricky slowdowns for unusual (or malicious) data.

Consider a crit-bit tree over variable-length bit strings. The set 1, 11, 111,
... (with n bit strings) has depth n.

Depending on how you count, it can be argued that this is still substantially
better than the worst-case of hash tables, but in all of the ways I think of
counting, a crit-bit tree is worse on this input than a balanced search tree.

~~~
beagle3
Big O notation, it isn't. The big question is the cost of different types of
comparisons.

The most important thing to remember is that a crit-bit lookup does O(string-
length) "critical" bit lookups, and then an O(string-length) final comparison
- for a total of O(string-length).

Whereas a hash lookup usually does O(string-length) computation of the hash
value, average case O(1) lookups, and then O(string-length) comparison of the
key value - for a total of O(string-length).

A balanced tree does O(log-n) comparisons, of O(string-length) each (the final
one giving the confirmation, which was an independent step for both hash and
critbit trees).

This is true even on this specific input. It boils down to how the amortized
single-character comparison compares to a cache miss.

If crit-bit trees leafs are packed in a cache-friendly way (they are constant
size, so this might actually be easy to arrange), they are likely to perform
at least as well as balanced trees.

~~~
jbapple
Good points.

Given a query string of length k and cache lines that store B bits, the crit-
bit trie for the bad input uses Theta(k) computations and causes Theta(k)
cache misses, I think. A balanced tree holding the same set causes Theta(k *
lg n) computations and O(lg n + (k * lg n / B)) cache misses, I think. My
understanding is that lg n / B is frequently much less than 1, even for small
cache line sizes, so that the balanced tree will generally use less I/O than a
crit-bit tree but more computation.

------
Andys
Also known as a radix tree, I believe.

~~~
gsg
Crit-bit trees are not radix trees.

A crit-bit tree stores next-different-bit positions at branches and entire
values at the leaves, and is searched by following branches depending on the
value of the relevant bit in the search key (the "critical bit" which gives
the data structure its name). Once a leaf is reached the entire search key is
compared to the value at the leaf, which is necessary because a key that is
not in the tree might differ at bit positions that are non-critical and would
not have been tested during the tree descent.

This is a very different arrangement to a radix tree, which is essentially a
form of trie.

------
dchest
agl's C translation written in literate programming style:

PDF: <https://github.com/agl/critbit/blob/master/critbit.pdf>

Repo: <https://github.com/agl/critbit>

~~~
pyre
Does 'literate programming style' require the lack of indentation in his
source file?

~~~
DavidHull
The source file is critbit.w. See
<[http://en.wikipedia.org/wiki/Literate_programming>](http://en.wikipedia.org/wiki/Literate_programming>).

------
DanWaterworth
I'm not sure it's the best way to store sets. It's seems like potentially
doing a pointer dereference per bit is quite an overhead. It would be 8 times
faster (for dense sets) to have a crit-byte tree. Though then obviously if you
did that naively, there would be a huge associated storage cost. To mitigate
that, you can do what AMTs do and have a bitmap per node followed by an array
of pointers.

~~~
gsg
Crit-bit trees don't branch per bit, they branch per _differing_ bit. Much
more efficient for most data sets.

~~~
DanWaterworth
I know that thanks

------
gorset
If you don't need the add and remove operations, it's possible to reduce the
size significantly by applying automata theory:
<http://www.n3labs.com/pdf/lexicon-squeeze.pdf>

It works extremely well "out of the box" for dictionary words, since the
automata enables sharing of both prefixes and suffixes.

~~~
jongraehl
The minimal-acceptor approach is also best when you're considering dynamic
programs that effectively involve some kind of FSA-intersection, including
best-first lazy versions like "find the closest spelling correction to a
sequence of words in your lexicon of some text", because your state (suffix
set) representation is canonical.

I only skimmed the paper, but a simple n*log n approach to produce the same
result is to first sort the lexicon, then build the trie while canonicalizing
the suffix sets depth first (needing to hash/compare only one level deep, by
induction).

This guy wrote an entire thesis comparing various methods for the same task -
[http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/p...](http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/adfa.html)

~~~
gorset
Implementing approximate matching has been on my todo list for some time. I'm
planning to either implement
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5362...](http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=536247)
or
[http://www.sciencedirect.com/science/article/pii/S1570866704...](http://www.sciencedirect.com/science/article/pii/S157086670400053X)

The data structure for the mealy recognizer described in the paper can very
easily be changed to support minimal perfect hash numbers for all elements,
without using more space (my implementation uses less space because of it,
because the same information can be used to calculate the number of strings
with a prefix in O(1) time).

~~~
jakab922
You might find this interesting as a testbed ->
<http://www.facebook.com/careers/puzzles.php?puzzle_id=17>

------
jongraehl
I couldn't find any benchmarks on the web to substantiate the speed claims.

Still, it's a cool data structure, and I respect djb enough to trust that it's
at least competitive with hash tables; I'll consider it when I want more
operations than offered by a hash table (especially for a set that's very
sparse compared to key length).

------
jakab922
How is this different from a trie? (<http://en.wikipedia.org/wiki/Trie>)

~~~
bumbledraven
It's a particular kind of trie.
<http://en.wikipedia.org/wiki/Trie#Bitwise_tries>

------
willvarfar
The article puts forward a tree as a better alternative to hashtables:

"I have become convinced that this strategy should change. The revised
strategy is much simpler: there should be one fundamental set-storage type,
namely a crit-bit tree. Here's how a crit-bit tree stacks up against the
competition:

* A hash table supports insertion, deletion, and exact searches. A crit-bit tree supports insertion, deletion, exact searches, and ordered operations such as finding the minimum. Another advantage is that a crit-bit tree guarantees good performance: it doesn't have any tricky slowdowns for unusual (or malicious) data."

Oh, what's the big-O of a lookup in a tree vs lookup in a hash-table again?
O(lg N) vs O(1) with bad locality of reference to boot, you say?

FAIL.

~~~
pja
He's not arguing that crit-bit trees are better than hash tables, he's arguing
that they are a better choice for the fundamental set datatype for languages
like python, perl etc etc.

When you write an application, you know the performance requirements of the
set representation you choose & can pick the most appropriate for your
application. A language designer doesn't have this information: their choice
of fundamental datatype affects everyone.

Bernstein is arguing that they should choose a set representation with good
performance for as wide a range of features as possible rather than one that
performs well for a small set of requirements, and very poorly for others.

I can see arguments both ways here, but FAIL is a bit too strong IMO.

~~~
willvarfar
You weaken his rhetoric; he writes like its crit-bit all the way for all
purposes.

And since there is not a queue of people complaining that Python's dict is not
iterable in order, I'd say he's wrong. Random access in a dict is the most
common use-case by a very long chalk.

