
Nedtries: an ordered container faster than hash tables and red-black trees - jasonwatkinspdx
http://www.nedprod.com/programs/portable/nedtries/
======
jallmann
This is basically a radix/patricia/critbit tree. Those might be O(1) for the
number of keys but are O(n) for the length of the key. Note this is true for
hashtables as well.

Such trees are pretty excellent in practice; they give hash-like fast lookup
while preserving ordering. Hashtables do have the advantage of being less
prone to cache thrashing as you traverse the tree, which is probably why
nedtries is best for <10,000 elements. Not to mention the key distribution may
affect tree balance and hence lookup times. Probably an issue when you're
allocating memory from sequential addresses.

I have a unoptimized and under-tested implementation of such a binary trie,
which uses string keys. <https://github.com/j0sh/radixtree>

edit: another big plus to these types of trees: they use a minimum of space,
unlike a hash table.

~~~
saurik
There has actually been work on cache-conscious tries, such as the HAT-trie.

<http://crpit.com/confpapers/CRPITV62Askitis.pdf>

~~~
jallmann
Those are way complicated, though. I read that HAT-trie paper some time ago
and I still don't understand it 100%. Compare this to djb's crit-bit
description which fits nicely in a paragraph and tells you most everything you
need to know.

I suppose it's one of those 80-20 issues: compressed binary tries only have a
few drawbacks, but addressing those requires a ton of effort.

~~~
saurik
FWIW, I don't think nedtrie has path-compression, and it is only level-
compressed in that it has an array at the top that indexes based on leading
zero count (which is likely to lead to a very unbalanced root node in
comparison to just doing level compression up there).

I pointed out the HAT-trie thing because you claimed that that was a bug with
tries, and it is definitely something being worked on. And, as someone who had
spent time implementing this data structure who sounded like the hadn't heard
of that work, you might have found it fascinating ;P.

(Also, "stratified indexes" are a concept that, while technically applying to
most unbalanced structures, was designed for patricia trees and supposedly
leads to good balancing properties. It was designed for disk storage, though,
so it not quite ideal for in-memory use, but will at least therefore be not
/ludicrous/ at memory performance if you stick to the page level.)

~~~
liuliu
Since you are here, what do you think would be a suitable data-structure for
SHA'ed key (the first 64-bit) associative array? I implemented a simple tries
structure (no compressing at all) to handle this, but constantly struggled
that maybe I over-engineered on this:
(<https://github.com/liuliu/ccv/blob/release/lib/ccv_cache.c>)

~~~
saurik
Unfortunately, my interest has mostly been at the algorithms level: real-world
implementations are usually dealing with so few elements that you are more
affected by "silly" concerns like cache locality and instruction timing than
anything else.

Therefore, it is really critical to find out how many items you want to be
able to store. If you want to store a couple hundred you are going to do
really well with something incredibly "stupid" like a changing hash table.

Another important question is regarding your expected concurrency. I mean,
with a really strong random hash function and a data set that reaches a
reasonable "steady state" on the number of objects it contains, hash tables
are going to be ludicrously efficient if you have a really good hash function.

(edit: I meant to say: if you can tolerate the hashtable going offline
entirely during a resize before it reaches steady state, although there are
some algorithms for doing online concurrent resize; memcached actually uses a
concurrent resizable hashtable, from what I remember, and is generally
considered "very fast")

<sidenote> And, SHA is going to be a ludicrously good hash; arguably a /too/
good hash ;P. Out of curiosity, why /are/ you starting with a cryptographic
hash (SHA)? Did your data already come in that form (maybe you are storing
iPhone UDIDs ;P)?

These kinds of hash functions are actually /designed/ to be slow (so you can't
use brute force) and are generally not needed for search structures (as you
hopefully have a plan for organic collisions anyway). ;P </sidenote>

I will say that, before going out and doing a lot of research on this, I was
in the position of building something similar-sounding (but using a fast
64-bit hash), and what I ended up with was similar to an "LPC trie" (which I
luckily discovered quite quickly, and was able to read up on).

[http://www.nada.kth.se/~snilsson/publications/TRASH/trash.pd...](http://www.nada.kth.se/~snilsson/publications/TRASH/trash.pdf)

[http://www.nada.kth.se/~snilsson/publications/Dynamic-
trie-c...](http://www.nada.kth.se/~snilsson/publications/Dynamic-trie-
compression-experiment/text.pdf)

(However, again back to the hash function: the fact that I was doing the hash
beforehand is not really how most of these algorithms are through through by
the people who are developing them. Once you use the hash function you are now
storing a fixed size element with a known-random distribution, which makes
"use the first X bits as a hash function and store it in an array" an
irritatingly good choice.)

I also generally recommend keeping a lot of the stuff the HAT-trie guy has
published in mind (I've got it on my "omg must read" list, but had to get back
to actual work). His PhD work was on cache-conscious data structures, and he
simply went through and optimized the performance of everything under the sun
for cache locality.

<http://www.naskitis.com/>

(By the way: I'd love to talk to you further at some point on IRC. I'm on
irc.saurik.com; send a private message, and make it clear who you are in your
initial message: don't just say "u there?" as I get a million of those ;P.)

~~~
stcredzero
_These kinds of hash functions are actually /designed/ to be slow (so you
can't use brute force)_

Not so much. I think they are designed to have good mixing, be impractcally
hard to reverse, and have other properties. This just makes them slower than
hash functions generally used in hash tables. My understanding is that both
MD5 and SHA were designed to be efficient under these constraints. bcrypt is
designed to be slow, and can beat other cryptographic hashes by ridiculous
margins. (Parameterized, actually)

------
jasonwatkinspdx
These are binary tries with some interesting optimizations, in particular
using special instructions supported on all modern cpus to skip common
prefixes at the start of looking up.

According to the author's benchmarks they offer very consistent performance
between insert and access operations, which seems to be key to why they
perform better than R-B trees. Since tries don't need to re balance on insert
we avoid memory operations R-B trees spend reorganizing interior nodes. Even
though these are amortized in the R-B, we're still saving work.

Likewise, these are faster than small hash tables since we avoid resizing-
rehashing overhead. Searching the binary trie is also likely faster than
following a collision linked list we might find in a naive hash table
implementation. However, for larger containers hashes begin to catch up, since
at these sizes the trie involves increasingly more memory accesses.

I'd like to see these tries compared to generalized cuckoo hashing.

I'd also like to see them compared to the bagwell array mapped tries used by
clojure and others.

~~~
saurik
(For the record, the reason why I have stared at this as much as I have just
now is because I study these kinds of data structures and therefore want to
understand what is new and compelling about this specific one.)

So, tries have the property of being only as balanced as the distribution of
their alphabets. It is very easy to end up in a situation where worst case
performance of a trie is linear in the number of items in the trie. There are
some researchers who were doing work on a concept called "stratified indexes"
that allow you to index into deeply skew'd tries that you may find
interesting.

Regardless, the claims in this article are somewhat off. For example: they
state that red-black trees "write out a lot of memory". Any full binary tree
with N leaf nodes has N-1 non-leaf nodes. That statement is just as true of
red-black trees as it is of this data structure. It is definitely true that
the STL red-black tree implementation is inefficient (allocating memory for
empty nodes), but that should not be held against the algorithm.

The documentation is even weirder. "Unlike either red-black trees or most hash
tables, nedtries can store as many items with identical keys as you like."
That is certainly not true, and isn't even true of the STL, which has a red-
black tree without unique restriction called std::multimap.

The documentation also claims that hashtables can store arbitrarily large keys
and nedtries can't, which is only as true of hashtables as it is for nedtries.
As hashtables aren't ordered, and in fact require fixed-length keys, you can
(and do: normally this statement would be going in the other direction of
causality) use hash algorithms to hash the information to fit into the key
length and then verify the value when you find it. The same thing could be
used for nedtries.

I'm also having a difficult time understanding why red-black trees are not an
"in place algorithm" and nedtries are. I'm not even certain I understand what
it would mean for a search tree to be "in place", but the documentation makes
claims about "dynamic memory allocation" being its key distinction, and I see
no reason why a red-black tree need require that for search or deletion.
Insert requires a single allocation, but so does a trie; and, even with the
STL most people use a custom pool allocator for that purpose.

To be clear, I am not at all stating that "nedtries aren't faster than other
implementations", just that that has nothing to do with their algorithm. The
algorithm seems to just be an array of tries, where the array is designed to
help with the common case of "there are a large number of zeros at the top of
the key" (caused by using small numbers as keys).

Unfortunately, that makes the distribution among the bin-level tries
drastically uneven, making that first "which bucket do I end up in" question
severely unfairly unbalanced.

That specific optimization, then, doesn't seem to offer anything over a
patricia trie, which itself would dramatically decrease the number of non-leaf
node traversals (as nedtries does not describe using any kind of path
compression at all). These path-compressed tries are what people normally use
for high-performance lookup like routing.

~~~
jasonwatkinspdx
I agree that his prose is a little confusing. Here's my understanding of his
intent:

"red-black trees write out a lot of memory"

I think this refers to how R-B trees require re balancing on insert,
particularly accessing siblings of ancestors. This is doubly bad on modern
processors, as it introduces a chain of control hazards dependent on these
accesses. That can burn a lot of cycles on cache misses. In contrast, the trie
structures only access nodes along the direct line down to the appropriate
leaf. This has a control hazard (ie when to end the loop) and a data
dependency between iterations. The control hazard can be easily predicted
(assume search recurses) and modern processors are quite good at unrolling the
data hazards across the shadow registers (or equivalent). Basically, modern
processors and compilers are much happier with the sort of inner loops you see
in tries vs restructuring trees (and are happiest with the inner loop of
accessing a hash table).

"The documentation also claims that hashtables can store arbitrarily large
keys and nedtries can't, which is only as true of hashtables as it is for
nedtries."

I believe his implementation assumes size_t sized keys. This limit comes from
his root table, which needs as many entries as bits in the key. While you
could code up a variant that used arbitrary length keys I think you'd lose
most of the performance advantages he measures (remember, we're mostly talking
about constant factors here, not complexity class). He doesn't care about this
limit, as his primary application is storing free lists in an allocator, which
will always have size_t keys.

"I'm also having a difficult time understanding why red-black trees are not an
"in place algorithm" and nedtries are."

I'm also confused by this, but I'm assuming he's looking at concurrent access.
Tries only ever push down toward leaves, so with careful ordering it's
possible to ignore many locking concerns. Or put a different way, you're only
ever allocating new nodes or filling in previously nil slots in existing
nodes. Trees with structural re balancing on insert like R-B could invalidate
a concurrent traversal with a rotation.

"That specific optimization, then, doesn't seem to offer anything over a
patricia trie, which itself would dramatically decrease the number of non-leaf
node traversals (as nedtries does not describe using any kind of path
compression at all)."

I think for his purposes there are some assumptions about distribution: namely
if he's storing memory addresses in a nedtrie, it's likely he's allocating
them with brk() or the like, and so they're clustered in one or two slots of
the root table. Likewise, if he's storing block sizes, then the root table
acts as a segregated fit similar to dlmalloc and its many offspring.

I'd agree that the way he over generalizes the advantages is a bit confusing,
but I do think there's something interesting here vs vanilla patricia tries.

~~~
saurik
Re: assumptions about distribution

But if the assumption is that they are all using a couple slots of that root
table, then the root table is not a performance advantage. It would be faster
to just start the trie immediately and switch off the first bit.

The point of having a structure like that at the top is that it is a cheap-o
form of level compression, and increases the fan-out of the root node,
allowing you to cut the search space down much faster given a single very
cheap lookup.

As for segregated fit, the bucket array itself is probably actually good
enough for that. What /could/ make that interesting is if you /also/ stated
that "small" (many leading zero) keys happened to be much more prevalent, and
in fact twice as prevalent, as those in the next bucket up.

Otherwise, it is just further unbalancing his trie and is getting reasonably
the same tradeoff as if he were using a patricia trie (which will use path
compression to eat large numbers of 0s and otherwise will be reasonably as
unbalanced without the cost of the bucket array, which is going to be many
internal nodes worth of memory).

~~~
saurik
In fact, it is even worse than that: in the case of random keys (which he
admits in the documentation is important for trie balance without helping the
user obtain them) he'd be much much better off just chopping the first five
bits off the key and using that to index into his top-level array.

Doing that would let the top-level array segment the search space perfectly
into 32 sub-buckets, avoiding the fact that the higher order array elements
are increasingly meaningless (as they are exponentially smaller).

At that point, however, the datastructure is now a level-compressed trie ;P.
(LPC tries seem to be at the state of the art for this kind of structure).

------
beagle3
djb's "critbit" trees are much more interesting, I think: They are basically
binary patricia trees (like nedtries, if I understand correctly). However,
djb's observation is that in order to locate a specific string, you only have
to look at the "critical" bits - the bits that are different between different
entries in the database.

So, for example, if key 1 differs from key 2 (top two keys in the tree) in bit
317, you check bit 317, and you know which one of them it is NOT (and also,
which branch of the tree it is not). Proceed recursively -- most keys will, in
practice, be identified by log2(n) bits where n is the number of keys in the
table, NOT the length of the key.

~~~
saurik
For the record, patricia tries in the literature are defined to have this
"critbit" property. The wikipedia article on that subject is a little broken,
but djb's article is correct when it states:

"This idea was introduced by Morrison in 1968 under the name PATRICIA, and
independently by Gwehenberger at about the same time."

~~~
liuliu
Also, I found this link to be a good reading for anyone interested in the
implementation detail: <https://github.com/agl/critbit/blob/master/critbit.w>

------
baddox
I like the concept, but the author haphazardly combines asymptotic analysis
with unsubstantiated performance benchmarks and uses some imprecise language.

What does "nearly O(1) complexity" mean? Why does he mention a complexity of
"O(8 _sizeof(void_ ))"? Shouldn't that constant factor be collapsed?

