
Aguri: Coolest Data Structure You've Never Heard Of - iamelgringo
http://www.matasano.com/log/1009/aguri-coolest-data-structure-youve-never-heard-of/
======
codeslinger
The point of the data structure in this case was not the radix trie alone but
rather the combination of the use of tries in concert with the aggregation
characteristics of IP traffic. This type of innovative combination is common
in the network world due to the necessity of providing ever more introspection
and functionality over ever higher throughputs and ever lower latencies. For
some serious treatment of the subject, check out the book "Network
Algorithmics" by George Varghese.

~~~
tptacek
This is an awesome, awesome, awesome book. Aguri isn't in it; aguri doesn't
even _rate_ for this book. That's how awesome this book is.

~~~
anupamkapoor
network-algorithmics is just soooo cool. really, what you must be reading
after stevens.

------
Xichekolas
He says binary trees are better default search table than hash table... binary
tree search is O(lg(n)) while hash table is O(1)... so hash table is better
for search, binary trees win in other ways (like ordered traversal if you need
that), but no way you could say one is _always_ better than the other.

Also confused how he proposes to build his Radix Trie in less than linear
time... just seems like he ditches his original premise and launches into
routing. There are selection algorithms that will find things in an unsorted
array in O(lg(n))... but pretty sure building a tree of any sort is not part
of it.

Radix Tries are sweet fun... so is just good old Radix Sort, but he starts
with something that doesn't really make sense for the topic (the unsorted
array he wants to pull three out of), then makes some weird claims, then jumps
into routing with Radix Tries... not sure what is insightful here.

~~~
tptacek
Hash tables are NOT O(1). You just failed the job interview.

You're like the 4th person I've seen that read (some) of this article and
pulled this tree vs. hash thing out of it. It's a throwaway point. Binary
trees provide comparable performance to hash tables, but also provide
efficient access to ranges of keys. Binary trees support more operations
efficiently than hash tables.

The article isn't about some war I've declared on hash tables.

~~~
Xichekolas
_"tree vs. hash thing ... It's a throwaway point"_

If you had read more than (some) of my comment you might have gathered that
was _my entire point_. The first page of your post was a throwaway point...
nothing to do with your actual point, which seemed to be Radix Tries. It
would've been nice to dispense with the distraction was all I was saying.

I know that given a lot of collisions that hash tables degenerate into linear
insertion and lookup... and I did point out that binary trees are better for
several things, but I still maintain that you can't just outright claim one is
better than the other _for every situation_.

Oh, and as for failing the job interview... not really sure why that was
necessary. You don't make a stronger point by being a dick about it.

~~~
tptacek
The hash table O(1)/O(n) question is a classic job interview gotcha. I didn't
mean to be a dick.

~~~
Xichekolas
... and I shoulda been nicer in my original comment. Applying radix tries to
routing was pretty interesting. Sorry to pick nits.

------
mojuba
Radix tries are nice. Use them when

(1) you need fast lookup and fast inserts

(2) you need to preserve the lexicograhpic order of your elements - something
that a hash table won't give you

(3) you are afraid of DoS attacks or maybe just "bad" data - hash tables can
be attacked easily if the implementation is known

(4) you don't care about memory usage that much.

I compared a simple 8-bit radix tree with some standard hash table
implementation - the former took roughly ten times more memory. I then changed
my radix to be based on 4 bits (each char is just split into 2 parts) and the
memory usage improved twice. Now I'm wondering if radix tries have more room
for improvement.

So this is something to consider.

And thanks for the article - never heard of Aguri, it's beautiful.

Upd. sorted lists basically give you same advantages except for fast inserts

~~~
tptacek
Compressed binary radix tries (also known as PATRICIA trees) give you
excellent memory usage characteristics and pretty much optimal search time
characteristics: both wins come from only storing the edges corresponding to
significant bit differences.

Since hash tables by necessity have to allocate enough storage to keep table
load down and minimize collisions, you could find a PATRICIA tree has even
better memory footprint than whatever hash table you use.

Of course, all data structures share the DoS problem. There's an input to a
PATRICIA tree that pessimizes lookup and memory usage: the one that creates a
full 32 bits of fanout for every key. In all cases, the attack degrades
performance to linear search.

Of all the "mainstream" data structures, balanced binary trees are probably
the hardest to attack this way.

~~~
mojuba
_both wins come from only storing the edges corresponding to significant bit
differences._

Of course I used compressed suffixes, but it didn't help much as compared with
hash tables.

 _you could find a PATRICIA tree has even better memory footprint than
whatever hash table you use_

I don't think you can demonstrate that Patricia has a better memory footprint
than a good hashtable. In fact best radix trees are much worse.

 _Of course, all data structures share the DoS problem._

You can attack radix trees in terms of memory usage, while hash tables are
prone to serious performance attacks.

 _In all cases, the attack degrades performance to linear search._

Not true. I don't think you understand these algorithms and O(n) complexity
very well.

The question how a hash table degrades when attacked highly depends on how
exactly conflict resolution is implemented. Among other methods is, for
example, expansion and rehashing, and you can't tell for sure if it degrades
to linear search or not.

And radix never degrades to linear search - never.

 _Of all the "mainstream" data structures, balanced binary trees are probably
the hardest to attack this way._

They are prone to a very specific attack when you make the tree to re-balance
often.

~~~
tptacek
You're ignoring table load factor in your memory comparison, you can attack
any dictionary "in terms of memory", I conceded the O(n)/O(k) lookup thing
before you wrote this, and why do you think "balancing" a red-black tree is
expensive?

~~~
mojuba
_You're ignoring table load factor in your memory comparison_

I believe I don't ignore anything, I'm measuring precise memory usage by two
identical programs doing same thing but based on different algorithms. I also
believe I got the most out of radix in my tests. The 4-bit version of radix
wasn't trivial, but it was reasonably fast and the mem usage was twice as
better than the 8-bit one, as I said before.

But even speculatively, on paper if you wish, radix consumes more memory than
some good hash table implementation.

Upd. forgot to mention, I did some more optimizations as well, like
compressing the pointer tables in tree nodes.

~~~
tptacek
I'm sorry, you're talking about a totally different algorithm than I am either
here or in the article you're responding to. There's no such thing as a "4 bit
PATRICIA" trie. Obviously, if you want to burn memory, you can increase the
arity of your tree nodes to "speed up" (heh) fetches.

------
garret
D. J. Bernstein uses these too.

<http://cr.yp.to/critbit.html>

------
RajendranB
The time space trade off between hashtables and binary trees are not touched
upon in all the above writings. Hashtables will require continuous memory
locations and hence from stack and hence won't work for large sets of data and
binary search trees or their derivatives will allocate memory from heap and
take more time (logn) for get and put than the constant time(O(1)) for
hashtables.

~~~
bayareaguy
It sounds like you're making a lot of assumptions about the storage hierarchy
and addressing scheme.

------
hisham
Wow reading this post was like replaying one of my interviews with Google...

------
dfranke
Knuth has a pretty good treatment of tries in volume 3.

------
jwp
Suffix array.

~~~
andrewf
A couple of years back, googling for "suffix array" got you advertisements for
jobs@google. I'm not sure whether Google HR ended the program, or just
migrated to more obscure search terms :)

~~~
apathy
Burrows-Wheeler Transform maybe?

One of the coolest compression algorithms evar...

~~~
eru
It's no compression algorithm in its own right. Just a first step.

An interesting first step.

