Here's what my implementation from some time ago looks like:
data Color = Red | Black deriving (Show, Eq)
data Tree a = Nil | Tree Color (Tree a) a (Tree a) deriving Show
insert :: Ord a => a -> Tree a -> Tree a
insert x t = case ins x t of
Tree _ l x r -> Tree Black l x r
ins :: Ord a => a -> Tree a -> Tree a
ins x Nil = Tree Red Nil x Nil
ins x t@(Tree c l y r) = case compare x y of
EQ -> t
LT -> balance $ Tree c (ins x l) y r
GT -> balance $ Tree c l y (ins x r)
balance (Tree Black (Tree Red (Tree Red a x b) y c) z d) = balanced a b c d x y z
balance (Tree Black a x (Tree Red b y (Tree Red c z d))) = balanced a b c d x y z
balance (Tree Black a x (Tree Red (Tree Red b y c) z d)) = balanced a b c d x y z
balance (Tree Black (Tree Red a x (Tree Red b y c)) z d) = balanced a b c d x y z
balance x = x
balanced a b c d x y z = Tree Red (Tree Black a x b) y (Tree Black c z d)
The AVL has a fuller implementation:
It's significantly more complex, but it gives a type-level/compile-time guarantee that the RB tree invariants are kept, which is nice.
However, there are several things I don't like about it.
1) There are still specific cases that require memory. The cases for treaps are always the obvious ones --- comparing priorities and comparing keys.
2) It obscures the connection to B-trees, which is what leads one to derive all of the special cases after all.
3) Maybe I'm dumb, but the running time guarantees aren't obvious.
4) If you're doing rbtrees in Haskell, you might as well enforce your invariants with the type system.
You mean 2-3-4 trees rather than general B-trees. Anyway, that is true of all implementations of red-black trees I have seen, and intentionally so. When you consider the number of possible parent/child configurations when splitting and merging 2-3-4 nodes, there's a whole lot of them. By expanding a 2-3-4 node as a cluster of red-black nodes, the local combinatorics are more manageable.
Even the presentations of red-black trees in textbooks (e.g. CLRS) usually obscure the connection with 2-3-4 trees, with one exception being Sedgewick's own book. That's too bad.
> 4) If you're doing rbtrees in Haskell, you might as well enforce your invariants with the type system.
If you're going to do that, it's probably easier to use the isomorphism with 2-3-4 trees and work with that. There's a standard trick for representing trees with the 2-3-4 invariant using non-regular data types:
data Tree a = Zero a | Succ (Tree (Node a))
data Node a = Node2 a a | Node3 a a a | Node4 a a a a
But it's not clear to me that there's much of a practical win in encoding the balance invariant in the type system here.
This continues to make me sad. My original (ugrad) algorithms class was taught from Sedgewick's book, and I always found RB trees to be elegant and conceptually simple (including the insertion and deletion cases); when I got to grad school, where CLR was the textbook of choice (and RB trees were actually taught in an earlier course anyway), man oh man was there a lot of wailing and gnashing of teeth over having to memorise the complicated deletion cases. Really? I just rederived them from scratch each time, and I was faster, and I was right.
It's a serious disservice to teach someone the red-black tree algorithms without first teaching 2-3-4 trees and then teaching the isomorphism, and it's a shame that so many textbooks haven't figured this out.
It probably doesn't anymore, but I guess it probably makes it a practical win to get type-level guarantees.
Also, guarantees about correctness are easier to get from tools like QuickCheck. Balance errors will be harder to spot.
It should be noted that this puts a Theta(lg n) singly-linked list of Succ nodes at the top of the tree, thus doubling the effective height of the tree.
I suspect treap delete is simpler.
To avoid that issue Tarjan and Werneck designed self-adjusting trees so the user doesn't have to understand implementation details while still having access to an
effective data structure. There is a free implementation (with relevant papers linked) so if you don't want to reinvent the wheel have a look :)
(A red-black tree is a plain-old binary search tree for everything but modifications.)
edit: For the actual comparison between the three: treaps give no guarantees. Red-black trees sacrifice some lookup speed in favor of insertions.
Do the logarithmic update times in top trees take into account the time to locate the node to update?
Anyway, when it comes to simplicity I will admit it is much more appealing than red-black trees and their 5 billion cases for insert and delete - none of them unreasonable, true, but I didn't like debugging that thing at all the first time I attempted to implement it. So, yeah, that's something you can reliably implement when you don't have time/ can't be bothered to look up the ridiculously complicated stuff and you just want to dish out a fast data structure that you know works (binary heap - 1, suffix tree - 0, treap - 1, RBTree - 0).
As for the problem, it should be quite simple if you just keep tabs on the element count on the left and right subtrees, which is small price to pay (under a few million elements, but even with you're working with hundreds of millions - suck it up and buy an extra gig of ram). Something like this pseudocode (hopefully with added error checking) :
pos n root =
if root.right.count >= n then
pos n root.right
else if n == 1 || root.right.count - n == 1 then
pos (n - root.right.count) root.left
Another benefit to having the node counts-- and this, I think, was the original motivation-- was it made the tree "binary" not only for looking up a particular key, but also for looking up the Nth key-ordered entry (which was important in the application).
In essence, the treap is the same approach to balancing as the skip list. What's wrong with AVL trees?
edit: that said, if you do have a genuine need to prioritize some nodes, the observation that the BST property does not preclude the heap property is a very nice one.
Ignoring the simplicity argument, treaps can perform some other operations (local updates and nearby searches) in optimal expected time (see theorems 3.1 and 3.2):
> As long as it's only probabilistically balanced, the only thing it offers beyond random binary search trees is unpredictability in the face of ordered input.
The randomness treaps use should be independent of the input keys, so ordered input should produce a treap with a shape drawn from the same distribution as the one for unordered input.
That said, their definition of optimal is big-O, which is unimpressive and I believe also reached without any balancing, provided the input has no duplicates.
There are balanced search trees with shorter maximum height, but they often take asymptotically longer for local operations (finger update and finger search).
> which is unimpressive
What do you mean?
> I believe also reached without any balancing
I don't know what this means in context. Treaps are randomly balanced, making them have height O(lg n) with high probability.
> provided the input has no duplicates.
Duplicate keys are handled properly. The problem of duplicate priorities is discussed in the paper.
I believe, but haven't verified, that expected logarithmic height is achieved just by randomly permuting the input sequence and not doing any subsequent balancing - so the treap approach is simply deferring the permuting to insertion time.
I can't think of a search tree with worse finger search performance than the treap - can you expand on that?
That is false. (edit: What I mean is that your belief about the mechanism of randomization is false. If you look at the paper I linked to above, you will see that the authors explain that the randomness must be independent of the input key values.)
> I can't think of a search tree with worse finger search performance than the treap - can you expand on that?
I'm not sure anymore, but I stand by my "asymptotically worse performance on finger updates" claim.
Sorry, I'm not following. Does the treap balancing mechanism not result in a tree equivalent to what one would get permuting the input sequence with "order by rand()"? I'm pretty sure it does. That has nothing to do with the input key values.
> I stand by my "asymptotically worse performance on finger updates" claim.
Excellent. Can you give me an example of that, then? I've never read an analysis of finger updates, but I'm willing to tentatively accept better expected-case performance. I'm not able to handwave my way to anything asymptotically better, though.
Yes, it does, but I was responding to you comment that "expected logarithmic height is achieved just by randomly permuting the input sequence". It achieves the effect of having randomly permuted the input sequence, but it does not actually perform input sequence permutation.
> Can you give me an example of that, then?
Any tree with nodes that store the number of descendents (weight-balanced search trees, for instance) take Omega(lg n) for finger updates. See also "Heterogeneous Decomposition of Degree-Balanced Search Trees and Its Applications" by Shan Leung ("Maverick") Woo, section 1.1.4 and "Robust balancing in B-trees" by Scott Huddleston and Kurt Mehlhorn. They both mention that the original flavors of B-tree have sequences of k insertion and deletion operations where the update costs alone are omega(k).
Another example is repeated insertion and then deletion of a single element in a full binary tree viewed as an AVL tree. Node height (or even single bit height-balance information like "which child is heavier") propagates all the way to the root for both operations.
Let me compare treaps and AVL trees: Both insert at the same point, then proceed to rebalance. The treap will rotate upwards until p_new>p_parent. The AVL will update balance information upwards until bal_parent%2==1, then update and possibly rotate once.
That's a couple of conditional operations extra, and field operations instead of tree operations. The final step in particular is less elegant, but straggling cleanup operations are everyday things and well worth the better performance, to me.
This is because it associates keys with values. Standard binary trees usually skip the values.
I wonder why BST are not implemented as a standard binary tree plus a hash table. I suppose the trade off depends on the size of the values.
> This is because it associates keys with values. Standard binary trees usually skip the values.
Don't think so. A Binary Search Tree is a Binary Tree in which an inorder traversal goes through the data -- keys, in this case -- in sorted order. Of course, usefulness dictates that keys typically have associated values, but they do not have to be there, for it to be a BST. (Why have keys with no associated values? That's one way to represent a set.)
A binary tree is good when the probability of a node to cut its set into a sought set and a not-sought set is the highest on average. And here is a draft of mine of such a binary tree but I don't know in fact if it is right :p
As for your heuristic, I'm not clear on your meaning. It's true that the "best" greedy top-down heuristic for a BST is to cut at the conditional median - but whether this is the same as a datum partitioning the sought from the unsought depends on more detail, especially since we're now talking about range searches.
Also, provided you have the distribution at hand and are willing to write a lot of hairy code, you can reduce the average path length by less than 1 (yay).
Also, in my paper, I pose some basic hypothesis which are: data are points, without any indication of distribution and the search range is a convex in the point space. Something like a school case.
With this hypothesis which are general, balancing the tree (ak. Octree) is a hard misconception coming from BST.
It will be difficult to me to explain here why, but it's not really hard to see it.
Take this exemple and my hypothesis: you have a logarithmic distribution of points in 1D (but you don't know it particularly) and you are searching the points in random continuous ranges, then the best tree is a total unbalanced tree. And note that my heuristic will gives you a total unbalanced tree. If you have regurlarly distributed points, my heuristic will gives you an octree.
One day, I hope, I will find a damn good mathematician to prooves that the K-Means Tree is the best tree in the general hypothesis.:p or maybe i'm wrong =)
What do you mean by total unbalanced tree? For an exponential distribution, a complete tree would naturally be suboptimal. The real trouble, as I understand it, with k-trees using the median heuristic is that there is no total ordering in more than one dimension, barring special cases, and so binary partitionings can no longer keep all sought nodes inside them.
If your larger point is that assuming a uniform distribution in octrees (instead of using a balanced tree structure) is vulnerable, then yes, I agree fully. If not, I'll conclude that you're talking about some sort of clustering, where the partitioning is spatial but the sought points are not necessarily described by a range. That's interesting, too.