
The Ubiquitous B-Tree (1979) [pdf] - robinson_k
http://wwwold.cs.umd.edu/class/fall2002/cmsc818s/Readings/b-tree.pdf
======
craigching
Both times before this was posted it garnered 0 comments:

[https://news.ycombinator.com/item?id=4317545](https://news.ycombinator.com/item?id=4317545)

[https://news.ycombinator.com/item?id=7243568](https://news.ycombinator.com/item?id=7243568)

When I read this I immediately recognized it as a PDF that I have in
GoodReader (where I store all my technical PDFs). So I'm curious if it will
start a good conversation about b-trees this time :)

To get the conversation started, a couple of years ago, I decided to write a
b-tree data structure to refresh my skills a bit. I wrote it in C since I've
been doing a lot of Java the last 10 years or so. I only got as far as an in-
memory b-tree and had started working on getting it stored to disk before I
abandoned the project (think I was working on it during Christmas holidays and
the holiday ended).

I subsequently read that in-memory b-trees can actually outperform other tree
data structures like rb-trees due to cache coherency. I never bothered to
performance test my b-tree vs. an rb-tree I also wrote for fun.

Anyway, that's my story about b-trees!

~~~
detrino
Cache coherency is something else.

But ya, B-Trees will (sometimes significantly) outperform BSTs for small keys
because they will have fewer cache misses. Also interesting is that the idea
that RB-Trees are faster than AVL trees for insert/delete may be outdated as
well. I've seen benchmarks of AVL trees beating std::set in both libstdc++ and
libc++ for both of those operations. Again this is probably due to less cache
misses.

My B-Tree story is actually a B+Tree story. B-Trees (and B+Trees) can can
actually be used for more than just ordered sets/maps, I've implemented a C++
sequence container[1] that allows O(log n) random access insertion and
deletion. It is _much_ faster than a similar container[2] implemented using
AVL trees, but at the cost of iterator stability.

[1]
[https://github.com/det/segmented_tree_seq](https://github.com/det/segmented_tree_seq)

[2] [http://avl-array.sourceforge.net](http://avl-array.sourceforge.net)

~~~
craigching
> Cache coherency is something else.

Ugh, yes, sorry about that. I meant that they are more cache friendly.

Not to disagree with anything you've said and only to back up what I was
trying to get at is this SO answer:

[http://stackoverflow.com/questions/647537/b-tree-faster-
than...](http://stackoverflow.com/questions/647537/b-tree-faster-than-avl-or-
redblack-tree?answertab=votes#tab-top)

The part I'm talking about is:

    
    
      Also, unless you play nasty allocation tricks with the red-
      black trees, it seems reasonable to conjecture that B-trees
      have better caching behavior (it accesses an array, not
      pointers strewn about all over the place, and has less
      allocation overhead increasing memory locality even more),
      which might help it in the speed race.
    

That's not exactly the reference I was thinking of, but it says what I
remember about it.

Again, not taking away from what you're saying and not disagreeing with you in
any way, more backing up what I originally said ;)

------
flgr
If you are interested in an overview of B-Trees, similar data structures, and
how well they work on modern hardware, you may also find the survey bit of my
master thesis interesting:

See here:
[https://www.researchgate.net/profile/Florian_Gross/publicati...](https://www.researchgate.net/profile/Florian_Gross/publicati..).

Along those lines:

* CSS Trees: Pointerless b-Trees with a layout optimized for cache lines ([http://www.vldb.org/conf/1999/P7.pdf](http://www.vldb.org/conf/1999/P7.pdf))

* Intel & Oracle's fast architecture-sensitive tree search (combines huge pages, cache line blocking, and SIMD in an optimal layout): [http://www.researchgate.net/profile/Jatin_Chhugani/publicati...](http://www.researchgate.net/profile/Jatin_Chhugani/publication/221213860_FAST_fast_architecture_sensitive_tree_search_on_modern_CPUs_and_GPUs/links/0c96051f5d2990770d000000.pdf)

* Adaptive radix trees ([http://codematch.muehe.org/~leis/papers/ART.pdf](http://codematch.muehe.org/~leis/papers/ART.pdf))

------
tomcam
A lucid, enjoyable paper that allowed me to write a very fast B-Tree
implementation in nascent ANSI C back in the day. Comer's books always struck
just the right balance between scholarly completeness and clarity, at least to
me.

------
rizzin
Why does the root have only two children and not d, like every other non-leaf
node?

~~~
detrino
Because when you split the root, it can only have 2 children.

~~~
rizzin
I didn't understand your answer even after rereading about tree operations but
I think I found out the actual reason why the root can have less than d
children:

it is because of rebalancing after deletion. Underflows can go all the way up
to the root so its children can borrow its indices. And when root has only one
index (and two children) and another underflow occurs then its children
combined with its index can form a new root decreasing the tree height by one.

