Hacker News new | past | comments | ask | show | jobs | submit login
Btrees are the new black (me.net.nz)
133 points by jjh42 on April 2, 2014 | hide | past | favorite | 41 comments

Someone should create a library where all your data is stored in a BTree. Think of it! Huge data sets, small data sets, everything scales in this system. And even better, they can create a language-agnostic API to a small server that just stores the btrees for them and gives them back on request. We could call it a database, but that would require we use indexes correctly for once in our lives.

If you do want a B-tree library in C that's not tied to SQL, I've had good success with lmdb[0]. As a bonus, it's also transactional and writers do not block readers (and vice versa). Crazy fast and easier to use from C than SQLite IMO.

If you want B-trees that compete with LSM trees, stratified B-trees[1] are pretty sweet. There's an in-kernel GPL implementation called Castle[2] on GitHub.

[0] http://symas.com/mdb/

[1] http://www.slideshare.net/acunu/20110620-stratifiedbtree

[2] https://github.com/acunu/castle

And then we can stack every conceivable use case on top of it, and write that language-agnostic API in a way that makes interfacing with most current languages so onerous that people can build hundreds of thousands of lines of code to make it slightly less painful!

Maybe we can make the configuration so convoluted that we can base entire companies and revenue models around maintaining them.

>write that language-agnostic API in a way that makes interfacing with most current languages so onerous that people can build hundreds of thousands of lines of code to make it slightly less painful

Those would be people who couldn't be bothered to learn a really simple declarative language, and prefer ad-hoc OO ORMS and other shit like that...

Oh, SQL itself is fairly easy. For many things, I'm a fan.

However, SQL query construction and result munging is painful.

Consider a UI search screen that has eight potential search parameters that requires a one to many join for the results. The query construction, under something like JDBC, can end up with hundreds of lines of tedious code, like (and this is a condensed example!):

    String query = "SELECT a.*, b.* from table1 a inner join table2 b on a.fk_id = b.id where" // shorthand
    List whereClauses = []
    if (params.region) { whereClauses.add(getRegionWhereClause(params.region)) } //hope this doesn't require another join!)
    if (params.country) { whereClauses.add(getCountryWhereClause(params.country)) }
    if (params.lastParam) { whereClauses.add(getLastParamWhereClause(params.lastParam))} 
    query += whereClauses.join (" AND ")
Each of the params methods are going to have a few lines of code.

    def getRegionWhereClause(regionCodeList) { "a.region IN" } //Hope you never want to change the table alias here!
http://use-the-index-luke.com/sql/where-clause/obfuscation/s... shows why using an ISNULL/NVL hackaround for static queries is the wrong answer.

Then you're going to return a list of rows that looks like | a.1 | a.2 | b.1 | b.2 | | a.1 | a.2 | b.1' | b.2' |

Where you really want: | a.1 | a.2 | [[b.1 | b.2] | [b.1' | b.2']]

So you have to go through and munge it. (Can you use something like CONCAT as a hack and groupby? Sure, but that introduces other problems.)

Having a way to pass in optional parameters to a where clause for the DB to strip out (while staying performant) and being able to return a 1 to many as an array would solve many problems, but it doesn't fit the paradigm of SQL.

SQL is not a simple language and expecting everyone to lean it properly is crazy. Simple queries are easy, sure, but as with all things you quickly outgrow simple things and then you're stuck. Especially if you want good performance.

SQL is arguably at least as easy as any other language you can define around this idea. Usually the complications come from trying to normalize the crap out of your data and expecting a single relational model to work for all use cases.

Oddly, moving this to another domain would not make things any easier in that regard. If you tried to have a single data structure that stores all of your data for all of your use cases, expect pain.

NTFS uses BTrees I believe. And matches your requirements.

Mayhap I'm the one missing the joke. That being that this is just describing databases, period.

Google has an open source implementation of b-trees in C++. You gain performance and memory efficiency (less overhead per element stored). You lose the iterator stability that std::map/std::set guarantees. Other than that the interfaces are pretty much equal.


(top article author). I wasn't aware of that, it looks like they come to similar conclusions "B-trees are widely known as data structures for secondary storage, because they keep disk seeks to a minimum. For an in-memory data structure, the same property yields a performance boost by keeping cache-line misses to a minimum. C++ B-tree containers make better use of the cache by performing multiple key-comparisons per node when searching the tree. Although B-tree algorithms are more complex, compared with the Red-Black tree algorithms, the improvement in cache behavior may account for a significant speedup in accessing large containers."

BTrees are cache-aware, and, in a sense, cache-oblivious.

There are more interesting structures, like fractal tree indices, which are cache-oblivios in the very formal sense.

My own opinion is that Btrees are nice, but not very effective in current world. Structures like LSM trees can be more effective. And you can construct high efficiency search structures over runs of LSM tree which will mimic Btrees.

A sidenote: your blog post is from the future - unless we're already in May? :)

Fixed. I'd like to say it was a subtle implication that it contains wisdom from the future ... but actually it was just me being boneheaded.

Funny how everything Google does is dogma. They move from complicated hash DS to B+trees, a decade later, and everyone follows. Better DS are currently Log-merge for heavy writes and data parallel BSTs for read-most.

At least they are not stubborn for long, I give them that.

I don't know how you got that impression, but inside Google people use hash tables far, far more often than B+trees. In fact, I don't think I ever used B+tree once in my entire career inside Google.

It's pretty hard to beat "expected O(1)".


If the hype is to be believed, Cache-oblivious b-trees are an even better match for today's hierarchical memory systems.

Unfortunately, I don't know any simple free/open-source implementation of these.

But here is a "home page" for them: http://supertech.csail.mit.edu/cacheObliviousBTree.html

There are even better fits for modern CPU architectures and hierarchical memory systems (but not on HDDs) than B-Tree based structures, like the adaptive radix tree: http://wwwkemper.informatik.tu-muenchen.de/~leis/papers/ART....

I think B-Trees are mostly important for educational purposes, since this is a very important general way of organizing data. For real world usage, there are hundreds of different structures optimized for different use cases.

Do you know of any examples of good open source projects using the ART?

Since there has been a resurgence of interest in OCaml on HN, some may be curious to see how a production quality Btree implementation looks like in OCaml. Algebraic data types, make it particularly convenient for these sorts of things. For comparison one may contrast it with an implementation in a language that does not have algebraic types, C++ for example. So glad Rust chose to have them.

Here is a Btree implementation from the mirage project [1] https://github.com/cgreenhalgh/ocaml-btree/blob/master/lib/b...

EDIT: removed link to binary tree after kerneis pointed it out.

[1] http://www.openmirage.org/

> It can be shorter still, if all you want is insert and read

Despite the name, I'm afraid this is a simple binary tree implementation, not a BTree:

type 'a tree = Empty | Node of('a * 'a tree * 'a tree);;

Obviously the author didn't know what he was doing (the intended scope was very large, AVL, BTrees, etc. but he stopped after committing this single file).

> Here is a Btree implementation from the mirage project

And in fact, this implementation does not show a lot because the actual B-tree work is done by the baardskeerder library: https://github.com/Incubaid/baardskeerder

It looks very large and complex, it is most certainly possible to write a simpler, pedagogical implementation.

Found cpc https://github.com/kerneis/cpc from your comment stream. Looks super interesting. http://felix-lang.org/share/src/web/tut/fibres_index.fdoc might pique your interest. It also compiles coroutines portably and efficiently (to a C++ runtime library). Felix tries to do for C++ what F# does to C# (in case there is any confusion, I am just a new user)

Sweet. I remember writing an implementation in TurboPascal in the nineties. This is... refreshing :-)

The original B-Trees as described by Bayer were "intrusive" (like implemented here), in the sense that every node contains keys, values, and pointers.

There are variants such as the B+Tree http://en.wikipedia.org/wiki/B%2B_tree , which stores only keys in nodes and chains blocks - which is more efficient in range scans and less in general retrieval; And the http://en.wikipedia.org/wiki/B*-tree which is more densely packed.

Interestingly, a red-black tree can be viewed as a BTree [0].

Another case where caching plays a huge role in determining the most efficient data structure (See vector vs list [1]).

[0]: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Analogy_...

[1]: http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector...

Robert Sedgewick gives a very good analogy of a red-black tree to a 2-3 tree on his Coursera course (Algorithms, Part I) [0] - a bit different to what is discussed in Wikipedia, but very easy to understand and implement.

[0]: https://class.coursera.org/algs4partI-004

This reminds me of The Ubiquitous B-Tree by Douglas Comer. That article was written in 1978.


Timo Bingmann presented a fast and robust implementation here https://panthema.net/2007/stx-btree/

Speaking of B-trees, I find CouchDB's implementation really interesting: the backing file is append-only, meaning that modifying the tree implies appending the modified nodes at the end of the file (for example, if I change a leaf, than I rewrite it at the end of the file, and rewrite its parent so that it points to the leaf's new position, and so on until we reach the root node).

Sounds like a waste of space, but it solves a bunch of problems, like being crash-proof (since we never overwrite anything, it's always possible to just find the last root) and allowing reads concurrent with a write without any synchronizing necessary. Plus, it's actually optimized for SSD's due to its log-like structure!

CouchDB B-Trees : http://guide.couchdb.org/draft/btree.html Log structures in SSDs : http://lwn.net/Articles/353411/

Massive waste of space though. LMDB is also copy-on-write and crash-proof but doesn't require compaction like CouchDB does.


Read the LDAPCon 2011 paper (and slides) to see how it works and why it's better than CouchDB.

In C++, std::set and std::map are typically implemented as RB trees. To compare the performance of those to a hash backed container, try std::unordered_set and std::unordered_map.

I have fond memories of red/black binary trees. About 15 years ago I wrote a reusable library for R/B trees in C. I had a series of programs I had to write that were collating a lot of data in memory into several buckets.

(The nice thing about virtual memory is you can make a really big RAM disk.)

Under what conditions will a B-tree out perform a hash map? Seems like the hash map was far and away the better choice in his test.

For insertions and lookup it is probably hard to find a situation (for large n hashes are asymptotically superior).

Trees provide some other nice properties such as maintaining an ordering (so it's fast to get the smallest argument or interate of the elements in numerical order) and always perform at the asymptotic time cost (a dynamic hash table occasionally takes O(n) when the table is resized so it might not be a good fit in a realtime situation)

What about false sharing?

I only benchmarked and considered the single-threaded case. Multithreading is a whole new can of worms.

Trivially, one could synchronize around the entire tree.

However, if modified nodes are always copied, then the only node that matters is the root node (since there can never be a partially modified tree accessible from the root), which boils down to atomic update of a root node pointer. Should be very efficient.

There is, of course, a performance implication of node copying, but it only affects add/delete/replace performance - read access runs at full speed - and it is cache-friendly, as well.

If you want different threads to see the same data structure then I think you would also need a mechanism to prevent multiple threads modifying the same node and clobbering each other's changes.

Yes, of course. But there is a difference between read, which only requires synchronization at the root node, and write, which requires synchronization at the modified leaf and potentially all the way back up the path from leaf to root.

LMDB was written specifically for multi-threading; all of its shared data structures are cache-line aligned to prevent false sharing. This is also essential to allowing readers to run with no locks.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact