Someone should create a library where all your data is stored in a BTree. Think of it! Huge data sets, small data sets, everything scales in this system. And even better, they can create a language-agnostic API to a small server that just stores the btrees for them and gives them back on request. We could call it a database, but that would require we use indexes correctly for once in our lives.
If you do want a B-tree library in C that's not tied to SQL, I've had good success with lmdb[0]. As a bonus, it's also transactional and writers do not block readers (and vice versa). Crazy fast and easier to use from C than SQLite IMO.
If you want B-trees that compete with LSM trees, stratified B-trees[1] are pretty sweet. There's an in-kernel GPL implementation called Castle[2] on GitHub.
And then we can stack every conceivable use case on top of it, and write that language-agnostic API in a way that makes interfacing with most current languages so onerous that people can build hundreds of thousands of lines of code to make it slightly less painful!
Maybe we can make the configuration so convoluted that we can base entire companies and revenue models around maintaining them.
>write that language-agnostic API in a way that makes interfacing with most current languages so onerous that people can build hundreds of thousands of lines of code to make it slightly less painful
Those would be people who couldn't be bothered to learn a really simple declarative language, and prefer ad-hoc OO ORMS and other shit like that...
Oh, SQL itself is fairly easy. For many things, I'm a fan.
However, SQL query construction and result munging is painful.
Consider a UI search screen that has eight potential search parameters that requires a one to many join for the results. The query construction, under something like JDBC, can end up with hundreds of lines of tedious code, like (and this is a condensed example!):
String query = "SELECT a.*, b.* from table1 a inner join table2 b on a.fk_id = b.id where" // shorthand
List whereClauses = []
if (params.region) { whereClauses.add(getRegionWhereClause(params.region)) } //hope this doesn't require another join!)
if (params.country) { whereClauses.add(getCountryWhereClause(params.country)) }
...
if (params.lastParam) { whereClauses.add(getLastParamWhereClause(params.lastParam))}
query += whereClauses.join (" AND ")
...
Each of the params methods are going to have a few lines of code.
def getRegionWhereClause(regionCodeList) { "a.region IN" } //Hope you never want to change the table alias here!
Then you're going to return a list of rows that looks like
| a.1 | a.2 | b.1 | b.2 |
| a.1 | a.2 | b.1' | b.2' |
Where you really want:
| a.1 | a.2 | [[b.1 | b.2] | [b.1' | b.2']]
So you have to go through and munge it. (Can you use something like CONCAT as a hack and groupby? Sure, but that introduces other problems.)
Having a way to pass in optional parameters to a where clause for the DB to strip out (while staying performant) and being able to return a 1 to many as an array would solve many problems, but it doesn't fit the paradigm of SQL.
SQL is not a simple language and expecting everyone to lean it properly is crazy. Simple queries are easy, sure, but as with all things you quickly outgrow simple things and then you're stuck. Especially if you want good performance.
SQL is arguably at least as easy as any other language you can define around this idea. Usually the complications come from trying to normalize the crap out of your data and expecting a single relational model to work for all use cases.
Oddly, moving this to another domain would not make things any easier in that regard. If you tried to have a single data structure that stores all of your data for all of your use cases, expect pain.
Google has an open source implementation of b-trees in C++. You gain performance and memory efficiency (less overhead per element stored). You lose the iterator stability that std::map/std::set guarantees. Other than that the interfaces are pretty much equal.
(top article author). I wasn't aware of that, it looks like they come to similar conclusions "B-trees are widely known as data structures for secondary storage, because they keep disk seeks to a minimum. For an in-memory data structure, the same property yields a performance boost by keeping cache-line misses to a minimum. C++ B-tree containers make better use of the cache by performing multiple key-comparisons per node when searching the tree. Although B-tree algorithms are more complex, compared with the Red-Black tree algorithms, the improvement in cache behavior may account for a significant speedup in accessing large containers."
BTrees are cache-aware, and, in a sense, cache-oblivious.
There are more interesting structures, like fractal tree indices, which are cache-oblivios in the very formal sense.
My own opinion is that Btrees are nice, but not very effective in current world. Structures like LSM trees can be more effective. And you can construct high efficiency search structures over runs of LSM tree which will mimic Btrees.
Funny how everything Google does is dogma. They move from complicated hash DS to B+trees, a decade later, and everyone follows. Better DS are currently Log-merge for heavy writes and data parallel BSTs for read-most.
At least they are not stubborn for long, I give them that.
I don't know how you got that impression, but inside Google people use hash tables far, far more often than B+trees. In fact, I don't think I ever used B+tree once in my entire career inside Google.
I think B-Trees are mostly important for educational purposes, since this is a very important general way of organizing data. For real world usage, there are hundreds of different structures optimized for different use cases.
Since there has been a resurgence of interest in OCaml on HN, some may be curious to see how a production quality Btree implementation looks like in OCaml. Algebraic data types, make it particularly convenient for these sorts of things. For comparison one may contrast it with an implementation in a language that does not have algebraic types, C++ for example. So glad Rust chose to have them.
> It can be shorter still, if all you want is insert and read
Despite the name, I'm afraid this is a simple binary tree implementation, not a BTree:
type 'a tree = Empty | Node of('a * 'a tree * 'a tree);;
Obviously the author didn't know what he was doing (the intended scope was very large, AVL, BTrees, etc. but he stopped after committing this single file).
Found cpc https://github.com/kerneis/cpc from your comment stream. Looks super interesting. http://felix-lang.org/share/src/web/tut/fibres_index.fdoc might pique your interest. It also compiles coroutines portably and efficiently (to a C++ runtime library). Felix tries to do for C++ what F# does to C# (in case there is any confusion, I am just a new user)
Robert Sedgewick gives a very good analogy of a red-black tree to a 2-3 tree on his Coursera course (Algorithms, Part I) [0] - a bit different to what is discussed in Wikipedia, but very easy to understand and implement.
Speaking of B-trees, I find CouchDB's implementation really interesting: the backing file is append-only, meaning that modifying the tree implies appending the modified nodes at the end of the file (for example, if I change a leaf, than I rewrite it at the end of the file, and rewrite its parent so that it points to the leaf's new position, and so on until we reach the root node).
Sounds like a waste of space, but it solves a bunch of problems, like being crash-proof (since we never overwrite anything, it's always possible to just find the last root) and allowing reads concurrent with a write without any synchronizing necessary. Plus, it's actually optimized for SSD's due to its log-like structure!
In C++, std::set and std::map are typically implemented as RB trees. To compare the performance of those to a hash backed container, try std::unordered_set and std::unordered_map.
I have fond memories of red/black binary trees. About 15 years ago I wrote a reusable library for R/B trees in C. I had a series of programs I had to write that were collating a lot of data in memory into several buckets.
(The nice thing about virtual memory is you can make a really big RAM disk.)
For insertions and lookup it is probably hard to find a situation (for large n hashes are asymptotically superior).
Trees provide some other nice properties such as maintaining an ordering (so it's fast to get the smallest argument or interate of the elements in numerical order) and always perform at the asymptotic time cost (a dynamic hash table occasionally takes O(n) when the table is resized so it might not be a good fit in a realtime situation)
Trivially, one could synchronize around the entire tree.
However, if modified nodes are always copied, then the only node that matters is the root node (since there can never be a partially modified tree accessible from the root), which boils down to atomic update of a root node pointer. Should be very efficient.
There is, of course, a performance implication of node copying, but it only affects add/delete/replace performance - read access runs at full speed - and it is cache-friendly, as well.
If you want different threads to see the same data structure then I think you would also need a mechanism to prevent multiple threads modifying the same node and clobbering each other's changes.
Yes, of course. But there is a difference between read, which only requires synchronization at the root node, and write, which requires synchronization at the modified leaf and potentially all the way back up the path from leaf to root.
LMDB was written specifically for multi-threading; all of its shared data structures are cache-line aligned to prevent false sharing. This is also essential to allowing readers to run with no locks.