Hacker News new | past | comments | ask | show | jobs | submit login
The subtleties of proper B+Tree implementation (ayende.com)
145 points by tim_sw 10 months ago | hide | past | favorite | 38 comments



Note: In most cases you don't want to split pages prior to insertion as is being done here. Instead, you want to insert data first, and then restore the page size invariants at the end (prior to committing the operation / writing to disk / etc). If you're doing multiple operations at once (e.g. adding new keys and deleting others) it may turn out that you can avoid the expensive page-splitting operation by postponing it.


No, you don't. That'll add expensive memory reallocations; it may let you avoid the split operation but then that isn't worth it because you've added a nasty performance cliff to your insert case - you'll either be doing the memory allocation repeatedly, or you'll waste a ton of memory when your buffer is bigger than your on disk node size.


For my workload it's optimal to keep working nodes deserialized anyway. YMMV but what I'm doing gets me over a million operations per second -- 1 Gbps of requests -- on a single CPU so I don't think I'm going far wrong.

EDIT: I just noticed you're the bcachefs author. I'm guessing you're dealing with much larger key/value pairs than I am (40+40)?


No, that's a typical size for bcachefs as well.

But I've had to optimize for memory usage, performance when working set does not fit into ram - we also don't have a serialize/deserialize step, that will really hurt you when working set doesn't fit in ram.

You want to validate when you're reading in a btree node, but you want to keep that as cheap as possible.

For avoiding the serialize/deserialize, the thing to use now would be Cap'n Proto.


Ah, that would explain it. Yeah, I have a total datastore size which is out of core, but the working set is generally in core... partly because I can use all the RAM on the system if I want to, whereas filesystems are (usually!) expected to occupy just a portion of the system memory.

FWIW, my "deserialize" doesn't involve copying all the data -- I just construct pointers into the (serialized) page. Serializing has to copy data though.


Just wanted to say, I really enjoyed reading your guys conversation, since I barely never do low-level optimizations like that. Very interesting, thanks for sharing your thoughts!


Well, bcachefs is currently the slowest of all Linux filesystems.



Probably more interesting to consider the follow-up article[1] after some debug code was disabled and such.

[1]: https://www.phoronix.com/review/bcachefs-benchmarks-linux67


Still dead last on a number of benchmarks, including FIO random reads, Dbench, and Postgres. And by a lot.


If I created the slowest mainstream Linux filesystem I'd be quite proud. Sneer all you want.


> No, you don't.

This is correct. Most disk-oriented DBMSs use fixed-size pages. So you can't go larger than the node size (not entirely true if using auxiliary data structures to buffer changes like an inefficient -epsilon tree).


That assumes that you are actually doing all the work in a side buffer? The way we do it, all the operations are done on the actual buffers of the tree, so there is no actual way _to_ have more space (even temporarily).

This way, you avoid extra allocation, copying, etc.


The only time I copy data (or keys for that matter) is when I'm constructing the new serialized page(s). All the rest of the work is done with pointers into the old page or the request structures.


There’s something to generalize here. If you run out of X, split by X. This bug happens when you run out of space, but split by count. Author’s comment said he fixed it by splitting by size (space taken).


If you're interested in B+Trees, and especially if you find yourself implementing one, go read this: https://w6113.github.io/files/papers/btreesurvey-graefe.pdf

I certainly haven't read the entire thing but gosh there's some good bits. E.g. "B-trees Versus Hash Indexes" - tons of insights jampacked together into a couple pages for anyone who hasn't already been exposed to B-trees in detail (e.g. in a research context vs the usual data structures class)


Yeah, this paper/book has a lot of great information about the details that your typical textbook explanation will never get into.


When I implemented this, long ago, I had a expanded page structure containing the old page plus the new key. So it had more memory than 1 page. Then the expanded page, when writing it back, becomes 2 pages if a split is needed.

I did it this way for a few reasons. One was I could release old key pages after new pages were written, avoiding data loss in IO errors. Second was key (prefix) compression became a lot more easy. I never realised it, but it avoids the trouble in both blog post too.

It comes with some copy overhead, unfortunately.


I've got a Data Structures and Algorithms final tomorrow, likely including a few 2-4 Tree problems... I will be forever grateful that we at least don't have to implement them, just simulate them.


Many years ago I taught a Data Structures course section where all the students were B+s or As. I gave them an assignment to do in-memory B-tree insertion and search. They mostly succeeded, though I had more than one paper handed in that was titled “The Assignment From Hell”.


> all the students were B+s or As

Or maybe

> all the students were B+s or B*s

I'll see myself out now


You win!


> “The Assignment From Hell”

LOL! I love that

That one for me must have been pricing reverse exchangeable notes in R


It really isn't so big a deal; the author of this post is spreading fear, in a way.

For example, saying that you need "real world data" to catch "edge cases" like this... no, proper unit tests will do just fine. Also, the problem of splitting different-sized keys in the way the author first tries to do does not strike me as a mistake anyone would fall into or that is hard to get out of.


just curious.. what language is it being taught in?


When I did it it wasn't in any language. We just drawed the evolution of the trees on paper. I got my degree in 2021.


What? They're fun to implement


Is this only relevant for string based btree implementations? Or is this more relevant for custom rolled btrees for a concrete set of types


This kind of thing makes me really wish we lived in a world where languages/libraries had first-class support for containers.

C++ at least tries but falls so far short.


What would that first-class support look like? Having more container types as language built-in?


It's not that the types need to be built-in, but that the compiler needs to know that a type supports an interface, and the library needs to take advantage appropriately.

For this particular case, key-based containers need to know if their key is a homogeneous sequence. But that's actually slightly too restrictive - in particular, a fixed-size prefix of a different type isn't actually any harder to optimize for, assuming the introspection API is good enough.

But for other things ... even "just give me the list of fields that this class has" involves containers. Implementing compare/hash efficiently (with short-circuiting!) requires peeking inside the heterogeneous tuple. Writing a partial JSON-like object literal? That's a container, and one with very common uses. Making a copy of an object with certain fields replaced? A container. Functions that take keyword arguments? A container which you would really like to propagate sanely. But most compilers that deal with such things don't actually support the usual container APIs, often even lacking simple concatenation (speaking of which - what about compile-time duplicate key detection?).


A functor typeclass?


what's the point of storing different sized keys together?


There are a lot of things you can do, e.g. factor out a common prefix. But even without that, this avoids doing extra indirections, which mean slow memory accesses.

The reason to use B-family trees over binary trees is to avoid said memory slowness in the first place.


Uhh.... maybe dont roll your own data structure if such a basic issue is causing you havoc.


Maybe link to a resource for people to download a good implementation of that data structure. Maybe link to an article describing hot so implement this data structure correctly in depth. Maybe try finding some other way to add to the conversation instead of disparaging the author for sharing something they learned. Maybe don’t post anything at all if you don’t have something productive to say. Maybe use nonviolent-communication (NVC) techniques to say exactly what you just said, but without condescension or attack.


This makes it sound like it’s the author’s fault or something. They’re pointing out a pretty subtle gotcha.


Aside from the other replies, I should point out that two very capable programers had to implement this differently because the performance implications of a design decision are different depending on the working-set size.

For all but the most basic data structures, the variations are so many that rolling your own is often a good design choice.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: