
Let’s Invent B(+)-Trees - panic
https://shachaf.net/w/b-trees
======
AlexeyMK
That made sense. I was following along, and then all of a sudden, it just kind
of ended.

As a layman who doesn't clearly remember B Trees, it would be awesome to have
even a sentence at the end, like

...and that's B Trees! Commonly used for storing fields in relational
databases, filesystems, and more!

For fellow laymen,
[https://en.wikipedia.org/wiki/B-tree](https://en.wikipedia.org/wiki/B-tree)
isn't bad, but is there more?

~~~
why-el
The Postgres documentation on btrees is absolutely stellar if you can afford
some extended time to study it [1]. From there you can read the actual code or
watch some videos on how Postgresql indexes use them. You can even dump
indexes locally to see their content! from root all the way to the leaf nodes
[2] (this assuming you read and understood the above).

[1]
[https://github.com/postgres/postgres/tree/master/src/backend...](https://github.com/postgres/postgres/tree/master/src/backend/access/nbtree)

[2]
[https://www.postgresql.org/docs/10/pageinspect.html](https://www.postgresql.org/docs/10/pageinspect.html)

------
benhoyt
This sounds exactly how the excellent Python Sorted Containers library works,
and achieves "faster than C" performance in pure Python by effective use of
the bisect module (which itself is written in C).
[http://www.grantjenks.com/docs/sortedcontainers/implementati...](http://www.grantjenks.com/docs/sortedcontainers/implementation.html)

> The Sorted Containers internal implementation is based on a couple
> observations. The first is that Python’s list is fast, really fast. Lists
> have great characteristics for memory management and random access. The
> second is that bisect.insort is fast. This is somewhat counter-intuitive
> since it involves shifting a series of items in a list. But modern
> processors do this really well. A lot of time has been spent optimizing mem-
> copy/mem-move-like operations both in hardware and software.

> But using only one list and bisect.insort would produce sluggish behavior
> for lengths exceeding ten thousand. So the implementation of Sorted List
> uses a list of lists to store elements. In this way, inserting or deleting
> is most often performed on a short list. Only rarely does a new list need to
> be added or deleted.

> Sorted List maintains three internal variables: _lists, _maxes, and _index.
> The first is simply the list of lists, each member is a sorted sublist of
> elements. The second contains the maximum element in each of the sublists.
> This is used for fast binary-search. The last maintains a tree of pair-wise
> sums of the lengths of the lists.

~~~
pixelpoet
"Faster than C", by effective use of C? Maybe I'm missing something, but that
seems to immediately be a contradiction.

~~~
benhoyt
Yeah, you're right. On its homepage it's labelled as "fast as C extensions"
and "often faster than C implementations". I guess because C implementations
often use inferior data structures -- I used to say "algorithm is better than
assembly".

------
ryanmjacobs
Haha, his web directory layout cracked me up.

After reading, I thought to myself, "Hmm... I wonder what else he has written.
I'll navigate upwards a level to
[https://shachaf.net/w](https://shachaf.net/w) "...

It returns a blank page with the letter "w". That actually made me laugh out
loud.

"Ah, shucks. Let's go up again."

Returns a non-styled page that reads "Success!".

~~~
mden
And then when you click on Success you get this graph (which is a bit sad) -
[https://shachaf.net/pet.txt](https://shachaf.net/pet.txt)

~~~
ryanmjacobs
Oh shoot, I didn't even notice that. Looks like he's at a constant
Happiness=100 though.

------
ar-nelson
This is weirdly relevant. I just finished implementing a persistent B-tree
library in Scheme.[1][2]

I don't know why B-trees aren't used more as a purely-functional data
structure. Once they get large enough, they don't move around much when
changed; there is no rebalancing except on deletes, and even that only affects
the direct path to the deleted node.

[1] [https://github.com/ar-nelson/schemepunk#b-trees](https://github.com/ar-
nelson/schemepunk#b-trees)

[2] [https://github.com/ar-
nelson/schemepunk/blob/master/btree.sl...](https://github.com/ar-
nelson/schemepunk/blob/master/btree.sld)

~~~
aabbcc1241
I like the concept of persistent B-tree, can you share about the storage
mechanism?

~~~
ar-nelson
I guess I should have disambiguated my usage of "persistent". In this case it
refers to a purely-functional data structure[1], not a data structure that is
persisted to disk. Inserting an item into a persistent B-tree produces a new
tree, while leaving the old one intact. Ideally, this is done with minimal
copying; unchanged subtrees are just pointers to the same subtrees in the old
tree.

This is very similar to persistent hash tries, which are used by purely-
functional languages like Haskell. But I designed this B-tree library as an
implementation of SRFI 146, which uses a comparison function, not a hash
function, to create a key-value mapping.

If you're interested in B-trees being persisted to disk, that's how most
databases and most filesystems already work.

[1] [https://www.geeksforgeeks.org/persistent-data-
structures/](https://www.geeksforgeeks.org/persistent-data-structures/)

~~~
kadoban
Persistent B-trees do seem like they'd work fine. You will have to copy whole
blocks that get modified, but I don't think that should hurt much. I'm fairly
curious how benchmarks will look compared to the usual choices (Haskell has
what, finger trees and radix trees or something? And red-black or AVL have
persistent implementations all over of course.)

~~~
ar-nelson
The reference implementation of SRFI 146 (persistent mappings) is a red-black
tree[1]. I benchmarked my B-tree implementation against it and found that, in
almost every R7RS Scheme, constructing small trees (~100 elements) and
querying and deletion on large trees (~10,000 elements) was 2-3 times faster
than red-black trees. The major difference was constructing large trees, which
could be easily 10 times faster.

[1]
[https://srfi.schemers.org/srfi-146/srfi-146.html](https://srfi.schemers.org/srfi-146/srfi-146.html)

------
afc
Interesting. :-)

I recently posted a brief document I wrote about a similar structure,
immutable AVL trees. I find AVL trees a very nifty structure, I think they
don't get as much credits as they deserve: logn (which in practice is about
the same as "constant" for most values of n one encounters in practice)
insertion, lookup, deletion and even append (for position-based trees, with
some assumptions). Incredibly simple implementation. And, if immutable (based
on shared memory), allows trivial snapshotting, having multiple "concurrent"
trees sharing most of their memory.

Anyhow, it's here: [https://github.com/alefore/weblog/blob/master/immutable-
avl-...](https://github.com/alefore/weblog/blob/master/immutable-avl-trees.md)

I have a small implementation here (which I use, among other things, to hold
all lines in a file, in my text editor):
[https://github.com/alefore/edge/blob/master/src/const_tree.h](https://github.com/alefore/edge/blob/master/src/const_tree.h)

~~~
jiggawatts
All node-based trees with a small branching factor of 2-3 have the same
performance problems on modern hardware, irrespective of the theoretical big-O
algorithmic complexity.

They all have poor memory locality, high overheads, and high indirection. This
is kryptonite for typical x86 or ARM CPUs. Unless you're only developing for
some tiny embedded CPU with no cache, no branch predictor, and low frequency,
such algorithms belong in the dustbin of history.

Practically always, you get better _actual_ performance when using array-based
structures such as hashtables, B-Trees, or the like.

Features like snapshotting can be implemented using virtual memory tricks, but
is a gimmick rarely used outside of pure functional languages. You might find
that simply copying an array when needed for a snapshot is faster than a fancy
tree with sharing as a native capability. The exception would be certain
dynamic programming scenarios where cloning is very common.

~~~
magicalhippo
The traditional ZFS implementation(s) use AVL trees as the in-memory data
structure for allocating space (ie finding free disk space).

However the performance limitations of them have started to show, so there's
been work done to switch to B-trees[1].

[1]:
[https://www.youtube.com/watch?v=LZpaTGNvalE](https://www.youtube.com/watch?v=LZpaTGNvalE)

------
nayuki
I published a reference implementation of B-trees that is optimized for
clarity and conciseness: [https://www.nayuki.io/res/btree-
set/btreeset.py](https://www.nayuki.io/res/btree-set/btreeset.py) ;
[https://www.nayuki.io/page/btree-set](https://www.nayuki.io/page/btree-set)

------
tedd4u
You follow along and build a toy B+-tree based sqlite clone in this
(incomplete) tutorial.

[https://cstack.github.io/db_tutorial/](https://cstack.github.io/db_tutorial/)

------
aabbcc1241
I like this website, it's simple and up to the point. And it works without
Javascript :)

------
derefr
Does anyone here know if there’s a data structure more ideal (i.e. lower-
overhead) than a B+-tree for holding onto large amounts of on-disk data, if
you can constrain the operations you’re going to do? Personally, in my
workload, I know I’m never going to delete from the tree, only insert and
query. Does this “get me” anything?

For another consideration: B+-trees (or trees generally) aren’t necessarily
the best when the workload is mostly appends to the end of the keyspace. Is
there an on-disk data structure that _is_ tuned for mostly-append, rarely-
insert workloads? Or how about for insert workloads, where each insert is
actually a _batch_ of inserts of a large contiguous range of sorted keys?
(I’ve observed that this workload makes LevelDB fall over, so LSM trees aren’t
it.)

~~~
kyralis
I've personally used a B+Tree variant in this case with special-cased insert
rules. Rather than splitting nodes when they're full, if they're filled due to
appending at the right-hand-side we simply start the next leaf (or next
interior + leaf pair) during insert. So, for pure in-order insertions, we end
up with a maximally efficient tree with no splits needed; the rest of the
logic still works if you then go back and need to update or insert into the
middle of the tree at a later date.

I'm sure I'm not the only one with that particular implementation, but I'm not
aware of any named algorithm/variant for it.

------
htiek
I like this approach to describing B-trees! I recently taught B-trees for a
data structures course and presented them using a similar method. Thought I'd
leave the link in case it was useful:
[http://web.stanford.edu/class/cs166/lectures/05/Slides05.pdf](http://web.stanford.edu/class/cs166/lectures/05/Slides05.pdf)

~~~
acqq
Really wonderful presentation, thank you! Especially the way the "animations"
(the slides with changes, actually) convey the dynamics in the sequence of
events explained.

------
6510
I'm somewhat ignorant of theory but when I ran into this I just kept a
separate unsorted array of values to-be added. In moments of idleness the
extra array is sorted and if time allows it the entire thing is sorted into
the original.

Also a fun approach for an array of sorted numbers is to use bits and have the
offset represent the value.

------
winrid
Very... Short article. Anyway my favorite kind of tree is the R tree if you
want to go down the research rabbit hole.

~~~
benibela
That is like a German pun

tree is Baum

R tree could be called Raum.

Which means space, which fits, because the tree stores the keys spatially.

~~~
winrid
I like it!

------
mr_gibbins
What a brilliant one-page summary of a complicated topic. Well done, a
pleasure reading it. Bookmarked.

------
valera_rozuvan
While implementing, I found that there is a large number of disk hops (and
hence greater time required to read), and accessing data is slow. Only a chunk
of data fits in a disk block (remember, there is a predefined size limit for
blocks).

How to get around this?

~~~
thesz
B+-trees degrade quite fast with the slightest hint of randomness in the
inserted data.

To get around this unfortunate property people invented Log-structured Merge
Trees - you keep new part of your data which is easy to insert randomly to and
once in a while merge it (as sorted data!) with main data. This way
degradation is greatly reduced at the expense of reading speed in case of
merge.

BTW, LTMs are a variant of the logarithmic method - a way to create dynamic
structures (fast to insert and query) from static ones (slow to insert, fast
to query). Even sorted arrays can be made fast for insertion in this way.

Anyway, try to keep your insertion order as sorted as possible. Delay actual
insertion into disk data if needed.

~~~
senderista
Could you elaborate on the "logarithmic method"? The fastest technique I know
of to insert into sorted arrays uses a square-root decomposition.

~~~
thesz
You can split a collection of 2^(k-1)<=N<2^k elements into up to k arrays with
length 2^i. Insertion into empty array with one element capacity is easy,
insertion into array with single element creates an array with two elements.
If you already have an array of two elements, you should merge these elements
into array of 4 elements.

The merge operation is fast and cache oblivious. Thus, insertion into a set of
these arrays is fast.

Query is not so fast - O(log^2N). But I can direct you to Cache-Oblivious
Lookahead Arrays (COLA) paper where a technique to get back to O(logN)
complexity for lookup is described.

Or you can simple merge these arrays into one at the bulk insertion end.

Source code (not mine):
[https://github.com/giannitedesco/cola](https://github.com/giannitedesco/cola)

------
hyperpallium
I understand that B+ trees made relational databases practical.

Is that true? How are they used (sorted, according to link)? How do B+ differ
from B trees?

~~~
shakna
They have some things that a relational database likes:

\+ Fast search (log(N))

\+ Insert/Delete are fast (O(N) ... Most of the time)

\+ Iteration is fast (unlike a hash map or similar)

You could have a look at [0] for a deeper dive.

[0]
[https://cstack.github.io/db_tutorial/parts/part7.html](https://cstack.github.io/db_tutorial/parts/part7.html)

~~~
hyperpallium
Thanks, sounds like sorting is an implementation detail, and not the main
benefit in itself.

~~~
aboodman
Imagine you're doing:

SELECT * FROM EMPLOYEE, DEPARTMENT WHERE EMPLOYEE.DEPARTMENTID = DEPARTMENT.ID

The naive thing would be to iterate employees and for each one read the
corresponding department. However that would mean one disk seek per employee.

Disk seeks on the rotational disks relational dbs were developed for took
hundreds of ms to complete. Even today's rotational disks have seeks that take
dozens of ms. So doing one seek per join result doesn't work at all. You can
add caching to disguise to some extent, but that costs memory.

Compare to B-Trees: You build an index on EMPLOYEE.DEPARTMENTID. That index is
also a B-Tree. Because they are sorted, you can just walk both employee index
and department table in parallel by id. You only need to do n/k seeks (where k
is branching factor of the b-tree). You only need one record worth of caching
per table (the one under the current cursor).

Bonus: because the indices are sorted, it's easy to pipeline each seek after
the previous to further reduced the latency of the query.

~~~
hyperpallium
Thanks, makes sense! I've once before seen comparing sorted values is faster,
even when not all match.

If "building an index" means it's just the indices (not all rows of the
table), doesn't that indirection mean an extra seek to get the full row? (I
probably have a fundamental misunderstanding here)

~~~
aboodman
You're right!

But you can sort the second column of the index too. So in our case, the index
would be sorted by dept id, then employee id.

    
    
      deptid  empid
      1       1
      1       3
      1       7
      1       10
      ...
      2       4
      2       5
      2       9
      ...
      3       2
      3       6
      3       8
    

Now if we just do the basic thing and run through the index in order, we'll
end up doing m sorted scans of the employee table, where m is the number of
unique department ids. Not ideal, but if departments are large, still far
better than seeking each employee individually.

But we can do better: If there are relatively few departments, the database
can put a cursor at the beginning of each run of them:

    
    
      deptid  empid
      1       1    <- cursor
      1       3
      1       7
      1       10
      2       4    <- cursor
      2       5
      2       9
      3       2    <- cursor
      3       6
      3       8
    

Now the database can scan those cursors in parallel, merge sort them, and feed
the result into scanning the employee id table. Now we again have a single
sorted scan of the employee table, at the cost of m extra memory.

This isn't a general solution. If there are zillions of departments and each
one has only a handful of employees, this doesn't work. And in modern
databases low-cardinality indexes like our dept_emp_id index sometimes use
more specialized data structures.

One of the beautiful/crazy things about working on databases is it's a product
that promises an abstraction that can't actually be implemented perfectly in
all cases. Databases uses all kinds of heuristics internally to decide
different strategies to answer queries. And vendors are constantly refining,
trying to get closer to an ideal they can never actually reach.

But this particular strategy is a common and important one and an example of
why B-trees were so important to early relational systems.

------
globular-toast
I love B-trees. Probably my favourite data structure. But do yourself a favour
and read Knuth's treatment of them.

------
zanderwohl
I love the idea but I'd love to see someone implement this and then show
performance and memory comparisons to trees and normal arrays.

------
every
Faint memories of MUMPS...

------
raincom
Very intuitive explanation.

------
techslave
i don’t get it. what’s next, visual bubble sort?

------
geophile
Uhh ... why is this the top article on HN? I know all about B-trees, and I do
not understand what is new or interesting in this very brief page.

~~~
photoGrant
Your knowledge isn't mine -- I found this to be quite a condensed way to
understand something I've little relation to.

What do you have to contribute that teaches not criticises?

