Hacker News new | past | comments | ask | show | jobs | submit login
Grokking AVL and RAVL Trees (photonlines.substack.com)
53 points by photon_lines on Aug 6, 2023 | hide | past | favorite | 20 comments



This feels written by ChatGPT, certainly the first and last paragraph. And it omits a few important characteristics. Balanced trees were known, but balancing was expensive. AVL trees have an O(log n) cost for search, insertion and deletion, and achieves this not by balancing but by almost balancing.


It was mostly written by me, although I do admit to using ChatGPT to help out in generating some of my material. Most of the recent blog posts are Feynman style lecture notes I wrote for myself to learn new concepts and this piece I actually wrote well over 10 years ago and I figured to share it in case anyone else wanted a quick and visual overview. Apologies for missing out any important concepts. At the time I wrote down material that I found important so there are definitely things I may have not explained correctly.


All examples rotate a node with a single child, what about the case where you have two children?


You would only rotate a node when it becomes un-balanced (i.e. so the only case you would deal with is the instance where a new child-node is added to another child within the tree). There are no cases where you would need to re-balance when you add a 2nd child - the balance factor would be larger than 1 in the instance of adding the 1st child element to the node and thus you could never find cases like this (unless I'm misunderstanding what you're asking).


In the case of 3 with children 1 and 5 and 5 with children 4 and 6. 3 and 5 have two nodes. If you add 7 as a child of 6, the tree is unbalanced. This is not one of the cases shown.


AVL trees are binary trees so they can only have 2 children.


There can be nodes with a single child. Otherwise an AVL tree would always have 2^n - 1 or 2^n - 2^(n-2) - 1 nodes. Imagine adding a node to a balanced tree with 3 nodes.


What kind of data persistence strategies make sense for AVL trees?


It's only used in memory, AFAIK. I doubt they could outperform B-trees in a db context.


They won't outperform b-trees in memory either. Cache lanes and cache hierarchies of modern computers (as in built after the 90s) mean that neither AVL trees or red-black trees have any place anymore.


Yup, b-trees should be the default these days. Even in memory latencies are so high the high fanout and shallow tree dominates other concerns. Red Black trees can be straight up terrible on modern hardware.

There's radix tree variants that still make sense vs b-trees though. Typically they take up more space than b-trees but have operations bounded by the key length not dataset size. They also form the same structure regardless of insertion order making them useful in cryptographic contexts.


As with linked lists, pretty much the only reasons for which you might want them are:

a) If your keys are really big, copying them around may be slower than using pointer manipulations; you'd have to benchmark this to be sure for particular cases.

b) You can use use them as intrusive data structures, with one structure in multiple trees or lists at a time, and you can insert one without dynamic allocation.


The extra invalidations are nasty, though. There's a reason the standard chose what it did.


What standard?


C++ (mentioned elsewhere in the thread). It's the main language that people bother to implement the full gamut of containers and their optimizations in; Rust is only barely starting to intrude.

Most other languages don't support things like value types, stuffing bits inside aligned pointers, etc. So even if someone implements a particular container it's sure to be quite suboptimal.


Ah. C++ was mentioned in a different sub-thread. I may be wrong but I don't think the C++ standard says anything about red-black trees. It just provides big-O bounds on operations that mean that you need to pick some sort of balanced binary tree to implement std::set and std::map.

C++ containers are usually suboptimal compared to writing the code you actually need for the situation you're really in. In most programs you have a lot of situations where you can just use a pretty basic data structure like a linked list or a dynamic array and the performance characteristics don't matter much. Then you have one big data structure or a couple of big data structures that need to use something a bit esoteric. Those are where you should probably just write your own.

For example, a text editor is going to have a lot of collections of things. But most of them are very small, and AVL tree vs RB tree vs B-tree simply does not matter. But the buffer representation itself will be some sort of custom data structure: maybe a piece table, a piece tree, a gap buffer, or a rope, or maybe something new. Either way, it's not going to be a reusable data structure.

What I'm saying is that in practice, there's very little actual value in having generic data structures beyond the very most basics. See Go, see C. So I don't think the facility for implementing 'optimal' generic data structures as libraries is particularly useful, and in practice they're suboptimal anyway.


AVL vs RB (and a few others) remains an implementation choice, but B-trees (and a few others, e.g. splay trees) are forbidden unless you add gratuitous overhead.


See also:

> C++ standard libraries commonly implement the map and set data structures using Red-Black trees, which store one element per node, requiring three pointers plus one bit of information per element to keep the tree balanced. By comparison, B-tree containers store a number of elements per node, thereby reducing pointer overhead and saving a significant amount of memory. For small data types, B-tree containers typically reduce memory use by 50 to 80% compared with Red-Black tree containers.

https://opensource.googleblog.com/2013/01/c-containers-that-...

And also:

> However, there are many reasons why B-trees are often a good choice for in-memory collections.The first reason is cache locality. When searching for a key in a binary tree, the algorithm would visit up to logN elements that are very likely dispersed in memory. On a B-tree, this search will consist of two phases — intra-node search and descending the tree — executed one after another. And while descending the tree doesn’t differ much from the binary tree in the aforementioned sense, intra-node search will access keys that are located next to each other, thus making much better use of CPU caches.

https://www.scylladb.com/2021/11/23/the-taming-of-the-b-tree...

And also, also:

> In this paper, we conduct a comprehensive performance study on five state-of-the-art indices, Masstree, BwTree, FAST, ART and PSL, with respect to throughput, scalability, latency, memory consumption and cache miss rate. According to our results, PSL achieves a similar query performance in terms of get throughput to a read-only index, FAST, and meanwhile performs up to 5x, 0.3x and 2.5x faster than Masstree, ART and BwTree respectively. Another interesting point observed in the results is that the performance of BwTree drops significantly when the workload is write-only, probably due to the contenction caused by the frequent fails on the hot blocks. FAST and ART win the competition on the cost of memory space, leading other indices with a saving factor between 20%-100%. Our experiments also reveal that for the dense dataset, ART and BwTree, which belong to indices built based on tries, may be the suitable candidates for a better overall performance.

https://www.comp.nus.edu.sg/~dbsystem/download/xie-icde18-pa...


a B-tree in the C++ STL would be a welcome addition!


It could be good for in memory stores / log-structured merged trees / other data store applications although it isn't used much now-days. I find them simpler to implement and understand than red-black trees -- although that's a matter of taste I suppose. They beat red-black trees in read-heavy loads (i.e. writes / updates are more costly for AVL trees than for Red-Black trees although they beat R-B trees for read-heavy loads). You can find another implementation in Illumos (an open source Unix operating system) available here: https://github.com/illumos/illumos-gate/blob/master/usr/src/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: