Hacker News new | past | comments | ask | show | jobs | submit login
Binary Trees are optimal except when they’re not (hbfs.wordpress.com)
75 points by jjgreen 11 days ago | hide | past | favorite | 50 comments

Author seems to have just backed into rediscovering B-trees


> To conclude, binary trees are optimal when we ignore the cost of accessing a node, but they aren’t when it becomes costly to access a node. When we access the nodes on disk, with a high cost, it becomes interesting to bundles many keys in a node, and we gave the optimal solution. However, the problem is often solved quite more directly. We just fit as many keys as possible in an I/O block. Disks operates on small blocks of (typically) 512 bytes, but file systems operate in somewhat larger, but fixed-sized blocks, something like 4 or 8k. A good strategy is therefore to fit as many keys as possible in that block, since even if the number of comparisons is large, it will still be faster than bringing that block into main memory.

I didn’t get that impression. I got the impression the author knew full well what B-trees are and is alluding to them.

The author’s other writings seem to show that they do indeed know of B-trees.

Skiplists are my favorite data structure. There are a few use cases that they can't satisfy, but for the typical memory-bound key-value store application, it's a very simple & effective approach.

If you have to go to disk, a combination of: splay tree, batching and append-only log structure are your best bet.

If you are a fan of skip lists, you may want to look into treaps as well. "Treap" is a portmanteau of "tree" and "heap": it is a binary search tree with respect to the search key, and it is a heap in another random or quasi-random quantity such as a hash of the key. Treaps have essentially all the properties of skip lists, and they are even simpler to implement.

One way to look at treaps is to start with a skip list. A skip list has variable-size nodes, which is kind of annoying. If you transform the variable-size node into a linked list, you end up with something that is essentially a binary tree, but now you have redundant information in multiple tree nodes. If you remove all redundancy you get the treap. This is not how one usually thinks of treaps, and it is not how they were invented, but it's good to be aware of the correspondence.

For a given set of keys and hash function, the treap is effectively unique (unlike binary search trees) irrespective of the insertion order, which is sometimes a good property to have.

> If you have to go to disk, a combination of: splay tree, batching and append-only log structure are your best bet.

Anyone actually use this combo (I don't know much about this) that show is good?


If I were to build a rdbms today, and not for OLAP but OLTP, is this a good combination? I have imagine building my relational language about this stuff, and have wondered if building a paged "columnar" layout could be the best:

   Page1(SizeOfBlock) =
      col1 = 1, 2, 3,
      col2 = "hello" "cruel" "world"
   Page2(SizeOfBlock) =
      col1 = 1, 2, 3,
      col2 = "hello" "cruel" "world"
So this flip the rows into columns but for a smallish batch. Probably making locks expensive but never have implemented this stuff.

I don't know about splay trees specifically, but the combination of an in memory index structor and append only log persistence has been a common theme in recent database research. Microsoft's bw-tree, a derivative of which powers CosmosDB is one example (though it uses a lock free hash table instead of a splay tree or similar).

The reason for the splay tree is because of their dynamic programming properties. When combined with an append-only log, you are naturally pushing the most frequently utilized information to the front, and letting older items fall to the back of the log.

All you have to do to keep an item alive is to access it using a splay tree. Cleanup is as simple as deciding how many older log segment files to chop.

In practical use cases you generally don't even have to worry about potential downsides. The sequential integer insert thing giving you log(N) is so rare. Most live systems are touching tons of items in arbitrary order so the probabilities work themselves out.

I can see how that'd be very effective for a cache, but I'm not at all seeing how it'd work for a database. You'd need to touch all live data to copy it forward and keep track of the low water mark before you truncate the prefix of the log. That'd have horrific write amplification.

> You'd need to touch all live data to copy it forward

This is already happening as a natural part of the utilization patterns. Just looking at something in a splay tree will cause it to be relocated to the root (the most live part of the log). That is the dynamic part of the data structure. You are doing a huge part of the GC effort simply by reading the database.

There are clearly applications where this would not be a good fit (i.e. data warehousing where you want to keep old shit on purpose). Even with the natural cleanup mechanism, you may still need to save stuff you haven't accessed in a while. Linked list nodes built on top being an example of an area that would need careful attention to avoid chopping up data structures unexpectedly.

>That'd have horrific write amplification.

This is precisely why I recommend batching as part of the approach in my original post above. With batching, you can take a total rewrite of all nodes in a linked list and squeeze them into a single I/O operation that flows into exactly the # of storage blocks required. In implementations without batching, you are talking about block I/O per linked list node that needs updating. Clearly this is a total non-starter for drive lifetime & performance.

That particular page layout is called PAX, I believe [1].

[1] https://www.pdl.cmu.edu/PDL-FTP/Database/pax.pdf

Great, I was sure not the first to think like this :)

Links to efficient implementations would be much appreciated.

I've played around with the idea but eventually settled on binary searched arrays for the small stuff and red/black trees for the rest.

One thing that bothers me about skiplists is having a static number of levels, which will either be too few or too many for most situations.

> One thing that bothers me about skiplists is having a static number of levels, which will either be too few or too many for most situations.

You should be inserting a new level whenever you add an element to the highest. I like skiplists, they're fun to teach but I'd say the part that irks me the most is their O(log n) is through "amortization", or sometimes its a long search, sometimes its a short search, so it "averages out in the end".

Geometrically-growing arrays (std::vector) and hash tables also have amortized O(1) insertion complexity. They are some of the most widely-used data structures of all. I agree that amortized guarantees can be a problem in certain applications, but it seems that programmers are happy to accept them most of the time.

True, I think it's more that skiplists are both amortized and (taught to) expand randomly unless you dig deeper into optimizing the structure. I think with regard to hash tables, the annoyance of hashing is overlooked due to the simplicity of the algorithm.

From a purely academic side, its difficult to explain why you'd use one data structure over another. We can say THIS structure has an O() for THIS operation, but its often at the tradeoff of another operation. Then it comes down to ease of implementation, where I'd say skiplists have a harder time explaining vs. hashing.


Not an implementation, but a very fun explanation, that you could definitely implement if desired.

You can do probabilistic inserts which should ensure the # of levels scales appropriately with the quantity of data. The probability factor can be a simple constant or some linear function of the quantity and/or index #.

I'm sure online resources abound...but do you have a favorite descriptor or implementation of them? Or a particular use-case?

I've used indexed skiplist implementations in DIY database projects. The idea was to first build a super fast K/V store (i.e. to memory+disk), and then build skiplists on top of that for tying together collections of indexed data. The problem with skiplists is that they kinda suck when used directly with mutable block storage. There is no way to do incremental updates like with a splay tree where you can write just the diff (a modified sub-tree and a new root node) to the end of a contiguous log file.

Basically algorithmic complexity analysis often ignores the cost of accessing data because the underlying storage is sometimes unknown - be it in memory, on a spinning disc or on magnetic tape.

Unfortunately that gives suboptimal algorithms because that basic practically is ignored. You then end up with a bunch of algorithms that look worse “on paper” but still outperforms the theoretically optimal ones.

Often the number of items is also ignored. If you want to implement a set-like feature but you’ll be using less than X elements, you can often get better performance just using a linear search through an array than use a full fledged hashed set.

This reads like a criticism of complexity analysis, but really you're just describing it. By definition and by design, algorithmic complexity analysis only shows you how an algorithm behaves asymptotically.

That means it's often not the right tool for selecting an algorithm with optimal real-world performance, but I think that's less a fault of the tool than it is a fault of the tool selector.

Saying, "complexity analysis ignores constant factors and small-data size performance" is sort of like saying "hammers ignore the threading on screws".

Saying something is a poor fit in the real world is a perfectly valid criticism. Arguably complexity analysis is such a poor fit programmers should never actually use it, but vastly simplifying problems can make for a good first approximation.

Saying "why do we teach programmers complexity analysis in a word RAM model, there are no idealized RAM machines that behave like that in the real world" is no better than saying "why do we teach people Euclidean geometry, there are no such idealized geometrical systems in the real world".

The point is that: 1. The analysis is useful and instructive. 2. It teaches you how to analyse computing models, starting from a simple one, so that when you need to, you can build and analyse more complex models specific to your domain.

In the real world spacetime is very close to Euclidean geometry near earth. Complexity analysis on the other hand can be wrong by literal orders of magnitude.

If carpenters needed to worry about the internal angles of a triangle a ding up to between 9,672 and 1.72 degrees then I think people would consider Euclidean geometry a waste of time, but 180 +/- 0.0000001 is fine.

You are responding to an argument I never made.

You don't teach children Euclidean because they would actually ever need to calculate the area of a triangle. You teach them geometry because it is a good exercise in working with logic, proofs and mathematical models.

>Complexity analysis on the other hand can be wrong by literal orders of magnitude.

Seeing how complexity analysis only applies to the asymptotic behavior of an algorithm, I don't even know how to make any semantic sense of that statement. It is not even wrong.

> only applies to asymptotic behavior.

This is false, complex analysis is vastly more than just Big O, other resources like memory usage are also considered as are exact results.

Can you explain what it means for a complexity analysis result to be wrong by literal orders of magnitude? I don't understand what that means, even with the clarification.

Memory usage is perhaps the easiest to understand. Allocation and deallocation to specific addresses can result in a lot of unusable address space depending on the hardware and OS. There is actually a huge body of work around memory allocation due to such issues.

Ex: Suppose you are allocating memory for a growing array. It’s allocated (+) and deallocated (-) in sequence. So +1000, then +1001, -1000, creating a (1000 wide gap at the start) but your next +1002 doesn’t fit in the gap. you now have 1001 bytes of allocated but due to fragmentation in effect 2001 are used up. This one isn’t that bad the second time around you you have a 2001 byte wide free space, but in more complex situations you can run out of memory vastly sooner than expected.

On PC’s virtual memory mapping generally means you don’t need to worry about this, but it used to be a huge pain and it can still be an issue for embedded systems.

PS: Not that this is 100% solved on PC’s, it’s still possible to run out of physical RAM much sooner than expected just significantly less likely.

> Arguably complexity analysis is such a poor fit programmers should never actually use it

I've never once felt that way in my career. I've often found that the insight I get from an algorithm's complexity is not a complete picture of its worthiness for a particular problem, but it is almost always a necessary piece of the puzzle.

I would never feel confident using an algorithm that I'd only empirically measured on some number of hand-chosen datasets because then who knows how it will behave when a user throws something unexpected at it.

I think you and the person you responded to are talking past each other a little bit. Complexity analysis may well be a poor fit, but still not as poor as a hammer's screw-turning ability. Complexity's poor fit to the real world is a criticism of its use as a programmer's tool, not the concept itself.

What you are saying may be a valid criticism of basic undergraduate algorithms courses, but it is a bit unfair as an absolute criticism of the field. There exists a huge literature that studies algorithms under various models of the memory hierarchy, and under various models of I/O. Even Knuth's venerable The Art of Computer Programming from 50 years ago studied the problem of optimal sorting assuming that the data is on tape and one has six tape drives available (for some value of six that I forgot).

Not many people realize that the literature you mention is the field of "analysis of algorithms", which is a sub-field of (or, in practice, somewhat different from) computational complexity theory / theory of algorithms. Robert Sedgewick (CS professor at Princeton, and an early PhD student of Knuth) has a great book with Flajolet on Analysis of Algorithms [1], and in one of the lectures from his course [2] makes a distinction between the complexity analysis usually taught in basic undergraduate algorithms courses (he calls O-notation not the scientific method, in a certain context in the lecture) and AofA (which involves saying "Running time is ~aN^c" instead of saying "Running time is O(N^c)", and also actually measuring against real programs) — watch the video or read the slides; it's an interesting distinction.

And the Purdue website [3] is even better at giving a sense of the field.

[1]: https://aofa.cs.princeton.edu

[2]: https://aofa.cs.princeton.edu/online/slides/AA01-AofA.pdf / https://www.coursera.org/learn/analysis-of-algorithms/lectur...

[3]: https://aofa.cs.purdue.edu

I recently had occasion to re-implement the R-tree structure [1] which I based on the original code by Guttman. There, each set of child nodes were stored in a (hard-coded 1k) page of memory, but there was a macro which allowed n sets of child-nodes in a page, but that value was set to 1. I assumed that this was an experiment which turned out to not work, hence the hard-coded 1 to disable it. As the code was nearing completion I experimented with values other than 1, and found a 3-fold increase in speed for a value of n=8 (IIRC) against a 4k memory page. It's always worth testing your assumptions :-)

[1] http://soliton.vm.bytemark.co.uk/pub/jjg/en/code/librtree/

Agreed. There are plenty of cases where an asymptotically "worse" algorithm performs better than a "better" one for all practical sizes.

Like it is very very much possible that the asymptotically "better" algorithms have coeffcients so high that the break even point takes like week or more of computation. There are not many programs where users will tolerate using it with so much data that operations take that long.

> The spaceship operator, <=> is C++20’s version of C’s old timey qsort comparison function (it is in fact much better because it also automagically provides all comparison operators). Basically, a<=>b can return <0 if a<b, 0 if they’re equal, and >0 if a>b

Wow, had no idea this operator was a thing now.

Does anyone here know if <=> actually gives a performance boost over separately checking < and then =, at least for basic things like integers and ASCII strings?

A quick search online doesn't seem to provide an obvious answer -- whether it compiles to something more efficient, or whether compilers were already smart enough to compile previous <-then-= checks in the same manner.

I'm assuming it's syntactic sugar, or else it would have been introduced for performance reasons long ago... but surely someone here has the definitive answer.

<=> on primitives probably won't be any faster than the binary comparison operators because of compiler optimizations.

However, something like std::less can be accidentally exponential on deeply-nested objects like trees: https://news.ycombinator.com/item?id=26344987. This is because two separate comparison operations may be performed at each level (< in both directions for std::less), each of which may recurse on the corresponding nodes. <=> does not suffer from this problem as it means that only one (rich) comparison is performed for each node.

`<=>` seems to replace `-` not `if`.

Only for integers -- you can't subtract strings, but <=> compares them just fine.

Yup, ignoring the costs of memory/storage access (cache, main memory, disk) can hurt an algorithm's real-world performance. Laying your data out in chunks that match the memory/storage blocks sizes (cache line size, disk block size) can yield much better performance.

And while having the algorithm be aware of the chunk sizes in every layer of your memory hierarchy is optimal, it can be inconvenient to get all that set up. But what's cool is that you can often lay your data out in a fractal-like pattern, and get pretty good performance no matter what the various chunk sizes are: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

If the author fetches an array of 1000 children from disk, what prevents him from using binary search on those?

This makes me wish that we had a language (or library, or runtime, take your pick) that abstracted away performance concerns.

As an example of how this might work, suppose you want a data structure that provides an ordered collection of items, with the ability to insert, randomly access, remove, and iterate over them. The compiler/library/runtime might be able to supply any of a binary tree, a linked list, or an array. You, the programmer, have some idea of the distribution of the data that will be stored in the collection (e.g. the collection will contain integers between a specific range) and the operations your program will perform on it (e.g. the collection will never be iterated over, insertions will be twice as common as removals, only the first/last elements will ever be accessed at any given time). The compiler knows the performance characteristics of each of the operations over various distributions of data (e.g. linked list insertion has a cost of O(1), random access has a cost of O(n), removal has a cost of O(1)). The language would allow the programmer to provide the compiler with hints about the expected distributions of data and operations, and the compiler would use this to pick the optimal data structure for the use case. Another example might be the compiler picking a sorting algorithm automatically, based on the expected distribution of collection sizes and data (e.g. automatically pick insertion sort of the collections are typically small and already partially sorted, but pick merge sort if they are typically large and unsorted).

It would also be interesting if this language/runtime could enable sampling of statistics on the actual observed distributions of data/operations under production workloads, and make it easy to update the programmer's specified distributions to get better performance with minimal effort. Or if the compiler could be given the performance characteristics of the target hardware as well, which could impact its choices. Maybe the program will be reading from an SSD, making the overhead of caching something in memory no longer worth it?

More broadly, it would be very cool if program semantics and performance could be decoupled entirely, and the compiler could have the ultimate freedom to pick the optimal data structures and algorithms so long as the program's observed behavior remained the same.

This sounds fairly doable. Rust pseudocode:

``` // Traits for the operations you want to perform... trait Map<K,V> { ... operations ... } trait List<K,V> { ... operations ... }

// Functions which dispatch for your data/operation distribution fn gimme_map<K, V>(data: DataDistribution, op: OperationDistribution) -> impl Map<K, V> { .. }

enum DataDistribution { ... } enum OperationDistribution { ... } ```

Ideally the decision would occur at compile time rather than runtime, to eliminate the dispatch overhead, so I think it would be likely to require macros, constexprs, or whatever compile-time execution mechanism is available in your language of choice

This very much sounds like a typical Relational Database (to a degree), with the language thus being SQL.

You're right, and I think a lot of data structures (all of them?) can be seen as special cases of relational databases. I also feel like there are similarities to logic programming languages like Prolog and Datalog, and a language that combined them while still allowing the programmer to finely tune performance through explicit hints could be powerful.

A binary search tree with string keys is just a compressed version of a binary tree with bit keys, which greatly reduces the number of nodes.

E.g. this has three nodes:

   /   \
  ant   fly

  ant -> 61 6E 74 -> 011000010110111001110100
  dog -> 64 6f 67 -> 011001000110111101100111
  fly -> 66 6C 79 -> 011001100110110001111001

                           / \
                      ,---'   `-------.
                     /                 \
                    /                   \
                   0                     1
                  /                     / \
                 /             ,-------'   `---.
                /             /                 \
               0             0                   1
                \           /                   /
                 1         0                   0
                /         /                   /
               0         0                   0
                \         \                   \
                 1         1                   1
                  \         \                   \
                   1         1                   1
                  /         /                   /
                 0         0                   0
                  \         \                   \
                   1         1                   1
                    \         \                   \
                     1         1                   1
                      \         \                 /
                       1         1               0
                      /           \             /
                     0             1           0
                    /             /           /
                   0             0           0
                    \             \           \
                     1             1           1
                      \             \           \
                       1             1           1
                        \           /             \
                         1         0               1
                        /         /                 \
                       0         0                   1
                        \         \                 /
                         1         1               0
                        /           \             /
                       0             1           0
                      /               \           \
                     0                 1           1
                   (ant)             (dog)       (fly)

Your new tree has the same nodes (with an additional root node), but I can't see how it records that `dog` is the parent of `ant` and `fly`.

"dog" being the parent of "ant" in the original string-based binary search tree is just an accident.

Consider the characters. Is the "g" character in "dog" the parent of the "a" in "ant"? We can "explode" the original tree like this:

      [ L | R | S ]
         /        \
   [L | R | S ]    [0] -> 'd'
             \     [1] -> 'o'
             "ant" [2] -> 'g'
                   [3] -> 0   // null-terminated C string :)

Now we have an anonymous node which is has two links to other nodes, and a link to a string. So it is the parent of the string and of the two nodes.

What the binary search tree is doing is representing the sequence { "ant", "dog", "fly" }. It captures that this sequence has three elements, which are in that order.

The fact that "dog" is a parent of "ant" is not relevant in a search tree; it's just an artifact of how the tree is balanced. In the following tree, it is a child of "ant", yet it is an equivalent search tree:

The bitwise tree, a kind of radix tree, does the same thing: it stores "ant", "dog" and "fly", indicating their presence and order in the most detailed way possible: by switching left/right on their individual bits.

> The bitwise tree, a kind of radix tree

I believe "trie tree" is the specific name for these un-compacted ones... So, a "bitwise trie" here.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact