Hacker News new | past | comments | ask | show | jobs | submit login
Hoare’s Rebuttal and Bubble Sort’s Comeback (reverberate.org)
147 points by haberman on May 30, 2020 | hide | past | favorite | 48 comments

A quick ramble on the article's point about using a hybrid approach:

As the article says, hybrid algorithms can make good sense, where you use a clever algorithm to break down large problems into many small problems, and use another simpler algorithm to solve the small problems. As the article says, in sorting, it makes sense to use something like quicksort to break down large problems, and to then use bubble sort to solve the resulting small problems. In matrix multiplication, it makes sense to use the mind-bending Strassen algorithm (with superior time complexity compared to the naive algorithm) to break down multiplications of large matrices, and to then use the naive algorithm to solve the resulting small matrix multiplication problems. [0]

The threshold for switching over to the clever solution increases year on year, as hardware improves. Clever algorithms tend to behave worse in terms of branch-prediction and cache.

Apparently a similar thing occurs in large integer multiplication. [1] I'm sure there are plenty of other examples besides.

Related: Of all the known matrix multiplication algorithms, the ones with the best time-complexity are never used in practice, as their constant factors spoil things. [2]

[0] https://en.wikipedia.org/wiki/Strassen_algorithm#Implementat...

[1] https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strasse...

[2] https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

The "hybrid" as used in this article referred to a partition scheme that combines a "Lomuto partitioning like algorithm" together with a "Hoare partitioning like algorithm".

This amplifies the the strength of each scheme and mitigates the weakness. Ie. Lomuto does too many swaps, but naturally fits a branchfree implementation. Hoare is frugal with swaps, but is rather branchy. The "hybrid" here inherits the frugality of swaps from Hoare's scheme, while benefits from the branchfree implementation of Lomuto scheme.

What you seem to describe here is the classical and well-known hybridization of divide and conquer algorithms. This is also used and implicitly implied when branchless bubble sort is discussed. But the "hybrid" as used in the article wasn't meant to refer to this.


In what application are you going to sort an array of a pure scalar type that is native to the machine?

Usually you're sorting pointers to objects, and there is some indirection required to compare them (perhaps an outright call through a function pointer).

Here is one good application: ordering an array of pointers by increasing address. That doesn't require indirection upon them, and is useful: e.g. if these are pointers into a heap we wish to compact.

> In what application are you going to sort an array of a pure scalar type that is native to the machine?

All the time in my experience. This is why radix sort is so useful. Many key types can be packed into 64 bits in a way that's compatible with the bitwise lexicographic ordering. You can pack 9-character low ASCII strings (or 10-character [a-zA-Z_][a-zA-Z0-9_]* identifiers) in 64 bits. [1] You can even do sub-bit packing, e.g. c + 3 * b + 3 * 7 * a for the tuple (a, b, c) with lexicographic ordering where 0 <= c < 3 and 0 <= b < 7. If you wanted to sort ascending on a and c but descending on b you'd replace b by 6 - b in the embedding. For radix sort you can also support 2's complement integers and IEEE-754 floats by simple key transformations. You get the idea: many complex keys are efficiently reducible to machine integers.

My default algorithm for scalar sorting is usually an optimized LSD radix sort. It's inherently branchless; its downside compared to quicksort is worse cache locality for the scattered writes after the histogram phase. Incidentally, quicksort is closely related to MSD radix sort with radix 2, which effectively partitions using data-independent pivots.

[1] For longer strings you would sort on the prefix using this embedding and then do a final pass where you sort among each (hopefully small) set of prefix-colliding strings by a different method, e.g. insertion sort for small sets.

Towards the end of the article, Gerben makes the point that the comparison is not on the critical path of his variant of QuickSort, so even if the comparison involves indirection, that comparison can still proceed in parallel with the main loop of the algorithm:

> Combining these two tricks could lead to merge sort on par with the simple partition loop of QuickSort. However it’s just a mitigation. If instead of sorting a simple primitive value we would sort pointers to a struct. The comparison operator would have an extra load which immediately adds to the critical path. QuickSort is immune to this, even rather costly comparisons do not influence the ability to make progress. Of course if the comparison itself suffers from frequent branch misses, then that will limit the ability to overlap different stages of the iteration in the CPU pipeline.

In the benchmark code for the blog, he includes a benchmark that has indirection to illustrate this, though these results were not included in the article itself. On my machine I get:

    $ CC=clang bazel build -c opt :all
    $ bazel-bin/bench_sort --benchmark_filter=BM_IndirectionSort
    Benchmark                                              Time           CPU Iterations
    BM_IndirectionSort<1, std_sort>                       64 ns         64 ns   10900000
    BM_IndirectionSort<1, std_stable_sort>                89 ns         89 ns    7900000
    BM_IndirectionSort<1, std_heap_sort>                 112 ns        112 ns    6300000
    BM_IndirectionSort<1, exp_gerbens::QuickSort>         26 ns         26 ns   27000000
    BM_IndirectionSort<1, HeapSort>                       96 ns         96 ns    7300000
    BM_IndirectionMergeSort<1>                            45 ns         45 ns   15500000

For more clarification. The number 0, 1 in these benchmarks mean the level of indirection. IndirectionSort<0, xxx> is sorting pointers on their address. While IndirectionSort<1, xxx> is sorting pointers on the value it's pointing to. Notice how MergeSort regresses much more then the QuickSort. Without an indirection MergeSort is competitive with an indirection QuickSort becomes relatively much better. As you point out sorting pointers to structs is much more important use case. This property becomes even more important as L1 cache is typically kind of small O(32kb), while L2/L3 cache is large O(1mb). So when sorting pointers it's not unreasonable to hit L2/L3 cache but miss L1. With L2/L3 typically being 10-20 cycles latency (vs 5 cycle for L1) QuickSort is largely insensitive to the L1 cache misses.

    BM_IndirectionSort<1, std_sort>                       93 ns         93 ns    7700000
    BM_IndirectionSort<1, std_stable_sort>               126 ns        125 ns    5600000
    BM_IndirectionSort<1, std_heap_sort>                 172 ns        172 ns    4200000
    BM_IndirectionSort<1, exp_gerbens::QuickSort>         34 ns         34 ns   19000000
    BM_IndirectionSort<1, HeapSort>                      157 ns        157 ns    4700000
    BM_IndirectionMergeSort<1>                            68 ns         68 ns   10400000
    BM_IndirectionSort<0, std_sort>                       74 ns         74 ns    9500000
    BM_IndirectionSort<0, std_stable_sort>                91 ns         91 ns    7800000
    BM_IndirectionSort<0, std_heap_sort>                 125 ns        125 ns    5800000
    BM_IndirectionSort<0, exp_gerbens::QuickSort>         26 ns         26 ns   27200000
    BM_IndirectionSort<0, HeapSort>                       77 ns         77 ns    9300000
    BM_IndirectionMergeSort<0>                            36 ns         36 ns   19400000

Well, consider an authorization system where you have a subject's group memberships and an access control entry set consisting of grants to group and user IDs, and access is granted if the intersection of the two is non-empty.

In such a system, sorting the two sets of IDs allows you to implement an O(N) or O(M log N) algorithm depending on the relative sizes of the two sets. If one set is much smaller than the other, then you can just iterate the smaller set and binary search the larger set. If the two sets are comparable in size you can walk a cursor in each.

These IDs are probably small -- 32 bits will generally do. That's a "pure scalar type that is native to the machine".

Mind you, these are smallish arrays in practice. Typically you might see users with a few hundred group memberships and access controls with a few tens of grants. But you might have many users, and many access controls, so many sorts to perform.

Intersection is calculable in amortized O(M + N) with hashing.

Build hash table in O(N) of the smaller set, then probe M entries of the larger in the table.

If we rely on sorted order, we can make it a requirement/invariant that these sets are maintained in sorted order. Membership in a group rarely changes; if a user is added to a new group, that can be inserted in order. A binary heap structure could be used to keep these objects in array form, while allowing better than linear insertion.

Andrei Alexandrescu's post, linked at the top, is a great read as well: https://dlang.org/blog/2020/05/14/lomutos-comeback/

Great article, this isn't my domain so the results are unexpected and the fact that a domain expert working on the problem can miss stuff like this reinforces something I see over and over when it comes to optimization - forget your assumption, intuition and best practices - everything is context sensitive (size of data, hardware specific) and gets obsolete through hardware evolution - the only way to optimise is to measure to identify bottlenecks after you have working code and them test again that you actually succeeded - don't optimise in advance.

And measure. Always measure. Without data it’s all posturing

the opposition usually sounds like this - "Microbenchmarks lie". On one of the past projects, the leading engineers and architects just loved that phrase and voiced it like a deep profound truth understanding of which was a hallmark of belonging to their circle of chosen. Any specific measurements showing issues with the supposedly brilliant architecture and implementation choices of that project (with many of these choices following the latest "best practices" fads published by some venerable technologists) were dismissed by that phrase just like by magic. Naturally that was one of the slowest, not to mention buggy, projects in my experience, and working in the enterprise space i've seen my share of slow and buggy.

Except that's not opposition? Microbenchmarks do, always, lie; you need to measure the performance of your actual, real code that you're actually caring about the speed of.

Microbenchmarks are not lying, but they can be misleading...

The most frequent problem is selecting the fastest solution with a microbenchmark and creating a lot of unnecessary cache eviction/pollution, sometimes on the code cache, sometimes on L1/L2 or both...

Microbenchmarks are good to explore the optimization space, but the final decision should be made with the full execution context in mind.

I'm surprised that they found a legit use-case for bubble sort. I always had the impression that bubble sort had very little going for it because the naive implementation is much slower than insertion sort, while also not being as intuitive. https://users.cs.duke.edu/~ola/bubble/bubble.html

Out of curiosity, does anyone know how selection sort would compare to bubble sort and insertion sort in this kind of problem?

If you define

  inline bool swap_if(
    bool c, int& a, int& b)
    int ta = a, tb = b;
    a = c ? tb : ta;
    b = c ? ta : tb;
    return c;
and then use it in partition's inner loop like

   right += swap_if(
     *left < pivot,
     *left, *right);
..., compiled with Clang (which generates `cmov` instructions), a simplest-possible quicksort is more than twice as fast as `std::sort`. Then you can doctor up the endgame with heapsort, insertion sort, even bubble sort if you like, and get an incremental improvement.

But using AVX machinery to implement an optimal sorting network for small arrays is faster, if you are sorting scalars. See <https://arxiv.org/abs/1704.08579>

A fully general swap_if,

  template <typename T>
  bool swap_if(
    bool c, T& a, T& b)
    T v[2] = { a, b };
    b = v[c], a = v[1-c];
    return c;
could and should be peephole-optimized to use a pair of `cmov` instructions, but is not in Clang, Gcc, Icc, or MSVC. But even without such an optimization, it makes Quicksort much faster than std::sort. It is slower than the cmov version probably because it costs more L1-bus traffic.

(Gcc, incidentally, is very, very sensitive to details of this version. Change the order of assigning a and b, or use `[!c]` in place of `[1-c]`, and it gets much slower, on Intel, for very non-(to me-)obvious reasons. Gcc also will never, ever produce two `cmov` instructions in a basic block. I have filed a bug.)

Another formulation is

  template <>
  bool swap_if(
    bool c, int& a, int& b)
    int ta = a, tb = b;
    a =  -c&tb | c-1&ta;
    b = c-1&ta |  -c&tb;
    return c;
which Clang also optimizes to a pair of `cmov` instructions. Gcc (like Icc and MSVC) emits the literal and/and/or instructions, which is also faster than std::sort, but less so than the array version.

If `swap_if` were in the Standard Library, it would probably be implemented optimally on all compilers, and almost half of the Standard algorithms could use it to get, often, ~2x performance improvement.

A partition using AVX machinery is potentially much faster, although not as much faster as one might hope. See


That took much of a year to implement. AVX512 instructions enable even better optimizations, but AVX512 might not come out in AMD until Zen-4. (We are at Zen-2 now.)

Your "simplest-possible quicksort" appears to be the same as the first branchless Lomuto partition given in the article.

The premise of the article is that, while branchless, this loop suffers from store/load dependencies that significantly inhibit instruction-level parallelism. Gerben's algorithm gets a 2x win over this approach by using a temporary buffer that prevents loads from aliasing previous stores.

Thank you for the correction.

The best sort in the article appears to yield 3x vs. std::sort for an array that fits in L2 cache. In my testing (with much larger arrays) I got 2.5x without any small-array transition, and 3x with a branchless sorting-network for ranges of 3 or less, and dramatically less code than that presented in the article.

I suspect that the algorithms used in the article, implemented just a little differently, could still yield substantially better performance, beating my simpler code.

This is a consequence of our deal with the devil: caches and speculation make our code faster, but we can no longer know whether it is objectively fast, or how much faster it could be. What seemed fast becomes slow the moment somebody does it faster.

Hm interesting, so is there any analysis / theory that will capture branch mispredictions?

e.g. like big-O for something closer to real machines.

I feel like CPU cycles should already be analyzed separately from memory accesses (although I guess they are usually proportional). But analyzing branch mispredicts seems harder

How does this hold up on ARM and other architectures? (RISC V, etc.)

I'm not sure about much analysis and theory that explicitly captures branch prediction, but in last year's cppcon keynote Andrei offered a possible metric [0] where he takes into account the average "distance" between two subsequent array accesses.

It's basically, (C(n) + M(n) + kD(n)) / n

Where M is moves, C is comparisons, k is a some constant, and D is "distance".

[0]: https://youtu.be/FJJTYQYB1JQ?t=2943

Please take into account these discussions are about scalar sorting methods. For simple symbol types (i.e. cpu register) vectorized sorts are much better and don't suffer branch prediction per comparison (using result masks to do exchange/swap).

Thanks for the interesting post! I liked the MergeDouble approach which I hadn’t seen before. Couldn’t that be extended to finding the median of the results using binary search and then doing four concurrent merges, for even more ILP?

Yes that is very much possible. In general you won't get 4 equal length ranges though. For instance in an already sorted array the median results in (n/2,0) and a (0,n/2) division. But the average case should be two sets of merge intervals both of size (n/4, n/4). In reality I think you will quickly run into other limits. You need quite a bit of registers, that shouldn't spill to stack.

I think the median of the results works great in any case (but, you are the expert here). If we find the result median in each of the two inputs, say 'a' and 'b' we have four merges (two forward, two reverse):

    [0, ..) w (0, ..)
    (.., a) w (.., b)
    [a, ..) w [b, ..)
    (.., n) w (.., n)
each of which should produce n/4 elements and then meet.

If the array is already sorted then great; that split will just "merge":

    [0, ..) w nothing
    (.., n/2) w nothing
    nothing w [n/2, ..)
    nothing w (.., n)
which should go very fast. Anyhow, possible I'm missing something!

No you are not missing something. This is, I think, correct. I haven't tried it but it seems possible and possibly very effective.

What program was used to create the dependency diagrams? And is there some more general introduction for using this sort of dependency analysis for low level optimizations?

That was just hand analysis of the data flow together with Google Diagrams.

As I mention in the article, dependency analysis has, IMO, not received the widespread attention it should. I like Fabian Giesen's article if you want an introduction.


FWIW, there are some interesting tools for microarchitectural performance analysis which are getting closer (although it's a work in progress): https://github.com/MattPD/cpplinks/blob/master/performance.t...

One particularly interesting example in this context is OSACA (Open Source Architecture Code Analyzer), https://github.com/RRZE-HPC/osaca

The related publications explain the approach used to model and perform critical path analysis relevant for the modern superscalar out-of-order processors:

- Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels (2019): https://arxiv.org/abs/1910.00214

- Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis (2020): https://hpc.fau.de/files/2020/02/Masterarbeit_JL-_final.pdf

There's also a broader line of research on performance modeling in this vein (some pretty detailed, including microarchitectural details like branch misprediction penalties, ROB capacity, etc.): https://gist.github.com/MattPD/85aad98ee8b135e675d49c571b67f...

More on modeling microarchitectural details (chronological order):

- Tejas S. Karkhanis and James E. Smith. "A First-Order Superscalar Processor Model." (ISCA 2004) - http://citeseerx.ist.psu.edu/viewdoc/download?doi=

- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2 (2009) - http://www.elis.ugent.be/~leeckhou/papers/tocs09.pdf

- Maximilien B. Breughe, Stijn Eyerman, and Lieven Eeckhout. "Mechanistic analytical modeling of superscalar in-order processor performance." ACM Trans. Architec. Code Optim. 11, 4, Article 50 (2014) - https://users.elis.ugent.be/~leeckhou/papers/taco2015-breugh....

- "Modeling Superscalar Processor Memory-Level Parallelism", Sam Van den Steen and Lieven Eeckhout, IEEE Computer Architecture Letters (CAL), Vol 17, No 1 (2018) - https://users.elis.ugent.be/~leeckhou/papers/cal2018-MLP.pdf

I wonder if they tried shell sort? IME it's so close to bubble sort in implementation that you can view it as a trivial optimization.

And just a few shells could be added, to make this optimization right? If the sequence is 1, 4, 10, 23, .. one could start by just adding a pass with 4 and then 1.


I don't know how to select the fallback sort loop size (related to cache line size maybe?); in his github code he uses 16 elements as the threshold.

A very interesting suggestion indeed. Shell sort is an insertion sort though. It owes its better complexity from being able to breaking out of loop early. But these are difficult to predict branches that bubble sort avoids.

Certainly this is worth an experiment

Well, you are the expert :). I misremebered and thought of the base case for shell sort to be a kind of bubble sort.

No mention of Timsort though.

For this case, in tiny arrays, something like timsort would be a bad idea -- the basic idea of the article is that it takes so little work to sort tiny arrays that you want to run the simplest algorithm possible.

For tiny arrays, Timsort uses insertion sort.

I think the Fallback for sorting short arrays section makes it clear that it's not only about sorting small arrays, so I think it's reasonable to wonder why Timsort wasn't mentioned.

You are right. For people who like quicksort, the "timsort of quicksorts" is pattern-defeating quicksort (pdqsort), which is my usual goto algorithm nowadays.

So does this work out as a three-part hybrid algorithm? Timsort, 'then' Quicksort, then bubble sort/insertion sort for short sequences?

Since the larger case was using QuickSort, it seems.

I don't follow. Why put all this work into optimising Quicksort without even mentioning that a superior algorithm has since been discovered?

I'm not sure how you came to the conclusion that Timsort is "superior". As far as I was aware, Quicksort generally does better on random lists and on primitive datatypes? Timsort is often used for its strengths: it's stable, it does better on partially sorted data, etc. Keeping this in mind, this article specifically focuses on optimizing the "small array" case of Quicksort, because it forms an important part of the performance profile of the algorithm in the places where it is used.

Timsort is great in Python and Java; the objects sorted are pointer-sized (cheap swaps) and there is always indirection (expensive comparisons).

Timsort is not in place, and is essentially similar to mergesort. Most libraries use quicksort for sorting primatives, so it's the right thing to compare to.

Josh, Gerben!

Have you tried a sorting network, instead of the bubble sort?

No and I suspect that there is room for significant improvement here.

I got pretty dramatic results with just a 3-element branchless sorting network.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact