
Hoare’s Rebuttal and Bubble Sort’s Comeback - haberman
https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html
======
MaxBarraclough
A quick ramble on the article's point about using a hybrid approach:

As the article says, hybrid algorithms can make good sense, where you use a
clever algorithm to break down large problems into many small problems, and
use another simpler algorithm to solve the small problems. As the article
says, in sorting, it makes sense to use something like quicksort to break down
large problems, and to then use bubble sort to solve the resulting small
problems. In matrix multiplication, it makes sense to use the mind-bending
Strassen algorithm (with superior time complexity compared to the naive
algorithm) to break down multiplications of large matrices, and to then use
the naive algorithm to solve the resulting small matrix multiplication
problems. [0]

The threshold for switching over to the clever solution increases year on
year, as hardware improves. Clever algorithms tend to behave worse in terms of
branch-prediction and cache.

Apparently a similar thing occurs in large integer multiplication. [1] I'm
sure there are plenty of other examples besides.

Related: Of all the known matrix multiplication algorithms, the ones with the
best time-complexity are _never_ used in practice, as their constant factors
spoil things. [2]

[0]
[https://en.wikipedia.org/wiki/Strassen_algorithm#Implementat...](https://en.wikipedia.org/wiki/Strassen_algorithm#Implementation_considerations)

[1]
[https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strasse...](https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm#Optimizations)

[2]
[https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...](https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm)

~~~
gerbenst
The "hybrid" as used in this article referred to a partition scheme that
combines a "Lomuto partitioning like algorithm" together with a "Hoare
partitioning like algorithm".

This amplifies the the strength of each scheme and mitigates the weakness. Ie.
Lomuto does too many swaps, but naturally fits a branchfree implementation.
Hoare is frugal with swaps, but is rather branchy. The "hybrid" here inherits
the frugality of swaps from Hoare's scheme, while benefits from the branchfree
implementation of Lomuto scheme.

What you seem to describe here is the classical and well-known hybridization
of divide and conquer algorithms. This is also used and implicitly implied
when branchless bubble sort is discussed. But the "hybrid" as used in the
article wasn't meant to refer to this.

~~~
MaxBarraclough
Thanks.

------
kazinator
In what application are you going to sort an array of a pure scalar type that
is native to the machine?

Usually you're sorting pointers to objects, and there is some indirection
required to compare them (perhaps an outright call through a function
pointer).

Here is one good application: ordering an array of pointers by increasing
address. That doesn't require indirection upon them, and is useful: e.g. if
these are pointers into a heap we wish to compact.

~~~
haberman
Towards the end of the article, Gerben makes the point that the comparison is
not on the critical path of his variant of QuickSort, so even if the
comparison involves indirection, that comparison can still proceed in parallel
with the main loop of the algorithm:

> Combining these two tricks could lead to merge sort on par with the simple
> partition loop of QuickSort. However it’s just a mitigation. If instead of
> sorting a simple primitive value we would sort pointers to a struct. The
> comparison operator would have an extra load which immediately adds to the
> critical path. QuickSort is immune to this, even rather costly comparisons
> do not influence the ability to make progress. Of course if the comparison
> itself suffers from frequent branch misses, then that will limit the ability
> to overlap different stages of the iteration in the CPU pipeline.

In the benchmark code for the blog, he includes a benchmark that has
indirection to illustrate this, though these results were not included in the
article itself. On my machine I get:

    
    
        $ CC=clang bazel build -c opt :all
        $ bazel-bin/bench_sort --benchmark_filter=BM_IndirectionSort
        -------------------------------------------------------------------------------------
        Benchmark                                              Time           CPU Iterations
        -------------------------------------------------------------------------------------
        BM_IndirectionSort<1, std_sort>                       64 ns         64 ns   10900000
        BM_IndirectionSort<1, std_stable_sort>                89 ns         89 ns    7900000
        BM_IndirectionSort<1, std_heap_sort>                 112 ns        112 ns    6300000
        BM_IndirectionSort<1, exp_gerbens::QuickSort>         26 ns         26 ns   27000000
        BM_IndirectionSort<1, HeapSort>                       96 ns         96 ns    7300000
        BM_IndirectionMergeSort<1>                            45 ns         45 ns   15500000

~~~
gerbenst
For more clarification. The number 0, 1 in these benchmarks mean the level of
indirection. IndirectionSort<0, xxx> is sorting pointers on their address.
While IndirectionSort<1, xxx> is sorting pointers on the value it's pointing
to. Notice how MergeSort regresses much more then the QuickSort. Without an
indirection MergeSort is competitive with an indirection QuickSort becomes
relatively much better. As you point out sorting pointers to structs is much
more important use case. This property becomes even more important as L1 cache
is typically kind of small O(32kb), while L2/L3 cache is large O(1mb). So when
sorting pointers it's not unreasonable to hit L2/L3 cache but miss L1. With
L2/L3 typically being 10-20 cycles latency (vs 5 cycle for L1) QuickSort is
largely insensitive to the L1 cache misses.

    
    
        BM_IndirectionSort<1, std_sort>                       93 ns         93 ns    7700000
        BM_IndirectionSort<1, std_stable_sort>               126 ns        125 ns    5600000
        BM_IndirectionSort<1, std_heap_sort>                 172 ns        172 ns    4200000
        BM_IndirectionSort<1, exp_gerbens::QuickSort>         34 ns         34 ns   19000000
        BM_IndirectionSort<1, HeapSort>                      157 ns        157 ns    4700000
        BM_IndirectionMergeSort<1>                            68 ns         68 ns   10400000
        BM_IndirectionSort<0, std_sort>                       74 ns         74 ns    9500000
        BM_IndirectionSort<0, std_stable_sort>                91 ns         91 ns    7800000
        BM_IndirectionSort<0, std_heap_sort>                 125 ns        125 ns    5800000
        BM_IndirectionSort<0, exp_gerbens::QuickSort>         26 ns         26 ns   27200000
        BM_IndirectionSort<0, HeapSort>                       77 ns         77 ns    9300000
        BM_IndirectionMergeSort<0>                            36 ns         36 ns   19400000

------
saagarjha
Andrei Alexandrescu's post, linked at the top, is a great read as well:
[https://dlang.org/blog/2020/05/14/lomutos-
comeback/](https://dlang.org/blog/2020/05/14/lomutos-comeback/)

------
rubber_duck
Great article, this isn't my domain so the results are unexpected and the fact
that a domain expert working on the problem can miss stuff like this
reinforces something I see over and over when it comes to optimization -
forget your assumption, intuition and best practices - everything is context
sensitive (size of data, hardware specific) and gets obsolete through hardware
evolution - the only way to optimise is to measure to identify bottlenecks
after you have working code and them test again that you actually succeeded -
don't optimise in advance.

~~~
mchaynes
And measure. Always measure. Without data it’s all posturing

~~~
trhway
the opposition usually sounds like this - "Microbenchmarks lie". On one of the
past projects, the leading engineers and architects just loved that phrase and
voiced it like a deep profound truth understanding of which was a hallmark of
belonging to their circle of chosen. Any specific measurements showing issues
with the supposedly brilliant architecture and implementation choices of that
project (with many of these choices following the latest "best practices" fads
published by some venerable technologists) were dismissed by that phrase just
like by magic. Naturally that was one of the slowest, not to mention buggy,
projects in my experience, and working in the enterprise space i've seen my
share of slow and buggy.

~~~
a1369209993
Except that's not opposition? _Microbenchmarks_ do, always, lie; you need to
measure the performance of your actual, real code that you're actually caring
about the speed of.

------
ufo
I'm surprised that they found a legit use-case for bubble sort. I always had
the impression that bubble sort had very little going for it because the naive
implementation is much slower than insertion sort, while also not being as
intuitive.
[https://users.cs.duke.edu/~ola/bubble/bubble.html](https://users.cs.duke.edu/~ola/bubble/bubble.html)

Out of curiosity, does anyone know how selection sort would compare to bubble
sort and insertion sort in this kind of problem?

------
ncmncm
If you define

    
    
      inline bool swap_if(
        bool c, int& a, int& b)
      {
        int ta = a, tb = b;
        a = c ? tb : ta;
        b = c ? ta : tb;
        return c;
      }
    

and then use it in partition's inner loop like

    
    
       right += swap_if(
         *left < pivot,
         *left, *right);
    

..., compiled with Clang (which generates `cmov` instructions), a simplest-
possible quicksort is more than twice as fast as `std::sort`. Then you can
doctor up the endgame with heapsort, insertion sort, even bubble sort if you
like, and get an incremental improvement.

But using AVX machinery to implement an optimal sorting network for small
arrays is faster, if you are sorting scalars. See
<[https://arxiv.org/abs/1704.08579>](https://arxiv.org/abs/1704.08579>)

A fully general swap_if,

    
    
      template <typename T>
      bool swap_if(
        bool c, T& a, T& b)
      {
        T v[2] = { a, b };
        b = v[c], a = v[1-c];
        return c;
      }
    

could and should be peephole-optimized to use a pair of `cmov` instructions,
but is not in Clang, Gcc, Icc, or MSVC. But even without such an optimization,
it makes Quicksort much faster than std::sort. It is slower than the cmov
version probably because it costs more L1-bus traffic.

(Gcc, incidentally, is very, very sensitive to details of this version. Change
the order of assigning a and b, or use `[!c]` in place of `[1-c]`, and it gets
much slower, on Intel, for very non-(to me-)obvious reasons. Gcc also will
never, ever produce two `cmov` instructions in a basic block. I have filed a
bug.)

Another formulation is

    
    
      template <>
      bool swap_if(
        bool c, int& a, int& b)
      {
        int ta = a, tb = b;
        a =  -c&tb | c-1&ta;
        b = c-1&ta |  -c&tb;
        return c;
      }
    

which Clang also optimizes to a pair of `cmov` instructions. Gcc (like Icc and
MSVC) emits the literal and/and/or instructions, which is also faster than
std::sort, but less so than the array version.

If `swap_if` were in the Standard Library, it would probably be implemented
optimally on all compilers, and almost half of the Standard algorithms could
use it to get, often, ~2x performance improvement.

A partition using AVX machinery is potentially much faster, although not as
much faster as one might hope. See

<[https://bits.houmus.org/2020-02-02/this-goes-to-eleven-
pt5>](https://bits.houmus.org/2020-02-02/this-goes-to-eleven-pt5>)

That took much of a year to implement. AVX512 instructions enable even better
optimizations, but AVX512 might not come out in AMD until Zen-4. (We are at
Zen-2 now.)

~~~
haberman
Your "simplest-possible quicksort" appears to be the same as the first
branchless Lomuto partition given in the article.

The premise of the article is that, while branchless, this loop suffers from
store/load dependencies that significantly inhibit instruction-level
parallelism. Gerben's algorithm gets a 2x win over this approach by using a
temporary buffer that prevents loads from aliasing previous stores.

~~~
ncmncm
Thank you for the correction.

The best sort in the article appears to yield 3x vs. std::sort for an array
that fits in L2 cache. In my testing (with much larger arrays) I got 2.5x
without any small-array transition, and 3x with a branchless sorting-network
for ranges of 3 or less, and dramatically less code than that presented in the
article.

I suspect that the algorithms used in the article, implemented just a little
differently, could still yield substantially better performance, beating my
simpler code.

This is a consequence of our deal with the devil: caches and speculation make
our code faster, but we can no longer know whether it is objectively fast, or
how much faster it could be. What seemed fast becomes slow the moment somebody
does it faster.

------
chubot
Hm interesting, so is there any analysis / theory that will capture branch
mispredictions?

e.g. like big-O for something closer to real machines.

I feel like CPU cycles should already be analyzed separately from memory
accesses (although I guess they are usually proportional). But analyzing
branch mispredicts seems harder

How does this hold up on ARM and other architectures? (RISC V, etc.)

~~~
lumberjackstian
I'm not sure about much analysis and theory that explicitly captures branch
prediction, but in last year's cppcon keynote Andrei offered a possible metric
[0] where he takes into account the average "distance" between two subsequent
array accesses.

It's basically, (C(n) + M(n) + kD(n)) / n

Where M is moves, C is comparisons, k is a some constant, and D is "distance".

[0]:
[https://youtu.be/FJJTYQYB1JQ?t=2943](https://youtu.be/FJJTYQYB1JQ?t=2943)

------
alecco
Please take into account these discussions are about scalar sorting methods.
For simple symbol types (i.e. cpu register) vectorized sorts are much better
and don't suffer branch prediction per comparison (using result masks to do
exchange/swap).

------
frankmcsherry
Thanks for the interesting post! I liked the MergeDouble approach which I
hadn’t seen before. Couldn’t that be extended to finding the median of the
results using binary search and then doing four concurrent merges, for even
more ILP?

~~~
gerbenst
Yes that is very much possible. In general you won't get 4 equal length ranges
though. For instance in an already sorted array the median results in (n/2,0)
and a (0,n/2) division. But the average case should be two sets of merge
intervals both of size (n/4, n/4). In reality I think you will quickly run
into other limits. You need quite a bit of registers, that shouldn't spill to
stack.

~~~
frankmcsherry
I think the median of the results works great in any case (but, you are the
expert here). If we find the result median in each of the two inputs, say 'a'
and 'b' we have four merges (two forward, two reverse):

    
    
        [0, ..) w (0, ..)
        (.., a) w (.., b)
        [a, ..) w [b, ..)
        (.., n) w (.., n)
    

each of which should produce n/4 elements and then meet.

If the array is already sorted then great; that split will just "merge":

    
    
        [0, ..) w nothing
        (.., n/2) w nothing
        nothing w [n/2, ..)
        nothing w (.., n)
    

which should go very fast. Anyhow, possible I'm missing something!

~~~
gerbenst
No you are not missing something. This is, I think, correct. I haven't tried
it but it seems possible and possibly very effective.

------
zucker42
What program was used to create the dependency diagrams? And is there some
more general introduction for using this sort of dependency analysis for low
level optimizations?

~~~
gerbenst
That was just hand analysis of the data flow together with Google Diagrams.

As I mention in the article, dependency analysis has, IMO, not received the
widespread attention it should. I like Fabian Giesen's article if you want an
introduction.

[https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-
introdu...](https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-
to-dataflow-graphs/)

~~~
matt_d
FWIW, there are some interesting tools for microarchitectural performance
analysis which are getting closer (although it's a work in progress):
[https://github.com/MattPD/cpplinks/blob/master/performance.t...](https://github.com/MattPD/cpplinks/blob/master/performance.tools.md#microarchitecture)

One particularly interesting example in this context is OSACA (Open Source
Architecture Code Analyzer), [https://github.com/RRZE-
HPC/osaca](https://github.com/RRZE-HPC/osaca)

The related publications explain the approach used to model and perform
critical path analysis relevant for the modern superscalar out-of-order
processors:

\- Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly
Kernels (2019):
[https://arxiv.org/abs/1910.00214](https://arxiv.org/abs/1910.00214)

\- Cross-Architecture Automatic Critical Path Detection For In-Core
Performance Analysis (2020):
[https://hpc.fau.de/files/2020/02/Masterarbeit_JL-
_final.pdf](https://hpc.fau.de/files/2020/02/Masterarbeit_JL-_final.pdf)

There's also a broader line of research on performance modeling in this vein
(some pretty detailed, including microarchitectural details like branch
misprediction penalties, ROB capacity, etc.):
[https://gist.github.com/MattPD/85aad98ee8b135e675d49c571b67f...](https://gist.github.com/MattPD/85aad98ee8b135e675d49c571b67fde9#file-
performance-modeling-criticality-md)

More on modeling microarchitectural details (chronological order):

\- Tejas S. Karkhanis and James E. Smith. "A First-Order Superscalar Processor
Model." (ISCA 2004) -
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79...).

\- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. "A
mechanistic performance model for superscalar out-of-order processors." ACM
Trans. Comput. Syst. 27, 2 (2009) -
[http://www.elis.ugent.be/~leeckhou/papers/tocs09.pdf](http://www.elis.ugent.be/~leeckhou/papers/tocs09.pdf)

\- Maximilien B. Breughe, Stijn Eyerman, and Lieven Eeckhout. "Mechanistic
analytical modeling of superscalar in-order processor performance." ACM Trans.
Architec. Code Optim. 11, 4, Article 50 (2014) -
[https://users.elis.ugent.be/~leeckhou/papers/taco2015-breugh...](https://users.elis.ugent.be/~leeckhou/papers/taco2015-breugh..).

\- "Modeling Superscalar Processor Memory-Level Parallelism", Sam Van den
Steen and Lieven Eeckhout, IEEE Computer Architecture Letters (CAL), Vol 17,
No 1 (2018) -
[https://users.elis.ugent.be/~leeckhou/papers/cal2018-MLP.pdf](https://users.elis.ugent.be/~leeckhou/papers/cal2018-MLP.pdf)

------
megameter
I wonder if they tried shell sort? IME it's so close to bubble sort in
implementation that you can view it as a trivial optimization.

~~~
kzrdude
And just a few shells could be added, to make this optimization right? If the
sequence is 1, 4, 10, 23, .. one could start by just adding a pass with 4 and
then 1.

[https://stackoverflow.com/questions/2539545/fastest-gap-
sequ...](https://stackoverflow.com/questions/2539545/fastest-gap-sequence-for-
shell-sort)

I don't know how to select the fallback sort loop size (related to cache line
size maybe?); in his github code he uses 16 elements as the threshold.

~~~
gerbenst
A very interesting suggestion indeed. Shell sort is an insertion sort though.
It owes its better complexity from being able to breaking out of loop early.
But these are difficult to predict branches that bubble sort avoids.

Certainly this is worth an experiment

~~~
kzrdude
Well, you are the expert :). I misremebered and thought of the base case for
shell sort to be a kind of bubble sort.

------
zitterbewegung
No mention of Timsort though.

~~~
CJefferson
For this case, in tiny arrays, something like timsort would be a bad idea --
the basic idea of the article is that it takes so little work to sort tiny
arrays that you want to run the simplest algorithm possible.

~~~
MaxBarraclough
I think the _Fallback for sorting short arrays_ section makes it clear that
it's not only about sorting small arrays, so I think it's reasonable to wonder
why Timsort wasn't mentioned.

~~~
CJefferson
You are right. For people who like quicksort, the "timsort of quicksorts" is
pattern-defeating quicksort (pdqsort), which is my usual goto algorithm
nowadays.

~~~
MaxBarraclough
So does this work out as a three-part hybrid algorithm? Timsort, 'then'
Quicksort, then bubble sort/insertion sort for short sequences?

------
oxxoxoxooo
Josh, Gerben!

Have you tried a sorting network, instead of the bubble sort?

~~~
gerbenst
No and I suspect that there is room for significant improvement here.

