
Lomuto's Comeback - VHRanger
https://dlang.org/blog/2020/05/14/lomutos-comeback/
======
ipsofacto
There's an even better branch-free (super scalar) sorting algorithm: "In-place
Parallel Super Scalar Samplesort (IPS4o)" which we started using:

[https://github.com/SaschaWitt/ips4o](https://github.com/SaschaWitt/ips4o)

[https://arxiv.org/abs/1705.02257](https://arxiv.org/abs/1705.02257)

As an example, to sort 10 million random longs on my computer it takes
std::sort 766 ms (roughly in line with Andrei's numbers) and ips4o::sort takes
274 ms.

[edit:formatting]

~~~
goldenkey
You might want to mention if are affiliated with the algorithm you are
promoting. ipsofacto ~ ips4o

~~~
ComputerGuru
Or maybe the throwaway was named after the comment it was intended to be used
to post.

~~~
ipsofacto
No affiliation, just picked the name because it fit... The company I work at
was exploring various sort algorithms to use and this turned out to be the
fastest.

Another benefit of this algorithm is that it is parallizes extremely well.

------
nightcracker
Hoare partitioning can also be implemented in a branchless manner, my
algorithm pdqsort does this:
[https://github.com/orlp/pdqsort](https://github.com/orlp/pdqsort)

~~~
andralex
Thanks! I tried that too, but it's slower than Lomuto. Forgot to mention in
the article.

~~~
nightcracker
> I tried that too, but it's slower than Lomuto.

That is very strange. I can not reproduce your results in C++, using your
code.

On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2
MinGW64) my algorithm performs best and your branchless version ends up slower
than your branchy one:

    
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && a 10000000
        min_milliseconds=828.1250
        median_milliseconds=890.6250
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && a 10000000
        min_milliseconds=671.8750
        median_milliseconds=671.8750
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && a 10000000
        min_milliseconds=343.7500
        median_milliseconds=343.7500
    

On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04
LTS, GCC 5.4.0) I get:

    
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && ./a.out 10000000
        min_milliseconds=514.8051
        median_milliseconds=536.1984
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && ./a.out 10000000
        min_milliseconds=688.2471
        median_milliseconds=698.9603
    
        $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && ./a.out 10000000
        min_milliseconds=340.1935
        median_milliseconds=344.7432
    

Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is
generating inferior code for AMD/Windows, as at least on Intel/Linux your
Lomuto branchless code can beat the branchy code. But pdqsort (which uses
branchless Hoare partitioning) consistently beats both.

I didn't change the benchmark, I only added pdqsort to yours, my full changes
are the simple inclusion of 1 header and 3 lines of code:
[https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c...](https://github.com/orlp/lomuto/commit/29381176f49e3588f6882ca992cbb04f0c19c5f5)

In order for any interested readers to reproduce, simply clone
[https://github.com/orlp/lomuto](https://github.com/orlp/lomuto) and run the
above commands.

------
mehrdadn
Notice that this is on _random longs_. Of _course_ branch misprediction and
memory bandwidth is going to crush you for a branchy sort... you'll be wrong
like half the time, and the comparisons and swapping are trivial! Real world
data isn't random, so I'd expect branch predictors to do much better with
recognizing patterns. And your sorting isn't going to be as simple as sorting
integers according to their values all that often. It's a neat exercise, but
absent more realistic benchmarking to the contrary, I wouldn't assume I'll get
a faster program by just substituting even a typical quicksort with this...

~~~
sesuximo
Sorting is the case study but not the insight. I think the cool/surprising
part of the article is that doing more work can be faster... Even more memory
work (which per conventional wisdom is not trivial!)

~~~
mehrdadn
My comment wasn't inherently about sorting either. Branch-free in general is
faster in the _worst_ cases, i.e. if the branch is difficult to predict.
Except most real-world branches are predictable... that's why we have branch
predictors. So in general you should expect branch-free to be slower, unless
you have reason to assume your branches are actually close to random.

~~~
derefr
> unless you have reason to assume your branches are actually close to random.

I mean, to the degree that you need to sort the data at all, it contains
informational entropy.

Many "presents as sorted" data structures trade space for time by keeping
track of the internal sortedness of chunks of the data, between sortation
passes. Heck, any insert-optimized data structure (B+ trees; LevelDB's sorted-
string-tables; N-ary tries generally) will get most of its performance gains
by being able to guarantee, at node split time, that the children of any
particular node are sorted relative to one-another.

So, in many cases, by the very fact that you don't already have metadata
attached to your data asserting that it's sorted, it's highly likely to have
no branch-predictability.

(Couple that to the fact that load-balancing in OLTP distributed systems
favors writes to evenly-keyspace-distributed partitioning keys, ala Dynamo,
and you'll frequently find that your inputs to an index-like data structure
can be _guaranteed random_.)

~~~
mehrdadn
I'm struggling to follow your comment. I said real-world data have patterns
and aren't completely random, so you can in general expect branch predictors
to help (unless, obviously, your data is actually known to be random like in
the example). You rebutted with "because load balancers evenly spread their
key space, you will frequently find that your inputs can be guaranteed
random", as if that somehow contradicts the point I was making...? It's also
clearly not true that it's "highly likely" your data has "no branch-
predictability" just because "you don't already have metadata attached to it
asserting it's sorted"... right? That's not only both obviously not the case
on its own, but also, why would you even sort data that's already guaranteed
to be sorted? I can't make sense of this, so I feel like we must be speaking
past each other somehow? But I'm confused how.

~~~
derefr
The insight I might have not made clear, is that if you have “this node and
its children are already sorted” metadata, then it’s cheaper to check whether
a given (fine-grained) input chunk is already sorted—and if so, to directly
construct a node from it—than it is to feed it blindly to the sorting
algorithm.

If such a pre-check is in play, then the sorting algorithm itself will
basically never have the sort of “runs” of sorted values that make some
algorithms cheaper. Those “runs” were already plucked out and turned into
nodes!

------
kragen
I used Lomuto's algorithm in my presentation of Quicksort in
[http://canonical.org/~kragen/sw/dev3/paperalgo#addtoc_21](http://canonical.org/~kragen/sw/dev3/paperalgo#addtoc_21)
because my emphasis there is on using the simplest possible algorithms. It
didn't occur to me that it might actually be faster, or be capable of being
made faster, and indeed in that page I said you could improve efficiency with
Hoare partitioning.

Alexandrescu's branch-free code in this article is a thing of beauty, in the
same sort of rugged way that Stepanov's code is.

------
zengid
I love this sort of "Historical Fiction" where interactions and conversations
between The Great People of Science are imagined and portrayed. A great
example of this is Louisa Gilder's wonderful book
_The_Age_Of_Entanglement:_When_Quantum_Physics_was_Reborn

[https://www.amazon.com/Age-Entanglement-Quantum-Physics-
Rebo...](https://www.amazon.com/Age-Entanglement-Quantum-Physics-
Reborn/dp/1400095263)

~~~
mwcampbell
Oh, I assumed that dialogue between Lomuto and Hoare was real -- at least the
gist of it, if not the exact words.

~~~
zengid
The interaction was probably real, since there is a date and location, but the
dialog (and certainly the narration) seems to be taking some artistic
liberties perhaps? This is also what Louisa did in her book on the history of
quantum mechanics; real interactions with imagined dialogs. My apologies if I
gave the wrong impression.

------
kleiba
To anyone interested in a deeper treatment of quicksort and variants thereof,
I can recommend Sebastian Wild's PhD thesis on that topic:
[https://kluedo.ub.uni-
kl.de/frontdoor/deliver/index/docId/44...](https://kluedo.ub.uni-
kl.de/frontdoor/deliver/index/docId/4468/file/wild-dissertation.pdf)

~~~
bachmeier
Off topic: Is it common practice for a dissertation at a German university to
be written in English?

~~~
DaiPlusPlus
Yes - to ensure the widest possible audience. English is the lingua-franca of
academia and has been for almost a century now.

~~~
kleiba
At least in MINT fields, probably less so in the humanities. Correct me if I'm
wrong.

~~~
atq2119
You're right. German economics is famously insular, for example. Part of that
is cultural, part of it is that they don't write as much in English.

------
codeflo
I’ve read that some vector instruction sets use masking instructions to avoid
branching. You execute both arms of the conditional and discard the unwanted
computation, which can be cheaper than an X% chance of a branch misprediction.

Should CPUs include more masking instructions for regular, non-vectorized code
as well?

~~~
tom_mellior
32-bit ARM assembly language provided conditional execution for most
instructions. This was dropped in the 64-bit instruction set.
([https://en.wikipedia.org/wiki/Predication_(computer_architec...](https://en.wikipedia.org/wiki/Predication_\(computer_architecture\)),
[https://en.wikipedia.org/wiki/ARM_architecture#64-bit](https://en.wikipedia.org/wiki/ARM_architecture#64-bit))
I imagine the architects had a clear picture of the advantages and
disadvantages, and made a very well-informed decision. I guess that part of
the reason might have been that it's not clear when compilers should emit
predicated code instead of branches.

Also, you can get something very similar even for scalar code as long as your
machine has a conditional move operation, which they all have:

    
    
        /* true path */
        a = ...
        b = ...
        true_result = ...
        /* false path */
        x = ...
        y = ...
        false_result = ...
        /* final result */
        result = (condition ? true_result : false_result)
    

This works if the two "branches" have no side effects (memory writes, function
calls) and cannot raise exceptions. Again, it's very difficult to estimate for
compilers (and humans) if it will pay off to run both computations rather than
branch.

The trade-offs for vectorization are different from scalar code since
vectorizing a loop by a factor N gives a huge win, and even if you waste some
time doing some redundant computations, you still have a reasonable chance of
being faster than scalar.

~~~
ncmncm
You can get close to the performance of a cmov instruction by generating a
pair of all-1s and all-0s values: a=c, b=c-1, and a result z=a&x|b&y.

Gcc will not under any circumstances produce two cmov instructions in a basic
block, so that is your only alternative without dropping to asm. Clang is
happy to produce two adjacent cmov instructions. Usually your ALUs are not
otherwise so engaged as to make the number of operations involved costly.

~~~
tom_mellior
I don't get this. If c can be 0 or 1, then a can be 0 or 1, so it can be
all-0s but not all-1s. b can be -1 or 0, so that part is fine. Can you show
actual C or C++ code for a function like this:

    
    
        int select(int x, int y, bool condition) {
            ...
        }

~~~
ncmncm
Sorry, brain freeze. See

[https://news.ycombinator.com/item?id=23237663](https://news.ycombinator.com/item?id=23237663)

The bitwise thing is correctly -c&x|(c-1)&y.

~~~
tom_mellior
Oh, yes, I should have realized that -c is enough to transform from 0/1 to
0/-1. Thanks!

------
ncmncm
If you define

    
    
      inline bool swap_if(
        bool c, long& a, long& b)
      {
        long ta = a, tb = b;
        a = c ? tb : ta;
        b = c ? ta : tb;
        return c;
      }
    

and then use it in a partition like

    
    
       right += swap_if(
         *left < pivot,
         *left, *right);
    

compiled with Clang (which generates `cmov` instructions), the Hoare partition
is still faster, and more than twice as fast as `std::sort`.

Using `swap_if` in Lumuto is also faster, but not as much faster. I interpret
the difference (vs. array ops) as resulting from reduced L1 bus traffic.

A fully general swap_if,

    
    
      template <typename T>
      bool swap_if(
        bool c, T& a, T& b)
      {
        T v[2] = { a, b };
        b = v[c], a = v[1-c];
        return c;
      }
    

could and IMHO should be peephole-optimized to use a pair of `cmov`
instructions, but is not in Clang, Gcc, Icc, or MSVC. But even without such an
optimization, it makes Quicksort much faster.

(Gcc, incidentally, is very, very sensitive to details of the second swap_if.
Change the order of assigning a and b, or use `!c` in place of `1-c`, and it
gets much slower, on Intel, for very non-(to me-)obvious reasons. Gcc also
will never, ever produce two `cmov` instructions in a basic block. I have
filed a bug.)

If `swap_if` were in the Standard Library, it would probably be implemented
optimally on all compilers, and almost half of the Standard algorithms could
use it to get, often, ~2x performance.

------
cozzyd
Writing branch-free code (which I find myself sometimes doing to take
advantage of simd vectorization) is always really painful. I wonder if someone
has made an attempt at better tooling for this.

~~~
repsilat
One thing people have made is array languages. The "primitive" operations
apply to arrays of data, so branches around individual elements can't be
expressed.

I think it's also generally less expressive, though. Writing Lomuto's
partition as efficiently would likely be impossible, but writing it branch-
free might not be -- you would compose branch-free primitives like

    
    
        left = filter(array, array < array[0])
        right = filter(array, array > array[0])
    

where `filter` is branch-free and takes an array of Boolean values as its
second argument.

Shrugs, aside from some standard set of functions/primitives and idioms I'm
not sure it would help much, though perhaps a mindset-shift is a more likely
solution than something more technical.

~~~
VHRanger
This sort of vectorised pseudo branching is often used in SIMD algorithms

~~~
cozzyd
Yes... and the transformations are often non-trivial (especially when
considering the possibility of out-of-bounds elements, which can't just safely
be multiplied by zero--they might be a NaN, or on the edge of a page). It
would be nice if there was a tool (or compiler option!) that could convert my
branchy code into an "identical" (assuming floating-point associativity, etc.)
branchless version.

~~~
snovv_crash
The Eigen Array [1] object is fairly close to this. Combining the 'select'
with comparisons, chained operators etc. Eigen vectorized stuff pretty
effectively too.

1\.
[https://eigen.tuxfamily.org/dox/group__TutorialArrayClass.ht...](https://eigen.tuxfamily.org/dox/group__TutorialArrayClass.html)

~~~
cozzyd
Interesting... I've used Eigen for matrix operations before, I did not know
they had such a smart general-purpose array implementation!

------
gautamcgoel
Reminds me of this great article on writing a fast mergesort:

[http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-
sort/](http://pvk.ca/Blog/2012/08/13/engineering-a-list-merge-sort/)

------
eloff
If you have a branch free algorithm, you can vectorize it. If he discussed
implementing this with SIMD instructions (whether by human hand or compiler
auto vectorization) I missed that part. But that seems an interesting angle to
me.

~~~
kccqzy
Not necessarily. A branch-free algorithm can totally have a data dependency
from the previous iteration of the loop to the next, making it impossible to
vectorize.

~~~
eloff
That's true, although there are sometimes workarounds for that by modifying
the algorithm to iterate over a small groups where there is still a data-
dependency between iterations, but not within each group, allowing one to use
SIMD. e.g. instead of delta compression over an array of integers where you
subtract each integer from the prior one, you can subtract each group from the
prior one with only a small loss of compression ratio.

The question is, is there such a dependency in this algorithm? I didn't check.

~~~
klyrs
Yes, sorting typically (and this one specifically) has long dependency chains.
The natural reflex is to split the chains across multiple workers... and
that's how you get merge-sort. But even the last pass has a linear dependence
chain. Not long ago, somebody posted a clever algorithm based on 4-way swaps,
which is a clever way of addressing the issue.

Edit: I can't seem to locate that 4-way swapping algorithm, I swear it was
posted to HN in the last year... somebody halp!

~~~
kccqzy
You might be thinking of this:
[https://news.ycombinator.com/item?id=22322967](https://news.ycombinator.com/item?id=22322967)

~~~
klyrs
Bingo, thanks!

------
Arcuru
Reminds me of this paper on binary search, which also shows a speed up using
branch-free comparisons.

"Array Layouts for Comparison-Based Searching"
[https://arxiv.org/pdf/1509.05053.pdf](https://arxiv.org/pdf/1509.05053.pdf)

------
eqvinox
So what's the impact of the data dependency chain on "first -= smaller" to the
next iteration?

(I guess at the very least it prevents vectorization, but other than that ...
shouldn't be too much, but probably still the new limiting factor?)

------
a1369209993
[https://dlang.org/blog/2020/05/14/lomutos-
comeback/](https://dlang.org/blog/2020/05/14/lomutos-comeback/) gives a 404
error. And archive.org (
[http://web.archive.org/cdx/search/cdx?url=https://dlang.org/...](http://web.archive.org/cdx/search/cdx?url=https://dlang.org/blog/2020/05/14/lomutos-
comeback/) ) is also getting nothing but 404s, so it's not a problem on my
end. I'm not really sure what's wrong with this site, since other commenters
seem not to have noticed a problem.

~~~
zombinedev
It's ok now, as far as I can see.

~~~
a1369209993
Actually it's it still giving a 404, but (after remembering and looking up the
--content-on-error option to wget) it appears TFA is served just fine as if it
were a 404 error page.

------
ncmncm
It is true that more swaps may be cheaper than fewer swaps.

The operations to count, nowadays, are not swaps or comparisons, which are
very cheap, but instead pipeline stalls, which are very, very expensive.

It is easy to better-than-double the speed of vanilla quicksort with a three-
line change in the partition step.

~~~
ncmncm
Details

[https://news.ycombinator.com/item?id=23237663](https://news.ycombinator.com/item?id=23237663)

------
Ericson2314
I'm not sure I've ever sorted an array or list. Rather I put things into sets
or maps or built in index---I always wanted to access things later, or have an
on-line data structure (which a sorted flat thing is not).

This whole field is a waste of time, and anachronism from when the master copy
was paper/pre-computer and computers were just doing the analytics.

~~~
ncmncm
Putting things in a map or set is almost invariably slower, often much slower,
than a smart sort. If performance doesn't matter, then go ahead. Most often
sort performance doesn't matter, or anyway most of the time is spent
elsewhere. People do use Python or Bash in production, without shame.

But some of us work on problems where performance of the sort does matter.

~~~
Ericson2314
Sure, it is slower, but my point is not that I sort by doing `Map.toList .
map.fromList` but that I _don 't immediately turn the map back to a list_. The
sorting literature is optimizing a thing I don't need to to do.

~~~
ComputerGuru
Just because you’re not using it directly doesn’t mean that you’re not
benefiting from those optimization in those layers you are depending on.

~~~
Ericson2314
I'm not using it directly or indirectly. The whole point of this sorting
literature is a space-time tradeoff that's completely different from when you
want an ordered map/index as the end result.

