
Pattern-defeating quicksort - pettou
https://github.com/orlp/pdqsort
======
atilimcetin
Rust's stable sort is based on timsort ([https://doc.rust-
lang.org/std/vec/struct.Vec.html#method.sor...](https://doc.rust-
lang.org/std/vec/struct.Vec.html#method.sort)) and unstable sort is based on
pattern-defeating quicksort ([https://doc.rust-
lang.org/std/vec/struct.Vec.html#method.sor...](https://doc.rust-
lang.org/std/vec/struct.Vec.html#method.sort_unstable)). The documentation
says that 'It [unstable sorting] is generally faster than stable sorting,
except in a few special cases, e.g. when the slice consists of several
concatenated sorted sequences.'

~~~
gnarbarian
I like merge sort. Average time may be worse, but it's upper bound is better
and it is conceptually cleaner and easier to understand (IMO).

~~~
partycoder
Just that it takes extra space and that's sometimes a constraint.

~~~
chii
[https://stackoverflow.com/questions/2571049/how-to-sort-
in-p...](https://stackoverflow.com/questions/2571049/how-to-sort-in-place-
using-the-merge-sort-algorithm)

In place merge sort exists. It's hard to write tho

~~~
partycoder
Which decreases the space complexity but increases the time complexity.

~~~
stjepang
No, the time complexity is the same: O(n log n). The author of the top answer
links to his book, where you can find a proof of time complexity:

[https://sites.google.com/site/algoxy/home/elementary-
algorit...](https://sites.google.com/site/algoxy/home/elementary-
algorithms.pdf)

~~~
EvgeniyZh
...but it increases run time. It's fine not to care on hidden constants while
analyzing algorithms, but not while using them in real life

------
stjepang
I think it's fair to say that pdqsort (pattern-defeating quicksort) is overall
the best unstable sort and timsort is overall the best stable sort in 2017, at
least if you're implementing one for a standard library.

The standard sort algorithm in Rust is timsort[1] (slice::sort), but soon
we'll have pdqsort as well[2] (slice::sort_unstable), which shows great
benchmark numbers.[3] Actually, I should mention that both implementations are
not 100% equivalent to what is typically considered as timsort and pdqsort,
but they're pretty close.

It is notable that Rust is the first programming language to adopt pdqsort,
and I believe its adoption will only grow in the future.

Here's a fun fact: Typical quicksorts (and introsorts) in standard libraries
spend most of the time doing literally nothing - just waiting for the next
instruction because of failed branch prediction! If you manage to eliminate
branch misprediction, you can easily make sorting twice as fast! At least that
is the case if you're sorting items by an integer key, or a tuple of integers,
or something primitive like that (i.e. when comparison is rather cheap).

Pdqsort efficiently eliminates branch mispredictions and brings some other
improvements over introsort as well - for example, the complexity becomes
O(nk) if the input array is of length n and consists of only k different
values. Of course, worst-case complexity is always O(n log n).

Finally, last week I implemented parallel sorts for Rayon (Rust's data
parallelism library) based on timsort and pdqsort[4].

Check out the links for more information and benchmarks. And before you start
criticizing the benchmarks, please keep in mind that they're rather
simplistic, so please take them with a grain of salt.

I'd be happy to elaborate further and answer any questions. :)

[1] [https://github.com/rust-lang/rust/pull/38192](https://github.com/rust-
lang/rust/pull/38192)

[2] [https://github.com/rust-lang/rust/issues/40585](https://github.com/rust-
lang/rust/issues/40585)

[3] [https://github.com/rust-lang/rust/pull/40601](https://github.com/rust-
lang/rust/pull/40601)

[4]
[https://github.com/nikomatsakis/rayon/pull/379](https://github.com/nikomatsakis/rayon/pull/379)

~~~
Retric
Comparing sorting algo's often says more about your benchmark than the algo's
themselves. Random and pathological are obvious, but often your dealing with
something in between. Radix vs n log n is another issue.

So, what where your benchmarks like?

~~~
stjepang
That is true - the benchmarks mostly focus on random cases, although there are
a few benchmarks with "mostly sorted" arrays (sorted arrays with sqrt(n)
random swaps).

If the input array consists of several concatenated ascending or descending
sequences, then timsort is the best. After all, timsort was specifically
designed to take advantage of that particular case. Pdqsort performs
respectably, too, and if you have more than a dozen of these sequences or if
the sequences are interspersed, then it starts winning over timsort.

Anyways, both pdqsort and timsort perform well when the input is not quite
random. In particular, pdqsort blows introsort (e.g. typical C++ std::sort
implementations) out of the water when the input is not random[1]. It's pretty
much a strict improvement over introsort. Likewise, timsort (at least the
variant implemented in Rust's standard library) is pretty much a strict
improvement over merge sort (e.g. typical C++ std::stable_sort
implementations).

Regarding radix sort, pdqsort can't quite match its performance (it's O(n log
n) after all), but can perform fairly respectably. E.g. ska_sort[2] (a famous
radix sort implementation) and Rust's pdqsort perform equally well on my
machine when sorting 10 million random 64-bit integers. However, on larger
arrays radix sort starts winning easily, which shouldn't be surprising.

I'm aware that benchmarks are tricky to get right, can be biased, and are
always controversial. If you have any further questions, feel free to ask.

[1]: [https://github.com/orlp/pdqsort](https://github.com/orlp/pdqsort)

[2]:
[https://github.com/skarupke/ska_sort](https://github.com/skarupke/ska_sort)

~~~
prestonbriggs
There's lots of pathologies to watch out for... Imagine sorting 10 million
random ints, where 1% of them are random 64-bit values and 99% are in the
range [0..9]. Might be extra fun in parallel.

------
nneonneo
I would love to see the benchmark results against Timsort, the Python sorting
algorithm that also implements a bunch of pragmatic heuristics for pattern
sorting. Timsort has a slight advantage over pdqsort in that Timsort is
stable, whereas pdqsort is not.

I see that timsort.h is in the benchmark directory, so it seems odd to me that
the README doesn't mention the benchmark results.

~~~
nightcracker
There are multiple reasons I don't include Timsort in my README benchmark
graph:

1\. There is no authoritative implementation of Timsort in C++. In the bench
directory I included [https://github.com/gfx/cpp-
TimSort](https://github.com/gfx/cpp-TimSort), but I don't know the quality of
that implementation.

2\. pdqsort intends to be the algorithm of choice of a system unstable sort.
In other words, a direct replacement for introsort for std::sort. So std::sort
is my main comparison vehicle, and anything else is more or less a
distraction. The only reason I included std::stable_sort in the benchmark is
to show that unstable sorting is an advantage for speed for those unaware.

But, since you're curious, here's the benchmark result with Timsort included
on my machine:
[http://i.imgur.com/tSdS3Y0.png](http://i.imgur.com/tSdS3Y0.png)

This is for sorting integers however, I expect Timsort to become substantially
better as the cost of a comparison increases.

------
ComputerGuru
Just because I was confused: this is by Orson Peters who first invented pdq.
It's not brand new (as in yesterday), but is a very, very recent innovation
(2016).

------
dpcx
Question as a non low-level developer, and please forgive my ignorance:

How is it that we're essentially 50 years in to writing sorting algorithms,
and we still find improvements? Shouldn't sorting items be a "solved" problem
by now?

~~~
stjepang
Basically all comparison-based sort algorithms we use today stem from two
basic algorithms: mergesort (stable sort, from 1945) and quicksort (unstable
sort, from 1959).

Mergesort was improved by Tim Peters in 2002 and that became timsort. He
invented a way to take advantage of pre-sorted intervals in arrays to speed up
sorting. It's basically an additional layer over mergesort with a few other
low-level tricks to minimize the amount memcpying.

Quicksort was improved by David Musser in 1997 when he developed introsort. He
set a strict worst-case bound of O(n log n) on the algorithm, as well as
improved the pivot selection strategy. And people are inventing new ways of
pivot selection all the time. E.g. Andrei Alexandrescu has published a new
method in 2017[1].

In 2016 Edelkamp and Weiß found a way to eliminate branch mispredictions
during the partitioning phase in quicksort/introsort. This is a vast
improvement. The same year Orson Peters adopted this technique and developed
pattern-defeating quicksort. He also figured out multiple ways to take
advantage of partially sorted arrays.

Sorting is a mostly "solved" problem in theory, but as new hardware emerges
different aspects of implementations become more or less important (cache,
memory, branch prediction) and then we figure out new tricks to take advantage
of modern hardware. And finally, multicore became a thing fairly recently so
there's a push to explore sorting in yet another direction...

[1]
[http://erdani.com/research/sea2017.pdf](http://erdani.com/research/sea2017.pdf)

~~~
xenadu02
It's always good to remember that while Big-O is useful, it isn't the be-all
end-all. The canonical example on modern hardware is a linked list. In theory
it has many great properties. In reality chasing pointers can be death due to
cache misses.

Often a linear search of a "dumb" array can be the fastest way to accomplish
something because it is very amenable to pre-fetching (it is obvious to the
pre-fetcher what address will be needed next). Even a large array may fit
entirely in L2 or L3. For small data structures arrays are almost always a
win; in some cases even hashing is slower than a brute-force search of an
array! __

A good middle ground can be a binary tree with a bit less than an L1 's worth
of entries in an array stored at each node. The binary tree lets you skip
around the array quickly while the CPU can zip through the elements at each
node.

 __It is more important than ever to test your assumptions. Once you 've done
the Big-O analysis to eliminate exponential algorithms and other basic
optimizations you need to analyze the actual on-chip performance, including
cache behavior and branch prediction.

~~~
flukus
> It's always good to remember that while Big-O is useful, it isn't the be-all
> end-all. The canonical example on modern hardware is a linked list. In
> theory it has many great properties. In reality chasing pointers can be
> death due to cache misses.

My favorite example is adding and ordered list of items into a a simple tree,
all you've really done is created a linked list. Big-O doesn't know what your
data looks like but you generally should.

~~~
beagle3
Simple binary tree is O(n^2) just like a linked list.

Unless you know what you know your distributions and are generally proficient
in probability theory (in 99% of the cases, neither can be relied on) the only
relevant big-O metric is the worst case one

------
graycat
I always wondered if there would be a way to have quicksort run slower than
O(n ln(n)).

Due to that possibility, when I code up a sort routine, I use heap sort. It is
guaranteed O(n ln(n)) worst case and achieves the Gleason bound for sorting by
comparing keys which means that on average and worst case, on the number of
key comparisons, it is impossible to do better than heap sort's O(n ln(n))
forever.

For a stable sort, sure, just extend the sort keys with a sequence number, do
the sort, and remove the key extensions.

Quicksort has good main memory locality of reference and a possibility of some
use of multiple threads, and heap sort seems to have neither. But there is a
version of heap sort modified for doing better on locality of reference when
the array being sorted is really large.

But, if are not too concerned about memory space, then don't have to care
about the sort routine being _in place_. In that case, get O(n ln(n)), a
stable sort, no problems with locality of reference, and ability to sort huge
arrays with just the old merge sort.

I long suspected that much of the interest in in-place, O(n ln(n)), stable
sorting was due to some unspoken but strong goal of finding some fundamental
_conservation law_ of a trade off of processor time and memory space. Well,
that didn't really happen. But heap sort is darned clever; I like it.

~~~
torrent-of-ions
It's cool to play with a pack of cards and run sorting algorithms on them. To
see the worst case of quicksort, use the first element as the pivot and give
it an already sorted list. It will take quadratic time to give back the same
list.

~~~
graycat
Right. So, for the first "pivot" value, people commonly use the median of
three -- take three keys and use as the pivot the median of those three, that
is, the middle value. Okay. But then the question remains: While in practice
the median of three sounds better, maybe there is a goofy, pathological array
of keys that still makes quicksort run in quadratic time. Indeed, maybe for
any way of selecting the first pivot, there is an array that makes quicksort
quadratic.

Rather than think about that, I noticed that heap sort meets the Gleason bound
which means that heap sort's O(n ln)n)) performance both worst case and
average case can never be beaten by a sort routine that depends on comparing
keys two at a time.

Then, sure, can beat O(n ln(n)). How? Use radix sort -- that was how the old
punched card sorting machines worked. So, for an array of length n and a key
of length k, the thing always runs in O(nk) which for sufficiently large n is
less than O(n ln(n)). In practice? Nope: I don't use radix sort!

------
jkabrg
[Post-edit] I made several edits to the post below. First, to make an
argument. Second, to add paragraphs. [/Post-edit]

Tl;dr version: It seems to me you should either use heapsort or plain
quicksort; the latter with the sort of optimisations described in the linked
article, but not including the fallback to heapsort.

Long version:

Here's my reasoning for the above:

You're either working with lists that are reasonably likely to trigger the
worst case of randomised quicksort, or you're not working with such lists. By
likely, I mean the probability is not extremely small.

Consider the case when the worst case is very unlikely: you're so unlikely to
have a worst case that you're gaining almost nothing for accounting for it
except extra complexity. So you might as well only use quicksort with
optimisations that are likely to actually help.

Next is the case that a worst case might actually happen. Again, this is not
by chance; it has to be because someone can predict your "random" pivot and
screw with your algorithm; in that case, I propose just using heapsort. Why?
This might be long, so I apologise. It's because usually when you design
something, you design it to a high tolerance; a high tolerance in this case
ought to be the worst case of your sorting algorithm. In which case, when
designing and testing your system, you'll have to do extra work to tease out
the worst case. To avoid doing that, you might as well use an algorithm that
takes the same amount of time every time, which I think means heapsort.

~~~
nightcracker
The overhead of including the fallback to heapsort takes a negligible, non-
measurable amount of processing time that guarantees a worst case runtime of
O(n log n), and to be more precise, a worst case that is 2 - 4 times as slow
as the best case.

Your logic also would mean that any sorting function that is publicly facing
(which is basically any interface on the internet, like a sorted list of
Facebook friends) would need to use heapsort (which is 2-4 times as slow), as
otherwise DoS attacks are simply done by constructing worst case inputs.

There are no real disadvantages to the hybrid approach.

~~~
jkabrg
Thanks for your reply.

> Your logic also would mean that any sorting function that is publicly facing
> (which is basically any interface on the internet, like a sorted list of
> Facebook friends) would need to use heapsort (which is 2-4 times as slow),
> as otherwise DoS attacks are simply done by constructing worst case inputs.

Why is that a wrong conclusion? It might be, I'm not a dev. But if I found
myself caring about that sort of minutiae, I would reach exactly that
conclusion.

Reasons:

* the paranoid possibility that enough users can trigger enough DoS attacks that your system can fall over. If this is likely enough, maybe you should design for the 2-4x worst case, and make your testing and provisioning of resources easier.

* a desire for simplicity when predicting performance, which you're losing by going your route because you're adding the possibility of a 2-4x performance drop depending on the content of the list. Ideally, you want the performance to solely be a function of n, where n is the size of your list; not n and the time-varying distribution of evilness over your users.

Finally, adding a fallback doesn't seem free to me, because it might fool you
into not addressing the points I just made. That O(n^2) for Quicksort might be
a good way to get people to think; your O(n log n) is hiding factors which
don't just depend on n.

~~~
carussell
I'm sympathetic because it may not be clear:

pdqsort is a hybrid sort; when it encounters apparently pathological input, it
switches from a strategy resembling quicksort to heapsort—it doesn't share
quicksort's worst case characteristics.

Your thesis is wrong:

> you might as well use an algorithm that takes the same amount of time every
> time, which I think means heapsort

Heapsort has a variable runtime. It will selectively reheap. Whether this
happens is dependent on the state of the heap and your next element, which
means the total number of times you reheap will vary with input.

~~~
jkabrg
> pdqsort is a hybrid sort; when it encounters apparently pathological input,
> it switches from a strategy resembling quicksort to heapsort—it doesn't
> share quicksort's worst case characteristics.

I understand how hybrid sorts work. I thought that would be clear enough. I
guess when you're a stranger on the internet, people don't automatically trust
you to know what you're talking about. I imagine this is even truer when you
say something counterintuitive or controversial. In spite of learning that,
I'm responding chiefly because of that quote.

> Your thesis is wrong: > > Heapsort has a variable runtime

Let's not get hung up on heapsort. My thesis is if you have an algorithm with
a gap between the worst case and average case, then you shouldn't use it on
the web. That gap might _not_ manifest as gap in the big O -- it could be 4x
slower in the worst case than the average case, and that would still be bad.

To see why, try imagining this world of pure evil:

You create a website that uses _gap_ sort. Gapsort is an algorithm that is 4x
slower in the worst case than it is in the average case, for a fixed value of
n (the size of the input). On top of that, triggering the worst case by
accident is hard. You deploy your site for several months, buy servers, and
get a userbase. In your performance testing, you only encountered the average
case, so you only provisioned hardware for the average case. Now imagine that
suddenly all your users turn evil, and start triggering the worst case; this
leads to a sudden 4x performance drop. You may have underprovisioned for this.

So my thesis is it _looks like_ having differences from the worst case and
average case is great, on average. But in a hostile world, that actually makes
things more complicated. When doing empirical performance testing, you'll test
for the average. _Moreover_ the gap can be just 4x; _this will not manifest as
a difference between the big Os of the worst and average cases_ (as I've said
previously).

Changing from quicksort to heapsort on some inputs may manifest as a
performance gap between the QS case and fallback case. Maybe that's not such a
good idea. In fact, maybe you shouldn't use any quicksort variant, even
introsort, in a potentially hostile environment, _because of the unavoidable
average-to-worst-case gap_. In introsort, that gap is merely a different
coefficient, but it's still a gap.

I hope that wasn't too repetitive.

[edit] I deleted and reposted this because the edits weren't being applied.

[edit] Done it twice now.

~~~
bmm6o
I hear what you're saying, but it seems like this is potentially a problem
only in cases where you are devoting significant resources to sorting. Like if
sorting takes only 1% of your CPU time, worst-case input will only bump that
to 4% - and that's only if every user starts hitting you with worst-case
requests. Even if you spend more, it's a question of how much work can the
malicious users cause you to do.

------
j_s
Has HN ever discussed the possibilities when purposely crafting worst-case
input to amplify a denial-of-service attack?

~~~
nightcracker
If whoever you're targeting uses libc++, I already did the analysis:
[https://bugs.llvm.org/show_bug.cgi?id=20837](https://bugs.llvm.org/show_bug.cgi?id=20837)

To my knowledge it's still not fixed.

------
kleiba
This book is one of the most general treatments of parameterized Quicksort
available: [http://wild-inter.net/publications/wild-2016.pdf](http://wild-
inter.net/publications/wild-2016.pdf)

------
beagle3
Anyone knows how this compares to Timsort in practice?

A quick google turns out nothing

~~~
mastax
[https://github.com/rust-lang/rust/pull/40601](https://github.com/rust-
lang/rust/pull/40601)

"stable" is a simplified Timsort: [https://github.com/rust-
lang/rust/pull/38192](https://github.com/rust-lang/rust/pull/38192)

"unstable" is a pdqsort

~~~
stjepang
To summarize:

If comparison is cheap (e.g. when sorting integers), pdqsort wins because it
copies less data around and the instructions are less data-dependency-heavy.

If comparison is expensive (e.g. when sorting strings), timsort is usually a
tiny bit faster (around 5% or less) because it performs a slightly smaller
total number of comparisons.

------
jorgemf
Where is a high level description of the algorithm? How is it different from
quick sort, it seems quite similar based on a quick observation of the code.

~~~
klodolph
The readme file actually contains a fairly thorough description of how it
differs from quicksort. Start with the section titled "the best case".

~~~
jorgemf
> On average case data where no patterns are detected pdqsort is effectively a
> quicksort that uses median-of-3 pivot selection

So basically is quicksort with a bit more clever pivot selection, but only for
some cases.

~~~
stjepang
You're forgetting probably the most important optimization: block
partitioning. This one alone makes it almost 2x faster (on random arrays) than
typical introsort when sorting items by an integer key.

------
wiz21c
Is there a analysis of its complexity ? The algorithm looks very nice !

~~~
nightcracker
Hey, author of pdqsort here, the draft paper contains complexity proofs of the
O(n log n) worst case and O(nk) best case with k distinct keys:
[https://drive.google.com/open?id=0B1-vl-
dPgKm_T0Fxeno1a0lGT0...](https://drive.google.com/open?id=0B1-vl-
dPgKm_T0Fxeno1a0lGT0E)

~~~
ouid
Best case? Give worst and average case when describing complexities.

~~~
nightcracker
I already did, you may want to re-read my comment.

~~~
ouid
Am I misinterpreting your usage of "best case"?

~~~
nightcracker
Without enlightening me on what your interpretation is, I have no way of
telling.

pdqsort has a worst case of O(n log n). That means, no matter what, the
algorithm never takes more than a constant factor times n log n time to
complete.

Since pdqsort is strictly a comparison sort, and comparison sorts can do no
better than O(n log n) in the average case, pdqsort is asymptotically optimal
in the average case (because the average case can never be worse than the
worst case).

On top of the above guarantees, if your input contains only k distinct keys,
then pdqsort has a worst case complexity of O(nk). So when k gets small (say,
1-5 distinct elements), pdqsort approaches linear time. That is pdqsort's best
case.

------
unruledboy
it's interesting that .Net built-in quicksort is actually doing the same
thing, with introsort behind the scenes.

------
torrent-of-ions
I can see why one might blindly call qsort on already sorted data (when using
user input), but why sorted data with one out of place element? Presumably
that element has been appended to a sorted array, so you would place it
properly in linear time using insertion and not call a sort function at all.
Why does such a pattern arise in practice?

~~~
nightcracker
You would be surprised how often people just use a (repeatedly) sorted vector
in spite of a proper data structure or proper insertion calls. It's a lot
simpler to just append and sort again. Or the appending happens somewhere else
in the code entirely.

As a real-world example, consider a sprite list in a game engine. Yes, you
could keep it properly sorted as you add/remove sprites, but it's a lot
simpler to just append/delete sprites throughout the code, and just sort it a
single time each frame, even if it only adds a single sprite.

So yes, technically this pattern is not needed if everyone always recognized
and used the optimal data structure and algorithm at the right time. But that
doesn't happen, and it isn't always the simplest solution for the programmer.

