
A faster Python sort - luu
http://dpb.bitbucket.org/a-faster-python-sort.html
======
raymondh
Adding an option to shoot yourself in the foot is bad API design and a
premature optimization.

For the most part, it is rare to have both fully randomized data and a need to
sort it.

One way to test the quality of a psuedo random number generator is to count
ascending and descending runs
[https://en.wikipedia.org/wiki/Diehard_tests](https://en.wikipedia.org/wiki/Diehard_tests).
This suggests that runs are hard to get rid of even when your trying to do it
on-purpose. The Timsort algorithm expends a little effort to see if runs are
present (which is common) so that it can dramatically reduce the effort
required to sort. Taking the check out is likely to be a bad idea in general
-- "Hey, I'll speed things up by a few percent by removing the check for a
high-speed shortcut. Now, give me a raise."

In a way, this micro-optimization isn't much more interesting than adding a
flag to indicate that the data is already sorted. "Hey, I sped-up the sort
routine by 100% because I already happened to know that the data was already
sorted. Another raise please."

Lastly, I question the benchmarking procedure. One uses numpy arrays without
any object overhead and the other timing uses plain python lists. This is an
apples-to-oranges comparison that doesn't allow you to draw any meaningful
conclusions about the relative speed of the two algorithms.

Very few people have crossed light sabers with Tim Peters and won. I'm amazed
at how many people think that the first ideas that pop into their head will be
better than decades of experience, deep analysis, and careful design. This is
mature code, widely studied, and adopted in multiple languages.

~~~
scythe
The difference between merge sort and tim sort is not "removing a check";
timsort is actually a different algorithm. In the ordinary sense of the word,
there is no check. In fact, the use of insertion sort is most important to
speed up sorting _random_ data, because the real advantage of insertion sort
is that for small enough chunks -- anything that fits in a few cache lines --
the amount you save on comparisons vastly exceeds the amount you spend on
swapping. Most well-written sort functions use insertion sort at small enough
sizes.

The adaptive speedup in timsort does not come from adopting ideas in insertion
sort; rather, it comes from strand sort.

[http://](http://) en.wikipedia.org/wiki/Strand_sort

As to the speedup, who knows? As you've stated, this is an apples-to-oranges
comparison. It depends on the Python vs NumPy object model, and it's pretty
suspicious to see mergesort beating quicksort.

------
ntenenz
If you know the structure of your data (or lack thereof) a priori, of course
you can select an algorithm that outperforms one that is strong in a more
general sense. An insertion sort is O(N) for nearly sorted data (hence why
it's used as a part of many hybrid sorts) and, on data that's already sorted,
will outperform ANY sort you throw at it. However, you would never use it on
data that you know is reversed.

In other words, calling a merge sort a "faster sort" is a complete misnomer.
Saying it's faster for a certain sequence of data is far more accurate (and
also a bit of a "no shit" statement since one could construct a multitude of
sequences where the opposite is also true).

~~~
tetha
That's why the best sorting algorithm is to generate sorted data, if that's
possible without too much overhead.

------
andreasvc
Maybe it's me but I feel like sort is an overstudied problem. There are much
more interesting challenges with datastructures and algorithms.

EDIT: the blog post presents a benchmark on two completely different
datastructures (numpy matrix vs python list), it is unclear if any conclusion
can be drawn.

------
keypusher
There are good reasons Timsort is used by the stdlib, namely that real world
data is often partially sorted. Of course you can manufacture cases in which
sorting a numpy array via mergesort is faster than timsort on a stdlib list,
but that doesn't make this headline appropriate. The author literally has one
example. That's not how you characterize an algorithm. Is he unaware of big O?
I wouldn't mind seeing a thorough review of the implementation and behavior
differences between these two algorithms, but this article isn't it.

------
ggchappell
Okay, this is a bit of a side issue, but a few details about the NumPy
Quicksort are bothering me.

First, we've known since 1997 how to improve Quicksort's worst case number of
comparisons to O(n log n) for an n-item list. (Track the recursion depth and
switch to Heap Sort on the current sublist if the depth exceeds 2 log_2(n);
this is usually called "Introsort".) This optimization is pretty well known.
For a front-line library like NumPy to include a quadratic-worst-case sorting
algorithm in 2014 is ridiculous.

Second (and rather less important), we can't write Quicksort with constant
extra space usage, since, when sorting a small sublist, we need to remember
the pivots in the larger sublists we are in the middle of sorting. So
Quicksort's extra space usage is going to be at best O(log n)
subscripts/pointers for an n-item list. However, the docs say 0 work space
required. This would be correct if the sort were written in "C" and NumPy only
counts Python objects in its space usage, but I can't find anywhere this is
stated.

Third (and perhaps _very_ important), I don't see anything in the docs about
locality of reference or any other is-this-algorithm-friendly-to-modern-
processor-architectures issues. There are good reasons Quicksort is largely
being abandoned for standard-library sort implementations.

~~~
bluecalm
>>There are good reasons Quicksort is largely being abandoned for standard-
library sort implementations.

Maybe I am not up to date on that but there was some fuss lately about
implementation of dual pivot quicksort in Java standard library which was to
replace the old implementation
([http://www.docjar.com/html/api/java/util/DualPivotQuicksort....](http://www.docjar.com/html/api/java/util/DualPivotQuicksort.java.html))

My experience is also that quicksort smokes all merge-sort hybrids on
partially sorted inputs but of course that might be because my inputs had
something very specific in them when I tested it. Quicksort is very cache
friendly (pivot sits in a register, and you keep comparing to what is already
in cache assuming you sort small structs with key in them).

My intuition is that mergesort hybridgs being faster in some languages is very
specific to implementation of those languages and they will lose once you are
down to sorting C structs.

~~~
ggchappell
Well, perhaps I'm not terribly up-to-date on the issues I mentioned third,
either. FWIW, there is a comment on why GNU libc defaults to Merge Sort for
qsort() on Reddit:

> The reason we made qsort() default to using merge sort was, indeed, that it
> always makes fewer compares than quick sort. We thought that in the common
> case, people would sort small objects (numbers, pointers) rather than large
> objects, and that therefore the performance would be dominated by
> comparisons rather than data movement.

Link to full comment:
[http://www.reddit.com/r/programming/comments/1jxq4/ask_reddi...](http://www.reddit.com/r/programming/comments/1jxq4/ask_reddit_why_does_the_gnu_libc_use_mergesort_to/c1k07v)

Since you say "pivot sits in a register", you're also talking about very small
data. It looks to me like the two of you disagree. (OTOH, that comment was 7
years ago.)

------
bluecalm
My intuition about sorting in Python is that by far the fastest way would be
to calculate keys first then call an optimized C quicksort on arrays of
structs {handle, key} and copy it back to Python. That would require more than
2n memory (and exactly 2n for things like ints or chars) but the speed-up
should be huge because well implemented C quicksort on small structs really
shines when it comes to exploiting characteristics of modern hardware (cache
hits are more important than number of comparisons).

~~~
zeckalpha
Timsort doesn't have n^2 worst case performance. Also, CPython's sort is
implemented in C, no need for copying:
[http://hg.python.org/cpython/file/be77b213ace0/Objects/listo...](http://hg.python.org/cpython/file/be77b213ace0/Objects/listobject.c#l2037)

~~~
bluecalm
It is implemented in C but it doesn't sort small structs without any pointer
dereference. That quicksort have n^2 worst case scenario doesn't matter in
practice at all. First you can make it nlogn (by either running heapsort at
some point or taking median out of 3-5 random elements instead of once at
predermined positions).

