
On the Worst-Case Complexity of TimSort - pelario
http://drops.dagstuhl.de/opus/volltexte/2018/9467/
======
kieckerjan
I am surprised to see the author of TimSort (Tim Peters) does not have a
Wikipedia entry. Seems to me he has enough claim to (wiki)fame.

~~~
saagarjha
It's probably just that nobody's gotten around to writing one.

------
0x0
The linked java test file, [http://igm.univ-
mlv.fr/~pivoteau/Timsort/Test.java](http://igm.univ-
mlv.fr/~pivoteau/Timsort/Test.java) \- still crashes the latest Java 10.0.2
with an 'Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
49'. Amazing! I wonder if this makes some web services vulnerable.. if the
user can submit a just-so array of ints to be sorted? But it does seem like it
would require uploading a really huge array (>4GB?)

~~~
iCannotEven

      arrayToSort[sum] = 1;
    

This is just blatant programmer error. The code is attempting to assign a
value to a slot in an array of a fixed size, which does not exist.

Use:

    
    
      Integer[] arrayToSort = new Integer[2000000000];
    

No error.

~~~
toxik
The equivalent Python code works fine, though.

    
    
        >>> a=[0]*sum(rls)
        >>> sum=-1
        >>> for i in rls:
        ...   sum += i
        ...   a[sum] = 1
        ... 
        >>> a.sort()
        >>> 
    

Takes a real good while, too.

~~~
iCannotEven
That's because the array to be sorted is packed and inflated, so as to be the
worst possible input for that kind of sort.

Worst-case complexity.

Also, I needed to bump my JVM heap up to 16GB (not 9GB as recommended), just
to run it.

------
julienfr112
Amazing there is still something to find in a sort algorithm.

~~~
lorenzhs
Especially when targeting realistic machine models, there are a lot of things
that are suboptimal about the classical sorting algorithms like quicksort or
mergesort. For example, a quicksort with perfect choice of pivot will incur a
branch miss with probability 50% for every element. That's not something
classical complexity analysis measures, but on actual CPUs, branch misses have
quite an impact (something like 10–15 cycles). Shameless plug for something I
once built for a lecture to demonstrate this:
[https://github.com/lorenzhs/quicksort-pivot-
imbalance](https://github.com/lorenzhs/quicksort-pivot-imbalance)

For sorting algorithms that take the behaviour of modern CPUs into account,
check out ips4o
([https://arxiv.org/abs/1705.02257](https://arxiv.org/abs/1705.02257), code:
[https://github.com/SaschaWitt/ips4o](https://github.com/SaschaWitt/ips4o)) or
for a simpler algorithm that's still much faster than quicksort in most cases,
blockquicksort
([https://arxiv.org/abs/1604.06697](https://arxiv.org/abs/1604.06697), code:
[https://github.com/weissan/BlockQuicksort](https://github.com/weissan/BlockQuicksort)).
Note that both papers were published in the last two years :)

Of course these algorithms are much more complex and error-prone to implement
and use some additional memory, which may explain why they're not used in
standard library implementations of popular languages.

~~~
kjeetgill
Out of curiosity, what do you read/follow that makes stuff like this
discoverable?

~~~
lorenzhs
I'm a PhD student in algorithmics / algorithm engineering, so I work with a
lot of people who do stuff like this, even though my research isn't related to
sorting. Super Scalar Sample Sort (which ips4o is based on) was co-authored by
my advisor, Peter Sanders, and I re-implemented it a few years ago in modern
C++ just for the fun of it. Turns out that was a lot nicer to read than the
original code (which isn't public) and a bit faster, too. I put that on GitHub
where the blockquicksort authors found it and contacted me and we traded some
emails and made some improvements. Sometime later my advisor and two
colleagues came up with an idea for in-place super-scalar sample sort, out of
which eventually ips4o was born.

So, uh, I guess I just work with the right people :) Sorry that I can't give
you anything concrete. All of these papers were presented at ESA (European
Symposium on Algorithms), though, so that's a good venue to follow. But
beware, ESA has a theory track that's a lot bigger than the experimental
track, and papers published there can be somewhat unapproachable ;)

~~~
kjeetgill
Ahh, that makes sense, thanks!

I'd recently asked elsewhere 'I've always wondered is there a good
mathematical representation of an algorithm that is useful for algebraic
manipulation? Like producing a quicksort from bubble sort.' And got linked
this paper [0]. This is only time I've heard the word 'Algoritmics' since
that.

Any interesting 'entry level' reads you could send my way?

[0]:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.2247&rep=rep1&type=pdf)

~~~
lorenzhs
Thanks for the link, that was fun to read (well, skim).

I don't think the term 'algorithmics' appears very often in publications, it's
more of an umbrella term for many things. The stuff we work in our group is
sort of the practical side of theoretical computer science, in that our focus
is on algorithms that can be efficiently implemented and don't just look good
on paper. The methodology is called Algorithm Engineering, it's described
quite well in
[https://en.wikipedia.org/wiki/Algorithm_engineering](https://en.wikipedia.org/wiki/Algorithm_engineering).
Apart from ESA's track B, the major conferences there are Alenex and SEA
(Symposium on Experimental Algorithms). All three are open access.

It's difficult to recommend anything in particular, not only because the scope
is very broad, but also because most papers aren't written for a wide
audience. Good writing is not something that academics optimize for
(perversely, good writing can be seen as a negative point, in that if a paper
is easy to understand, it may be rejected for being too simple), nor is it
taught. Maybe something like [https://github.com/papers-we-love/papers-we-
love](https://github.com/papers-we-love/papers-we-love) could serve as a
starting point?

------
TeMPOraL
How is it that the abstract is talking about "Java version" and "Python
version" when discussing computational complexity? Aren't algorithms
algorithms, independent of the language you're implementing them in?

~~~
grillorafael
Yes. If the java have a different complexity it is a different algorithm.

To the writers defense, they have to algorithm in pseudo code in the article

~~~
brazzy
> If the java have a different complexity it is a different algorithm.

It doesn't seem wrong to me to talk about different versions of the same
algoritm when there are only minor differences.

~~~
ehsankia
Right, like how Quicksort can be pretty different depending on how you choose
the pivot. It's still Quicksort, but there's different variants.

~~~
grillorafael
yes but they will share the same complexity

~~~
FreeFull
Depending on how you choose the pivot, the worst-case of complexity of
Quicksort can be O(n^2) or O(n log n)

~~~
klmr
No, worst-case complexity of real quicksort is always O(n^2), regardless of
pivot choice strategy (even with stochastic pivot choice, although you’d have
to get _very_ unlucky to hit that worst case). You can make the average case
better or worse though.

The only way of making quicksort’s worst-case runtime O(n log n) is by
limiting recursion depth, as done e.g. in introsort. But that’s no longer
quicksort.

~~~
nh2
This is wrong.

See
[https://en.m.wikipedia.org/wiki/Quicksort](https://en.m.wikipedia.org/wiki/Quicksort),
section "Selection-based pivoting".

~~~
deathanatos
Specifically,

> _A variant of quickselect, the median of medians algorithm, chooses pivots
> more carefully, ensuring that the pivots are near the middle of the data
> (between the 30th and 70th percentiles), and thus has guaranteed linear time
> – O(n). This same pivot strategy can be used to construct a variant of
> quicksort (median of medians quicksort) with O(n log n) time. However, the
> overhead of choosing the pivot is significant, so this is generally not used
> in practice._

------
grillorafael
Given that rho can vary with the input and is completely arbitrary value,
shouldn’t be also called n?

Memories on the subject are not great so might be saying bullshit in here

~~~
matharmin
In the worst case, rho is equal to n, and you get O(n log n). However, O(n + n
log rho) gives a better description of how it performs on partially sorted
arrays.

~~~
Scarblac
And in the best case (already sorted array), it's equal to 1 and the algorithm
performs as O(n), which is nice to prove in one go.

In some other typical cases (otherwise sorted array with one element inserted,
two sorted arrays appended to each other) rho is 3 and 2, so also O(n).

~~~
dmurray
All sorting algorithms can trivially be made to perform in O(n) in the
"already sorted" scenario without worsening the worst case complexity, so that
isn't really helpful.

~~~
giovannibajo1
It can be done only by adding an initial check that does only that. But this
means that the algorithm speed doesn’t gradually increase with partially
sorted sequences.

For instance, timsort is also very fast if only a single element is unsorted,
or only two elements, or only three elements. These are not special cases
explicitly handled, its just the natural way the algorithm works.

------
willtim
Not good that Java's sort still has bugs.

~~~
wiz21c
I'd say it the other way around : 1/ it's amazing that a code that is used in
countless occurences can still have a bug; 2/ (I've studied sorting algorithms
for a while) finding such a bug is very clever... Kudo's to the authors.

~~~
pvg
It looks deceptively like it but 'kudos' is not plural of a kudo (or reference
to some awesome thing Kudo once did).

~~~
klmr
… and even if it were a plural the apostrophe would be misplaced.

