Hacker News new | past | comments | ask | show | jobs | submit login
Timsort is a sorting algorithm that is efficient for real-world data (hackernoon.com)
260 points by signa11 on July 1, 2018 | hide | past | favorite | 77 comments



Recently I "made" [0] one of those visualized sorting videos for Timsort for real-world data. Point was to give an example of data that was already mostly sorted and show how fast it sorts it. It may be of interest here:

https://www.youtube.com/watch?v=ZxLxf5xqqyE

[0] My use of "made" gives me way more credit than I'm due. Really I just hacked others' work to use another data source. See the description in the video for more info on how it was produced.


I feel like this is important to note:

My example of real world data is data in which the first 90% is already sorted and the next 10% is random. Whether this is real enough is questionable, but it does demonstrate how Timsort automatically takes advantage of runs of sorted data.

It would be nice to see a comparison working on truly real-world data. Perhaps from an open dataset.


I am experimenting with Lidar datasets lately. In lidar you work with huge point clouds. Millions to billions. For some algorithms - convex hull - you need to sort the points for example by x-cooridinate. I did some quick tests with quick sort, merge sort, and timsort. In my experiments quick sort was the slowest with 5.5 seconds on the chosen dataset. Mergesort was 5 seconds and timsort was around 4 seconds.


When sorting coordinates, radix sort with an appropriately chose radix (often but not always 256) is usually fastest.


Hi there, thanks. Need to try this


Yeah seeing more datasets would definitely be nice. I linked my GitHub repo where I made the changes necessary for this example in the YouTube video. Using the last commit (the only changes from upstream) as a guide, other datasets cold be supported similarly.

That said, I don't think my choice of data is that bad actually. Apparently starting with a sorted dataset, appending some new items, and then sorting the result again is very commonly done by programmers. That is essentially what's represented in my example.


Maybe the UK Postcode lookup database?

It's about 650MB (CSV):

https://data.gov.uk/dataset/7ec10db7-c8f4-4a40-8d82-8921935b...

In theory, it's supposed to be updated every now and then, but doesn't seem to have happened since 2015. :(

For a direct SQLite version of it, we have it on DBHub:

https://dbhub.io/justinclift/National%20Statistics%20Postcod...

That site is still in (slow) development though. YMMV. :)


Here's an idea: if you're on Linux, use this to sort the list of all paths on your file system (find / -xdev). It's a nice dataset since it's (a) a large list, (b) comparisons are nontrivial, and (c) it's a realistic mix of somewhat-sorted-but-not-completely. (Note: If the file system already sorts everything this won't do the job, so it won't work on e.g. WSL.)


Sorting strings by generic comparison is very inefficient regardless of the algorithm you choose, especially if the long prefixes repeat, which is the case for FS paths.


I wasn't trying to suggest otherwise.


Cool video, thanks for the visualization!


Nice video, thanks for sharing this.


In Python 3.7, the Timsort was sped-up another 40 to 75% for homogenous lists (just some of the common cases).

It works by replacing Python's slow generic PyObject_RichCompareBool() with a type-specialized comparison function.

* https://docs.python.org/3/whatsnew/3.7.html#optimizations

* https://bugs.python.org/issue28685


For those with a similar train of thought, this PR describes the standard sorting algorithm used in Rust: https://github.com/rust-lang/rust/pull/38192


This is great. A simpler TimSort without galloping but keeping merging of runs. It needs a good name (or any name for that matter) so that it can get picked up and used elsewhere.

I'll suggest Pacesort (because the pace gait has smaller steps than a gallop and doesn't have many search results).


Should be Stjepan sort, didn't he invent this variant?


TimmaySort?


Timmay!


I'm not a fan of the anti-academic vibes in this post. Good technology can be born anywhere, but it's foolish to ignore the advantages of research.


Well, to be fair, I'm always concerned about those vat grown algorithms from a lab. I mean, are we sure they're safe?


I only trust real algos that are grown in a free range PHP forum.


Aye, but are they better than the local-sourced samples in the comments on the PHP manual website? 17 upvotes can't be wrong.


Hey! Author here! Thank you so much for posting this here! If you have any questions feel free to ask me


It was not apparent from your article how Timsort ensures stability. After doing some reading on my own I was able to discover that it is done so by only reversing strictly decreasing runs, and by never reordering runs.

While your description covers both of those points[1], it didn't mention them together as contributing to stability.

1: Actually rereading to make sure I didn't miss anything, you have a mistake, as you say: "Runs have to be strictly increasing or decreasing" while they have to be either monotonically increasing (i.e. non-decreasing) or strictly decreasing.

so: 1,2,2,3 is a run, but 3,2,2,1 is not.


The exposition of the sorting algorithm is beautiful and clear. Thanks!

Your disparaging of academic work is utterly ridiculous. I am a mathematician working in academia with real world data, developing algorithms that are used in industry. This sentence at the begining of your article makes me think that you are not really aware of how the real world works.


Sure. Why are you highlighting this wasn't "created in an academic laboratory". Do you think your blogging approach of thinly veiled negativity toward established peer-reviewed works will give you more likeability among the general readership?


Is the implementation at the end of the article complete? I don't see anything that looks like the "galloping" behavior described just prior.


Ooh, it's actually not complete. I put that as a placement while I tried to find the original source code for it. I didn't expect it to be shared anywhere. me. Thank you for bring this to my attention, I'll see if I can find the source code again, terribly sorry for doing the worst thing imaginable and publishing something that isn't complete!


> Since the algorithm has been invented it has been used as the default sorting algorithm in Python, Java, the Android Platform, and in versions of GNU.

I'm glad you mentioned how far the algorithm has spread to other languages since originating in Python. But, "versions of GNU" what?


I assume he means sort/gsort, from the coreutils.


http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/sort... says:

   Use a recursive divide-and-conquer algorithm, in the style
   suggested by Knuth volume 3 (2nd edition), exercise 5.2.4-23.  Use
   the optimization suggested by exercise 5.2.4-10; this requires room
   for only 1.5*N lines, rather than the usual 2*N lines.  Knuth
   writes that this memory optimization was originally published by
   D. A. Bell, Comp J. 1 (1958), 75.


Versions of GNU. Although it's rare to see this term being used in this context these days, "GNU" was invented as a name of an operating system.


Where is TimSort used in a GNU operating system?

I suspect this line originates from the Wikipedia entry for TimSort with the line:

"Timsort has been Python's standard sorting algorithm since version 2.3. It is also used to sort arrays of non-primitive type in Java SE 7,[3] on the Android platform,[4] and in GNU Octave."

Notice how the order of Python, Java, Android, and GNU is the same? My tentative hypothesis is that "Octave" was dropped, and that "GNU" here means something more like "in the GNU project."

Also, the author uses a time complexity table with Ω(), Θ(), and O() notations, but suggests "To learn about Big O notation", go to another hackernoon article which only talks about O() and not big omega or big theta notations.


That raises the question, what does the Hurd kernel use timsort for?


In your example of reversing decreasing runs, you dropped a 9.


Thank you so much!


another minor typo:

> do the merging as soon as possible to exploit the run that the run just found is still high in the memory hierarchy

s/run/fact/ if I'm reading correctly

for the python timsort, the original source code is here: https://svn.python.org/projects/python/trunk/Objects/listobj..., although porting it to idiomatic python definitely seems nontrivial


Also, it looks like the pypy implementation is here: https://bitbucket.org/pypy/pypy/src/default/rpython/rlib/lis...



Oops, thanks! That's much better. Not sure why I stumbled on the svn link first.


Are there benchmarks available for this kind of algorithm? If you come up with a fast sorting algorithm how would you go about making a case it's faster than currently used ones on real world data?


You can go back and read the mailing list archives to see the benchmarks that were used by Python devs to decide on using Timsort: https://mail.python.org/pipermail/python-dev/2002-July/threa...

You can also read Tim Peter's notes about the sort: https://bugs.python.org/file4451/timsort.txt

That was back in 2002, so there's certainly been a lot of scrutiny since then. In particular, I've never actually read about Java sorting benchmarks, but there are tremendously sophisticated benchmarking tools/benchmarking experts in the Java world, so I imagine that it got a lot of analysis there.


It's worth noting that Java doesnt exclusively use Timsort. Depeding on the data type and heuristics based mostly on the size of the input array, a simpler sort such as counting sort may be use.

You can find a more detailed summary at:

https://stackoverflow.com/a/41129231/149138


Cool find. I'd never actually realized that.


I'd hope people using it wouldn't have chosen it unless they had measured performance -- doubly so if the switched to it from something else.

On the other hand, common third-party benchmarks may not be useful for measuring these sorts of improvements, which take advantage of common input distribution biases. "Real world workloads" is really too much of a simplification -- people sort all shapes and sizes of data.

When JS engines started getting fast(er), every engine made their own preferred benchmark because they said existing ones didn't adequately show the real-world gains their engines were providing.

We shouldn't be reflexively cynical about that. If they have made good-faith measurements, it is likely that old benchmarks were "unrepresentstive" in some way. Again, "real world benchmarking" is hard, and that applies just as much to yesterday's benchmarks as today's.


Probably more than you can shake a stick at: https://www.postgresql.org/message-id/CAEYLb_W%2B%2BUhrcWprz... :)


I wrote a personal benchmark suite specifically to compare the Javascript port of Timsort to my own 'pet' Javascript sorting module 'Barsort.js' which was specialised to sorting Numbers, while Timsort also does strings quickly and takes custom comparators.

I tested these distributions for arrays of 100 to 1000000 elements:

  descending ints
  ascending ints
  descending ints with maveric vals
  ascending ints with maveric vals
  equally distributed reals from -20 to 20
  gaussian distribution -1 to 1
  midspiked reals with duplicates -200 to 200
  huge magnitude with gaps and duplicates and infinites
  equal-dist of ints 1 or 2
  equal-dist of ints 1 2 3
  equal-dist of ints 1 2 3 4
  equal-dist of ints 1 2 3 4 5
Timsort seemed relatively a little slow with gaussian/normal and equal distributions, but still 10x faster or more than browsers native sort at the time.

I was able to extend my pet sorting module to be quite competitive with Timsort's Javascript port. It was a captivating side project that I escaped from as soon as tests passed, its code not in a decent state, but is here - https://github.com/strainer/barsort



> If you come up with a fast sorting algorithm how would you go about making a case it's faster than currently used ones on real world data?

For analysing sorting algorithms - and related things which that leads into - there are established ways to determine their best, average, and worst case runtimes for input of a given size.

If you're interested in learning this stuff, the video's here are useful and not math heavy:

https://www.coursera.org/learn/algorithms-part1

Technically it's explained using java. But it's mostly done using primitive data types (eg int, float, byte, ...).

So it's very easy to follow along and copy the implementation, but translated to a different language (Go in my case, as I don't do java at all).


Well I think that ship has already sailed, Timsort is used in Java on Android and Python, so it's already seen massive adoption.



In the film _Heist_, Gene Hackman's character is asked how he pulled something off. "I tried to imagine someone smarter than myself and I then I thought: What would he do?"

When I need to pull something off, I ask myself: What would Tim do?


I ported timsort to C#/dotnet recently. Available in this nuget:

https://www.nuget.org/packages/Redzen/

Source code here (there are three variants):

https://github.com/colgreen/Redzen/tree/master/Redzen/Sortin...

None of the standard framework sort methods perform a stable sort so it's handy for that, as well as being a lot faster for some data, i.e. with pre-sorted runs. Otherwise the framework now uses introsort which is pretty good, so you should definitely performance test with both. I use both sorts depending on the context and I usually end up using the built in introsort, with timsort used for a few special cases.


It's a great piece of engineering and because of the tongue-in-cheek name might be confused with a joke sorting algorithm...my favorite of which is sleep sort.

https://www.geeksforgeeks.org/sleep-sort-king-laziness-sorti...


> Timsort is a sorting algorithm that is efficient for real-world data and not created in an academic laboratory.

It's always a bit sad to see academia as an ivory tower where researchers generate unusable knowledge. Academia is not only about finding theoretical pure solutions but often also about finding solutions to real-world problems. For instance, machine learning, logic design, networking algorithms, cryptography, all deeply depend on work done in academia (and not just “theoretical” work).


In addition, in Tim's post on the python mailing list he says he developed the algorithm after reading through a number of papers on sorting[1]. So while the algorithm was developed with the goal of being practical, plenty of academic research was used to develop it.

[1]: https://mail.python.org/pipermail/python-dev/2002-July/02683...


Reading the mailing list end-to-end actually shows how much research Tim did before designing it. To suggest that the "academics" had no clue is extremely disingenuous. Worse, it misleads the next generation of great would-be algorithm designers into thinking that studying the research is foolish. Study of both the theoretical and the practical are hugely important; ignore either at your peril.


I believe Tim Peters was trained as a mathematician. He is good at analyzing algorithms. Often he will post on the Python lists about the subtleties of floating point behavior. As you say, he read a much of recent research on sorting before coming up with TimSort. He special ability I think is making the "engineering" tradeoffs to come up with an algorithm that works well on real-world data.


You believe or you know? Nothing I've read suggests that. Please don't advance ideas like that without giving proper justification; this is exactly how myths start.


FWIW, mergesort (von Neumann, 1940s, Manhattan Project), Shell sort (Shell, General Electric Company, 1959), and quicksort (Hoare, Elliott brothers Ltd, 1961) were not created in an academic laboratory.


OK, but the OP's point was that academia is not simply some free-floating institution avoiding usefulness. Because some sorting algorithms were developed outside of a university doesn't contradict that.

Also both von Neumann and Hoare worked as professors. Von Neumann was a polymath and published plenty of academic papers. Hoare is still a professor. Are you implying that their work in academia had no value in developing those algorithms?


I think the comment in the essay was a gratuitous and needless dismissal of academic research. All modern improvements to sorting, including TimSort, have been influenced by academic research.

I think the most commonly used sorting algorithm were made in the first days of computing, before CS was even established as a field, making the point irrelevant. People don't care to even know what algorithm they are using. How many know when they are using introsort, even when that was developed by an academic.

I think your objection points out the needless specificity of the original claim. "Created in an academic laboratory" requires that the person be in an academic lab, thus excluding (say) academics who are on sabbatical at (say) a national lab.

Since there were few academic laboratories in the first few decades of computers, it's rather pointless.

While Hoare did go on to become a professor, if Tim Peters were to get a professorship, would that make a difference?


> machine learning, logic design, networking algorithms, cryptography, all deeply depend on work done in academia (and not just “theoretical” work)

All highly theoretical fields long before we found concrete applications. It often took one or more people an entire lifetime to convince others than these were valuable subjects to pursue.

> Academia is not only about finding theoretical

No, but it has to start there. Theoretical science is how we find new things to turn into applied science. It required 45+ years between the time Heinrich Hertz introduced contact mechanics and RADAR was put into use, 33 years from when Einstein published mass–energy equivalence (E=mc2) and Otto Hahn cracked nuclear fission but only a few more years until Robert Oppenheimer and the Manhattan project turned that into the first atom bomb.


No, but it has to start there.

That's simply not true. There's a loop: sometimes, industry comes up with a specific innovation, and academia generalized it; other times, academia comes up with an idea, and it's implemented in industry (often a commercialisation effort by the inventors). And sometimes academics are employed by industry and there's a hybrid. There's no one true way.


Let's add all the work on finding decision procedures for all of mathematics to that list


I ran the code and it gave me [2, 3, 5, 6, 7], missing the 1. Appending the_array[i] after clearing the new_run by `new_run = []` seems to fix this issue.


Any sorting algorithms that use deep learning?


You could of course write one. Presumably you're talking about using deep learning to decide the order in which to compare elements, but this would be difficult to ensure both always terminated and always gave correct results, to say nothing of likely being very inefficient.

The difficulties in sorting algorithms aren't difficulties in recognizing patterns, so it's difficult to see what deep learning brings to the table.

That being said, knock yourself out; surprise me!


For creating custom comparators?

"Sort these web pages in order of quality."


I've never seen a reference text that considers the comparator part of the sorting algorithm.

Timsort, quicksort, mergesort, introsort, smoothsort, etc. don't include the comparator as part of the algorithm definition.


You mean like a search engine? :-)

I believe ML is used in search engines nowadays.


Maybe finding in the input data.


"Finding" what in the input data as a part of sorting?


You could always just write out all permutations and train a network on recognizing the sorted one ;-)


A storage requirement of O(n!)

Sorting a mere 13 items of 1-byte each would require 6.2 GB of storage.


We just generate them on the fly - after all we're not stupid ;-)

https://en.wikipedia.org/wiki/Permutation#Algorithms_to_gene...

.. and there

"Generation in lexicographic order"

This algorithm only requires you to hold the last permutation in memory to obtain a new (distinct) one.


Good interview question ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: