My use of "made" gives me way more credit than I'm due. Really I just hacked others' work to use another data source. See the description in the video for more info on how it was produced.
My example of real world data is data in which the first 90% is already sorted and the next 10% is random. Whether this is real enough is questionable, but it does demonstrate how Timsort automatically takes advantage of runs of sorted data.
It would be nice to see a comparison working on truly real-world data. Perhaps from an open dataset.
That said, I don't think my choice of data is that bad actually. Apparently starting with a sorted dataset, appending some new items, and then sorting the result again is very commonly done by programmers. That is essentially what's represented in my example.
It's about 650MB (CSV):
In theory, it's supposed to be updated every now and then, but doesn't seem to have happened since 2015. :(
For a direct SQLite version of it, we have it on DBHub:
That site is still in (slow) development though. YMMV. :)
It works by replacing Python's slow generic PyObject_RichCompareBool() with a type-specialized comparison function.
I'll suggest Pacesort (because the pace gait has smaller steps than a gallop and doesn't have many search results).
While your description covers both of those points, it didn't mention them together as contributing to stability.
1: Actually rereading to make sure I didn't miss anything, you have a mistake, as you say: "Runs have to be strictly increasing or decreasing" while they have to be either monotonically increasing (i.e. non-decreasing) or strictly decreasing.
so: 1,2,2,3 is a run, but 3,2,2,1 is not.
Your disparaging of academic work is utterly ridiculous. I am a mathematician working in academia with real world data, developing algorithms that are used in industry. This sentence at the begining of your article makes me think that you are not really aware of how the real world works.
I'm glad you mentioned how far the algorithm has spread to other languages since originating in Python. But, "versions of GNU" what?
Use a recursive divide-and-conquer algorithm, in the style
suggested by Knuth volume 3 (2nd edition), exercise 5.2.4-23. Use
the optimization suggested by exercise 5.2.4-10; this requires room
for only 1.5*N lines, rather than the usual 2*N lines. Knuth
writes that this memory optimization was originally published by
D. A. Bell, Comp J. 1 (1958), 75.
I suspect this line originates from the Wikipedia entry for TimSort with the line:
"Timsort has been Python's standard sorting algorithm since version 2.3. It is also used to sort arrays of non-primitive type in Java SE 7, on the Android platform, and in GNU Octave."
Notice how the order of Python, Java, Android, and GNU is the same? My tentative hypothesis is that "Octave" was dropped, and that "GNU" here means something more like "in the GNU project."
Also, the author uses a time complexity table with Ω(), Θ(), and O() notations, but suggests "To learn about Big O notation", go to another hackernoon article which only talks about O() and not big omega or big theta notations.
> do the merging as soon as possible to exploit the run that the run just found is still high in the memory hierarchy
s/run/fact/ if I'm reading correctly
for the python timsort, the original source code is here: https://svn.python.org/projects/python/trunk/Objects/listobj..., although porting it to idiomatic python definitely seems nontrivial
You can also read Tim Peter's notes about the sort: https://bugs.python.org/file4451/timsort.txt
That was back in 2002, so there's certainly been a lot of scrutiny since then. In particular, I've never actually read about Java sorting benchmarks, but there are tremendously sophisticated benchmarking tools/benchmarking experts in the Java world, so I imagine that it got a lot of analysis there.
You can find a more detailed summary at:
On the other hand, common third-party benchmarks may not be useful for measuring these sorts of improvements, which take advantage of common input distribution biases. "Real world workloads" is really too much of a simplification -- people sort all shapes and sizes of data.
When JS engines started getting fast(er), every engine made their own preferred benchmark because they said existing ones didn't adequately show the real-world gains their engines were providing.
We shouldn't be reflexively cynical about that. If they have made good-faith measurements, it is likely that old benchmarks were "unrepresentstive" in some way. Again, "real world benchmarking" is hard, and that applies just as much to yesterday's benchmarks as today's.
I tested these distributions for arrays of 100 to 1000000 elements:
descending ints with maveric vals
ascending ints with maveric vals
equally distributed reals from -20 to 20
gaussian distribution -1 to 1
midspiked reals with duplicates -200 to 200
huge magnitude with gaps and duplicates and infinites
equal-dist of ints 1 or 2
equal-dist of ints 1 2 3
equal-dist of ints 1 2 3 4
equal-dist of ints 1 2 3 4 5
For analysing sorting algorithms - and related things which that leads into - there are established ways to determine their best, average, and worst case runtimes for input of a given size.
If you're interested in learning this stuff, the video's here are useful and not math heavy:
Technically it's explained using java. But it's mostly done using primitive data types (eg int, float, byte, ...).
So it's very easy to follow along and copy the implementation, but translated to a different language (Go in my case, as I don't do java at all).
When I need to pull something off, I ask myself: What would Tim do?
Source code here (there are three variants):
None of the standard framework sort methods perform a stable sort so it's handy for that, as well as being a lot faster for some data, i.e. with pre-sorted runs. Otherwise the framework now uses introsort which is pretty good, so you should definitely performance test with both. I use both sorts depending on the context and I usually end up using the built in introsort, with timsort used for a few special cases.
It's always a bit sad to see academia as an ivory tower where researchers generate unusable knowledge. Academia is not only about finding theoretical pure solutions but often also about finding solutions to real-world problems. For instance, machine learning, logic design, networking algorithms, cryptography, all deeply depend on work done in academia (and not just “theoretical” work).
Also both von Neumann and Hoare worked as professors. Von Neumann was a polymath and published plenty of academic papers. Hoare is still a professor. Are you implying that their work in academia had no value in developing those algorithms?
I think the most commonly used sorting algorithm were made in the first days of computing, before CS was even established as a field, making the point irrelevant. People don't care to even know what algorithm they are using. How many know when they are using introsort, even when that was developed by an academic.
I think your objection points out the needless specificity of the original claim. "Created in an academic laboratory" requires that the person be in an academic lab, thus excluding (say) academics who are on sabbatical at (say) a national lab.
Since there were few academic laboratories in the first few decades of computers, it's rather pointless.
While Hoare did go on to become a professor, if Tim Peters were to get a professorship, would that make a difference?
All highly theoretical fields long before we found concrete applications. It often took one or more people an entire lifetime to convince others than these were valuable subjects to pursue.
> Academia is not only about finding theoretical
No, but it has to start there. Theoretical science is how we find new things to turn into applied science. It required 45+ years between the time Heinrich Hertz introduced contact mechanics and RADAR was put into use, 33 years from when Einstein published mass–energy equivalence (E=mc2) and Otto Hahn cracked nuclear fission but only a few more years until Robert Oppenheimer and the Manhattan project turned that into the first atom bomb.
That's simply not true. There's a loop: sometimes, industry comes up with a specific innovation, and academia generalized it; other times, academia comes up with an idea, and it's implemented in industry (often a commercialisation effort by the inventors). And sometimes academics are employed by industry and there's a hybrid. There's no one true way.
The difficulties in sorting algorithms aren't difficulties in recognizing patterns, so it's difficult to see what deep learning brings to the table.
That being said, knock yourself out; surprise me!
"Sort these web pages in order of quality."
Timsort, quicksort, mergesort, introsort, smoothsort, etc. don't include the comparator as part of the algorithm definition.
I believe ML is used in search engines nowadays.
Sorting a mere 13 items of 1-byte each would require 6.2 GB of storage.
.. and there
"Generation in lexicographic order"
This algorithm only requires you to hold the last permutation in memory to obtain a new (distinct) one.