Recently I "made" [0] one of those visualized sorting videos for Timsort for real-world data. Point was to give an example of data that was already mostly sorted and show how fast it sorts it. It may be of interest here:
[0] My use of "made" gives me way more credit than I'm due. Really I just hacked others' work to use another data source. See the description in the video for more info on how it was produced.
My example of real world data is data in which the first 90% is already sorted and the next 10% is random. Whether this is real enough is questionable, but it does demonstrate how Timsort automatically takes advantage of runs of sorted data.
It would be nice to see a comparison working on truly real-world data. Perhaps from an open dataset.
I am experimenting with Lidar datasets lately. In lidar you work with huge point clouds. Millions to billions. For some algorithms - convex hull - you need to sort the points for example by x-cooridinate. I did some quick tests with quick sort, merge sort, and timsort. In my experiments quick sort was the slowest with 5.5 seconds on the chosen dataset. Mergesort was 5 seconds and timsort was around 4 seconds.
Yeah seeing more datasets would definitely be nice. I linked my GitHub repo where I made the changes necessary for this example in the YouTube video. Using the last commit (the only changes from upstream) as a guide, other datasets cold be supported similarly.
That said, I don't think my choice of data is that bad actually. Apparently starting with a sorted dataset, appending some new items, and then sorting the result again is very commonly done by programmers. That is essentially what's represented in my example.
Here's an idea: if you're on Linux, use this to sort the list of all paths on your file system (find / -xdev). It's a nice dataset since it's (a) a large list, (b) comparisons are nontrivial, and (c) it's a realistic mix of somewhat-sorted-but-not-completely. (Note: If the file system already sorts everything this won't do the job, so it won't work on e.g. WSL.)
Sorting strings by generic comparison is very inefficient regardless of the algorithm you choose, especially if the long prefixes repeat, which is the case for FS paths.
This is great. A simpler TimSort without galloping but keeping merging of runs. It needs a good name (or any name for that matter) so that it can get picked up and used elsewhere.
I'll suggest Pacesort (because the pace gait has smaller steps than a gallop and doesn't have many search results).
It was not apparent from your article how Timsort ensures stability. After doing some reading on my own I was able to discover that it is done so by only reversing strictly decreasing runs, and by never reordering runs.
While your description covers both of those points[1], it didn't mention them together as contributing to stability.
1: Actually rereading to make sure I didn't miss anything, you have a mistake, as you say: "Runs have to be strictly increasing or decreasing" while they have to be either monotonically increasing (i.e. non-decreasing) or strictly decreasing.
The exposition of the sorting algorithm is beautiful and clear. Thanks!
Your disparaging of academic work is utterly ridiculous. I am a mathematician working in academia with real world data, developing algorithms that are used in industry. This sentence at the begining of your article makes me think that you are not really aware of how the real world works.
Sure. Why are you highlighting this wasn't "created in an academic laboratory". Do you think your blogging approach of thinly veiled negativity toward established peer-reviewed works will give you more likeability among the general readership?
Ooh, it's actually not complete. I put that as a placement while I tried to find the original source code for it. I didn't expect it to be shared anywhere. me. Thank you for bring this to my attention, I'll see if I can find the source code again, terribly sorry for doing the worst thing imaginable and publishing something that isn't complete!
> Since the algorithm has been invented it has been used as the default sorting algorithm in Python, Java, the Android Platform, and in versions of GNU.
I'm glad you mentioned how far the algorithm has spread to other languages since originating in Python. But, "versions of GNU" what?
Use a recursive divide-and-conquer algorithm, in the style
suggested by Knuth volume 3 (2nd edition), exercise 5.2.4-23. Use
the optimization suggested by exercise 5.2.4-10; this requires room
for only 1.5*N lines, rather than the usual 2*N lines. Knuth
writes that this memory optimization was originally published by
D. A. Bell, Comp J. 1 (1958), 75.
I suspect this line originates from the Wikipedia entry for TimSort with the line:
"Timsort has been Python's standard sorting algorithm since version 2.3. It is also used to sort arrays of non-primitive type in Java SE 7,[3] on the Android platform,[4] and in GNU Octave."
Notice how the order of Python, Java, Android, and GNU is the same? My tentative hypothesis is that "Octave" was dropped, and that "GNU" here means something more like "in the GNU project."
Also, the author uses a time complexity table with Ω(), Θ(), and O() notations, but suggests "To learn about Big O notation", go to another hackernoon article which only talks about O() and not big omega or big theta notations.
Are there benchmarks available for this kind of algorithm? If you come up with a fast sorting algorithm how would you go about making a case it's faster than currently used ones on real world data?
That was back in 2002, so there's certainly been a lot of scrutiny since then. In particular, I've never actually read about Java sorting benchmarks, but there are tremendously sophisticated benchmarking tools/benchmarking experts in the Java world, so I imagine that it got a lot of analysis there.
It's worth noting that Java doesnt exclusively use Timsort. Depeding on the data type and heuristics based mostly on the size of the input array, a simpler sort such as counting sort may be use.
I'd hope people using it wouldn't have chosen it unless they had measured performance -- doubly so if the switched to it from something else.
On the other hand, common third-party benchmarks may not be useful for measuring these sorts of improvements, which take advantage of common input distribution biases. "Real world workloads" is really too much of a simplification -- people sort all shapes and sizes of data.
When JS engines started getting fast(er), every engine made their own preferred benchmark because they said existing ones didn't adequately show the real-world gains their engines were providing.
We shouldn't be reflexively cynical about that. If they have made good-faith measurements, it is likely that old benchmarks were "unrepresentstive" in some way. Again, "real world benchmarking" is hard, and that applies just as much to yesterday's benchmarks as today's.
I wrote a personal benchmark suite specifically to compare the Javascript port of Timsort to my own 'pet' Javascript sorting module 'Barsort.js' which was specialised to sorting Numbers, while Timsort also does strings quickly and takes custom comparators.
I tested these distributions for arrays of 100 to 1000000 elements:
descending ints
ascending ints
descending ints with maveric vals
ascending ints with maveric vals
equally distributed reals from -20 to 20
gaussian distribution -1 to 1
midspiked reals with duplicates -200 to 200
huge magnitude with gaps and duplicates and infinites
equal-dist of ints 1 or 2
equal-dist of ints 1 2 3
equal-dist of ints 1 2 3 4
equal-dist of ints 1 2 3 4 5
Timsort seemed relatively a little slow with gaussian/normal and equal distributions, but still 10x faster or more than browsers native sort at the time.
I was able to extend my pet sorting module to be quite competitive with Timsort's Javascript port. It was a captivating side project that I escaped from as soon as tests passed, its code not in a decent state, but is here - https://github.com/strainer/barsort
> If you come up with a fast sorting algorithm how would you go about making a case it's faster than currently used ones on real world data?
For analysing sorting algorithms - and related things which that leads into - there are established ways to determine their best, average, and worst case runtimes for input of a given size.
If you're interested in learning this stuff, the video's here are useful and not math heavy:
In the film _Heist_, Gene Hackman's character is asked how he pulled something off. "I tried to imagine someone smarter than myself and I then I thought: What would he do?"
When I need to pull something off, I ask myself: What would Tim do?
None of the standard framework sort methods perform a stable sort so it's handy for that, as well as being a lot faster for some data, i.e. with pre-sorted runs. Otherwise the framework now uses introsort which is pretty good, so you should definitely performance test with both. I use both sorts depending on the context and I usually end up using the built in introsort, with timsort used for a few special cases.
It's a great piece of engineering and because of the tongue-in-cheek name might be confused with a joke sorting algorithm...my favorite of which is sleep sort.
> Timsort is a sorting algorithm that is efficient for real-world data and not created in an academic laboratory.
It's always a bit sad to see academia as an ivory tower where researchers generate unusable knowledge. Academia is not only about finding theoretical pure solutions but often also about finding solutions to real-world problems. For instance, machine learning, logic design, networking algorithms, cryptography, all deeply depend on work done in academia (and not just “theoretical” work).
In addition, in Tim's post on the python mailing list he says he developed the algorithm after reading through a number of papers on sorting[1]. So while the algorithm was developed with the goal of being practical, plenty of academic research was used to develop it.
Reading the mailing list end-to-end actually shows how much research Tim did before designing it. To suggest that the "academics" had no clue is extremely disingenuous. Worse, it misleads the next generation of great would-be algorithm designers into thinking that studying the research is foolish. Study of both the theoretical and the practical are hugely important; ignore either at your peril.
I believe Tim Peters was trained as a mathematician. He is good at analyzing algorithms. Often he will post on the Python lists about the subtleties of floating point behavior. As you say, he read a much of recent research on sorting before coming up with TimSort. He special ability I think is making the "engineering" tradeoffs to come up with an algorithm that works well on real-world data.
You believe or you know? Nothing I've read suggests that. Please don't advance ideas like that without giving proper justification; this is exactly how myths start.
FWIW, mergesort (von Neumann, 1940s, Manhattan Project), Shell sort (Shell, General Electric Company, 1959), and quicksort (Hoare, Elliott brothers Ltd, 1961) were not created in an academic laboratory.
OK, but the OP's point was that academia is not simply some free-floating institution avoiding usefulness. Because some sorting algorithms were developed outside of a university doesn't contradict that.
Also both von Neumann and Hoare worked as professors. Von Neumann was a polymath and published plenty of academic papers. Hoare is still a professor. Are you implying that their work in academia had no value in developing those algorithms?
I think the comment in the essay was a gratuitous and needless dismissal of academic research. All modern improvements to sorting, including TimSort, have been influenced by academic research.
I think the most commonly used sorting algorithm were made in the first days of computing, before CS was even established as a field, making the point irrelevant. People don't care to even know what algorithm they are using. How many know when they are using introsort, even when that was developed by an academic.
I think your objection points out the needless specificity of the original claim. "Created in an academic laboratory" requires that the person be in an academic lab, thus excluding (say) academics who are on sabbatical at (say) a national lab.
Since there were few academic laboratories in the first few decades of computers, it's rather pointless.
While Hoare did go on to become a professor, if Tim Peters were to get a professorship, would that make a difference?
> machine learning, logic design, networking algorithms, cryptography, all deeply depend on work done in academia (and not just “theoretical” work)
All highly theoretical fields long before we found concrete applications. It often took one or more people an entire lifetime to convince others than these were valuable subjects to pursue.
> Academia is not only about finding theoretical
No, but it has to start there. Theoretical science is how we find new things to turn into applied science. It required 45+ years between the time Heinrich Hertz introduced contact mechanics and RADAR was put into use, 33 years from when Einstein published mass–energy equivalence (E=mc2) and Otto Hahn cracked nuclear fission but only a few more years until Robert Oppenheimer and the Manhattan project turned that into the first atom bomb.
That's simply not true. There's a loop: sometimes, industry comes up with a specific innovation, and academia generalized it; other times, academia comes up with an idea, and it's implemented in industry (often a commercialisation effort by the inventors). And sometimes academics are employed by industry and there's a hybrid. There's no one true way.
I ran the code and it gave me [2, 3, 5, 6, 7], missing the 1. Appending the_array[i] after clearing the new_run by `new_run = []` seems to fix this issue.
You could of course write one. Presumably you're talking about using deep learning to decide the order in which to compare elements, but this would be difficult to ensure both always terminated and always gave correct results, to say nothing of likely being very inefficient.
The difficulties in sorting algorithms aren't difficulties in recognizing patterns, so it's difficult to see what deep learning brings to the table.
https://www.youtube.com/watch?v=ZxLxf5xqqyE
[0] My use of "made" gives me way more credit than I'm due. Really I just hacked others' work to use another data source. See the description in the video for more info on how it was produced.