Some of the core JDK developers are really on another level. The JDK code, post 1.5, is a joy to read, though verbose. Compilers and Languages seems to attract a certain caliber of engineers, I think.
As best as I can remember, those 9 lines are the only literal copying Oracle was able to find; the rest of their suit centers around whether the design of an API (e.g. parameter names) can be copyrighted. (And unless the Supreme Court steps in, the answer is "maybe.")
The code (it was a backend server code for a web app) was heavily tested with ab  and JMeter  against extreme load, everything seemed to worked fine. Fast-forward 6 months, when we had a sudden peak in users (for a few days, we went from around 500 visitors a day to around 500.000, which roughly meant > 5 million requests per day). Suddenly, the backend, which ran without problems for half a year, crashed in production every 5 hours or so with a segmentation fault. I could not for the life of me reproduce this. After some panicking, I let the backend run in gdb in production against the ~50 requests per second we were still getting. After a few hours, the segfault occurred again, and I figured out that on extremely rare edge cases, the pseudo code I copied to C++ divided by 0, which lead to chain of problems afterwards, eventually resulting in said segfault. If I remember correctly, the fix was trivial.
I think this is the wrong way to look at the problem. Progress is always iterative. If a given person happens to make a bigger iteration, then people call them 'masters of the universe', but that iteration isn't inherently different from the normal, smaller kind of iteration. And - consider: if every line of code has a chance of being buggy, then a bigger change is likelier to be more buggy. I say more buggy, rather than have more bugs, because bugginess is rarely disconnected; the whole conception is likely to be more flawed the newer it is. If we were to only accept small contributions then, somewhere along the way from here to the limit, we would arrive at timsort. And all the flaws of timsort wouldn't be gone, they would just be amortized into smaller chunks along the way from here to there. Which may be preferable, but that's something you have to consider; it's not so cut-and-dry as 'if we only make change conservatively, then we avoid catastrophic failure'.
All this not to mention that there are people who are 'masters of the universe' at fixing bugs.
+1. Bloch himself made the iteration over work by Arthur van Hoff and others at Sun before him.
The specific line from the Javadoc you're alluding to  is:
> The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y.
The problem is the second implied half of that statement: Pre-Timsort it was ", and if you don't, results might not be sorted in the correct order" while with Timsort it was ", and if you don't, you will get a RuntimeException."
It's way too easy to accidentally violate that condition: System.currentTimeMillis(), incorrect null handling, random numbers. Sometimes not even in the Comparator itself. The condition I posted, plus the other two:
> The implementor must also ensure that the relation is transitive: (x.compareTo(y)>0 && y.compareTo(z)>0) implies x.compareTo(z)>0.
> Finally, the implementor must ensure that x.compareTo(y)==0 implies that sgn(x.compareTo(z)) == sgn(y.compareTo(z)), for all z.
Are just way too easy for someone to inadvertently violate to justify a RuntimeException when it occurs.
It's good to have a contract that defines behaviour even in the case of invalid input.
That could be throwing an error, or it could be given unsorted input. Either way is ok.
(The C and C++ answer is to leave the behaviour in the face of invalid input undefined. Undefined means that anything goes, including formatting your hard disk.)
I’m sure there quite a few languages that will catch that at compile time.
Is it? In Rust and other languages you would use trait bounds to restrict the input types to be comparable with each other, which would make this a compile time error.
Making sure you can compare the types is already happening in the compiler, and doesn't find the bug.
The comparison being invalid is definitely a bug; if I didn't want sorted output, I wouldn't be sorting.
The first software patent, for SyncSort, is for a sort that beats O(n log n). The basic idea is to read records for a while, get some stats about the key distribution, and set up the buckets to get a roughly equal fraction of the observed keyspace. If blocks of records show up with very different stats, action has to be taken to adjust the bucketing.
A couple years ago I wrote an article about where this O(n log n) bound comes from if anyone is interested:
"Comparison Sorting Algorithms and Mystery of NlogN complexity"
And yes, radix sorts are something like O(n log k), where k is the bitwidth of the maximal key. Since k is often a constant this could be thought of as O(n). A Python sort could enumerate the list, check for all (1) integer contents and (2) no overridden comparator method, and then use a radix sort, I guess.
I've always thought claiming radix sort on integers to be linear was a bit disingenuous, since k << log(n) only if your list has a lot of duplicate entries.
In fact, if your integers are all unique, then k >= log(n).
(for consistent choice of base for log)
It's not clear the choice of log should be consistent. For radix sort the number of iterations is the log of the max element in the base of the number of buckets. For a merge sort the number of comparisons is pretty close to log-base-two of the number of elements to sort. (Not exactly -- to merge n items you need n-1 comparisons, not n.)
Of course, asymptotically they're the same -- lg(n) == log(n) etc -- and we're counting comparisons on one two-fingered hand and divs+mods on the other, so non-asymptotic comparisons don't make much sense without also talking about benchmarks.
Imagine a list of all 0s and 1s that you wanted to sort (like 001110101). Why bother sorting? Just count up how many 0’s and 1’s there are (four 0’s, five 1’s) and generate the sorted list. O(n).
Clearly this doesn’t work when you have real world data to sort, but it is the basis for a radix sort, where you sort a batch of numbers digit-by-digit, giving you O(n*d). Though I think it’s rarely used.
We have a leaderboard with a list of users, which is normally a list of users totalling between 2 to 50,0000, but in rare circumstances can be several million. The scores users can achieve are in a rather small range, perhaps only 450k valid different scores, and we were doing this sort on average once every 30 seconds.
What we ended up doing was switching to a radix sort when the list of users got to be very long, as it was substantially faster in those cases, making the user experience much better, as the leaderboard was much more up to date for the very large cases.
Hadoop Terasort does the same thing of sort, which is to start off with a partitioner doing a radix sort & then sort each part with a lower O number of comparisons (the Spark one actually does local sorts and then has the merge-sort part just fetch ranges after first pass), however those two work great when the key ranges are almost entirely unique with no big runs of identical keys and differing values.
The Timsort one however has an advantage when there are huge runs of identical values or almost sorted inputs, like when you have a time-series with the occasional out of order packet. The galloping mode in the algorithm neatly eats those ranges up very fast.
My favourite example to show sorting is data dependent is the tricolor sort example, which illustrates what happens when you have a large number of identical keys.
 - https://en.wikipedia.org/wiki/Dutch_national_flag_problem#Th...
The reason is it "violates" the n logn lower bound of comparison sorts is because it isn't a comparison sort. Sort of like how hash table lookups "violate" the O(log n) average lower bound of binary tree lookups. It has different performance bounds because it's a different problem.
Which version ends up faster is mostly based on factors that big O notation doesn't capture easily. You have to push things to impractical numbers like n=2^1000 or 2^1000000 to properly distinguish between constant factors and log factors, and at that point it doesn't actually help you write better code.
In the worst case, every element falls into a single bucket, at which point your best best is to do a bit wise radix sort over the low 8 bits.
This ends up being equivalent to k=log(n).
Original author: https://github.com/BonzaiThePenguin/WikiSort
Graph and copyage (by me): https://tse.gratis/aArray/#details
Or grail sort: https://github.com/Mrrl/GrailSort/blob/master/README.md
Sort, is a deep rabbit hole
Can't find a direct comparison with TimSort though..
If you're using another language, you might want to verify that the bug is fixed/not present in your library
Stable sorts often do require that, (mergesort is usually stabble, heapsort and quicksort are inherently not), but even that's not required - there is a completely in-place variant of merge sort that only requires O(log n) space for stack (like quicksort; heapsort is O(1)). See e.g. https://xinok.wordpress.com/2014/08/17/in-place-merge-sort-d...
I spent some time four summers ago with merge sort. I had a PoC for an in-place algorithm that survived several rounds of poorly selected sample data. That was quite a disappointment.
In looking around I believe I ran across several implementations that required sqrt(n) extra space and one that I think claimed log n but was so complicated I never did figure out why it was supposed to work. At least one of these had higher time complexity, but often enough you need to work with data sets that dominate your memory footprint. Even a second array of pointers might push you over.
Stack frames are external storage. I’d have to see the code to see how they manage to do recursion without log(n) external storage.
Traditional merge sort can be written using iteration, which makes the external storage for sort state O(1), but the semi spaces are still there.
If it's just a matter of keeping up with the bounds of the partitions, I can see that being done in constant space.
I believe when I realized that fixing my bug would turn it essentially into this algorithm, I found something else to do.
If it's not stable, then what's the point? It's not a merge sort variant by the most important measure, IMO, and as I said, I was only considering merge sort variants.
Even with all of the additional logic people have created to avoid worst case performance, quicksort is simpler than this algorithm by a huge margin.
Edit: Whoops, apparently heapsort is not "stable" (not sure what that means actually), sorry.
Example: sorting 2#a, 1#c, 2#b by only the first number.
An algorithm that produces 1#c, 2#b, 2#a is a correct sorting algorithm, but not stable as it changes the order of 2#a and 2#b.
One of the things you can do with a stable sort is to fake complex sorting by iteratively sorting by each criteria. Though I have rarely seen that be practical anywhere other than tabular data.
Might be of interest...
Quite a few submissions for something no one has heard of: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
The other main discussions are 2011: https://news.ycombinator.com/item?id=3214527
Truly a giant of the Python community.
<1/2 wink-ly yr's>
Basically, timsort is to mergesort as pdqsort is to quicksort.
Can you englight me ?
You can certainly be happy not knowing the tools of your trade very intimately, but that's definitely not something to encourage.
Values in a quantum computer are superpositions but when measured they will return only one result. They don’t have an infinite amount of time and or space .
Like why are you not arguing that C++24 will make all algorithms O(1)? There is as much reason to believe that as there is to believe that quantum computing will do so.
Or maybe graphene memresistors will form timelike loops and allow computers that give you results before you ask the question? Maybe! Again, no more reason to think that won't happen than some kind of purely conjectural magic from quantum computation.
Maybe the development of green energy such as solar panels will lead to the development of nano-scale self assemblers, allow us to turn the entire moon into ultra efficient computronium powered by sinking the moon's residual heat of formation into deep space, making the asymptotic complexity of most algorithms on many problems largely irrelevant. Maybe! Or maybe the insight might actually come from a school kid that trips over a rock and gets a vision after hitting their head. Better start putting pebbles out in front of schools, because you never know! :)
Isn't getting a sqrt() speedup for everything and an exponential speed up for a few things good enough for you?
If we really understand the idea of quantum computing so poorly that that we've massively underestimated it, isn't it even more likely that our misunderstanding has made us massively overestimate it?