
Proving that Android’s, Java’s and Python’s sorting algorithm is broken - amund
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
======
bsdetector
Their corrected version:

    
    
        if (n > 0 && runLen[n-1] <= runLen[n] + runLen[n+1]  
        || n-1 > 0 && runLen[n-2] <= runLen[n] + runLen[n-1])
    

In the first clause they add earlier to later runLen elements, but in the
second they add later to earlier elements. Switching the order just makes the
expression harder to understand, like reusing variables within a scope.
Addition is commutative and there's nothing technically wrong with it, but
this construction makes it appear like there may be something special about
element n when there's not.

The programmer also has to do mental arithmetic to check the bounds. Bounds
checks can be written so that the largest index subtracted is also the value
tested:

    
    
        if (n >= 1 && runLen[n-1] <= runLen[n] + runLen[n+1]  
         || n >= 2 && runLen[n-2] <= runLen[n-1] + runLen[n])
    

This makes it easier to see that the bounds are correct. Formal methods found
an important bug that would not have been found otherwise, but a lot of lesser
bugs can be prevented just by writing clear and consistent code.

~~~
jonahx
Question: Formal methods found the bug, but now that we know about it, what
lessons can human programmers learn? That is, in the spirit of "20/20
hindsight," can we see the false assumption that made the bug possible as an
instance of a certain kind of mistake, which we can look for and avoid in the
future?

Or do you think the only lesson here is "Never fully trust anything that
hasn't been formally verified"?

~~~
maxerickson
To me, [https://mail.python.org/pipermail/python-
dev/2002-July/02689...](https://mail.python.org/pipermail/python-
dev/2002-July/026897.html) reads like the bug was known (err, the possibility
of running out of slots in the bookkeeping stack was understood). That
indicates that it could be an intentional tradeoff.

~~~
bradleyjg
The source code for the python version has this comment:

    
    
      /* The maximum number of entries in a MergeState's pending-runs stack.
      * This is enough to sort arrays of size up to about
      *     32 * phi ** MAX_MERGE_PENDING
      * where phi ~= 1.618.  85 is ridiculouslylarge enough, good for an array
      * with 2**64 elements.
      */
    

According to the linked article that's incorrect, it is only good enough for
an array with 2^49 elements. So, if the (implicit) design guarantee is for
lists up to 2^64, the implementation is technically bugged. But it'd be pretty
hard to run into it in practice. 2^49 of python's plain integers consume more
than 4.5 petabytes of RAM.

~~~
kragen
2⁴⁹ item references in a Python list will consume about 2⁵² bytes of RAM on an
LP64 system, which is 4 pebibytes (4.5 petabytes, as you said). You might also
need to allocate space for the objects that the references refer to, which
will almost certainly be bigger. I don’t understand the bug well enough to
know if it will work with large numbers of duplicate/identical items.

Nitpick: the idiomatic term is “buggy”, not “bugged”. Something is “bugged” if
it has a hidden microphone in it transmitting to spies, not if it contains a
software “bug”.

------
mkesper
From the article: The reaction of the Java developer community to our report
is somewhat disappointing: instead of using our fixed (and verified!) version
of mergeCollapse(), they opted to increase the allocated runLen
“sufficiently”. As we showed, this is not necessary. In consequence, whoever
uses java.utils.Collection.sort() is forced to over allocate space. Given the
astronomical number of program runs that such a central routine is used in,
this leads to a considerable waste of energy.

~~~
tveita
It looks like the Python version that's good for 2^49 items uses 1360 bytes
for this array, allocated on the stack.

I wouldn't worry about an extra couple of bytes nearly as much as I would
worry about changing the behaviour of a function used in an "astronomical
number of program", so this looks like a pretty reasonable and conservative
choice, at least for an immediate patch.

~~~
krick
Does it "change behavior"? As far as I understand, the only change is that
this test wouldn't crash anymore. Which certainly is "changing the behavior",
but allocating more memory is as well. If so, I don't see any reason not to
use formally verified version. Not that it is really important, but quite
reasonable.

------
thomasahle
This is actually really cool. They tried to verify Timsort as implemented in
Python and Java using formal methods. When it didn't seem to work, they
discovered that there was actually a missing case in both implementations,
which could lead to array out of bounds exceptions.

Really shines as an example of how important proof is in computer science.

~~~
stingraycharles
In the Haskell community there is a tool called QuickCheck [1]: it is able to
generate inputs based on preconditions, and you provide a function to verify
the postconditions.

I am not sure why these kind of testing methods aren't used more often in
other communities, since it makes it really easy to catch corner cases; in the
Haskell world, at least, using QuickCheck is somewhat pervasive.

[1]
[https://wiki.haskell.org/Introduction_to_QuickCheck2](https://wiki.haskell.org/Introduction_to_QuickCheck2)

EDIT: I have never done this before, but could anyone explain why I am being
downvoted? I wasn't making the claim that this is a substitute for a formal
proof, I was merely adding this information to the discussion since it seemed
relevant.

~~~
grandpa
There's a difference, though: QuickCheck randomly tries a large-ish number of
possible inputs to try and disprove a postcondition, whereas formal methods
prove the postcondition for all inputs. Practically, if the number of possible
inputs is huge compared to the number of failure cases, it's unlikely that
QuickCheck will find it. I strongly suspect - and this would be an interesting
experiment - that QuickCheck wouldn't have found this particular bug, since
Timsort has been in use for years without anyone noticing it.

~~~
TheLoneWolfling
What about AFL?

~~~
jamesfisher
What's AFL?

~~~
fpgaminer
American Fuzzy Lop
([http://lcamtuf.coredump.cx/afl/](http://lcamtuf.coredump.cx/afl/)), a fuzzer
which instruments programs at compile-time to help find interesting inputs
faster than brute-force fuzzing.

It may have indeed found a bug in TimSort, since it's more apt at exercising
branches, but I think AFL is C/C++ only.

------
bglazer
Has anyone used the KeY project in industry?

I'd certainly prefer proofs over unit tests. However, I don't understand
formal proof systems sufficiently well to know whether these they would work
in terms of your typical "app" that makes RPC calls, DB changes, and generally
has lots of moving parts and statefulness.

~~~
rwmj
I have tried to use Coq and Frama-C to prove commercial OCaml and C programs,
without, it has to be said, any success. I'm waiting for someone to write the
brilliant tutorial.

~~~
dhekir
You're waiting for a tutorial on Coq integrated with Frama-C, or for a
tutorial on any of them?

~~~
rwmj
I'm waiting for a tutorial on how to prove correctness of either OCaml or
(especially) C code in real programs.

~~~
dhekir
Research tools are getting closer to what people consider "real" programs,
even if they often focus on embedded systems software. CompCert already deals
with a quite large subset of C. Frama-C can deal with most if not all
syntactic features of C, but then the next challenge is the C standard
library, e.g. specifying every useful function (and even "simple" ones such as
memcpy can be quite tricky). Afterwards, you have to deal with the glibc, then
other high-level libraries, etc...

Most of these tools are either still in a mostly-academic setting (where
"documentation = conference paper"), or do not have enough funding to pay for
the development of more user-friendly features and extensive documentation.
But with the ever-increasing security issues receiving media attention lately,
we can hope more funding will allow these tools to reach a more mainstream
status.

By the way, could you give an example of a small program that you would
consider "real"? Just to have an idea of its size and complexity.

~~~
rwmj
Firstly I don't expect to prove a whole program. However being able to prove
the correctness of key functions or algorithms in a program could be useful.

You mention glibc, and it would be great to prove that (for example) 'qsort'
is correct. That wouldn't be entirely trivial:

[https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/msor...](https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/msort.c;h=4e17a8874736d7a653b3f589fbb7ad253225c939;hb=HEAD)
[https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/qsor...](https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/qsort.c;h=304fbed8e5d8d2edfd606ab7db6412e408c1d479;hb=HEAD)

For an example of a larger program, I'd like to prove various invariants of C
programs, such as that 'reply_with_error' is called exactly once on every
error path in this program:

[https://github.com/libguestfs/libguestfs/tree/master/daemon](https://github.com/libguestfs/libguestfs/tree/master/daemon)

------
jordigh
Oh, crap. I suppose this also means we have the bug in our GNU Octave
implementation:

[http://hg.savannah.gnu.org/hgweb/octave/file/0486a29d780f/li...](http://hg.savannah.gnu.org/hgweb/octave/file/0486a29d780f/liboctave/util/oct-
sort.cc)

Well, time to patch it there too.

------
Animats
Nice. As usual, entry and exit conditions aren't that hard to write; it's loop
invariants that are hard.

    
    
      /*@ loop_invariant
        @  (\forall int i; 0<=i && i<stackSize-4; 
        @             runLen[i] > runLen[i+1] + runLen[i+2])
        @  && runLen[stackSize-4] > runLen[stackSize-3])
        @*/
    

It's surprising how close their notation is to our Pascal-F verifier from 30
years ago.[1] Formal verification went away in the 1980s because of the
dominance of C, where the language doesn't know how big anything is. There
were also a lot of diversions into exotic logic systems (I used to refer to
this as the "logic of the month club"). The Key system is back to plain old
first-order predicate calculus, which is where program verification started in
the 1970s.

For the invariant, you have to prove three theorems: 1) that the invariant is
true the first time the loop is executed, given the entry conditions, 2) that
the invariant is true for each iteration after the first if it was true on the
previous iteration, and 3) that the exit condition is true given that the
invariant is true on the last iteration. You also have to prove loop
termination, which you do by showing that a nonnegative integer gets smaller
on each iteration. (That, by the way, is how the halting problem is dealt with
in practice.) #2 is usually the hardest, because it requires an inductive
proof. The others can usually be handled by a simple prover. There's a
complete decision procedure by Oppen and Nelson for theorems which contain
only integer (really rational) addition, subtraction, multiplication by
constants, inequalities, subscripts, and structures. For those, you're
guaranteed a proof or a counterexample. But when you have an internal
quantifier (the "forall int i") above, proof gets harder. Provers are better
now, though.

A big practical problem with verification systems is that they usually require
a lot of annotation. Somebody has to write all those entry and exit
conditions, and it's usually not the original programmer. A practical system
has to automate as much of that as possible. In the example shown, someone had
to tell the system that a function was "pure" (no side effects, no inputs
other than the function arguments). That could be detected automatically. The
tools have to make the process much, much easier. Most verification is done by
people into theory, not shipping products.

[1]
[http://www.animats.com/papers/verifier/verifiermanual.pdf](http://www.animats.com/papers/verifier/verifiermanual.pdf)

------
inglor
This is an excellent use case of formal verification. I can definitely see the
merit in using tools like this in my code.

They mention KeY [http://www.key-project.org/](http://www.key-project.org/) .
Is anyone using this here? Are there any good resources on it except for the
official site (and this blog post)?

~~~
amund
Hi, I will ask the authors of the blog post and corresponding academic paper
about additional KeY resources and follow-up.

~~~
inglor
Thanks and thanks for the interesting read. I've been looking for real uses of
formal verification for a long time. I've played a lot with code contracts in
C# and I've played some with languages like Eifel - the advantage of this
approach is that it's static and it performs actual proof rather than
enforcement.

These forms of formal verification could really help with building robust
software and if someone makes them easy enough to use I can definitely see
them as useful alongside if not instead of unit tests.

------
ezyang
Everyone here is talking about Timsort, but you should also check out the
materials they've published about the KeY project, which they used to carry
out this verification. [http://www.key-
project.org/~key/eclipse/SED/index.html](http://www.key-
project.org/~key/eclipse/SED/index.html) describes their "symbolic execution
debugger", which lets you debug any Java code even if you don't know all the
inputs (by simply taking the input as a symbolic value). The screencast is
very accessible.

------
ujjwal_wadhawan
Spark 1.1 switched default sorting algorithm from quicksort to TimSort as well
in both the map and reduce phases-
[https://databricks.com/blog/2014/10/10/spark-petabyte-
sort.h...](https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html)

------
ericfrederich
Where can I get that PS1 from the shell in the video? It appears to have a
green check mark if the previous command returned 0 otherwise some red symbol.
Looks cool

~~~
protomyth
for bash [http://stackoverflow.com/questions/16715103/bash-prompt-
with...](http://stackoverflow.com/questions/16715103/bash-prompt-with-last-
exit-code)

------
tenfingers
This is a good example of why formal verification is incredibly useful. I
tried to invest some time in learning some of the proof verification languages
and tools, but so far I wasn't too successful.

It looks like that to formulate a proof, I always have to rewrite the
algorithm/problem first in the tool's language, which is often _not_ easy. I
could see myself making mistakes in writing the proof just as well as I do
when I'm programming.

Proof validation is also tricky. Coq isn't fully automatic as I initially was
expecting. I actually used "prover9" which is first-order only, but does
automatic validation. I guess Coq is really useful when you need to understand
the proof and interactive validation can guide you, whereas prover9 could help
with automation.

The thing is, it's still too much work, even for seemingly simple algorithms,
to write a proof in either system in order to improve on the current situation
of unit testing (that is: if I wanted to get something with more intrinsic
value than a test case).

Formally verified languages are nice, but for a gazillion of reasons you still
need to verify what's running currently.

~~~
vstolz
With tools like Coq, you may even have the benefit of extracting an
implementation from your proof! Some assembly may be required though, and
extraction only works to functional languages.

------
amund
The authors have posted a follow-up posting: about KeY - the tool used to
prove the bug in TimSort - [http://envisage-project.eu/key-deductive-
verification-of-sof...](http://envisage-project.eu/key-deductive-verification-
of-software/)

------
imaginenore
And I always thought sort functions are always tested with millions of
randomly generated sequences and the results are compared with the results of
other implementations known to be good.

~~~
maxerickson
That was done. It's talked about here:

[http://svn.python.org/projects/python/trunk/Objects/listsort...](http://svn.python.org/projects/python/trunk/Objects/listsort.txt)

The bug would only be triggered by generating a truly massive array (the
implementation mentions that it will work for up to 2^64 elements:
[http://svn.python.org/projects/python/trunk/Objects/listobje...](http://svn.python.org/projects/python/trunk/Objects/listobject.c)
search for MAX_MERGE_PENDING).

~~~
e12e
As mentioned by another commenter, the bug is that it's (for python)
documented to work for 2^64 elements, but "only" works for 2^49. 2^49 is still
pretty big... note that for 64-bit integers, 2^49 integers is 2^(49+3)=2^52
bytes... or 4 _petabytes_ of raw data. Even if you're sorting single bits (one
and zero) it's quite a bit of data to chew through.

[ed: Hm, that's not quite right. A 64-bit integer is 2^6, so that should be
2^(49+6)=2^55, or 32 petabytes). I think :-) ]

------
nichochar
This is bad-ass. Thanks for taking the time and investigating something that
so many overlook yet use every day (myself included)

------
maxerickson
Fixed in python:

[http://bugs.python.org/issue23515](http://bugs.python.org/issue23515)

------
RubyPinch
anyone know the list of bugs for this?

[https://bugs.openjdk.java.net/browse/JDK-8072909](https://bugs.openjdk.java.net/browse/JDK-8072909)

It seems they havn't submitted to any other trackers, which is a bit
unfortunate

~~~
vstolz
There is also
[http://bugs.java.com/view_bug.do?bug_id=8011944](http://bugs.java.com/view_bug.do?bug_id=8011944),
where IIUC the suggested "fix" was to use a VM switch to enable the old (and
slower) sorting...

------
andrewstuart2
I'm surprised that modern implementations don't use something like quicksort,
since it has linear space requirements (sorts in-place) and good constants on
average for its n*log2(n) running time.

~~~
Arnt
Quicksort has poor worst-case behaviour, so it's easy to carry out a DoS
attack on network services that use it. Language maintainers don't like
restricting their library to input from friendly users, at least not when
there are other algorithms that work reasonably with unfriendly input.

~~~
jonesetc
Isn't the malicious attack fought by just picking a random pivot?

~~~
Freaky
There are still inputs that will make it go quadratic, that just obscures it
slightly. A proper fix is to fall back on a different sort if it looks like
quicksort is doing that:
[https://en.wikipedia.org/wiki/Introsort](https://en.wikipedia.org/wiki/Introsort)

~~~
thaumasiotes
> There are still inputs that will make it go quadratic

Can you elaborate on this? To force quicksort into a quadratic running time,
you need to ensure that each pivot splits off a bounded number of elements
(e.g. no more than three, or no more than twenty million) from the rest of the
list. If the pivot is being chosen at random, then it looks to me like the
guarantee you'd need to make is "every single element of this list [because
any of them might be chosen] is larger, and smaller, than no more than _k_
other elements of the list". But as the size of the list grows, that condition
forces it to be mostly composed of the same element repeated over and over
again, which is really easy to sort, and in particular is really easy for
quicksort to handle.

~~~
nightcracker
See my bug report on libc++:
[http://llvm.org/bugs/show_bug.cgi?id=20837](http://llvm.org/bugs/show_bug.cgi?id=20837)
.

~~~
thaumasiotes
But that adversary works by causing the sorting agent to run arbitrary,
adversary-supplied code every time it makes a comparison. It has to do that
because it invents the values in its list on the fly when it detects them
being accessed by the sort (also, in order to detect them being accessed by
the sort). That's not really the same thing. I mean, if you want to tie up a
process, and you can already make it run arbitrary code that you supply, just
give it something like while(1); .

If this adversary were forced to realize all the values it fed to the sorter
before the sorter did any work, or if it were unable to supply its own code to
the sorter, random pivot selection would be a defense.

edit:

I feel like pointing out that a quicksort implementation could defeat this
adversary, without hurting its O(n log n) running time, by just comparing the
first element of the sublist it was working with to every other element in the
sublist -- and throwing away the results -- and then proceeding as normal.
This is O(n) comparisons, which violates the vulnerability criterion of making
only O(1) comparisons per call, but doesn't affect the big-O running time at
all. What it does do, with an eye to this particular adversary, is realize all
the values before doing any sorting work. It still doesn't fix the actual
vulnerability the paper identifies, which is that you're running adversary-
supplied code. I'm growing to feel like your bug report was frivolous.

~~~
nightcracker
What you fail to realize is that this attack is simply an academic proof that
shows the libc++ implementation is broken and can have quadratic performance.
This is a bug, because the C++ standard mandates a worst case of O(n log n).

What you also don't seem to realize is that this attack merely uses a
comparison function to _find_ the worst case. Once the worst case is found you
can feed this input to any program using libc++'s std::sort, without
comparison function, and trigger the worst case.

So no, this is not frivolous at all.

~~~
thaumasiotes
> What you also don't seem to realize is that this attack merely uses a
> comparison function to find the worst case. Once the worst case is found you
> can feed this input to any program using libc++'s std::sort, without
> comparison function, and trigger the worst case.

This is true iff pivots are selected deterministically. In which case, why did
you post it as a response to "how can an adversary force quadratic behavior
when pivots are chosen randomly?"

------
wheaties
Wish they'd put that up on Github or Bitbucket. There are so many things to be
learned from that. I don't want to just download things or just use a tool
(although it's pretty awesome that they made the tool available.)

~~~
vstolz
What are you looking for?

~~~
wheaties
To watch how change sets are going into the tool. To see what is being done
and why as it happens. I find that one of the best ways to learn these things.
Especially when it is theory heavy and watching the implementation congele
around the theoretical framework.

~~~
vstolz
Sorry to keep following up, but we'd really like to know what we can improve
-- into which tool? Are you talking about the KeY tool, or the various
libraries they are now catching up on this issue?

------
noahl
Just a nitpick, but the article makes a mathematical mistake. In the last
paragraph of section 1.2, it says

    
    
        For performance reasons, it is crucial to allocate as
        little memory as possible for runLen, but still enough to
        store all the runs.  *If the invariant is satisfied by all
        runs, the length of each run grows exponentially (even faster
        than fibonacci: the length of the current run must be
        strictly bigger than the sum of the next two runs lengths).*
    

However, fibonacci growth is strictly faster than exponential. In fact, this
is why n * log(n) is the lower bound on the number of comparisons a
comparison-based sorting algorithm must use: because n! is approximately n *
log(n).

~~~
ithinkso
None of this is true.

 _> fibonacci growth is strictly faster than exponential_

There is even an explicit formula for F(n) that shows fibonacci growth is
_exactly_ exponential[1]

 _> In fact, this is why nlog(n) is the lower bound on the number of
comparisons a comparison-based sorting algorithm_

It's true that lower bound for comparison-based sorting is nlog(n) but it has
nothing to do with fibonacci

 _> n! is approximately nlog(n)_

Stirling's formula[2] says otherwise (that n! is exponential)

[1] [http://en.wikipedia.org/wiki/Fibonacci_number#Closed-
form_ex...](http://en.wikipedia.org/wiki/Fibonacci_number#Closed-
form_expression)

[2]
[http://en.wikipedia.org/wiki/Stirling%27s_approximation](http://en.wikipedia.org/wiki/Stirling%27s_approximation)

~~~
noahl
How embarrassing! It looks like I flipped around Fibonacci and factorial in my
head when writing my reply.

You're completely right that the comparison-based sorting bound has nothing to
do with Fibonacci numbers. And in fact log_2(n!) is about n*log_2(n), which is
where the comparison-based sorting bound comes from. And Fibonacci growth is
absolutely exponential, just with an exponent larger than 2.

