Hacker News new | past | comments | ask | show | jobs | submit login
Mathematically analyzing sorting algorithms (peter-eigenschink.at)
73 points by eigenschinkpete on Jan 22, 2012 | hide | past | favorite | 14 comments

Like many self-taught programmers without any formal education in computer science, I've always been a bit mystified by analysis of algorithms. Anything beyond counting the iterations of loops seems to turn into a morass of notation and obscure analytic techniques—ignorance of which has never kept me from writing code that runs, well, fast enough.

This article, though, was a nice walk-through that reminded me that this stuff is probably quite straight-forward if I would just sit down and give it a shot. It got me to read the Wikipedia page on the Master theorem [1], which seems easy to apply and genuinely useful for figuring out what's going on in complex recursive code. I think I'll finally click "buy" on Carmen's book [2], which has been sitting in my Amazon shopping cart for years.

1. http://en.wikipedia.org/wiki/Master_theorem

2. http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&...

The master theorem is a blight on the teaching of analysis of algorithms. The main reason is that few students understand what makes it work and end up treating it as a piece of plug-and-chug equipment. The second reason is that it is not very general.

The alternative? Learn to love recursion trees. It addresses both issues. It's intuitive and it's general. As a side effect, the master theorem becomes an almost obvious fact; the three different cases correspond to geometric sums with increasing, decreasing and constant terms, with the caveat that since we are working asymptotically, increasing/decreasing/constant must be interpreted in the looser asymptotic sense. A geometric series with increasing (decreasing) terms is dominated by its last (first) term, corresponding in the recursion to the last (first) level of the tree. That makes it trivial to estimate the sum: just estimate the cost at the appropriate level and you're done. Only in the constant term case do all the terms play a role, but then you get your estimate by just multiplying the number of terms (the tree depth) by the constant value the terms have in common.

All but the very simplest recursions require heavy machinery to find closed forms. If you can't apply the Master Theorem often your best bet is to guess the answer and try an induction, otherwise you quickly end up with generating functions and differential equations.

At a high level it's not really that much more complex (conceptually) than counting iterations and trying to find closed-form equations to describe the total number of iterations in best/worse/average cases. For example, something that requires one loop that iterates N times for N data points is O(N); something that requires two nested loops, each of which iterates N times, to process the same N data points, is O(N^2), because the inner loop will execute N^2 times, which is clearly worse than two loops in succession, i.e. 2N iterations, which is just O(N). The rest is mostly working out clever schemes to actually quantify more complex iteration (or recursion) patterns, or to find ways of accounting for infrequent expensive operations, like re-hashing ("amortization" basically averages them out across all the intervening steps), among other twists.

on the other hand, things like "nlogn is a lower bound for comparison sorts" can't easily be reduced to counting your for-loops.

There is a (fairly) simple proof of that, however. You may well be aware of this already, but in case someone hasn't seen it before:

  Each comparison of objects provides one bit of information.
  So k comparisons provide k bits of information, or 2^k possibilities. 
  Therefore to distinguish between n options, we need at least log_2(n) comparisons.
  There are n! ways of ordering a list of length n.
  log(n!) ~ n * log (n) 
  So to sort a list of length n, we need at least O(n * log(n)) comparisons.
It certainly requires more maths than just counting 'for loops', but I find the explanation surprisingly elegant.

Information-theoretic lower bounds are neat. Here's an even simpler example. In the decision tree model of computation (where decision nodes have constant degree), consider the problem of finding the value corresponding to a key in a dictionary with n elements. Each leaf is labeled by a value, and in the general case where all values are distinct there are n leaves. So, the tree must have at least one path of length Omega(log n). Note that this is true even when the key space is a contiguous interval like [0..n], where it's just an array indexing problem. That's why the RAM model of computation is strictly more powerful than the decision tree model.

> log(n!) ~ n * log (n)

Random aside: while this is usually proven with Stirling's approximation, there is a much simpler proof. For the upper bound, n! <= n^n because 1 <= n, ..., n <= n. For the lower bound, n! >= (n/2)^(n/2) because you can drop the first n/2 terms without increasing the product and estimate the last n/2 terms as being >= n/2. That gives the double inequality (n/2)^(n/2) <= n! <= n^n. Take logs and you get (n/2) log(n/2) <= log(n!) <= n log(n). What makes this proof tick is that while (n/2)^(n/2) is a really shitty lower bound on n!, the shittiness evaporates once you take logs.

If you want a bible to demystify algorithms (and other stuff too) for you, try picking up a copy of Knuth's Art of Computer Programming.

With this (my first) article I intend to give a little introduction into the analysis of algorithms, to show why it is necessary and how it basically works. There are many algorithms out there to give a solution to the same problem and every programmer should know why they use which one in a particular situation. As this was my first article, it should also be an opportunity for the readers to tell me which aspects of the analysis should be highlighted and to give me some feedback. Thank you anyway for the comment, the next post will be concerned with a more specific topic.

I've got to admit, I don't understand why this article exists. If you're a beginner, the math is too complicated. If you're anything more than a beginner then you already know that bubble sort is O(n^2) and quick sort is O(n log n) and you probably know how to derive it. Who is this aimed at?

I know the complexity, at the intuition level, but can't derive them, even after reading this article.

As a fellow Austrian I don't intend to criticize the author too much (;. It's an okay article, however nothing special and tbh it kinda feels like the introductory section of the first chapter of an algorithms book.

What's bugging me tough, is that lately quite a few of those "okay but not Hacker News material" articles made it to the top.

It seems that esp. the words mathematics and algorithm are enough for an article to appear in the top ten. I wonder why that is ...

This article missed the last and most interesting bit: we know that Quicksort is O(nlogn) in the best case and O(n^2) in the worst case; what is it like on average? That is, what is the performance averaged over all possible inputs?

I can see why it was left out: the proof that Quicksort is O(nlogn) on average is much more complicated than anything in the article. (So, of course, my algorithms professor thought it would be a great homework problem. Cute.)

That said, the fact that it is O(nlogn) on average is actually well know--it's just the step from where the article left off to the conclusion that's missing. Wikipedia has an outline of the proof, if anybody is interested: http://en.wikipedia.org/wiki/Quicksort#Average_complexity

The bubble sort has an off-by-one error: The outer loop has i go from 1 to A.length and then references A[i+1].

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact