Hacker News new | comments | ask | show | jobs | submit login
A beginner's guide to Big O notation (2009) (rob-bell.net)
312 points by g4k on May 5, 2016 | hide | past | web | favorite | 65 comments

On the topic of efficient algorithms, I recently read a nice note in an Algorithms textbook -

> It would appear that Moore’s law provides a disincentive for developing polynomial algorithms. After all, if an algorithm is exponential, why not wait it out until Moore’s law makes it feasible? But in reality the exact opposite happens: Moore’s law is a huge incentive for developing efficient algorithms, because such algorithms are needed in order to take advantage of the exponential increase in computer speed.

Here is why. If, for example, an O(2n) algorithm for Boolean satisfiability (SAT) were given an hour to run, it would have solved instances with 25 variables back in 1975, 31 variables on the faster computers available in 1985, 38 variables in 1995, and about 45 variables with today’s machines. Quite a bit of progress—except that each extra variable requires a year and a half’s wait, while the appetite of applications (many of which are, ironically, related to computer design) grows much faster. In contrast, the size of the instances solved by an O(n) or O(n log n) algorithm would be multiplied by a factor of about 100 each decade. In the case of an O(n2) algorithm, the instance size solvable in a fixed time would be multiplied by about 10 each decade. Even an O(n6) algorithm, polynomial yet unappetizing, would more than double the size of the instances solved each decade. When it comes to the growth of the size of problems we can attack with an algorithm, we have a reversal: exponential algorithms make polynomially slow progress, while polynomial algorithms advance exponentially fast! For Moore’s law to be reflected in the world we need efficient algorithms.

In the quote, O(2n) and O(n6) are meant to be O(2^n) and O(n^6), right?

Yup, that's correct. Sadly, I can't update the comment now :(

Looks like someone finally got it to the front page:


Looks like the asymptotic complexity of getting this article to the frontpage is pretty high.

It's all about putting the year on there.

> O(N) describes an algorithm whose performance will grow linearly and in direct proportion to the size of the input data set

Nitpick, but this is technically wrong. There needs to be a "no faster than" somewhere in there. Merge sort is, for example, included in O(2^n) by the actual definition of big O (vs big theta, which is defined as a tight bound). So there exist algorithms that are O(N) that don't grow linearly.

f(x) is O(g(x)) just means there is some constant K such that f(x) <= Kg(x) for all x beyond some value. This has nothing to do with 'worst case performance'.

You'll be a lot happier if you mentally replace O(n) with Theta(n) every time you see it, rather than feeling the need to nitpick when you know that that's what people mean when they say O(n), 99% of the time.

Problem is that Theta(n) is an almost useless notation.

Thus, most times people say O(n) on practice[1], they mean O(n). But most times people define O(n), they cite the Theta(n) definition.

It isn't hard to put an "up to", "no more than" or "or less" at the definition, and it does not make it threatening or hard to read.

[1] There's probably a small corner in Hell reserved for people that mean O(n) on average but refuse both to say that qualifier and care about what data distribution created that average.

There seems to be a misconception here that this has something to do with best/average/worst case input. "f(x) is O(g(x))" is a purely mathematical statement about two functions, and means that f(x) is smaller than or equal to (K times) g(x) for all x beyond some threshold. I.e. O(g(x)) is a set of functions that includes f(x).

When we say "quicksort is O(g(n))" for some g we also need to specify, or assume, the distribution of the input (i.e. best/average/worst case), and we are free to assume whatever distribution we like. Then, assuming that distribution, the running time of the algorithm is just a function of the input size, and we can use the above definition to make a statement about that function (i.e. what complexity class it is in).

A complete statement looks like "quicksort is O(n^2) in the worst case", which means "for every n, pick the list of length n that takes the longest to sort (i.e. the worst-case input), and the number of steps executed by quicksort will be no more than Kn^2 as long as n is large enough, for some fixed K".

The difference between O, Theta and Omega only concerns what functions the set contains - in particular, Theta(g(x)) is a subset of O(g(x)). This is a separate issue to whether we are talking about best/average/worst case input.

I disagree.

In my experience, outside of technicality-penis-measuring contests (like this thread, or a few particular minutes during the class I was leading a few hours ago), every time someone says O(foo), they're describing an algorithm that they believe to be Theta(foo). On the other hand, I've never seen someone _define_ big-Oh and accidentally define big-Theta, which is unsurprising, given that the big-Theta definition is literally twice as much work to write out as the big-Oh.

Relatedly, Theta is a highly useful concept, except for that fact that it's much easier to type O(n) and also mildly faster to whiteboard it (relative to Theta(n)). Why do you think it's almost useless? Mind you, I don't think it's useful enough to care about the fact that most non-specialists can't keep track of the difference between these complexity classes.

Also, I question your claims about Hell, and what I take to be your implication that you disapprove of both saying using big-Oh to refer to average-case without saying so, and of saying "average" without specifying a distribution or an aggregation function.

To the former, that's how people use the language, and meaning average-case is 100% as reasonable as assuming that big-Oh always means worst-case. Big-Oh is a system for dividing arbitrary functions into nested equality classes (well, I guess maybe the right name for this is inequality classes?), and <<average-case resource consumption vs input size>> is just as useful an input function as <<worst-case resource consumption vs input size>>. There's no technical reason, and no pragmatic reason, to insist otherwise.

As for distribution, if someone doesn't specify distribution, it means they're implying that it's true across all the reasonable distributions that they can think would apply, just like it always means something like that when a human leaves out a detail. Every single time I've looked into a case like this, the assumed distribution is a uniform random distribution of all possible inputs. There are always things left unsaid, why claim that this particular elision is a sin?

What do you mean? Why's Theta(n) useless?

Do you mean O(n^2)? Because merge sort certainly won't be exponential.

No, I mean O(2^n). O(2^n) includes everything that grows slower than 2^n as well. It is no relation to the worst case performance of merge sort, just a clarification on the definition of Big O.

> Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm.

From higher up in the article. You picked the O(N) section description to pick a nit with the whole article when it covers exactly what you want it to earlier.

That's not what the grandparent is getting at. Big O describes an upper bound (and often, but not necessarily, refers to the worst-case scenario).

I like this page http://bigocheatsheet.com/

Espcially the graph which hammers home how quickly things go wrong.

That table is so wrong it makes me mad.

In what way? Having skimmed it, I see a couple of minor mistakes but the bulk of it seems correct.

The chart says you can search a Cartesian tree in O(log n) expected time, but I don't see how that's possible given that there's no ordering between a node's children.

Also, it's possible to do mergesort in O(1) space; in fact, it's one of the exercises in TAOCP, if I recall correctly. But the algorithm is so complicated that nobody uses it in practice.

The article is mostly good, but I have one nitpick.

> An example of an O(2^N) function is the recursive calculation of Fibonacci numbers

No, the naive recursive evaluation of the Fibonacci sequence has complexity O(1.618^N) (or just O(Fib(n)). It is unequal to O(2^N) because the base of the power is different.

To be fair, as Lxr points out below, big O notation is an upper bound on the growth rate of a function, and since 2^N grows faster (asymptotically) than (1.618...)^N, O(Fib(N)) <= O(1.618...^N) <= O(2^N) (with some abuse of notation), and the article is technically correct. You may be thinking of big theta notation, which requires both an upper bound and lower bound.

This example is helpful: https://en.wikipedia.org/wiki/Big_O_notation#Use_in_computer...

You are right. I made a mistake, not thinking through big O vs. theta notation. Thanks.

Technically 1.618^N is still O(2^N) but I agree with your point.

I was about to correct you until I realized you used big-O instead of big-theta notation.

Informally we tend to use big-O to mean big-theta which only adds to the confusion.

Actually we don't use it informally as big-Theta, because big-Theta assumes that the lower and upper bound are asymptotically the same.

For example Quicksort is O(n^2) but Omega(nlogn) it is neither Theta(nlogn) nor Theta(n^2).

You probably meant that informally we just assume that the stated bound is as tight as possible.

> For example Quicksort is O(n^2) but Omega(nlogn) it is neither Theta(nlogn) nor Theta(n^2).

No, this is flat out wrong. Big O is an upper bound on an asymptotic growth rate. Big Omega is a lower bound on the asymptotic growth rate. Big Theta is a tight bound on the asymptotic growth rate. These are independent to the average case, best case, or worst case run time of a given algorithm.

Quicksort has an average run time of Theta(n lg n). Equivalently, its average run time is O(n lg n) and Omega(n lg n). It has a worst case run time of Theta(n^2). Equivalently, its worst case run time is O(n^2) and Omega(n^2).

> Actually we don't use it informally as big-Theta,

Wrong again.

You can state that quicksort has run time O(n^2). The function you're describing with O(n^2) is f : S -> R, where S is the set of input values and R is the set of real numbers, and f(x) is the running time of running quicksort on input value x, and n : S -> N is the size of input value x. The notation O(g(n)) describes the function f in terms of the function n.

That f is in O(g(n)) means there exist constants C and N_0 such that for all x in S, provided that n(x) >= N_0, it is true that |f(x)| <= |C g(n(x))|.

And that f is in Omega(g(n)) means there exist C > 0 and N_0 such that for all x in S, provided that n(x) >= N_0, it is true that |f(x)| >= |C g(n(x))|.

Perhaps it is you that need to be educated.

They are independent only if you explicitly state that you are computing them for best/average/worst case independently. Is it commonly done? Sure. But there is nothing in the definition that forces us to do that.

If the things were as you state would you have any use for Omega and Omicron then? Wouldn't just Theta suffice?

> But there is nothing in the definition that forces us to do that.

That's true. You're right.

> If the things were as you state would you have any use for Omega and Omicron then? Wouldn't just Theta suffice?

Couldn't there be cases where we don't have a known tight asymptotic bound but do have an upper and/or lower bound? And although it's an abuse of notation, you do often see big-O used in place of Theta. From CLRS:

"In the literature, we sometimes find O-notation informally describing asymptotically tight bounds, that is, what we have defined using Theta-notation."

My interpretation is that "O(...)" informally means "Theta(...) in the average/amortized case". In the average case, Quicksort is Theta(n log n).

Common mistake. When people say O(2^n) they USUALLY mean something closer to 2^O(n).

The reason is that f(n) = O(g(n)) means that for some constant k and integer N, if N < n then |f(n)| < kg(n). In other words "grows no faster than a constant times". However when you've got exponential growth the slightest of things can cause the exponent to be slightly different, or there to be a small polynomial factor.

That was the case in his example. (The exponent was different.)

The best layman's example of O(log N) I've heard is finding a name in the phone book. You open it around the middle, select which half gets you closer, and repeat. It's then easy to "get" why it's not a big deal if it's a huge phone book versus a tiny phone book.

That seems more like a layman's explanation of a binary search than O(log N). Binary search just happens to be O(log N)

Binary search is an example of a O(log N) algorithm, as OP said. Having tangible examples is great for explaining algorithmic complexity to people.

One I've used is sorting with a deck of cards. Take a single suit, A-K, and have people sort it using bubble or insertion or other sorts. It's only 13 cards, but the difference even between some of the O(n^2) algorithms becomes obvious when swapping is postponed until the latest stage versus earliest.

Then you can also demonstrate radix sort. One of the most underrated sorting algorithms, in my opinion. A lot of people focus on the (potentially) large constant. But it can be appropriate, or may be reduced depending on your use-case. And it's a fantastic sort when you can't give a neat numeric quality for sorting and have to sort over multiple dimensions. Again with cards, you can do two passes. One with buckets by value, and one with buckets by suit. After the second, you collect each of the 4 stacks and the deck is sorted.

Cards, phone books, recipes, these are things that most people encounter (ok, maybe not phone books anymore), in their lives. They offer great aids to demonstrating CS concepts to either new students or people who have only a passing curiosity (or none but they shouldn't have asked :P).

I'm not sure how drawing that distinction helps a layman.

Clarifying something as an example of a principle instead of the principle itself seems pretty important, regardless of lay status.

Ah, when you phrase it that way I see what you mean, I agree.

A phone book search does O(log N) string comparisons, but its running time is not O(log N) unless you consider all string comparisons to take a constant amount of time. Because for the names in the phone book to be distinct, the strings have to be about log N in size, so each string comparison costs O(log N) in the worst case. So the cost of phone book search is O((log N)^2).

I've skipped some details here: string comparisons can finish early, and maybe N should be the size of the phone book rather than the number of names in the phone book -- but even if you consider these factors, the runtime complexity is still more than O(log N).

Thats not a good example for Big O because number of operations computes exactly to logN in worst case.

Try with mergesort and see how it ends being nlogn

Isn't O(log n) just an upper bound limit ? So isn't it more correct to say, that in worst case the number of steps are < O(log n) ?

Big-O describes an upper limit in itself. Something that is O(N) is also O(2^N). No less-than needed, it's included in the definition of big-O.

Yes, but context matters; Big O is often used informally. The interviewer's going to look at you weird when you tell him the O(N) algorithm is O(2^N).

We have the same problem with things being "in NP". It's clear that all of P is in NP, but not clear whether all of NP is in P. But people often say "NP" with the implication of "(apparently, as far as we know) not in P", which is not actually correct based on the meanings of those terms.

(In this case the problem is probably made worse by the mental interpretation of "NP" as "Not Polynomial", when it really means "Nondeterministic machine can solve in Polynomial time", and if a deterministic machine can solve something in polynomial time, a nondeterministic machine can do so as well!)

A phone book? How is Amazon's Kindle App O(log N)? ;-)

Nice! Very succinct and clear. I wrote an intro myself, as someone who'd recently learned the concept. Wordier, but might be helpful if you think like I do. http://nathanmlong.com/2015/03/understanding-big-o-notation/

Super interesting comparing this to the one we have on Interview Cake: https://www.interviewcake.com/article/big-o-notation-time-an...

I like how Rob Bell's piece has headings with the most common time costs--sorta makes it easier to see at a glance what you're going to get, and I imagine gives folks a quick sense of, "Right, I've seen that! Been /wondering/ what that means."

I think the nicest way to learn this if you don't have a formal CS education is to still pick up a Discrete Math text book (like Chapter 3 in Rosen) and then read chapters 3 and 4 in CLRS.

If something puzzles you, don't be afraid to start drawing yourself a diagram. O(n^2) vs. O(n) amortized for naive string concatenation vs. growing a buffer by doubling:


This can be a good way of getting an intuitive feel for what's going on. Note in the 2nd case, you can stack all of 1st two rows representing cost into the 3rd row. In fact, you can always stack the execution costs into row (n-1) and never exceed the size of row n.

Looks like you're alluding to the recurrence tree method for solving recurrences mentioned in CLRS. And you're right, it's a great way to visualize and get an intuitive feel of big-oh

It would be sensible to put the terms like "constant" and "logarithmic" in there, IMO.

Is there a beginner's guide to proving different time complexities?

> O(N) describes an algorithm whose performance will grow linearly and in direct proportion to the size of the input data set.

Argh, I hate this every time I see Big O notation covered.

Big O != performance. If you have an O(N) algorithm that walks the data set in the order that it's laid out in memory(assuming contiguous data) it will beat your O(NlogN) and sometimes even O(logN).

[edit]meant to omit nlogn, that's what I get for any early morning rant pre-coffee.

Radix Sort is the classic example I always bring up. On machine word size keys, with a separate pointer look-up table(to get final value) it will beat QSort, MergeSort and all the other NlogN sorts by 10-50x. This includes having to walk the data 3-4 times depending on how you want to split the radix to line up with cache sizes.

Friends don't let Friends use Big O to describe absolute performance.

When talking about theory, I have two points: - First: O(N) < O(N log N), so your example isn't good - Second: Your description of Radix sort is actually O(N), and the standard sorting algorithms are O(N log N), so it again isn't a surprise, theoretically speaking...

The thing that Big-O knowingly hides is the constant factor. For example, some algorithms are 2N and some are 1000N. For large enough N, this may not matter, but for any reasonably sized N, it does.

In practice, of course that Big-O is not sufficient to describe absolute algorithm performance, as it depends on so many factors such as implementation, computer architecture, relative size of data, etc. But, when looking at an algorithm and knowing its Big-O complexity can often be a useful input to your decision on whether to use it. So I disagree with this Big-O bashing...

> If you have an O(N) algorithm that walks the data set in the order that it's laid out in memory(assuming contiguous data) it will beat your O(NlogN)

Not sure what you're trying to say with this particular statement. Of course it will. NLogN grows way faster than N.

And Radix sort is of course drastically faster than comparison sorts if the word size is smaller than logN regardless of the particularities of the implementation.

People don't and shouldn't use big O to describe absolute performance but it's a great place to start and you can't reach the level of implementation details performance tuning if you can't understand or formalize the basics.

But you do have a point overall.

In practice, performance may be different. Big O assumes a particular 'virtual machine' with a particular word size and a particular time it takes to execute an instruction.

Yeah, meant to omit NlogN there, I blame it on lack of coffee.

My meta point is the constant factors that Big O throws out usually end up dominating real-world performance.

That's why some Data Stuctures courses use tilde notation instead (same as big O, but the constant is preserved).

In big-O senses Radix sort is one of those more persistent misconceptions. At scale Radix sort has to be nlog(n) because you should really consider the possibility all your numbers are unique.

This does not even start on the reality that memory access really is O(n^(1/3)), addition is O(n) and a lot of other things we assume are O(1) are not.

The statement you quoted is correct. It doesn't say "O(log(N)) algorithms are faster than O(N) algorithms." It says performance of a particular algorithm degrades linearly as data set grows, if that algorithm is O(N). Which is true. (asymptotically)

I was nitpicking the performance part of the statement. N does scale linearly.

The point I was trying(and failed) to drive home is that it matters how you walk your N not just what the N is.

If you've got to do a pointer indirection for each N rather than a linear read that can end up dominating the non-constant factors in a lot of cases.

You're not understanding what I'm saying. It's of course true that some O(N) algorithms are slower than other O(N) algorithms. But the statement you quote doesn't imply the contrary. Obviously walking an array can be thousands of times faster than walking a linked list. That doesn't change the fact that walking an array of size 3*N is 3 times slower than walking an array of size N, which is (roughly) what big-O notation means and what the quote you're objecting to correctly claims.

> If you've got to do a pointer indirection for each N rather than a linear read that can end up dominating the non-constant factors in a lot of cases.

Huh? Pointer indirection and cache miss for each element is already non-constant (you have to do it N times) so I don't understand what it means that this "can end up dominating the non-constant factors".

Not really.

Take the linked-list case for example. Let's say the allocator you have is a small-block allocator, of which most good malloc implementations use for linked-list style allocations.

Since the SBA allocates in chunks your cache locality(driving your "constant-factor" performace) isn't linear any more. It'll change in jumps & starts as you break chunk boundraries. Also another class allocating in-between your linked list allocations(really common) will break up the chunk locality and now your real-world O(N) performance no longer scales linearly.

You're overlooking the word "grow".

There's nothing wrong with that sentence. No absolutes.

Of course an O(N) method will beat an O(NlgN) method, because O(N) < O(NlgN).

Radix sort is a linear time sort because the hardware (random access memory) is built to allow constant-time lookup of memory. Radix sort wouldn't be linear time on a sequential access memory system, like tape, but merge sort would still be O(NlgN) on such a system.

It doesn't invalidate what you say, but you're assuming there's a bounded number of different values in the array -- 2^k where k is your word size.

Under this assumption, a correctly implemented quicksort also runs in O(N) time.

>By one estimate, the health care costs of obesity are responsible for nearly 21 percent of total health care spending in the U.S.

BigO has nothing to do with performance even though that what we most use it for.

Its simply a classification of growth characteristics, relationship between two functions.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact