Hacker News new | past | comments | ask | show | jobs | submit login
The Myth of RAM (2014) (ilikebigbits.com)
593 points by ddlatham on Aug 29, 2016 | hide | past | favorite | 277 comments

It so happens that a large part of my PhD was on this very subject. The result I've got N log(N), this is more visible when you get to larger RAM (I had 0,5 TB RAM at the time). We have an empirical result, a justification and a rigorous predictive model.

The reason has to do with hashing, but a different type: TLB.

I posted more details as https://news.ycombinator.com/item?id=12385458

Thanks, I came across your research before and thought it was quite cool. In Section 8.5, when discussing whether hash tables would be suitable for handling TLB misses, wouldn't denial of service attacks also be a concern? If an attacker knows the hash function and can control the addresses being looked up, they might be able to trigger worst-case behaviour on every lookup, couldn't they?

That's addressed in that very section:

> Moreover, an adversary can try to increase a number of necessary rehashes.

It seems to me that the section is a bit too dismissive, though, as there are hash tables and hash functions that mitigate these concerns. In particular, collisions can be replaced with trees, like in Java, limiting the worst case to O(log n) again.

Rehashes and bad probing behaviour aren't quite the same thing, but I'll let it count and admit I may have replied too quickly ;)

The problem with this analysis is that in the graph in the very first part he shows that memory access IS O(1) for pretty substantial scaling factors, and then when you hit some limit(e.g. size of cache, size of RAM) access times increase very rapidly. Sure, if you draw a line across 6 orders of magnitude, it ends up looking like O(n^1/2), but how often do you scale something through 6 orders of magnitude?

The "memory access is O(1)" approximation is pretty good, certainly good enough for almost all every day use. The median size of a hash table I allocate definitely fits in L1 cache, so why shouldn't I think of it as O(1)? If you are reading off of disk, the O(1) approximation holds as long as your dataset stays between 1 MB and 1 GB. That's quite a bit of room to play around in.

Yes, you need to be aware of access times and the changes in them if you are really scaling something way up. But I'm not convinced that I shouldn't just keep thinking of "hash access is O(1)" as a convenient, generally accurate shortcut.

It is trivial to exceed the L1 cache size and not that uncommon to exceed L2. That brings us to a 100x delta which is worth thinking about, no? Even exceeding L3 isn't horribly rare for average desktop CPUs (let alone mobile devices).

There are other dimensions to this too like prefetching, streaming, pipelining, vectorization, etc. In some cases using an array and doing a linear search is faster than any hashmap on a modern CPU.

I think the takeaway from this article is not to blindly trust the theoretical big-O numbers for data structures and actually test them with realistic datasets on your target hardware.

This is a very good point. You should run actual performance tests. The results may surprise you!

In my actual work, optimizing in memory searches is not worthwhile because saving half a dozen microseconds on a search is not real valuable when you are about to spend half a dozen milliseconds making a database call (which is technically O(log N), since there's a B-tree index back there). But I do spend quite a bit of time figuring out which of those queries should be moved to an in-memory O(1) cache (redis or memcache, depending).

And I actually never think about the perf tradeoff between the DB and the cache being O(log N) vs O(1) - I think of it in terms of the median and 99th %tile times that I know from monitoring our actual production servers. So as xenadu points out, I guess there is some truth in the article, but it's just sort of lost in this somewhat academic discussion of scaling things to infinity and beyond.

In my actual work, optimizing in memory searches is not worthwhile

Funnily enough, I had to do this last week. We started from the question "why does it take 160ms to scan 60,000 objects in memory?" and reached the disappointing conclusion that, on this particular platform (iMX6/800MHz/DDR3) a cache miss costs you an astonishing 233ns. I got as far as double-checking all the RAM initialisation parameters before giving up.

There is now an index on the in-memory search.

Random percentile abuse. 99% is less useful than a median. The real data is a true histogram as well as maximum time.

99% might mean every 100th customer does not get a service.

Totally with you on 'benchmark everything, who knows when the branch predictor will mispredict or a full cache flush will happen. (Interestingly enough, I rarely try to beat the compiler since ICC is super smart but strategically placed, manually inserted CLFLUSHes have sped up some my work. Go figure.)

Eh L1 is bigger than you think. What cache-trashes is context switching. A common technique is shielding processes[1], delegating all interrupts to certain processors. Here's an easy example - since HT has it's own set of caches for both "CPUs" inside the physical CPU [i.e., the ALU (and all the aux AVR registers too) is shared between both], if you're doing anything computationally intensive with an unpredictable set of interrupts, but your crunch pattern is predictable enough do the following-- set the affinity for the 'cruncher' to one CPU with a full shield, and then all other components like interrupts, IO, things that you don't mind re: cache misses, etc.

[1] https://rt.wiki.kernel.org/index.php/Cpuset_Management_Utili... This is just one way to do it, but it explains the concept.

>I think the takeaway from this article is not to blindly trust the theoretical big-O numbers for data structures and actually test them with realistic datasets on your target hardware.

You have to do this anyway to make sure that the constant factors aren't big enough to change the optimal approach at the size of your dataset. Big-O only really buys you an initial rule to not do things that are obviously wrong.

Right, an algorithm that always produces an answer right after the heat death of the universe is O(1).

Big-O is great for telling you which algorithms/datastructures won't scale, it can't tell you which are actually fast enough.

And in fact, some data structures that scale poorly will perform better on small sets of data.

For example, a simple list of pairs with linear lookup time can be faster than a hash table for small amounts of data, because it doesn't have to compute the hash.

I would say it differently: you absolutely should trust big O numbers for data structures and use that as a major guiding principle in terms of how to structure a program ... and you should require extraordinary evidence, like a surprising result in a performance test, before navigating away from trusting the big O analysis.

What this all says to me is that for an extreme claim (e.g. using an array and doing a linear search will be faster than a hashmap lookup) it should require equally extreme evidence to substantiate it (e.g. the results of performance tests that stand up to heavy scrutiny).

I feel there is a trap in which someone might look at this sort of thing and say, a-ha I don't actually need to care about big O reasoning or standard understanding of data structures at all! That would be a tragic misreading of this sort of example.

> before navigating away from trusting the big O analysis.

I don't think that's the spirit of the article. The spirit is to not assume O(1) is O(1) if you work with a physical computer and care about anything beyond how many times your program counter increments. Make sure to use big O notation that includes the complexities of carrying out those single instructions.

If you work with more than 32k of data, you either need to modify your big O to take latency into account, or you're assuming some worst case cache level, which is fine, and what you're doing with an idealistic big O...but that's a terrible assumption if you're actually trying to achieve speed or realistic metrics.

I agree. That's why the article is wrong-headed in its presentation.

Generally when you begin working on problems where this sort of thing is relevant, you have to make choices to get something going. You want to avoid premature optimization and you need something on the ground. In short, you have to assume O(1) is O(1) ... and it basically always is, except when there's evidence that it's not.

The value of the article is to point out cases when these abstractions break down, and the value of performance testing. But carrying it to an extreme such as, "Never make any assumptions about how any data structures work until you've performance tested every single thing in your application" is of course wildly unproductive.

> using an array and doing a linear search will be faster than a hashmap lookup

For small enough sets of data, this will likely be true, simply because you don't have to compute a hash.

Of course, it likely won't scale very well.

I always thought the Big-O notation was purely to compare one algorithm to another, not actually measure literal, real world performance.

Trying to describe a hardware operation in Big-O notation then assert everyone should something other than O(1) for a memory access operation seems like a gross misunderstanding of its purpose.

Big-O notation is there purely to compare one algorithm to another, yes. But in terms of what?

Say you have two search algorithms. For a dataset of size N, Algorithm 1 does N memory reads and N data compares. Algorithm 2 does N * (log N) memory reads and log N data compares. These are not real algorithms, just illustrations.

It's not uncommon in the analysis of such algorithms (both searching and sorting) to only consider the number of data compares and assume memory reads are completely free. So that would give you O(N) for algorithm 1 and O(log N) for algorithm 2.

Analyses with more sophistication will note that a memory read is not in fact free, and will look at the number of "operations", whether those be memory reads or compares. In terms of the number of "operations", algorithm 1 is O(N) and algorithm 2 is O(N log N).

These are both true statements about these algorithms: Algorithm 2 is O(log N) if you look at number of compares and O(N log N) if you look at number of "operations". Which of these characterizations is more relevant to you? Depends on whether memory reads are actually free in your setup.

Anyway, the original article is arguing that for the comparison most people care about, wall clock time, even the more sophisticated version is wrong, because it assumes that the time needed per operation does not depend on N, so you can just count all the operations and not worry about their relative speed, because that's just a constant factor. If the time taken for a memory read _does_ depend on N, then you can't do that. For example, if you want to look at the algorithmic complexity of these algorithms in terms of clock cycles, and a compare is one clock cycle (might not be, depending on what your data is!) and a memory read if sqrt(N) clock cycles (due to caching effects), then the complexity of algorithm 2 will be O(N * sqrt N * log N), whereas algorithm 1 is O(N * sqrt N).

Of course once you're caring about actual time very often constant factors start to matter too, and the whole idea of doing asymptotic analysis might break down. But it might not. It really depends on your exact problem and on the assumptions underlying the asymptotic analysis...

> It's not uncommon in the analysis of such algorithms (both searching and sorting) to only consider the number of data compares and assume memory reads are completely free.

Can you link to an example? That seems plainly wrong and I haven't seen an analysis which makes that assumption.

https://en.wikipedia.org/wiki/Binary_search_algorithm#Perfor... is a simple example. It goes to great pains to talk about numbers of comparisons per iteration, but never once considers the number of memory accesses per iteration (which happens to be 1 or 2 depending on how you count things).

In fact, most of the performance analysis of algorithms on Wikipedia that I've seen look like this. Similar for introductory algorithms textbooks, if I recall correctly (which I may not; it's been a while since I looked at one of those).

Now in practice the number of memory accesses in an algorithm like this is typically O(1) in the number of comparisons so an implicit assumption that all your memory accesses are O(1) means just adding a constant factor to your overall complexity, so it's OK to just not worry about the whole thing. And counting "memory accesses" is complicated because whether something is a "memory access" or not is a bit tricky. For example, in the binary search case, do you need to "memory access" the value you're searching for before every compare, or can you assume it's stored in O(1) storage like a register? But all of this falls down if your memory accesses are not in fact O(1).

There is a complimentary notation for additional memory use.

Any engineer worth their salt actually specifies basic operations separately. This includes comparisons, fused or normal multiplies and additions, divisions of applicable, memory loads and writes.

Even then you can get fooled due to locality.

Specifically, Big-O describes the relative performance of an algorithm as the input grows. That's why constants are discarded, and that's why O(n^2) is probably fine if you only ever have very small data sets.

> purely to compare one algorithm to another, not actually measure literal, real world performance.

Estimating real world performance is the whole point of Big-O notation. Trying to solve the traveling salesman problem on all the stars in the Milky Way is something your PC isn't going to do in your lifetime, for instance.

If you ignore the nonEuclidean effects of general relativity, there are very good polynomial time approximation algorithms.

This right here is the most relevant response to the article.

I agree with aaronbwebber that for general use, and for teaching, that it's far easier to assume O(1) memory access. It's a decent approximation, and works until you start having applications where loads/stores dominate. It also falls apart when things like multi-threaded apps start spinning on locks, but lets not go there in a HN post, could make an entire blog out of mem issues.

Here's why we should continue to use O(1) access in general. Unless you enjoy rat-holes (I wouldn't have a job without them...so, why not). If we really wanted to, for every algorithm we could consider things like: 1) pre-fetch algorithm (assuming you know which one will be chosen) 2) associativity 3) cache size at each level (including buffering) 4) queueing depth at all levels 5) DDR controller capacity & while we're at it, do we have latency of mem controller on-chip..wait, which one? 6) bank conflicts in DDR 7) NUMAness (how about NUMA cache behavior?) 8) even worse...should we consider cost of differing mem tech. 9) TLB behavior...there's a lot there, can't list

Here's why we don't: YOU CAN'T. Do you realize the incredible amount of work required to get the info for the analysis? The list of things that influence memory behavior is huge. Even if you can get all of it, you work it out for one algorithm...you've now wasted a year, and figured out a cost for one architecture/OS/run-time combo. Congratulations. I'm all for including the cost of memory...just realize what you're asking before you write a post asking about it. There's no end to how detailed you want to go. And in the end you end up destroying the asymptotic bound you're trying to create b/c you'll realize that it's now a probability distribution that really doesn't lend itself to generalizable asymptotic behavior, which is what you want when considering algorithms.

When choosing an algorithm for a specific hardware, then you can choose an algorithm for that arch.....but most people don't do this, it's not necessary unless you're in HPC, Datacenter scale analytics, or in the embedded space/cyper-physical space.

Permit me to disagree; I think it is useful to internalize the general rule that for every 100x increase in data size the theoretical performance will drop by 10x. This isn't a perfect rule but it is just as valid a shortcut as saying hash lookups are O(1).

For all intents and purposes all modern CPU hardware works the same way (a cache hierarchy, superscalar, pipelined, with relatively slow DRAM hanging off some bus).

In practice measure different approaches to performance-critical code with realistic datasets and don't make assumptions. Don't assume that you know dictionaries are O(1) and therefore searching an array must be slower.

The problem is, that's not really true. There isn't a hard/fast rule that'll get you what you want given the way modern hardware is designed. Run something on a core designed for mobile workloads and that changes vs. one designed for server workloads vs. one designed for HPC. They're definitely not all the same, and the differences will kick you in the rear when trying to simplify.

I do agree teaching people that there are differences, and give them the tools to learn when algorithmic changes are necessary to optimize for the hardware.

The article rationalizes with physics, arguing that a physically large data store, when it is efficiently implemented, is limited by the inverse-square of distance inherent in a 3D world.

Certainly there are exceptions when you make a system more efficient (for example, filling empty memory slots or moving systems closer together or paying for a better interconnect).

> Here's why we don't: YOU CAN'T.

But that’s one of the main points of the article: You don’t have to. Asymptotically, you won’t be better than O(sqrt(N)) because physics does not allow it. You can put all the different caches in you want, you can buy the most expensive RAM there is, you can buy the best interconnect systems for the fanciest NUMA cluster you can find: You won’t beat O(sqrt(N)) if you wish to go to arbitrary N.

Of course, if you bound your problem size, you may well get O(1) access times and for some applications, it may well be sensible to consider the coefficient associated to the square-root scaling small enough to neglect it, but if you truly want to talk about algorithmic complexity (and not just the scaling of one particular implementation for one particular range of valid N), you should take that additional square root into account.

The one exception is possibly quantum computation in some cases.

I don’t know of a mechanism in quantum mechanics that would allow to pack information more densely (or even infinitely dense as would be required by O(1) random access times), what did you have in mind?

There is a middle ground between tuning an algorithm to the exact memory hierarchy of the system it is running on and ignoring the problem altogether: https://en.wikipedia.org/wiki/Cache-oblivious

this is true....and thankfully this technique is also being taught in some CS algorithms classes.

Check out "cache-oblivious algorithms" for code that is designed to work well with memory hierarchies but which does not need to be tuned for the specifics of any particular architecture.

I like it: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

It gives a name to a good default way of writing code.

> but how often do you scale something through 6 orders of magnitude?

Well, that's the whole point of using big-O notation.

Otherwise, there's no point in distinguishing between O(1) and O(lg N) either--lg(10^6) is only ~20, after all.

> but how often do you scale something through 6 orders of magnitude?

I can't speak for others but the team I'm on has a mind-boggling amount of data. 6 orders of magnitude is not even close (no, I'm not doing something clever like starting with "byte" or "bit" as smallest unit). The truth is most people don't have "big data" but plenty of people do—petabyte scale is becoming pedestrian these days and a few larger organizations are into the exabyte range.

Latency grows as you go from registers to RAM to hard drives in distant data centers. Consider that data centers don't really stack on top of each other, they have to be built flat or you can't get rid of the waste heat fast enough. Same with microprocessors and RAM. Everything is flat, in practice, with a limit to how much it gets stacked.

Maybe you think that this is a special case, but with everyone building their applications on top of the same cloud providers, your provider's scalability starts to affect you personally. So even if you build a small application in the cloud, your ability to access data is impacted by the fact that your cloud provider is providing access to exabytes of data.

You're failing to account for the fact that it's a log-log graph, though, so even a modest slope can be pretty significant.

If you want a rule of thumb, might as well just stick to "hash tables are fast" instead of the false precision of O-notation.

Big O-notation isn't precise, it is an approximation by nature. It only cares about the largest factors for determining the estimate.

The log-log graph is the way to determine those large factors. Taking the log removes the multiplicative constants. The differences between n^.5 and n^1 and n^1.5 are quite significant for big-o

The graph really is totally flat below the L1 cache size line, and even on a log-log graph the slope between 10 MB and 1 GB is not significant.

That being kB range for L1, single digit MB for L2. Relatively simple Java apps tend to eat multiple GB, while accessing most of it.

The definition of big-O is what happens as n approaches infinity. You are free to model your computer as having O(1) memory access as it's a much simpler model and close enough in practice.

As N climbs orders of magnitude, we start using ever more distant kinds of memory, so access time does not scale as O(1) in theory.

As for practice, it's not O(1) either or else cache misses would just be a theoretical nicety that didn't matter in the real world.

If you make the assumption that there is an upper bound on the maximum time a read can take (e.g. reading from disk), reading from memory becomes an O(1) operation regardless of cache misses.

If you make that assumption, sure. If instead you assume that you live in a universe where information can't propogate faster than the speed of light (true), storage takes up space (true), and memory is arranged on an plane (generally true in practice), then you get back to sqrt(n) access times. If some day we move to totally 3D computers we can get that down to cube root of n.

Anthropic O(1) argument: An algorithm will either finish before all humanity dies, or it won't. If it does, it's bounded from above by a constant. If it doesn't, it's completely irrelevant to everyone. So for all practical purposes every algorithm is O(1).

By similar assumption, every terminating algorithm running on a real computer (having bounded memory) is O(1). This is technically true, but not useful.

Most people assume the average memory access will be reasonably fast (why we have a cache hierarchy in the first place). Sometimes the growth of it matters, sometimes it does not.

Algorithm complexity is not something to be memorized, but analyzed.

The assumption that the computer you are modeling has an infinite amount of memory finite number of caches is a useful assumption as it dramatically simplifies the analysis and still allows you to use big-O to somewhat accurately analyze the performance of a real computer.

You missed what I was illustrating by referencing bounded memory - every realized program has a finite (therefore constant-bounded) number of states. The point is that such analysis is not useful, because it "cheats" with an misleading implicit constant for the bound.

Similarly, if one is analyzing the performance of a general algorithm and assumes memory access is constant time, then that constant needs to be understood as the full access time of main memory (assuming that's what problems will always fit in).

One cannot casually make this assumption to get an O(n) complexity result, profile the program with a problem size that actually fits in cache, and then combine the two to extrapolate larger running time. They each have different baked-in assumptions.

Yes, that assumption can be used to make any operation O(1).

We live in a world where we buy hardware to run our programs, not write programs to run our hardware. [This is a Dijkstra paraphrase—can somebody find me the original?]

Think of the O(1) as paying the worst cost of a given memory tier, uniformly. The more data your program uses the more hardware you need to run it the worse your worst case will be.

Makes the analysis completely worthless. The idea of the analysis is to evaluate asymptotic behaviour for large data sets. Of the idea ever a nonlinear cost of accessing the data, you should take it into account.

Suppose the memory is tree structured like typical TLB, you do get at least one extra log n factor. And actually there are even more costs on the way.

It happens all the time. In real world code people throw hash tables and maps at all types of problems of all sizes. If you hit arbitrary boundaries where your fundamental assertion of access time becomes non constant its an issue.

This post might spur new thought about containers and how there usage should really be governed by their expected capacity.

I think that the "best" message here is that the performance of memory access, hash lookup, etc, is a more complicated question than always just O(1). Knowing the context -- basically, the second-to-last graph that highlights performance vs size vs cache limits -- is the bigger picture for the system. And it's perfectly usable everyday, you just have to remember a few more words: "memory access is O(1) as long as I stay within L1".

The problem I have with the article is that the author has replaced one broad generalization with another equally reductive but diametrically opposed one. It's no more accurate, and no more decontextualized, than the original statement, and in my head that means it's also not really all that useful.

Simpler version: locality of data matters a lot.

One can move across all these scaling factors by changing from linear access to random access. A common example is linked lists vs arraylists. Many articles show it takes about 5 inserts per read for a linked list to beat arraylists.

Not with large data sizes it doesn't. Which is the whole point. Array would be even slower than the list on insert because you actually have to move much more data around.

The one thing array is better at its sometimes locality, especially if you iterate over neighbouring elements.

> For the purpose of this series of articles I'll be using the O(f(N)) to mean that f(N) is an upper bound (worst case) of the time it takes to accomplish a task accessing N bytes of memory (or, equivalently, N number of equally sized elements).

That's not really valid; it's not how algorithmic analysis works. The author's conclusion for what is happening and why is correct, but I believe he is confused about how to get there.

Simply, when doing complexity analysis on an algorithm, one must always count an operation. It's not okay to point to the time taken for an implementation and say "That's our function." It is a function, but it's a function of time, not a count of how many operations are performed at given sizes of N.

However, he is correct that naive analysis of arrays and linked lists will result in this odd behavior: arrays will tend to outperform lists on real systems. The problem with the naive analysis is in what it counts. For example, on an insert, a naive analysis will count the number of elements accessed in the structure. That's naive because it assume all accesses are the same - which is what he's getting at with the "myth of RAM". Because of the memory hierarchy, they are not all equal.

But the correct response is not to give up counting operations and look at time, the correct response is to find the right thing to count. And the right thing to count is basically going to be last level cache misses - the operations that force one to go to memory. If you do that, then you will find that the operations you are counting will correlate much better to the actual time spent.

In some places, the author gets this mostly correct: "You can also use Big-O to analyze the time it takes to access a piece of memory as a function of the amount of memory you are regularly accessing." That's fine, as you're counting memory accesses.

In other places, it's not correct: "That I use Big O to analyze time and not operations is important." You can't count time, only operations. You want to count the operations that correlate with your actual running time, but the entire point of good analysis is to find those operations. You can't just shortcut it, only measure time, and then call it algorithmic analysis.

The author gets a lot right, but despite the lengthy discussion, I think he still has some confusions about algorithm complexity analysis.

For the record, these lessons should be familiar to anyone who has done serious performance analysis of computer systems, either on their own, or in the context of a course that focused on systems or architecture.

Fwiw theoretical computer scientists do this kind of analysis as well, while the post makes it sound like they don't. The equal-cost model of RAM operations is only one model used in asymptotic algorithm analysis, and there are other cost models with other properties. The introduction of this paper gives a decent concise overview of some of them: http://www.cs.cmu.edu/~rwh/papers/iolambda/short.pdf

Since it is a topic I'm interested in I took the time to read all 4 parts, the author manages to summarize it in a paragraph which would have been helpful at the beginning:

When somebody says “Iterating through a linked list is a O(N) operation” what they mean to say is “The number of instructions needed to be executed grows linearly with the size of the list.”. That is a correct statement. The argument I’m trying to make is that it would be a mistake to also assume that the amount of time needed would grow linearly with the size of the list as well. This is an important distinction. If you only care about the number of instructions executed that’s fine, you can use Big-O for that! If you care about the time taken, that’s fine too, and you can use Big-O for that too!

Sadly, he doesn't take this knowledge to its conclusion. Let's introduce the notation Oi() for the Big-O notation in instructions, and Ot() for the Big-O notation for time.

Lemma: For all f(N), if Oi(f(N)) > Oi(g(N)), Ot(f(N) will be > Ot(g(N)).

Or put another way, it's important not to confuse complexity scaling with time scaling, but the more complex the computation, the longer it will take.

A complexity measure is a measure of complexity over a model. Mostly we talk about input size models, but we can (and do) talk about other models. It isn't a question of complexity scaling vs. time scaling, but of complexity scaling over an input size model vs. a machine architecture model.

There are tons of well-researched machine architectural models for complexity analysis. In the case of the article, any one of the many external memory (EM) or hierarchical memory models would be appropriate. These can capture latencies and bandwidth of hierarchical memories and storage, and in some cases even eviction, contention, partition, etc.

This isn't some esoteric point. In high performance computing, distributed algorithms, or even systems & architecture it's pretty much expected that if you talk about complexity, you should do so under some sort of EM model.

Well stated. That's a much better way to say it.

Don't think your conclusion holds - and that's actually something the author brings up.

It might be quicker in time (not in the amount of instructions executed) to calculate something multiple times, or to do a more complex calculation or what have you, in order to touch less memory.

> if Oi(f(N)) > Oi(g(N)), Ot(f(N) will be > Ot(g(N)).

That doesn't hold. Some instructions take longer than others.

This is how I would go about stating the proof.

1) The article shows that the time to process N datums of data is related to the size of the total data processed. And that as the size of the data set grows, the time to process it grows with the root of N.

2) Computational complexity's order of magnitude is relative number of datums need to be processed in order to complete the computation.

3) Less complex computations process fewer datums, fewer datums mean a smaller total memory footprint.

Therefore, processing fewer datums will lower your total memory processing cost which will lower your total time. Q.E.D.

Q.E.Nope. Memory can get reused.

If all instructions complete within some constant deadline independent of N/the input then it does hold even if they take different amounts of time.

The article demonstrates that memory access is an instruction that actually has O(n^1/2) performance.

It doesn't hold for binary tree lookups vs B-tree lookups.

The hypothesis of the statement doesn't hold. O(n log n) isn't greater than O(n log n).

You have to evaluate it in a model with an explicit cache width size, then O(log_b(n)) < O(log(n))

O(log_b(n)) is the same as O(log n). With O(sqrt i) memory accesses you still have a constant factor separation between the absolute time performance of the two.

> O(log_b(n)) is the same as O(log n)

Not in every abstract model. See other reply.

It has nothing to do with the model and is just a question of whether b is a variable in the big O notation.

And in this case of O(sqrt i) memory access times, a binary tree and b-tree stay within a constant factor even as you vary b. (The reason is, the binary tree accesses that a single b-sized access replaces get exponentially more "local.")

> It has nothing to do with the model and is just a question of whether b is a variable in the big O notation.

There are models, like the cache oblivious model, where b is assumed to be a variable, so the model matters.

> a binary tree and b-tree stay within a constant factor even as you vary b.

it's not a constant factor if b is a variable. It's a factor of log b.

> it's not a constant factor if b is a variable. It's a factor of log b.

It's between 1/(sqrt(2)-1) and sqrt(2)/(sqrt(2)-1), depending on how full the b-tree nodes are.

I don't know where you got those numbers from.

> The reason is, the binary tree accesses that a single b-sized access replaces get exponentially more "local."

sqrt(n)+sqrt(n/b)+sqrt(n/b^2)+... versus sqrt(n)+sqrt(n/2)+sqrt(n/4)+...

And since b-tree nodes can be half empty, there's a sqrt(2) uncertainty. (And of course there is no other memory overhead at all, none whatsoever.)

Thanks, I see. Are you assuming ephemeral usage of nodes to make that claim?

I don't know what that means. I'm assuming a random element of the tree is picked, that parent nodes are in a smaller or equal cache level than children, that all but one cache levels are used completely, that each cache level has O(sqrt n) access time, and that there is an upper bound on the ratio between successive cache sizes.

Or less generally: it takes sqrt(j) nanoseconds to dereference the pointer with value j, and parent nodes are at smaller addresses than their children.

Right, you're assuming there's one tree being maintained.

Or any fixed number of them, or any curve where the number is polynomially smaller than the total size...

log_b(n) * log(b) = log(n). log(b) is a constant in our algorithmic analysis, and can therefore be removed. There will never be a log_b(x^2) that grows at slower than log(x), no matter how big b is.

Which is why you have to evaluate in a model with an explicit cache width. In such a model, b is not a constant. The cache-oblivious model [1] is a fairly well-known example, but that won't work in this case, since we need to know b to set the tree width. Any of the other external memory models will do.

[1] https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

I think the notation is imprecise.

The quantity the author describes is commonly called the working set. His explicit thesis is that the time to process the working set is related to the size of the working set regardless of computational complexity. He spends the entire first part of his discussion belaboring this point with a linked list example. He goes on to do a systems analysis of the results and concludes that time of execution is system dependent not complexity dependent. Again using the same algorithm engaging with a larger and larger memory system.

Generally engineers would experience this important principal in a more binary way, once their program started swapping to disk, its performance would fall through the floor.

If you hold the size of the datum constant (and that is important), then processing fewer of them will take less time than processing more of them.

The argument you make, as I understand it is as follows; Consider two algorithms that process N and P datums respectively to achieve the same computational result, and where N > P. There exists an algorithm such that P datums takes less per unit time to process than N. Resulting in a violation of my Lemma because Oi(P) > Oi(N) but Ot(P) < Ot(N). And you cite cache obliviousness as the principle that enables that violation.

Would you agree that I have accurately summed up your argument?

It's roughly right, but it's not cache obliviousness that enables the violation. For the argument to work out, you have to evaluate it within a model that has an explicit cache and hence, treats big O results as being parametrized over the cache width.

Does that really follow? The sqrt(N) factor is the overhead of memory access only, so if you're comparing an algorithm where every operation is a memory access to another where that is not the case it would seem you could have the inverse relation in Oi and Ot.

Edit: A counterexample then:

Oi(f(N)) = N^(2/3) (memory accessing operations) Oi(g(N)) = N (non memory accessing operations)

Ot(f(N)) = N^(7/6) Ot(g(N)) = N

So Oi(f(N)) < Oi(g(N)) but Ot(f(N)) > Ot(g(N))

I just assume that every big-O equation has some sort of units (possibly implicit) and that neither the inputs nor the results are dimensionless quantities.

We just tend to assume that "N" is clear from context and that the result is some number of similarly-loose "instructions".

Your Lemma doesn't hold if the time to execute one instruction is dependent on N - which is pretty much the entire point the blog post makes.

I think there's some good info in this article covered by various degrees of misinformation. For some reason, the article starts off with this totally wrong definition of big-O, and proceeds to make conclusions with this wrong definition. Let me provide the accurate definition:

The statement "f is O(g)" means there exists some input, call it t, such that for every x >= t, it only takes some constant multiplier M (i.e., constant in x) to always have g absolutely no smaller than f. In notation:

|f(x)| <= M * |g(x)|, where x is at least t.

This bit about "x is at least t" is very important and notifies us that this is "asymptotic behavior".

It does not make a difference how wacky or weird f is compared to g below t. It can contain all these crazy memory hierarchy artifacts, it could contain a short burst of exponential slowdown, it could contain anything.

Furthermore, according to the above definition, big-O has nothing to do with any tangible quantity whatsoever. It's a method for comparing functions. The functions may represent whatever is of tangible or intangible interest: memory, time, money, instructions, ...

Big-O analysis usually posits that the details below t aren't the details that matter. (Of course, there are situations where they do, but in such you would not use big-O.) If you want to have some analysis that is global, you don't need asymptotic analysis (though it might help as a start). You can just talk about functions that are strictly greater than or less than your function of interest everywhere. But these analyses are difficult because a much higher level of understanding of your function of interest is required.

> It does not make a difference how wacky or weird f is compared to g below t. It can contain all these crazy memory hierarchy artifacts, it could contain a short burst of exponential slowdown, it could contain anything.

Yes, but if you continue reading the next post, the author is essentially arguing that the cache hierarchy extends indefinitely due to 1) the speed of light and 2) the amount of information that can fit within a light cone of a certain radius.

In that case, cache hierarchy effects aren't 'weird things that happen below t', they represent the asymptotic behavior we can expect due to the laws of physics, no matter how far out to the right we go.

I think it's great to make such arguments but a bit more rigor would be appreciated in such theoretical constructions. If the main point to get across is in Part II, and Part I does not support Part II, as a matter of convenience, I would elide Part I.

I wouldn't. It's nice to show these effects in practice. In fact, I would have added a part 1.5 that gave theoretical bounds for how much heat can be dissipated from a volume that contains N bits.

Heat dissipation is proportional to the surface area, so this suggests a reason that information density is (physically) bound by surface area, rather than volume.

Another factor that gives the same limit in theory is gravity. If you have a volume of space and start shoving hard drives in it, eventually the hard drives will collapse into a black hole, and your information density is limited by the surface area of the black hole—again, giving you O(R^2) bits of storage for a volume with radius R.

That's explicitly mentioned in the article.

It's mentioned in the second article in the series, not the linked article.

Interesting. Perhaps a stupid question but is this somehow related to the recently proposed theory which says that the universe is essentially a hologram?

One way it's related: if a volume's information only depends on its surface area, then you can imagine the volume is really just a hologram with the same number of bits, and the bits are directly embedded on the surface of that hologram.

From what I understand, the holographic principle (the theory you mentioned) was inspired by the consequences of black hole thermodynamics.

There's only so much surface area you can fit in a given volume.

Helge von Koch begs to differ ;-) Also, simple counterexamples: https://people.emich.edu/aross15/math121/misc/love-1989-supe...

How are those counterexamples? I'm not talking about abstract mathematical geometry, I'm talking about actual physics. A paper that says "just drill infinite holes in a cube" isn't relevant here.

Stuffing hard drives together until they form a perfect Schwartzchild black hole is hardly "actual physics", so I posited we're long into abstract constructs giving theoretical upper bounds.


Worrying about the speed of light in processor caching these days is like putting a spoiler on your cart and horse to reduce drag. It might matter eventually, and that's kind of cool to think about, but it's absurdly negligible with the technology of today. Processors today are orders of magnitude slower than that.

That's completely false. The speed of light today already constrains processor design (and by extension, on-CPU caches):


"Even if signals in the chip were moving at the speed of light, a chip running above 5GHz wouldn't be able to transmit information from one side of the chip to the other". Note that a light-nanosecond is about a foot, so I'm assuming this is talking about the propagation delay of electrons in silicon that need to go through gates, which I'm sure is much slower than light in a vacuum.

To put this another way, you can view the scaling law as a hard upper bound on the speed of access to data; engineers can only approach that bound from below as technology improves.

But given a certain level of technology, it's easier to make a faster L1 cache than an L2 cache, and that will hold even as we approach theoretical limits.

A bit of a nit: We're not worried about the propagation of electrons over distance as much as we're worried about the propagation of the electric field. Electrons move too slow, the electric field moves much faster.


> which I'm sure is much slower than light in a vacuum.

Speed of signal in copper is about ⅔c, I suspect silicon will be in a similar ballbark – and the metal layers linking the silicon layers are made of copper anyway. So it's about 8"/ns… and the (three dimensional, 10+ layer) signal routes are anything but a straight line.

The speed of the signal depends on the properties of the surrounding medium though, right? You get one speed through copper in air but another speed entirely for copper in silicon.

And another for pure silicon, and others for the various doped silicons, …

I don't even know what the actual materials are with current technologies, but whatever it is, the speed will be much lower than the vacuum speed of light.

Well, that depends on the context. You're totally right in general, but for the purpose of order-of-magnitudes of maximal clock rates, the speed of light hardly varies. I can't find any normal, plausible processor material with a refractive index of 4, let alone 10. So while light in silicon may well travel [almost 3.5 times](http://refractiveindex.info/?shelf=main&book=Si&page=Chandle...) more slowly than in a vacuum - which is hardly nothing, at the granularity we're talking about I think it's not a huge issue. But that's just perspective - clearly the materials matter and will affect signal transmission speeds, and probably even today impose significant (if indirect) limitations.

It's possible weird dopings or quantum effect change this dramatically, I suppose. But if I had to guess, they don't.

And the signals all need to get there at the same time

And that's such a hard problem that you don't even have to go down to CPU scale for it to matter.

If you look at mainboards (or some smaller PCBs), the connecting lanes between e.g. RAM and CPU closer to the centre are often zigzagged, to match the lengths – and delays – of the necessarily longer outer lanes. The speed of light is actually really damn slow.

> The speed of light is actually really damn slow.

I don't think one really appreciates that until they found out that after X hours of place&route they missed timing by some fraction of a nanosecond

> you don't even have to go down to CPU scale for it to matter.

It's say it's the opposite, really. The longer the wires, the more effort it takes to synchronize their delays within a certain number of picoseconds.

Well, not entirely. For data signals in general you just need all the lines to resolve to the desired value before the clock. You do need to be very careful about the clock, though.

DDR4-SDRAM achieves speeds in excess of 1.2 GHz over distances of 4" or more. That doesn't leave too much wiggle room.

Surprisingly, the behavior at that speeds is fairly well known. I don't even think you need to move from FR4 to a more controlled material (e.g.generally in mmWave, Isola FR408, along with sometimes Rogers stuff, IME, is what is used). Allegro PCB SI(Signal Integrity) even models high speed timing fairly well at the design stage.

You've got plenty of stuff at the test stage too. E.g., the gear on high-end Lecroy's metrology test gear is at 100ghz. Agilent (Keysight, whatever, it's still HP to me) has a full test rig for USB3.1[0] at 10gbps for their consumer level gear (again, fairly slow). Step it up to FPGA speeds and here's[1] an app-note by Altera with way higher speeds.

Here's a really brief overview of 'rules of thumb' that work up to PCI-e[2] by TI. 50 minutes and worth a watch. Clean power that won't couple in, matching lines lengths on diff pairs, proper isolated ground planes (give AGND and DGND their own layers) and proper termination will easily get you 95% of the way there.

(Shameless promotion-- available for high-speed design, pre-EMC compliance testing, fault diagnostics, etc).

[0] http://www.keysight.com/en/pd-2472798-pn-U7243B/usb-31-compl... [1] https://www.altera.com/en_US/pdfs/literature/an/an528.pdf [2] https://www.youtube.com/watch?v=A3qw_Ecx9Co

Note that GDDR5 abandoned length matching, and just measures the latencies of each wire to calibrate itself.

Well yeah, I figured that that's implied, the only times that really matter are the rising and falling edges of the clock.

>"Even if signals in the chip were moving at the speed of light, a chip running above 5GHz wouldn't be able to transmit information from one side of the chip to the other".

That's almost completely misleading. Except for the clock itself, nothing that happens inside a processor takes a single clock cycle. And clock distribution delays need compensation anyway.

That's also true of memory. No one has ever seen 1-0-0-0 CAS RAM. (And no one ever will.)

It's true that pushing up against distribution times makes topology more complicated and lowers the maximum possible speed.

But that's actually the difference between electronics and photonics, which promise to run many orders of magnitude faster than EM-based silicon design. (I've seen "millions" quoted, but that may turn out to be hyperbole. Even so - optical fibre can run at tens of Tbps.)

Of course there's a difference between transmission speed, which is a function of the speed of light, and maximum data rate, which is a function of bandwidth.

But unlike electrons-in-silicon connections, photonics may not need to be 100% serial. It's entirely possible to get multiple data channels down a single line with frequency multiplexing, and perhaps also by phase rotation.

That has huge potential to change the packing density and the speed of processor designs; potentially you could have processor elements that were massively parallel but connected with single lines.

There's also some hope that because power dissipation shouldn't be such a problem, it will be easier to make 3D designs - although that's probably more speculative.

Bottom line is silicon is nearly done, but photonics is just getting started. Like fusion power it's probably a couple of decades away, but when it arrives it will be huge.

Most stuff that happens inside processors takes a single cycle, and we used to have single-cycle RAM until the CPUs ran away with increasing speed.

Photonics has a serious density problem: visible light is too large in terms of wavelength. This is already an issue with etching silicon. It's also at best at the 1950s stage where people have developed single photonic transistors in the lab but not yet photonic VLSI.

Anyway, the main factor of delay in chips is capacitance, both between wires and on the gates of FETs. This must be charged through the on resistance of the driving FET and all the intermediate wiring.

So, what you're really saying is that it's quite accurate, there are just details and more details but in the end - as long as we use a single clock - signals need to be able to travel from one end to the other in a single clock cycle.

Now, if we had (partially) clockless processors...

Also, your details are a lot more misleading than those in the article. lots of things in a modern processor take 1 clock cycle. Agner has lovely tables:www.agner.org/optimize/ and note that most basic integer and floating point operations on a skylake processor take 1 cycle - things like bit shuffling, moving, adding (not floats), and quite a few incidental other operations. Now, I'm sure you can quibble that a latency of 1 cycle in that table isn't really 1 cycle under some unusual interpretation of what's going on, but at that point you're quibbling about what "doing something" means, which isn't helpful. What programmers/compiler writers call 1 clock cycle can execute what most people call 1 instruction for many common instructions, even today.

All cache (and in fact, memory as well) used to be single-cycle.

Why aren't they any more? Because the speed of light doesn't allow it.

Multi-cycle cache/memory latency and the cache hierarchy it implies is a result of being bound by the speed of light.

On the massively-parallel-connected-with-single-lines theme, there's been some interesting work done with on-chip RF, which does essentially the same thing with conventional silicon.

If anyone talks about signals crossing the chip in a single cycle, they're barking up the wrong tree. It's absolutely unnecessary. It would not take a tremendous engineering work to make a CPU where signals go no more than 3mm or even 1mm in a cycle. (At 5GHz you have a distance budget of over 40mm.) The most important constraint for single-threaded operation is transistor switching time. It's likely the most important constraint will always be transistor switching time.

Transistor switching time is limited mostly by electrical capacity which is determined by the permittivity of the material the transistor is made from. And permittivity and permeablity are the only things determining the speed of light in a material. So saying that speed of light is the limiting factor of transistor switching time is valid.

Spoilers don't reduce drag.

They produce downforce to counteract the lift effect of air moving under the car.

It is only necessary at very high speeds, and prevents loss of grip in (especially) the rear tires.

But, yeah, horse and buggy. I'm with you there.

The downforce isn't there to counteract air under the car (actually fast flowing air under the car produces downforce; read up on ground effect and Bernoulli).

The downforce is there so the car can go around corners quicker; the downforce increases the friction (Ff<=mu x Fn), i.e. the force of friction is less than or equal to the coefficient of friction (mu / μ) multiplied by the normal force. In the case of a tire the normal force is the downward force (provided by the wings or spoilers). Therefore the more downward force the higher the friction hence the more the tire can exert lateral force (cornering force). For example a F1 car at high speed can exert approximately 4Gs of lateral force when cornering, or 5Gs of force when breaking (at slower speeds a F1 car can't brake as hard).

Dragsters don't go around corners (not while racing, anyway), but they still benefit from the extra weight on the drive wheels produced by the air hitting the top of the spoiler.

It's the same effect though - the F1 car cornering needs to maintain sufficient friction to prevent the tyres slipping laterally, whereas the F1 car braking or the dragster accelerating needs to maintain sufficient friction to prevent the tyres slipping longitudinally.

The air hitting the top of the spoiler isn't generating anywhere near the suction created on the bottom of the spoiler with air underneath having to speed up (and hence have much lower pressure compared to the top).

Spoilers reduce drag and provide down force and high speeds.

Aerospace engineer here: Producing lift in either the up (airplane) or down (spoiler) direction generates induced drag. The amount depends on your design, in that a large real or effective AR can reduce it, but it's going to be there.

I think what happens in automotive designs, which are not always penned with the same attention to aerodynamics as aerospace designs, is that they often have unintentional lift at the rear. Adding a small spoiler, then, doesn't add actual downforce, it just reduces lift -- and therefore reduces drag into the bargain.

The author gets the definition of big-O completely backwards. Every single time the author says "Big-O" he means "little-o".

Here are the definitions:


Basically, big-O is a claim that your process is at least -this- fast. Small-o is a claim that your process can't be faster than -this-. He means the latter.

> Still not convinced? Think I'm misunderstanding Big-O?

Yea, I do.

Your claims about what little-o and big-O mean don't match the explanation at https://cathyatseneca.gitbooks.io/data-structures-and-algori...

That page says o(f) == O(f) and not Theta(f). When you say "small-o is a claim that your process can't be faster than -this-", that seems to match the definition of Omega(f). Did you mean that the author should have used big-Omega?

As a side-note, I've always thought that {big-O,Theta,Omega,small-o} notation was very confusing. We should just have a notation for "the asymptotic equivalence class of a function", let's say A(f). Then to say f \in O(g), we say A(f) <= A(g). To say f \in Theta(g), we say A(f) = A(g). f \in o(g) becomes A(f) < A(g). And so forth. Instead of lots of new notation, we add a single function A and re-use notation for orderings which everyone already knows. (Of course, I'm omitting the definition of the ordering on these equivalence classes, so maybe there is some difficulty in defining that...)

>As a side-note, I've always thought that {big-O,Theta,Omega,small-o} notation was very confusing. We should just have a notation for "the asymptotic equivalence class of a function", let's say A(f). Then to say f \in O(g), we say A(f) <= A(g). To say f \in Theta(g), we say A(f) = A(g). f \in o(g) becomes A(f) < A(g). And so forth. Instead of lots of new notation, we add a single function A and re-use notation for orderings which everyone already knows. (Of course, I'm omitting the definition of the ordering on these equivalence classes, so maybe there is some difficulty in defining that...)

Your A is pretty much equivalent to \Theta. Saying f \in \Theta(g) is equivalent to saying \Theta(f) = \Theta(g).

Edit: as an aside O(f) = O(g) is also equivalent to f \in \Theta(g), and O(f) \subset O(g) is equivalent to f \in O(g).

For equality, yes, Theta works! But I care not just about equality but about comparison. In particular, it would be nice if the "natural" ordering on A(f) had the property that

    A(f) <= A(g) iff f is O(g)
Theta(g) is a set, and the most natural ordering on sets is subsetting/inclusion. However, Theta(x) is not a subset of Theta(x^2), even though x is O(x^2)! So the natural ordering on sets doesn't do what I want.

It might be possible to define an ordering on Theta-classes that does what I want, though.

Edit: I just saw your comment about O. I think O might just work! So instead of remembering big-O, Theta, and Omega, I can just use

    O(f) <= O(g)  for  f is O(g)
    O(f) = O(g)   for  f is Theta(g)
    O(f) >= O(g)  for  f is Omega(g)
Is that true?

(I think small-o might be slightly more complicated, unfortunately. It has a more unusual definition: https://en.wikipedia.org/wiki/Big_O_notation#Little-o_notati...)

>Edit: I just saw your comment about O. I think O might just work! So instead of remembering big-O, Theta, and Omega, I can just use

> O(f) <= O(g) for f is O(g) > O(f) = O(g) for f is Theta(g) > O(f) >= O(g) for f is Omega(g)

>Is that true?

Yeah, I'm pretty sure that will work (haven't checked it very rigorously though). The definition of O(f) <= O(g) would simply be O(f) \subset O(g), which immediately satisfies almost all criteria (only tricky bit is proving that either O(f) is a subset of O(g) or the other way around).

For little o I think you should be able to use O(f) < O(g), using the ordering defined above.

> Is that true?

Yes, and that's the reason why

- g \in O(f) is an upper bound on g

- g \in Omega(f) is a lower bound on g

- g \in Theta(f) is a "tight" bound on g, or both an upper and a lower bound at the same time

Crap, you're right. I should have said Omega.

I'm afraid the author is correct in using big-O here instead of Omega or little-o.

For comparison, suppose that someone claims that for all x, sin(x) ≤ 1/2. That would just be wrong. If someone claims that sin(x) ≤ 2, then that is true, but not as informative as saying that sin(x) ≤ 1. There's something special about 1 here. If you want to highlight that, you can say that 1 is the least upper bound for sin(x). This is a bit of a mouthful, but we cannot achieve the same meaning by just replacing "≤" by "≥" or "=" or "<". Any of these replacements would just make the statement incorrect.

Big-O is something like the asymptotic version of ≤. For example, how many comparisons does heapsort need to sort an array of length n? If someone says O(n), that's just wrong. If someone says O(n^2), then that's correct, but not as informative as saying O(n log n). O(n log n) is special here, since it is the smallest complexity class that contains the number of comparisons. Again, it is a bit of a mouthful, but we cannot say the same thing by just replacing big-O by Omega, Theta, or little-o (the asymptotic versions of ≥, =, and <, respectively). For example, Theta(n log n) and Omega(n log n) are incorrect, since heapsort only requires a linear number of comparisons if the array already happens to be sorted.

The author argues that random memory access time is O(sqrt(n)), meaning that for large n, it might take up to constant * sqrt(n) time, but might also be faster if the memory to be accessed happens to be very close. Using Omega(sqrt(n)) instead would mean that random memory access can take an arbitrarily long time, but at least constant * sqrt(n). This is not what the author is trying to say.

You're right, but almost all colloquial usage of Big-O seems to be close to theta-O because people pick the tightest bound they can.

That's why everyone writes that comparison sorts are O(n lg N), even though it's also technically O(exp N) as well.

Yea, I thought about that, but he doesn't mean theta, either. What he's giving is a physical impossibility proof -- you can't do memory access better than sqrt(N). When he gives a physical example of a machine of arbitrary size that accesses memory in sqrt(N) time, then we can talk about theta.

Well, cube root of N; you can stack memory.

The speed of light limitation on distance to DRAM memory is real, but so far seems to affect mostly supercomputers. Are there any large server boards where speed of light lag is the limit on memory capacity?

Read part two of the article. You can't stack memory indefinitely -- you'll get a black hole.

> it's also technically O(exp N) as well.

I wonder if this is what one interviewer was trying to get at when he asked me over and over again if a binary search was worst case O(log N). I just kept saying that if you took more than log N steps to perform a binary search then you by definition did not perform a binary search. He was off his rocker so I kind of doubt he was looking for the difference between little-o, theta, and big-O. It's good stuff to remember and might impress a more sane interviewer some day though.

Please see https://en.wikipedia.org/wiki/Best,_worst_and_average_case

Big-O is valid notation for _many aspects_ of an algorithm. The author applied it to a very practical one and explained what exactly he describes very well.

The best/worst/average issue is orthogonal to the definition of what big-O means. The time it take an algorithm has best/worst/average cases, and each of these is a function and thus is a member of big-O classes.

The definition of big-O has a completely standardized and unambiguous definition, and the author of the article somehow has a four-part blog post on big-O without knowing it.

I think it's a massive and needless abuse of standard computer science terminology to re-define it as something completely different.

I'm not entirely sure what you are referring to. You might be referring to the fact that the author's definition of big-O doesn't say anything about constant factors or asymptotics. This makes the definition incorrect, or at least sloppy. But judging by usage, it seems that he actually knows and is using the standard definition. The error is just in that one sentence, and it doesn't affect the rest of the argument.

You might also be objecting to the fact that he makes a distinction between time and instruction count, and is using big-O notation for both. I don't think there's anything nonstandard about making this distinction when it needs to be made. Take, for example, Karmarkar's algorithm:


Re-read the article, bearing in mind that O(sqrt n) is a set that contains O(1).

Yes, but it isn't very helpful to think of it that way, we're not competing to find the weakest constraint.

Great series of articles and the lessons are very important to someone writing performance system's programs.

Here is another chart I like you show people: https://dl.dropboxusercontent.com/u/4893/mem_lat3.jpg

This is a circular linked list walk where the elements of the list are in order in memory. So in C the list walk looks like this: while (1) p = *p;

Then the time per access was measured as the total length of the array was increased and the stride across that array was increased. The linked-list walk prevents out of order processors from getting ahead. (BTW another huge reason why vectors are better than lists)

(This is from an old processor that didn't have a memory prefetcher with stride detection in the memory controller. A modern x86 will magically go fast.)

From that chart you can read, L1 size, L2 size, cache line size, cache associativly, page size, TLB size. (It also exposed an internal port scheduling bug on the L2. A 16-byte stride should have been faster than a 32-byte stride.)

Math is pure and not constrained by the real world. Big O analysis begins with the assumption that you have unlimited uniform memory. The author points out that memory is not uniform in the real world. It's equally untrue that we have infinite memory at our disposal. The limits of the real world are good to remember but that does not invalidate Big O analysis.

For the analysis in the article, it is enough to assume an unlimited amount of space into which you can put your memory as well as an unlimited amount of time. For your own analysis, you can of course make any assumptions you wish, but the author argues (and I agree with him there) that the spatial scaling O(sqrt(N)) for random access is both relevant in practice (as seen by the caching graph) and cannot be improved on when considering the laws of physics.

> The limits of the real world are good to remember but that does not invalidate Big O analysis.

No, it just means that you failed to take one specific bit into account and that your analysis may hence be less relevant to the real world. Much the same way we can say that integer addition is O(1), because usually we deal with fixed-size integers, we can say that memory access is O(1), because usually we deal with a fixed maximal memory size. Of course, integer addition is O(log(N)) and, according to the arguments made in the article, memory access is O(sqrt(N)).

The author assumes throughout the article uniformly random memory access. This is the worst case for a cache. If you're going to get down to the nuts and bolts of your implementation and come out of the high world of mathematics and Big O then you cannot just consider one element, namely, in this case caching. You should also consider your access pattern which very likely is not uniformly random and therefore probably does not fit the authors analysis. In fact, the only reason caching works is because the authors premise is generally wrong.

The author does address this point in part three of the series where he compares access times to a small area of a larger memory region. Even with absolutely sequential access to an array of size K, you first have to find this particular array in your larger memory region of size N, giving you total cost to iterate over the full array as O(sqrt(N) + K).

I am not sure how costly it is to iterate through your entire memory (assuming a no-op), but I would argue that eventually you reach some bottleneck and end up at O(sqrt(N)+N), too.

Interesting read. Researchers in the HPC community have developed a number of performance models to predict real-world performance in more detail than possibe through simple Big-Oh of number of operations, e.g. while OP concentrates on latency, the Roofline model ( https://en.wikipedia.org/wiki/Roofline_model ) mainly considers limited memory bandwidth.

A lot of people are pointing out that BigO is a purely theoretical, mathematical model that should be understood and used properly without regard to silly details like physics.

That is theoretically correct. But, the difference between theory and practice is that in practice there exists a large percentage of programmers writing code for the real world without understanding and using BigO properly. Their mental model of performance begins and ends with BigO. As far as they are aware, its model is reality.

Source: I've been giving a large number of programmer job interviews lately. It's a rare day when I encounter an engineer (even a senior one) who is aware of any of the issues brought up in this series. And, I work in games!

The article is still wrong - iterating through a linked list is O(N log(N) sqrt(N)). You can't have infinite nodes in a 16-bit, 32-bit, or even a 64-bit address space - to deal truly with N, one must consider the more generic case of a variable address encoding, which has a variable size (log(N)) and associated lookup etc. costs as the number of nodes grows.

This is the motivation behind e.g. the "x32 ABI" in Linux: All the power of x86-64 instructions, with none of the additional cache pressure/overhead of 64-bit pointers - log(32) being cheaper than log(64).

...ahh, being this explicit in your Big-O notation is probably not that useful, usually, although I've seen it occasionally in papers (where they're quite explicit about also counting the number of bits involved). Maybe they're dealing with BigNum s, which would make it a practical concern? The key takeaway is this:

> That I use Big O to analyze time and not operations is important.

Time depends on compiler settings, allocation strategy, and a whole host of other factors that are outside the purview of your algorithm. Operations is a lot easier to contrast and compare between different algorithms, the meat of what you're trying to do most of the time. Both are valid choices, just know which one you're dealing with.

The time factors are good to be aware of, to be sure - the performance pitfalls of (potentially) highly fragmented, hard-to-prefetch linked lists over unfragmented flat arrays should be well known to anyone charged with optimizing code - but it's probably easier to think of them as some nebulous large time constant (as even array iteration is going to hit the same worse-than-O(N) behavior, although with proper prefetching the bottleneck may become memory bandwidth rather than memory latency) and deal with those differences with profiling and other measurements, instead of Big-O notation.

Sorry, but for asymptotic analysis pointer size is considered constant. So, even the smallest last uses B-bit pointers.

Feel free to introduce a packed pointer type that would be asymptotically faster, but then it is a different data structure.

Since when does a linked list mandate unpacked, fixed-width pointers? Citation needed! I note they are not called pointered lists, and challenge the assumption that they even need use pointers - or do you think this abstract computer science concept is something different when applied to disk storage formats by using file offsets as the link? Linked lists as a concept date back to ~1955, when RAM was very expensive - magnetic-core memory being just introduced. Think punch cards.

Returning back to the modern era - one also does not need a packed pointer type, merely support for multiple pointer sizes within the same process, and the ability to dynamically dispatch to the correct one for the job. And multiple pointer sizes within the same process are a matter of routine:

In the 16-bit C era, vanilla "near" and "far" pointers very clearly weren't the same size. A very similar situation occurs under the hood when dealing with 32-bit processes on 64-bit windows via WoW64, which can lead to some interesting complications if you use a 64-bit application to create a crashdump for a "32-bit" process, where you can actually inspect the 64-bit side of things too (fire up WinDbg and dig in: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3... )

This is to say nothing of 20+ byte member function pointers on 32-bit architectures because C++ is only out-crazied by C++ compilers. Or dealing with the complications of multiple pointer sizes via shared memory. Or dealing with file "pointers" (read: offsets) of varying sizes.

> You'll know that iterating through a linked list is O(N), binary search is O(log(N)) and a hash table lookup is O(1). What if I told you that all of the above is wrong?

It's not wrong, it doesn't have enough contextual information to be right or wrong.

Ehh, his assumption is that O notation should say something at lower bounds, which is simply wrong. He should be testing 10-15+ GByte data structures not < MB data structures.

Consider, O(X * Log (X)) is generally thought of as good. But, let's assume the algorithm takes 24 hours + x * log(x) nanoseconds. Well for low values of X that 24 hours is going to be a pain and it's going to look like constant time, but for really really big data sets x * Log(x) will actually dominate.

There's been a fair bit written on the topic. One of the better papers that has a parameterized model is here: https://www.computer.org/web/csdl/index/-/csdl/proceedings/f...

I should note that this paper is more than 25 years old. :-)

I guess the author is trying to simplify, but its way more complex than that. Simply assuming a few layers of cache completely misses all the other layers that have effects starting with.

Cache lines, RAM Read vs write turnaround, dram pages, number of open dram pages, other CPU's interfering with the same RAM channel, remote NUMA nodes, and probably some I'm forgetting. All this is very similar to secondary storage access rules (even for SSDs)...

Assuming each layer is at worst tree-like, but you can easily devise the upper bound.

Sure, but I think the point is that big-O notation fails miserably at analyzing certain algorithms because it doesn't have a way to represent the locality of the algorithm.

Put another way, all the little "constants" thrown away in the analysis may not actually be constants, and their non-constantness may be enforced with actual physics. In other words, like the article says, the idea that storage access times are constant is nonsense. Due to physical limitations, this is insurmountable rather than being a side effect of architecture. So, its quite possible that for certain algorithms the "constant" factors may be overriding terms in the analysis.

The article is conflating theoretical algorithm analysis and low level implementation details.

Big O analysis is a theoretical measurement of algorithm performance. By definition it ignores details like memory access speed, the exact instructions used, and other details of specific hardware architectures.

Real life algorithm implementations obviously need to deal with those low level implementation details, but that doesn't change the theoretical analysis. It's easy enough to find (or design) machines without cache where this difference in memory speed doesn't exist.

"RAM+arithmetic"-style complexity analysis in terabyte-scale-and-above problems is unable to distinguish between very practical and absolutely impractical algorithms. "square-root ideal cache hierarchy + arithmetic" has much better distinguishing power (for parallel programs, you also need to remember to bound throughput, e.g. Bernstein's area*time, for similar reasons).

It's amazing how many people didn't actually read all 4 parts of the article.

His argument has nothing to do with caching or prefetching, etc.

First, it's about random access. You can't prefetch a random fetch!

Second, he's measuring time, a perfectly valid thing to do. And the reality is when you lay your memory cells out in 2 dimensions it takes order of sqrt(n) time to fetch a random memory cell value, where n is the number of memory cells you're using.

Third, it turns out order of sqrt(n) time is the best you can do even if you had the best technology in the universe.

I'm not sure the cost of accessing the storage medium belongs in the complexity of the algorithm, since that cost will change based on the storage medium, not the algorithm itself. It strikes me as more of a constant, (even though it isn't constant).

Still, interesting read, nontheless.

In part two, the author demonstrates that if you packed information as densely as possible, i.e. your computer is a black hole, the best you could do is O(n^1/2).

If, and only if, you're accessing that memory randomly. If you're accessing it in a linear scan, that line is going to drop away from sqrt(N) very quickly.

Yeah, but that is not an argument not to include that in the analysis of how a function behaves when you throw lots of data into it. Also, the author comments this in one of the parts.

I agree that in practice most of the time it won't affect your choice of algorithms. But it may be helpful in understanding how your run time will actually scale as your data size increases.

Also, I could conceive it being useful in odd cases if there are choices between two algorithms: Algorithm 1: O(N) memory access operations, O(N log^2 N) cpu operations Algorithm 2: O(N log N) memory access operations, O(N log N) cpu operations

If we assume memory accesses are O(1) time, then Algorithm 1 is faster, but if we assume memory access are O(sqrt(N)) then Algorithm 2 is faster.

Here's the problem, the sqrt(N) assumption is based on completely random access, where most accesses bust all the caches. What will matter more, when it comes to the cost of memory in real applications, is the actual size of your dataset in relation to the cache sizes, and whether your access is sequential across a region of memory (an array) or random (a scrambled linked list).

In practical terms, the plateaus of the stair step graphs matter more than the general trend of the line.

Unless you're concerned about worst case scenario, where the last part is definitely not a plateau ever. You can reduce it to a Turing machine fetching you data sequentially.

I find this really odd, it's not wrong, but it doesn't invalidate O(1). It's mashing two-things together that are unneccessary and can cause misunderstanding.

Big-O provides a decent tool for generic analysis and an understanding of access times of memory hierarchies. Since memory hierarchies can vary, they shouldn't be considered while doing generic analysis, much anyways.

Both are important to understand. The key thing is setting your Big-O access expectations to the slowest level of your heirarchy. In that way, your expectation remains generic and still proximally accurate across the average cases.

When you consider them together, think of the heirarchy as a series of piecewise functions that modify the value of the constant time based on the speed of the bounds that fit your data.

This square of N notation falls apart in other cases. 128GB's of RAM would have roughly the same access speed as the 8GB's he had available, if he had that much in his system. But having 128GB of RAM would completely destroy the squaring by flattening an entire magnitude from his hypothesis.

But it is a nice display of memory heirarchies, IMO.

If you go on to read the later parts, the author explains how physics limits random access to N bits to O(sqrt(N)) even without a memory hierarchy. This has practical implications when comparing sequential access in an array to random access in a hashmap.

> This square of N notation falls apart in other cases.

Not particularly; since O(n^1/2) is typically used as a catchall for all O(n^x) where 0 < x < 1. The whole idea of Big O notation is that it provides an idea of systemic complexity that is normalized to a theoretical "constant" system represented by a coefficient (that is usually dropped when writing out the Big O).

IMO it still holds: distributed systems behave differently than local ones. Now that we're writing algorithms on the cloud, we've added an abstraction layer that adds complexity, and that brings some non-constant amount of overhead. It didn't make sense to think about memory like this 50 years ago; but it does now.

Nah. Sorry cache-misses don't count as part of a theoretical analysis on complexity. Why? Because you're getting into specific access pattern performance. Complexity is about "all things being equal". Is it the only thing you should consider? At first it should be, then if you run into a problem with a specific structure that has remarkable scale or access then go ahead and consider what the underlying hardware might be doing with the specific access patterns your structure is encountering.

It's interesting to see linked-list as his example, because it is the most likely to have cache-misses as you move through it as the allocations are very fragmented. I'd be very curious to see the same chart on a warmed-up hash-table.

Also, if we're considering the hardware, can we take into account pre-fetching and branch prediction? What are your numbers then? Yeah RAM is farther out then the local caches, but the CPU is also not completely ignorant of what it has to do next.

Did you read part II of the series? It's a deeper analysis than you seem to give him credit for.

His argument is that in this universe, the specific laws of physics we have here enforce the property that memory access takes o(sqrt(N)) time [1] where N is the size of your data set.

[1] - What's actually going on is that it takes o(N * sqrt(N)) time to touch N bits of information. Since for any particular bit, it might be very close to the processor and thus fast, but hitting N bits necessitates that most of the bits you access are far away.

I did read part II, and I think his analogy is off. Memory isn't linearly limited (it happens to be in the processor, but this isn't a given), I don't even understand why he brings circles into his conception of memory. Memory isn't linearly bound to the previous local cache. RAM isn't an order of magnitude larger than the L3 cache, it's MANY of order of magnitude larger! His analogy is just wrong. A better analogy would be library A takes X amount of time to access its books and can store 100 of them, and library B takes 10X amount of time to access its books, but it can store 1,000,000 books!

Yes as we increase in memory size latency goes up, but he never proves that this is sqrt(N) (he correlates that it is). Each jump in up can be explained by cache misses in each successive local cache, but RAM can scale more than a few orders of magnitude beyond the L3 cache. He needed to keep his chart going past 1GB to see that there is actually a plateau to be hit. If he had scaled to 10GB and then to 100GB he would have seen access times be about the same that they were for 1GB.

Re-read the "The theoretical limit" section in part II. His arguments depend on the physics in this particular universe, not how computers happen to be built.

My response was about his math and physics:

If you scale the radius of a sphere by one unit then you indeed get an order of magnitude increase in volume (bits of info), but that's not the correct model! We don't increase by on unit! RAM isn't a one unit increase in radius! It's an order of magnitude increase! If you increase the radius by an order of magnitude you get a 3 fold order of magnitude increase in memory. So you jump up in latency by one order of magnitude and you get 3 orders of magnitude of memory in return.

His analysis and math are wrong.

You're totally missing his argument. If you scale RAM cubicly at any positive density, as you are suggesting, eventually you will achieve the matter density sufficient for your RAM to gravitationally collapse into a black hole.

If the density of your RAM is d, then the volume that a mass of M RAM takes up is M/d. If it's arranged in a sphere, the radius is ((3M)/(4dpi))^(1/3). Notice how this radius is a constant times the cube root of M.

Whereas the Schwarzschild radius scales proportionally to mass.


Thus, if you put enough mass of constant postive density together (no matter how small that density is) eventually you get a black hole.

N.B. - The author's argument is a little more subtle because he's talking about information density via the Berkenstein Bound, and he gets a square root instead of a cube root. But the argument is the same flavor.

You clearly did not understand the linked article at all. He is referring to theoretical results on the information density of a black hole, which indeed collapses to area. In practice, the reason RAM doesn't appear to increase by three orders of magnitude is not that (it's that we don't actually build RAM in a sphere) but his point is that whether you are looking at theory or practice the square root is the correct model.

> Sorry cache-misses don't count as part of a theoretical analysis on complexity.

They don't in the random access machine model, but there are other models where cache misses do factor in.

Yes. This is what I came to say. Big-O assumes a model of computation. Typically people assume a random access machine with certain constant time operations like arithmetic. In some models arithmetic might not be O(1) (e.g., in a pure lambda calculus it's O(n)). A random access machine model makes a bunch of simplifying assumptions that don't actually hold in practice. As the article suggests, if you're writing code in industry it's not a bad idea to understand how your specific hardware differs from the model.

Why is this wrong-headed discussion top-rated on HN?

And why is there so much misunderstanding on HN of big-O notation wrt cache misses lately?

All you kids, get off my lawn.

Closely related but unfamiliar to most software geeks, Bélády's work in the 1960s and later on the theoretical limits of operation throughput when using cache hierarchies is very relevant to high-performance software design. The theory generalizes nicely to any topology where you can control how access latencies are distributed, and carefully designed software can get relatively close to the throughput limits (though it is somewhat incompatible with the way most software engineers design systems these days e.g. multithreaded concurrency is a non-starter).

Do you think something like lightweight modular staging (https://scala-lms.github.io/) would allow fairly high level code to get to the theoretical limit?

> At this point some of you may argue that the whole idea of Big-O analysis is to abstract architectural details such as memory latency. This is correct - but I argue that O(1) is the wrong abstraction.

No, your model is wrong. Others have already pointed out some issues with the author's understanding of Big-O notation. However, this is a fundamental misunderstanding. Big-O is a tool to analyse some function's asymptotic behaviour, i.e., how it behaves when the input parameter grows versus infinity. You have to put your model of cost into that function. If your measure is time, and memory access doesn't take constant time in your model, then you have to account for that in your cost function. You can just as well use Big-O notation to describe the asymptotic space complexity of an algorithm (how much memory does it need?). O(1) has no special meaning - it's just the set of all unary functions whose value stays below a constant, no matter how large their input parameter gets.

The author is literally blaming his tools for his own misunderstandings.

Blaming the authors for ignoring true memory models, more like.

The notation is worthless for big data use due to unrealistic machine model it implies.

No, the RAM model is worthless for big data use, because it doesn't model distributed computing at all. You need a different model - one that also includes communication, and probably file I/O as well. Big-O notation is not to blame if the thing you put into it models the wrong thing. Just assign a cost α to initiating a connection, β for sending a word of data, and similar for I/O. Treat those as variables, and you'll get O(local work + β communication volume + α latency).

Thanks to the prefetcher a low-entropy access to memory, like reading the next value in an array, will tend to happen in constant time. For a linked list, tree, or other data structure where the location of the next access can't be predicted easily by something like stride analysis then the author is correct.

The author argues that a random access to memory is not O(1) but instead O(root N) because of distance.

The easy reactive response is that with respect to algorithm design the size of RAM, N, is a constant.

On the other hand for very high scaling factors, as input size rises the size of RAM must also rise. In this way N can be thought of as a variable and that seems to be what the author is thinking. Different algorithms will behave differently as they are scaled to infinity and beyond.

I think the author's argument is interesting but maybe it's better to make new models for time complexity analysis. I think Bob Harper's students have done good work on this.

In addition to distance there is also the cost of selection, namely the muxes and decoders, which would multiply the cost of access by log N.

I am sorry... but no, the article is interesting and well written, but it has nothing to do with big O notation. Random access in memory is still in O(1), it doesn't depend on the size of the data structure (I am assuming that is the "n" the author talk about by pretending that a memory access is O(sqrt(n)). Even if you have a very complex memory architecture with 15 caching levels, spread all over the world, if you have a maximum of 5 day delay for accessing your memory through the mail, it will still be O(1), because 5 day is constant, it does not depend on the size of the data structure.

The "n" the author is really talking about may be the depth of the cache hierarchy.

Are your questions maybe answered in part IV of the series, the FAQ[1]?

[1] http://www.ilikebigbits.com/blog/2015/2/9/the-myth-of-ram-pa...

There are models like the Transdichotomous model [1], where the properties of the machine vary based on the problem. That seems to be what is happening here.

[1] https://en.wikipedia.org/wiki/Transdichotomous_model

What he tried to show is precisely that in practice it does depend on the size of the data structure.

...but if you have a significantly smaller dataset, you don't need such a complex caching architecture, which brings faster access times at the margins. So even in your example, worst-case access time scales with dataset size.

It's just a fun way to frame a conversation about cache/ram latency, relax.

I see the point and it is interesting, it's just not big O ^^ If I replace in my head all O(thing) by MyCustomComplexityMeasureYetToBeDefinedProperly(thing) then it makes perfect sens.

Big O is basically that. Big O notation predates computers (late 19th/early 20th century) -- it's simply a notation to signify how a problem scales in complexity in relation to the size of the data set. I would argue that multi-level memory access is itself a problem that scales complexity in the same way.

You can say everything is O(1) because nothing would ever take more than `2^128` seconds. In practice, O-notation means "with a reasonably small constant".

> but it has nothing to do with big O notation

If this has nothing to do with Big-O then Big-O is a useless concept and should be abandoned. You may be interested in measuring the number of CPU instructions it takes to traverse a linked list in which case O(N) is accurate.

Personally I care about how much wall-clock time it takes in which case a linked list is very much not O(N) and the larger the list the further away from O(N) it becomes. There is a rather large delta between a 10K and a 100MB linked list. You can go test this for yourself right now.

> Random access in memory is still in O(1)

No it isn't. Random access only looks like O(1) at certain plateaus, like size of L1, size of L2, size of L3, size of RAM on the local NUMA node, size of RAM on neighboring NUMA nodes, size of RAM on distant NUMA nodes, size of SSD, size of HDD, size of NAS locally connected, size of NAS distantly connected, size of cloud storage.

Constants can matter in Big-O, especially if they are very large. Compared to the number of instructions 3Ghz CPU must execute to do a hash lookup, accessing RAM has a relatively large constant factor and execution time can be massively impacted by sub-optimal RAM access patterns.

Except Big-O was never intended to describe wall clock time. As soon as you try to use Big-O to describe a number of concrete units you'll fail. Big-O is purely a comparative measure, this algorithm vs. that algorithm. As it stands, it is fairly meaningless outside of purely academic exercises.

EDIT: I mean, consider that the entire basis for Big-O notation assumes that every "operation" are always equal. Adding is always constant. Big-O is very hand wavey and claiming the constants matter is to invent an entirely new notation.

Big-O isn't handwavey, it was absolutely intended to describe (among other things!) wall-clock time, and the article isn't just talking about constants. Also, many operations (like adding) are constant-time on modern computers, outside of memory hierarchy effects, ALU stalls, etc., since they're bound to finish within exactly one clock-cycle.

Big-O is mathematically rigorously defined, but has nothing to do with wall-clock time. It's about the asymptotic behaviour of functions. How you interpret the values of these functions is up to you - of course you can model wall-clock time, but then you'd probably be interested in all the constant factors and lower-order terms that asymptotics hide from you, so Big-O analysis is the wrong tool.

If wall-clock time is being affected by non-constant factors (as the article convincingly asserts), why on earth wouldn't you want to model those factors with big-O notation? Moreover, why do you think that people uninterested in wall-clock time wouldn't care about constant factors, but people interested in wall-clock time unilaterally would?

If your program does two steps, one of which takes time O(n) and the other O(√n), then your program takes time O(n) because that dominates. That's an example of lower order terms. If you measure wall-clock time, then you'd be interested to know about that second step. But if you're looking at asymptotic analysis, then it's completely irrelevant because the linear term dominates it.

Your other questions are all best answered by: Big-O notation is not the right tool if you want to make predictions about wall-clock time. I don't think I need to justify why someone interested in wall-clock time would be very interested in the constant factors involved - the difference between n and 20n is a factor of twenty, and whether a program takes half a second or ten does make quite the difference.

But asymptotically, they're the same. Someone who is interested in the asymptotic behaviour wants to know how fast the function grows - is it linear? linearithmic? quadratic? cubic? exponential? (with which base - 1.1^n, 2^n, 10^n?). For them, a constant factor isn't interesting.

Of course, neither extreme is particularly helpful when trying to compare algorithms in practice. Modelling the hardware extensively and including all the little constant factors is exceedingly tedious, even if you can actually get all the information required to do so. Pure asymptotics also don't help much if they hide huge constant factors. That's why Algorithm Engineering is a thing (and a very good thing!). Basically, it tries to bridge the gap between algorithms that theorists like, and those that practitioners use. To that end, design, analysis, implementation, and experimentation must form a circle and influence each other. Wikipedia probably does a better job of explaining it, though.

The article points out that accessing n words of RAM takes O(nsqrt(n)), which is not dominated by O(n), which is my entire point here. And in the cache-oblivious model (for example), big-O notation is still alive and well; there are just additional variables introduced to account for the various constants.

What the graph really seems to indicate is that time is only linear when working within a cache size on the author's computer (remember that iterating a linked list accounts for the gradual increase in between the cache jumps). If the theoretical upper bound of RAM access was really the important factor at this scale, I wouldn't expect it to be almost flat and to suddenly jerk up every time we have to go to the next cache.

Assuming the author's O(sqrt(n)) is correct, it seems only relevant on much, much larger scales.

In light of that, it really doesn't make sense to pollute the typical use of Big O notation. It should always be understood to be just one metric to understand an algorithm.

Related, Herb Sutter's fantastic talk about arrays:

https://channel9.msdn.com/Events/Build/2014/2-661 @ 23:30

The black hole piece in part II was amusing, if you keep reading.

I am pretty sure it's wrong. I just made a comment about this on the author's article. He claims information N~rm, but for any normal material you have mass proportional to r^3. So N~r^4, not r^2 as the author claims.

Black holes and information is a thorny issue, see e.g. [1]. So I suspect something went awry in the argument when (s)he brought quantum gravity into it.

[1]: https://en.wikipedia.org/wiki/Black_hole_information_paradox

You have a volume V1, and let d1 be the maximum density of information you can store in V1. If you try to store any more, your storage will collapse into a black hole, so d1 is the maximum density.

Now you get 8 times more data. But cannot story this data in a volume 8*V1, because if you try to use 8 volumes V1 with density d1, and place these next to each other, they will collapse into a black hole.

Thus, the more data you have, the smaller density you can use for storing it.

"Normal materials" don't scale - with a constant density an expanding sphere will collapse into a black hole.

I think this way of looking at the problem is misleading. O(1) or O(N) always stays O(1) or O(N), just the constant changes. You can always access any element in RAM (on a SSD, HDD) in a bounded amount of time. Use that pessimistic time as the time of one step.

Viewed in this way, O(N) is still O(N), and a processor with caches is a magic device that somehow computes faster than O(N)... or for O(1) computes in sub-constant time (if that can be even well-defined).

If I understood it correctly, the author links cache miss from memory subsystem hierarchy to asymptotic complexity (big O), so if an operation for fixing a cache miss takes higher time complexity, he takes that instead of O(1).

Similar happens when you write an O(1) algorithm while relying on malloc(), which is usually O(n log n), thus your algorithm is not really O(1), but O(n log n).

What's your n? The amount of memory allocated? In that case, I'd be interested to see such an algorithm that takes time O(1) but touches O(n) memory cells in the process...

Obviously you need to include the time your calls take in the analysis, but your example seems ood.

I think there is a key point in the FAQ (article four, all linked through the series):

> You are conflating Big-O with memory hierarchies

> No, I’m applying Big-O to memory hierarchies. Big-O is a tool, and I am applying it to analyze the latency of memory accesses based on the amount of memory you are using.

As some others have pointed out, the line is crossing hierarchies of cache, and that he is not looking at the big O of instructions. Both of these are accurate, and the author is aware of this.

He is using the tool of big O analysis to measure a performance characteristic. That characteristic is not the traditional number of instructions or amount of memory utilized in the computation of an algorithm. It is the latency for access to a random piece of data stored on a system.

There are two cases considered, the practical, and the theoretical.

At the practical level, we do not have a unified physical implementation of the address space in a modern computer. This means that accessing a random address in memory is an action that will most likely cross levels of the cache hierarchy. It is well known that there are order of magnitude jumps crossing these levels. Perhaps it is uninteresting to you, and the importance of cache locality in an algorithm is something that you already have a very strong handle on. That makes his observation of time-to-access a random address trivial, but not wrong.

Big O tells us that a binary search is the most efficient search algorithm for an array (constraint - the array must be sorted), but in practice a linear search with a sentinel value across an unsorted array will be faster if the array fits in cache. Keeping in mind the big O latency of random memory access across cache hierarchy levels would be the theoretical analysis to tell us this. The traditional big O looks at number of instructions. These are both valid tools in choosing an optimal algorithm.

The second point the author makes is the theoretical limit. Assume the ideal storage medium with minimum access latency and maximum information density. This storage medium is matter. The limit of packing is the point at which you would create a black hole.

With this ideal storage medium, you cannot pack an infinite amount of data within a distance that can be traversed at the speed of light within one clock cycle. For this colossal storage array, there are some addresses which cannot be physically reached by a signal moving at the speed of light within the amount of time that a single clock cycle (or single instruction) takes. Accessing a random address is not a constant time operation, though the instruction can be dispatched in a constant time. There is a variable time for the result of that instruction to return to the processor.

At this theoretical limit, we would still end up with a cache hierarchy, though it would be 100% logical. With a single storage medium and unified address space, the cache hierarchy would be determined by physical distance from CPU to physical memory location. Those storage cells (whatever form they take) that can be round-tripped by a speed of light signal in one clock cycle are the first level of cache, and so on. You could have very granular, number-of-clock-cycles cache levels stepping by one at each concentric layer of the sphere, or you could bucket the number of clock cycles. Either would effectively act as a cache.

This theoretical exercise is an extreme limit, but bears out the practical implications that our current physical implementations of cache hierarchy exhibits in practice.

Again, perhaps these observations are trivial, but I believe they do stand up to scrutiny. The key insight is that the performance characteristic being described by big O is time, not the more traditional space or number of instructions.

I think time is a valuable metric in terms of algorithm selection. If we think about end users - they don't care that one instruction or 1,000,000,000 are being executed. They care about how quickly work is done for them by the computer. Instruction-based analysis can be a huge help in this consideration, but so can time-based analysis.

Neither should be ignored, and neither invalidates the other.

> Big O tells us that a binary search is the most efficient search algorithm for an array (constraint - the array must be sorted), but in practice a linear search with a sentinel value across an unsorted array will be faster if the array fits in cache.

From a cache perspective, a binary search is actually the most pessimal case. If you want efficiency, look at b-trees optimized for memory hierarchies.

Only after reading the last article of the series I checked the link to share it. Only then noticed that I misread the heading on the blog. I read "I like big tits" and though is this page hacked or something? The url corrected my dirty mind :).

Great series. Even if you don't agree with the notation it has still valuable information. Thanks author!

Let me quote Einstein: "Everything should be made as simple as possible, but NOT simpler!"

And that's IMHO exactly where the original author erred. But i find his musings so incredibly funny and enlightening, that i will use them as a future reference of how NOT to do an analysis.

He didn't just do an apple vs oranges comparison, but he essentially threw eggs, potatoes and ham in the mix and tried to deduce an universal law from his concoction by sprinkling some quantum mechanics fairy dust into the mix! Hilarious!

Just by simply looking at his sloppy graph (Typical origin-shenanigans are often a dead give away for the quality of an examination.) one should be able to recognize the form of an underlying step function, as expected from a multi-layered memory system (L1,L2,L3,RAM, ...) .

But NO, he envisions a Square-Root function, just by arbitrarly placing a line in a logarithmic coordinate system. WTF?! Where is the fitting? And how does he defend his conclusion? drum-roll QUANTUM MECHANICS .... Muhahahaha! Great show!

So essentially it's not cost that prohibits our brave engineers of increasing L1 cache size ad infinitum, but because of quantum mechanics! Muhahahaha! Stop it, my stomach hurts.....

I love that you have had the foresight to create throwaway accounts and leave them for a year before using them to be insulting and indulging in pointless theatrics.

I'm sorry, that you feel that way, but instead of criticizing me as "insulting" and "indulging" in "pointless theatrics" maybe you could explain how and why his argument of "Myth of Ram" and O(√N) holds any merit.

Just by looking at his graphs you could argue, if it's about RAM he should just look at the the part from 10MB to 8G, which shows an almost ideal approximation to O(n log n). The L1 part is even better then O(1), which is curious in itself.

So please excuse me if i'm a bit sarcastic about the "groundbreaking" insights. "Myth of RAM" seems like an attention grabbing caption, and i couldn't help to comment in an according way. No insult or harm ever meant. If perceived as such i do apologize.

Why is information within a sphere bound by m * r? Naively, I'd expect it to be bound by r^3 or m * r^3.

If anybody is curious about the physics, the principle he described is pretty similar to the Holographic Principle.


It's a very interesting experiment/conclusion, but it rests upon one assumption: The assumption that the entire dataset has been preloaded into the L1/L2/L3 caches.

This assumption is a shaky one to make, and is easily violated. Imagine if you have a hashmap that is small enough to fit entirely in L3 cache. However, most of it has been evicted from the L1/L2 caches, by other data that the core has been reading/writing to as well. Eventually, the thread returns to the hashmap and performs a single lookup on it. In this scenario, the time required will indeed be O(1).

So what you really have is a best-case-complexity of O(sqrt(N)), if your data has been preloaded in the closest possible caches, and a worst-case-complexity of O(1) if your data is stuck in an outer level cache/DRAM. Given that we usually care more about the worst-case-scenarios, not the best-case-scenario, using the O(1) time complexity seems like a reasonable choice.

Going back to the author's premise that the time-complexity of a single memory access is O(sqrt(N)), not O(1), this is true only where N represents all/most of the dataset being processed. If N represents only a small fraction of the dataset being processed, and your caches are going to be mostly filled with other unrelated data, then the time complexity is closer to O(1).

Clearly the O(sqrt(N)) is more accurate than O(1) under some circumstances, but even so, it's not clear what benefit this accuracy confers. All models are inaccurate simplifications of reality, but simple-inaccurate models can still be useful if they can help in decision-making. Big-O analysis isn't used to estimate the practical running-time of an application. For that, you'd be better off just running the thing. Big-O analysis is more used to compare and decide between different competing algorithms/data-structures. And in that sense, whether you choose to model linked-lists/binary-search/hash-maps as O(Nsqrt(N))/O(log(N)sqrt(N))/O(sqrt(N)), or O(N)/O(logN)/O(1), the recommendation you end up with is the same.

Great article. It gets better, too, so make sure you read all 4 parts.

The library example is a bad one, since it leads to O(N) and not O(√N), a conclusion that contradicts the thesis.

"In general, the amount of books N that fits in a library is proportional to the square of the radius r of the library, and we write N∝ r²."

No, the number of books N is proportional to the area of the front face of the shelving, not the area enclosed within the circle. Assuming all libraries are the same height, that means N is proportional to the circumference of the circle, which is proportional to r, not r². Meanwhile, assuming that all books are reachable in the same amount of time by the librarian no matter their height on the shelf, that means T ∝ r (as before). Since T ∝ r and N ∝ r, that means T ∝ N or T=O(N).

Eh, your library seems to have a very poor use of floor space. Place the shelves well and you will definitely have something that can store books proportional to the area of the library.

Hint for more efficient placement: You can put shelves away from the outer wall in arbitrary positions within the circle. At that point, the assumption that the librarian can reach any point in the circle at constant time is blatantly false (as the books closer to the centre will generally be faster to reach).

It's not "my" library, that's how it was defined in the article (Part II).

The graph in the article shows the impact of L1, L2, and L3 cashes. If array fits into L1 cache the access will be the fastest and then it degrades with L2 cache, then L3, then generic memory.

If you instead iterate through an array of size K you will only pay O(√N + K) since it's only the first memory access that's random. Re-iterating over it will cost O(K). This teaches us an even more important lesson: If you plan to iterate through it, use an array.

This is rubbish. Re-iterating it is the same as iterating it the first time: if you array doesn't fit into cache, you're going to pay for pulling it from further out into the memory hierarchy.

To anyone who doubts me: try it. Try iterating an array that fits entirely in L1 many times, then do the same with an array that has to be pushed out to swap. The slowdown will be considerably worse than linear.

The theoretical discussion is interesting, especially the circular library that gives some intuition of the square root law.

But in practice, you usually know the order of magnitude of your data, so access is rather O(1), for some constant that depends on the size of the data. Jeff Dean's "Numbers Everyone Should Know" quantifies this constant.


I think that at some point this O(n * sqrt(n)) is actualy not precise. Maybe it works for the first few GB, but then other mechanisms come into play.

For example processing 100GB of data actually don't have to be O(nsqrt(n)) because if you process it on cluster, then other machines are also using L1, L2, L3 caches and RAM. Then the whole process can be streamlined which means that some operations can be faster than the pessimistic nsqrt(n).

Unfortunately, you now have to deal with network transfer between the machines, which are even more expensive than RAM access.

Definitely but then is is still n * sqrt(n)? Do I have to add sqrt(n) to each n i'm processing? I doubt so.

I think the problem is that at each stage there is different component (sqrt(n)) that comes into play.

There comes a point where the latency is high enough that you're very unwilling to pay for random access. And very often that point is "broadcast over a network," where think time really is often dominated by things like the speed of light. So yes, it's still n * sqrt(n) in some abstract sense (where n is the amount of theoretically available memory within a given distance over the network); in practice it's much worse than that because you never come close to saturating all the available space between computers with storage nodes.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact