Funnily enough I published something similar as part of my PhD (2010). Essentially communication costs dominate over computation costs - an addition operation is practically free compared to the time and energy of moving data within a chip to the ALU to do the addition, let alone between chips. This is now the reverse situation to historical VLSI - whereby the computation was slow and expensive compared to practically "fast and free" on-chip communication. The implications extend far beyond just RAM. But yes, depending on the access patterns involved and the (physical) spatial arrangement of data, binary tree traversal (on 2D CMP or 2D cross-chip layout) is O(sqrt(N)) or O(log(N)sqrt(N)), and this result is not dependent on the system boundaries of memory hierarchies - it is true for dedicated on-chip scratchpad memories as it is for giant wafer-scale chips (such as Cerebras) as well. There are some special cases where it can be O(1) or O(log(N)) on average that I won't go into, but you can read more here if interested (sections 2.1 and 8.3):
A key way to think about things is that algorithms are not really running in some Platonic realm, each executed instruction occurs at some physical location and time, and data needs to move from one place and time to another place and time. To this end both physical wiring (or networks) are required to move the data spatially, and memory is used to move the data temporally. Together an algorithm's executed instructions are situated somewhere spatio-temporally and RAM (both on-chip and off-chip) serves as a temporal interconnect, but one that itself takes up physical space.
I like to think of this as level 2 systems analysis. It is implied in some of Feynman's papers on computation as well.
It gets even more interesting (to me) when you consider semantic entanglement of data in non-platonic spaces. When you treat 'time to data access' as a fourth dimension and the state transition vector of an algorithm as the path one can show that Ft(O(path(n))) (Ft being the function that converts the complexity of a path into the time such a path takes to transit) for some arbitrary n is rather difficult to nail down. It can also pop out the result that the lowest complexity path(algorithm) is not faster than a high complexity path(algorithm) if the entangled elements(data) are in a slow space. Crazy I know but it is a pretty straight forward path from Amdahl's law to this sort of analysis.
This article has a strange take on algorithmic scaling.
L1, L2, LL, MM, Disk, NetDisk ..., GalacticRAM.
Wow, you actually expected traversing ‘system boundaries’ with distinct access costs and maintain theoretical scaling?
An algorithm that is O(n) will be O(n) in L1, O(n) in L2, ,..., and yes O(n) in GalacticRAM. Each (sub-)system traversal is due to exceeding capacity limits.
There is likely a more interesting mathematical result buried in here, since we see that a series of sub-systems with relative constant order of magnitude increases in access costs in aggregate transforms f -> f’ with this interesting n^.5 term showing up instead of the logs.
This is the 2nd post I've seen recently submitted to HN that plays fast and loose with big-O notation. It has a defined meaning. If someone wants to talk about something else, great use different notation. People already seem to have some trouble grokking big-O and we don't need to muddy it with 'ad-hack' usages.
According to the exact literal meaning, every algorithm you ever run on your computer is O(1), because your computer is finite. The same can be applied to the reachable universe.
That’s not true - an algorithm is a mathematical “entity”, it can and is infinite. The actual implementation on a real machine might not run forever, but there is no such kind of limit with math.
The article was specifically talking about putting algorithms onto real machines and the speed consequences. This creates a new version of the algorithm with certain limits. And because of these limits, there is a maximum runtime for any halting code, and it is a constant number.
No? I'm tired and it was way too long since I dealed with Big O so someone will surely correct me.
Your computer being finite is irrelevant. Big O is about growth factor.
Just because you are capped at value X (< infinity) does not mean that every value below X has the same characteristics. If it did, sure, you are O(1). But if it is N^2 it will still be N^2 even if your machine only has resources for N < 12.
Big O is about the eventual behavior for large enough N. An algorithm is O(N^2) is for sufficiently large inputs the running time is < c * N^2, for some fixed constant c. That's a really weak guarantee for practical purposes -- it allows an algorithm to take stupidly long amounts of time on all practical inputs, so long as eventually it's got a running time with the given bound.
The colloquial use of big O as a general growth factor doesn't hold up to its definition, if you want to be sure it's valid for small inputs too.
I'm in agreement that saying algorithms on computers are all O(1) doesn't make much sense. In a finite memory model, it's impossible to solve big problems, so the running time function becomes undefined for big enough inputs, so big O becomes completely inappropriate in this situation.
> I'm in agreement that saying algorithms on computers are all O(1) doesn't make much sense. In a finite memory model, it's impossible to solve big problems, so the running time function becomes undefined for big enough inputs, so big O becomes completely inappropriate in this situation.
The difference between "everything is O(1)" and "you're not allowed to use O" is pretty minor anyway.
Either way it takes a lot of very useful analysis and throws it out the window because a particular branch of math was being too pedantic about needing infinity.
Use big O anyway. It works fine. It's a good tool even within the limits of real machines. If you're not abusing it on purpose, it tells you useful information in the exact same way that the pure math version does.
> That I use Big O to analyze time and not operations is important.
This is why we have benchmarks and big-O for different things. If we care about cache level sizes and the constant factors, use benchmarks with increasing orders of magnitde.
Or if preferring to remain abstract, model N as the bits of memory so that the access cost goes up with increasing N. To represent specific cache level sizes and latencies use benchmarks everyone will understand.
I mean, big O comes from that particular branch of math. The whole point of it -- even in CS -- is it's asymptotic analysis. (It's also been long criticized in CS for this deficiency. Usually it's the hidden constant factor, but the minimum problem size before it's relevant is an issue too!)
It's fine colloquially misusing big O to mean something like "for usual inputs, the running time is roughly this function", but you can't then in the same breath say all practical algorithms are technically O(1), because that's just not true in the technical sense (and it's useless or at least inaccurate in the colloquial one).
It doesn't really matter, but to add more to the discussion, the way I see it is that computer programs are implementations of algorithms on some restricted set of inputs. The algorithm has the asymptotic analysis, but the program itself does not.
> It's fine colloquially misusing big O to mean something like "for usual inputs, the running time is roughly this function", but you can't then in the same breath say all practical algorithms are technically O(1), because that's just not true in the technical sense (and it's useless or at least inaccurate in the colloquial one).
My argument is "there's no reason to stick to the exact technical definition, because the exact technical definition breaks when you apply it to real computers". I don't think that's contradictory in any way.
But I don't think this is a misuse either. Talking about "usual inputs" and "roughly" would be a misuse, sure. But if I can prove it uses at most an^2 + bn + c steps, and I call it O(n^2), then what's the problem?
This is typically why we don't pick N to mean all of the resources in the universe. I guess you're the kind of guy that chooses to uses not 8-bit bytes in an interview?
Instead typically it's the amount of accesses into some array. Other times people get cute and it's `log2(input length)` so somebodies algorithm could be O(N) while if N was just `input length` then it'd be N^2.
This isn't about picking N. This is saying "No matter what N you pick, your runtime will be smaller than roughly 2^[number of atoms in the reachable universe]" (Because if you hit the same state twice, you're stuck in a loop that will never halt.)
So yes, your algorithm is bound by O(N^2), and that's the number that matters on practical computers... but it's also bound by O(1) with a stupidly large constant. So it's O(1).
> I guess you're the kind of guy that chooses to uses not 8-bit bytes in an interview?
Dude, I'm here specifically to argue against insisting on the exact literal meaning, and to choose the practical option.
...which becomes meaningless and not useful. There is a whole area of mechanical sympathy and cache aware or cache oblivious algorithms that are well worth discussing and learning, but we don't need to use big-O notation to do so.
I have a video link to an MIT lecture that disproves your assertion. Cache aware, cache adaptive, and cache oblivious algorithms are definitively analyzed using big O. In fact it is big O notions that provide assurances about these approaches.
Mechanical sympathy has little (if anything) to do with big-O, btw. LMAX, the poster child of mechanical sympathy, is really about minimizing the use of memory coherence machinery and not taking the substantial latency hits.
It's useful notation even when we're not dealing with mathematical infinities. Unless you have a better alternative for the same purpose, "don't use it" is a pretty bad answer. It's being used for pretty much the same meaning. Not meaningless at all.
Ending up at sqrt(n) doing a visual linear regression on a log plot doesn't mean the same thing to me. It would be better to frame the discussion as an illustration of the limitations of big-O analysis in real world cases, than to say that the big-O of a linked-list is really O(sqrt(N))
Second, there are three images of the same graph, each showing how access times aren't linear when they cross cache boundaries. There aren't any transforms and there aren't any "aggregate transforms f -> f’".
Third, algorithmic complexity is about abstract amounts of work relative to amount of elements. It isn't about "a series of sub-systems" or "galactic RAM".
This article shouldn't use big O notation because it isn't really about algorithms, it is just about non constant access times.
We strongly & emphatically disagree. Analysis assumes a homogeneous computational regime for the said “infinite” store otherwise it is basically meaningless. If this assumption is known not to hold (for example in considering the role of caching in algorithms), this is specifically incorporated in the analysis. See “cache models” and cache-oblivious algorithms for examples.
Cache-Oblivious Structures I, MIT, by Erik Demaine
None of this is true for basic O notation, it's just what exponent the algorithm carries on the number of elements.
Big-O notation doesn't even have to do with this article, it's just showing that memory access isn't strictly constant. This all seems like you're trying to argue something that isn't a part of this thread and I'm not sure what that is.
There is also a simple, weaker argument about why memory addressing is not O(1).
To address memory of size N, you need a log(N) sized addresses. Operations on log(N) sized numbers are not constant time, so addressing memory of an arbitrary size can't be constant time.
We usually take operations like addition as if they were constant time because we assume a certain number of bits, and indeed, 32 or 64 bit addition is constant time, but not the addition of arbitrary sized numbers (aka. bignum).
Retrieving data from memory, even magic memory that doesn't care about the laws of physics is actually a kind of binary search. You take the most significant bit of the address, which tells you in which half of the memory bank it is, then the next bit gives you the next half, then the next, etc... until you get to the cell containing your data. That's O(log(N)), not O(1).
My argument is weaker that the one in the article, because it doesn't take the laws of physics in consideration so it only puts memory operations at O(log(N)) and not O(sqrt(N)), but the idea is similar in that memory operations are not constant time.
It is also the reason why I think that log(N) in big-O notation is the limit of practicality. That is, further than that and big-O becomes meaningless and one needs to take the actual machine in consideration. Maybe, after reading the article, O(sqrt(N)) should be the limit instead.
I’m struggling to understand this argument. The size of memory that you can address on any particular hardware is fixed, just like the word size for additions. Restricting yourself to a smaller part of that address space in software won’t change anything about how the address is decoded.
Big O notation deals with asymptotic behavior of a function - in other words, as N goes to infinity. If we're accepting the premise that N has an upper bound, the rest of the discussion is meaningless.
Presumably the argument is an abstract one, not applying to any particular machine that actually exists.
Big O notation is often used with assumptions like adding numbers is always constant time. You can't physically build a machine where addition between any two numbers is constant time, but that doesn't matter.
The argument makes sense once you think of memory as a hierarchy of different sizes. The log of the size of the set of registers is smaller than the logs of the sizes of Li caches, which in turn are smaller than the log of the size of main memory.
The whole reason we do big-O complexity analysis is to compare the performance of algorithms. If asymptoyic complexity analysis had nothing to teach about the actual performance of actual algorithms in hardware, we wouldn't be doing it at all.
The point is now to refine the computational model to make it even better at predicting performance on actual hardware. It is well known for example that algorithms which require arbitrary jumps in memory are worse than those that require linear scans - so this is an attempt to model this aspect of performance as well.
"It is well known for example that algorithms
which require arbitrary jumps in memory are
worse than those that require linear scans"
Even this is a suspect assumption once you factor in things like address space randomization, whatever weirdo microcode stuff systems are doing to mitigate timing attacks, multiple cores, etc.
Or rather: arbitrary jumps are bad, but you can't make any assumptions about what is a "linear" scan, so good luck attempting to model that with some kind of generalized notation.
Even with all of that considered, admittedly, theoretically-linear memory accesses will still tend to be actually linear, I guess, if they fall inside the same memory page or whatever. In which case this shiny new notation may actually be moderately predictive. Which isn't that much better or different than where we're already at with old-school Big O.
ASLR has absolutely nothing to do with this. All it does is to ensure that the address of, say, the value of (int) ((void*)malloc) is not always 0x12345678 or something predictable, and the same is true for the start of the stack and heap etc. What it obviously doesn't do is to randomly scatter every allocation at different addresses. If the address of arr[0] is 0x1000000, then the address of arr[1000000] is very much guaranteed to be 0x2000000.
Now, the actual layout of your address space in RAM is not necessarily contiguous of course. That has nothing to do with ASLR, it is all related to virtual memory, a much older concept. However, this is also irrelevant for performance, at least until you get to NUMA architectures which anyway need their own specific models.
The reason virtual memory to physical memory mapping doesn't really matter for performance is that RAM is, actually, random access at this level. Reading two contiguous cache lines worth of data from physical RAM into the processor cache is just as fast as reading two non-contiguous ones.
The reason random memory accesses are much slower than linear ones is not the physical layout in RAM. It is because when doing linear reads, you'll get most of the data from the processor cache with a single slow read from RAM, whereas with random access you'll need to go all the way to RAM (or gods forbid, to disk) for each and every read. The difference is exacerbated even more by predictive pre-fetchers, which can paralleize reading the next cache line from RAM while your code is running on the current cache line - but those only work if your code will indeed use the predicted next cache line, otherwise you'll stall.
So, random access code is slow because it is actually running at the RAM's speed. Linear access code is fast because it is actually running at the L1 cache's speed. Physical layout of your pages in memory is almost entirely irrelevant (it's only a problem if your reads are not page-size aligned, which is not normally possible).
And finally, the cache-oblivious computing model for complexity analysis is pretty well established, and it is known to be significantly better at predicting real-world performance differences than the more traditional system. It is more complex to work with, though, so some people don't bother. And it is not of that much theoretical interest in the field of computational complexity analysis, since it doesn't fundamentally change the complexity class of any algorithms (e.g. both this and the simpler O(1) memory access model will agree on whether an algorithm is polynomial or exponential).
I think, though, that your argument depends very directly on hardware implementation concerns, which is exactly the kind of thing that big O notation is meant to abstract away.
It doesn't. It merely depends on processing speed being finite. If you need to address the 2^(2^64)th item (remember, big O notation is about behavior as n goes to infinity), your item index needs 2^64 bits. You can't determine what item to get without looking at all the bits in the index. If your processing speed is finite, that requires time proportional to the number of bits in the index.
I think you’re confusing some variables here. If you say mergesort is O(N log N), N is the number of elements in the collection. The size of the address space is orthogonal to the limit being expressed. If a memory look up is O(log(M)) where M is the size of the address space, then it is constant time or O(1) with respect to N.
It's easy to show that an N element array requires at least O(N) bits of memory. If we call the memory size M, then an obvious condition is M >= N. And if M >= N, then O(log(M)) as N tends to infinity is emphatically NOT O(1).
Fair enough that perhaps you can allow M to be a function of N. It feels like this sort of model might run into other issues, though. If you treat a memory look up as an operation on an infinitely long bit string that therefore takes non-constant time, then why wouldn’t you select integer addition to act on unbounded numbers and have similar asymptotic behavior?
Obviously in practice M >> N, M is not a function of N, and/or M is constant, but we’re essentially discussing how to define the axioms underlying the complexity analysis.
The model is ultimately a choice, and you chose one that models important aspects of your domain.
People very very often work with data that doesn't fit into the fastest available memory, so modeling that fact that memory accesses taking O(foo(N)) time makes a lot of sense if you want to compare the performance of algorithms with different amounts of memory accesses. The performance differences predicted by a model that adds this complexity compared to a model that doesn't will be better.
For example, traditional complexity analysis predicts that linear search in a linked list should be about as fast as in an array up to some constant factor. If you instead model memory access as taking longer the larger the data structure you seek to access, you'll predict that linked list search is in fact worse and worse compared to array search as the size of each increases. This second prediction will more closely match the actual performance on real hardware.
In contrast, people very rarely work on problems where fixed, small size integers (i32, i64) are not good enough. Those that do also typically model addition as an O(foo(N)) operation.
There is of course some interesting overlap where you'd perhaps want to model the actual address calculation using large integers, but since in practice that will almost always be much faster than the actual memory access itself, it's rarely useful to account for it inatead of the memory access cost.
I think we’re in agreement mostly. I was trying to make the point that the model needs to contain some practical assumptions at the level we’re discussing, or it breaks down. Both binary integer addition and binary address look ups are O(L) where L is the number of bits in the number. In practice for both, we assume L is a constant so both operations are O(1). Hardware completes the operations in essentially a single atomic step, too.
Even with data that is so big you can’t fit it in local storage, you’re almost surely not going to exceed the size of a 64 bit address space. If you can’t fit in cache/RAM/HDD/cloud storage the bottleneck won’t be a binary address look up. The upper bound will be a function of the latency between the CPU and where the memory is stored (like TFA is talking about).
Doing arithmetic on numbers larger than 64 bits (e.g. cryptography) is far more common than exceeding the address space of the OS / hardware / theoretical address space maximum (hundreds of terabytes up to exabytes).
The author took the time to write four long blog posts on the subject but apparently din't bother to look up cache-oblivious algorithms [1], i.e. the field that deals with designing algorithms that work well in the presence of memory hierarchies.
These seem minimally useful in the sense that once one needs to whip out the SIMD intrinsics and software prefetches one probably actually cares about the 30% or 100% speedup one can get by writing cache-aware algorithms instead.
The opportunity from designing cache-aware algorithms is much larger than a 2X speedup. As a rule of thumb, main memory : L1 access times are ~ 100 : 1.
The canonical example of a cache-aware algorithm is scanning a 2D array as it is laid out on RAM vs in cache-hostile strides. The perf diff varies per architecture, but on my Broadwell server, the former is ~6.7X faster than the latter for a 10000 x 10000 int array.
Of course there is a large perf opportunity in vectorizing code, but that was not in scope for this blog post.
The so called cache-oblivious algorithms are exactly designed to account for the speed difference between cache levels. The term cache-oblivious refers to the fact that these algorithms don't take into account the exact sizes of the various cache levels.
This also applies when you scale up to the multi petabyte RAM range via distributed computing/cloud. I wonder if we could actually continue the (very rough) linear fit on a log-log plot.
FYI: This article sucks (in relative terms) if you read Part I in isolation. Click on Part II to get to the really good parts. From there on, it gets weaker. If you're busy, you can skip / skim parts III and IV.
My hope is this post saves someone some time.
Part 1: Empirical. Good.
Part 2: Theoretical limit. Really motivates Part 1 and gives an interesting perspective.
Part 3: Implications. Weaker overall.
Part 4: Clarifications / definitions / etc. Only helpful if you have questions or didn't understand something.
Missing: Showing best-case is O(log N) and not O(N) simply due to addressing. If I want to address a petabyte, I need 50 bits, and a kilobyte, only 10. That's also good theoretical limit as to why it's not O(N), although not nearly as nice as the authors'.
O(log(n)) confuses me. I could have sworn that the "Center-of-spherical library" thought experiment proves that its O(cube-root(n)) in 3-dimensional space, and O(square-root(n)) for 2-dimensional space.
That is: all memory has a physical shape, and the worst-case speed is related to the distance away from the center (aka: CPU Core) and whatever "book" or memory-cell you're looking for.
For a 2d space, the optimal organization of books (or memory cells), is a perfect circle: each book is exactly R (radius) distance away from the center, meaning all books can be fetched under equal time. This leads to O(sqrt(n)) performance, as the more memory you have (assuming memory is the same "shape"), the larger the circle _must_ become.
For 3d space, the optimal organization of books is a perfect sphere. Same-same really, except we can fly now.
Real-world chips are largely 2d technologies, but given the shear number of layers available on modern chips (hundreds of layers for Flash and other tech), maybe spherical / 3d perspectives are a better theory to work off of. Transitioning between layers is always fraught with problems (vias in physical chips / boards have inductances and other parasitics after all), so we're somewhere between 2-dimensions and 3-dimensions in practice (with more and more tricks to get us towards the 3d side).
------------
In the real world: chips are limited by these parasitic elements (inductances and capacitances), which slow down the electrical signal. But these are primarily related to distance (more inductance the longer a trace is, and more capacitance the more ground-is-parallel to that trace). So this theory ends up working out as a physical model of the electrons moving at the chip level to fetch data and communicate.
Let me clarify: This wasn't intended to contradict the root-n limit. Both limits exist.
* The root-n limit is based on physics of a 3d universe.
* A log n limit is based on computation, and would apply even in a infinite-dimensional universe, where everything is one unit away.
The root-n is a stronger limiting factor, obviously, but both are helpful to know since they talk about different limits from different premises.
Both disprove O(1). Both are (1) quite theoretical (2) not the limiting factor in most scenarios in practical terms (3) still apply in a limited set of cases.
The log n argument is helpful, for example, for network addressing for the internet and network topologies (and issues like 32 bit versus 128 bit addresses, NAT, etc.), were distances are well above the limits of physics, and the root-n argument doesn't apply. It's also helpful to know since it also applies to storage, and not just speed. If I have twice as much storage, my pointers all get longer by a bit.
(I'm intentionally omitting square root versus cube root, since that's tangential to the above, and a rabbit hole I don't want to go down right now; perhaps another post)
As it turns out, the information density of a region of space is bounded by it's surface area, not its volume. If you try exceeding that bound, you collapse into a black hole. This observation led to the idea of the Holographic principle in physics. Having said that, this is one of those areas that depends on a full understanding of quantum gravity, which we do not yet have, so I wouldn't call it settled science.
In practical terms, if your argument ends with "otherwise, your computer would collapse into a black hole", you are considering technology far beyond anything we are currently capable of, so n^3 information density is still on the table for everyone except theoretical physicists.
The article makes an argument about black holes to get to r^2 instead of r^3. It seems a bit surprising, but then I don’t know that type of physics.
Thinking of the more classical geometrical point of view, I agree that between 2D and 3D seems right. If nothing else, I’d imagine there aren’t many wires going diagonally through the layers, haha. Maybe a cylinder would be the right model.
There memory access speed depends not only on N as described in this article, but also on the order in which you access the little bits of N, like CPU cache locality, but also different in important ways.
> The blue line corresponds to a O(√N) cost of each memory access. Seems like a pretty good fit, no?
Kind-of obvious, and I’m sure I’ve bumped into this before, but it suddenly surprised me again looking at this: a sqrt(N) line on a log-log graph is a straight line, so at a glance looks identical to the linear N aside from slope, and without a reference point N, and with two axes of differing units and differing spacing, it’s pretty hard to see whether a line is sqrt(N), or how far away from N it is.
A better way to ensure non-contiguous memory is allocating 2, maybe 3x the memory you need by repeatedly calling mmap(), then picking random points in that memory, and using those. Otherwise chances are your memory is still contiguous
How did the author ensure the memory allocations were not contiguous? I suppose in C/C++ it's relatively straightforward since you could allocate a bunch, then free 99% of it, repeatedly.
Of those, only the first one (memory) is actually used during measurement, The seconds one (nodes) is just there to help shuffle the linked list, and is freeded before the actual test:
free(nodes); // Free up unused memory before meassuring:
// Do the actual measurements:
Int start = clock();
...
Since the linked list is shuffeled, the nodes will be linked in a random order. However, since the total memory is contigous, that is the only factor (instead of having every node be a separate allocation, which could allow the system to put every node on a separate page)
The same trick should work in Rust as in C or C++. In Java you could create a linked list class, pin the nodes in memory so that the GC doesn't compact them , then apply the same shuffling. In Go you can go back to the C trick, since Go's garbage collector doesn't do any kind of compaction.
https://www.cl.cam.ac.uk/~swm11/research/greenfield.html
A key way to think about things is that algorithms are not really running in some Platonic realm, each executed instruction occurs at some physical location and time, and data needs to move from one place and time to another place and time. To this end both physical wiring (or networks) are required to move the data spatially, and memory is used to move the data temporally. Together an algorithm's executed instructions are situated somewhere spatio-temporally and RAM (both on-chip and off-chip) serves as a temporal interconnect, but one that itself takes up physical space.