The Myth of RAM (2014)

Tojot · on Aug 29, 2016

It so happens that a large part of my PhD was on this very subject. The result I've got N log(N), this is more visible when you get to larger RAM (I had 0,5 TB RAM at the time). We have an empirical result, a justification and a rigorous predictive model.

The reason has to do with hashing, but a different type: TLB.

I posted more details as https://news.ycombinator.com/item?id=12385458

lorenzhs · on Aug 30, 2016

Thanks, I came across your research before and thought it was quite cool. In Section 8.5, when discussing whether hash tables would be suitable for handling TLB misses, wouldn't denial of service attacks also be a concern? If an attacker knows the hash function and can control the addresses being looked up, they might be able to trigger worst-case behaviour on every lookup, couldn't they?

Veedrac · on Aug 30, 2016

That's addressed in that very section:

> Moreover, an adversary can try to increase a number of necessary rehashes.

It seems to me that the section is a bit too dismissive, though, as there are hash tables and hash functions that mitigate these concerns. In particular, collisions can be replaced with trees, like in Java, limiting the worst case to O(log n) again.

lorenzhs · on Aug 30, 2016

Rehashes and bad probing behaviour aren't quite the same thing, but I'll let it count and admit I may have replied too quickly ;)

aaronbwebber · on Aug 29, 2016

The problem with this analysis is that in the graph in the very first part he shows that memory access IS O(1) for pretty substantial scaling factors, and then when you hit some limit(e.g. size of cache, size of RAM) access times increase very rapidly. Sure, if you draw a line across 6 orders of magnitude, it ends up looking like O(n^1/2), but how often do you scale something through 6 orders of magnitude?

The "memory access is O(1)" approximation is pretty good, certainly good enough for almost all every day use. The median size of a hash table I allocate definitely fits in L1 cache, so why shouldn't I think of it as O(1)? If you are reading off of disk, the O(1) approximation holds as long as your dataset stays between 1 MB and 1 GB. That's quite a bit of room to play around in.

Yes, you need to be aware of access times and the changes in them if you are really scaling something way up. But I'm not convinced that I shouldn't just keep thinking of "hash access is O(1)" as a convenient, generally accurate shortcut.

xenadu02 · on Aug 29, 2016

It is trivial to exceed the L1 cache size and not that uncommon to exceed L2. That brings us to a 100x delta which is worth thinking about, no? Even exceeding L3 isn't horribly rare for average desktop CPUs (let alone mobile devices).

There are other dimensions to this too like prefetching, streaming, pipelining, vectorization, etc. In some cases using an array and doing a linear search is faster than any hashmap on a modern CPU.

I think the takeaway from this article is not to blindly trust the theoretical big-O numbers for data structures and actually test them with realistic datasets on your target hardware.

aaronbwebber · on Aug 29, 2016

This is a very good point. You should run actual performance tests. The results may surprise you!

In my actual work, optimizing in memory searches is not worthwhile because saving half a dozen microseconds on a search is not real valuable when you are about to spend half a dozen milliseconds making a database call (which is technically O(log N), since there's a B-tree index back there). But I do spend quite a bit of time figuring out which of those queries should be moved to an in-memory O(1) cache (redis or memcache, depending).

And I actually never think about the perf tradeoff between the DB and the cache being O(log N) vs O(1) - I think of it in terms of the median and 99th %tile times that I know from monitoring our actual production servers. So as xenadu points out, I guess there is some truth in the article, but it's just sort of lost in this somewhat academic discussion of scaling things to infinity and beyond.

pjc50 · on Aug 30, 2016

In my actual work, optimizing in memory searches is not worthwhile

Funnily enough, I had to do this last week. We started from the question "why does it take 160ms to scan 60,000 objects in memory?" and reached the disappointing conclusion that, on this particular platform (iMX6/800MHz/DDR3) a cache miss costs you an astonishing 233ns. I got as far as double-checking all the RAM initialisation parameters before giving up.

There is now an index on the in-memory search.

AstralStorm · on Aug 30, 2016

Random percentile abuse. 99% is less useful than a median. The real data is a true histogram as well as maximum time.

99% might mean every 100th customer does not get a service.

iheartmemcache · on Aug 30, 2016

Totally with you on 'benchmark everything, who knows when the branch predictor will mispredict or a full cache flush will happen. (Interestingly enough, I rarely try to beat the compiler since ICC is super smart but strategically placed, manually inserted CLFLUSHes have sped up some my work. Go figure.)

Eh L1 is bigger than you think. What cache-trashes is context switching. A common technique is shielding processes[1], delegating all interrupts to certain processors. Here's an easy example - since HT has it's own set of caches for both "CPUs" inside the physical CPU [i.e., the ALU (and all the aux AVR registers too) is shared between both], if you're doing anything computationally intensive with an unpredictable set of interrupts, but your crunch pattern is predictable enough do the following-- set the affinity for the 'cruncher' to one CPU with a full shield, and then all other components like interrupts, IO, things that you don't mind re: cache misses, etc.

[1] https://rt.wiki.kernel.org/index.php/Cpuset_Management_Utili... This is just one way to do it, but it explains the concept.

pedrocr · on Aug 29, 2016

>I think the takeaway from this article is not to blindly trust the theoretical big-O numbers for data structures and actually test them with realistic datasets on your target hardware.

You have to do this anyway to make sure that the constant factors aren't big enough to change the optimal approach at the size of your dataset. Big-O only really buys you an initial rule to not do things that are obviously wrong.

wongarsu · on Aug 29, 2016

Right, an algorithm that always produces an answer right after the heat death of the universe is O(1).

Big-O is great for telling you which algorithms/datastructures won't scale, it can't tell you which are actually fast enough.

wtetzner · on Aug 29, 2016

And in fact, some data structures that scale poorly will perform better on small sets of data.

For example, a simple list of pairs with linear lookup time can be faster than a hash table for small amounts of data, because it doesn't have to compute the hash.

p4wnc6 · on Aug 29, 2016

I would say it differently: you absolutely should trust big O numbers for data structures and use that as a major guiding principle in terms of how to structure a program ... and you should require extraordinary evidence, like a surprising result in a performance test, before navigating away from trusting the big O analysis.

What this all says to me is that for an extreme claim (e.g. using an array and doing a linear search will be faster than a hashmap lookup) it should require equally extreme evidence to substantiate it (e.g. the results of performance tests that stand up to heavy scrutiny).

I feel there is a trap in which someone might look at this sort of thing and say, a-ha I don't actually need to care about big O reasoning or standard understanding of data structures at all! That would be a tragic misreading of this sort of example.

nomel · on Aug 29, 2016

> before navigating away from trusting the big O analysis.

I don't think that's the spirit of the article. The spirit is to not assume O(1) is O(1) if you work with a physical computer and care about anything beyond how many times your program counter increments. Make sure to use big O notation that includes the complexities of carrying out those single instructions.

If you work with more than 32k of data, you either need to modify your big O to take latency into account, or you're assuming some worst case cache level, which is fine, and what you're doing with an idealistic big O...but that's a terrible assumption if you're actually trying to achieve speed or realistic metrics.

p4wnc6 · on Aug 29, 2016

I agree. That's why the article is wrong-headed in its presentation.

Generally when you begin working on problems where this sort of thing is relevant, you have to make choices to get something going. You want to avoid premature optimization and you need something on the ground. In short, you have to assume O(1) is O(1) ... and it basically always is, except when there's evidence that it's not.

The value of the article is to point out cases when these abstractions break down, and the value of performance testing. But carrying it to an extreme such as, "Never make any assumptions about how any data structures work until you've performance tested every single thing in your application" is of course wildly unproductive.

wtetzner · on Aug 29, 2016

> using an array and doing a linear search will be faster than a hashmap lookup

For small enough sets of data, this will likely be true, simply because you don't have to compute a hash.

Of course, it likely won't scale very well.

krinchan · on Aug 29, 2016

I always thought the Big-O notation was purely to compare one algorithm to another, not actually measure literal, real world performance.

Trying to describe a hardware operation in Big-O notation then assert everyone should something other than O(1) for a memory access operation seems like a gross misunderstanding of its purpose.

bzbarsky · on Aug 29, 2016

Big-O notation is there purely to compare one algorithm to another, yes. But in terms of what?

Say you have two search algorithms. For a dataset of size N, Algorithm 1 does N memory reads and N data compares. Algorithm 2 does N * (log N) memory reads and log N data compares. These are not real algorithms, just illustrations.

It's not uncommon in the analysis of such algorithms (both searching and sorting) to only consider the number of data compares and assume memory reads are completely free. So that would give you O(N) for algorithm 1 and O(log N) for algorithm 2.

Analyses with more sophistication will note that a memory read is not in fact free, and will look at the number of "operations", whether those be memory reads or compares. In terms of the number of "operations", algorithm 1 is O(N) and algorithm 2 is O(N log N).

These are both true statements about these algorithms: Algorithm 2 is O(log N) if you look at number of compares and O(N log N) if you look at number of "operations". Which of these characterizations is more relevant to you? Depends on whether memory reads are actually free in your setup.

Anyway, the original article is arguing that for the comparison most people care about, wall clock time, even the more sophisticated version is wrong, because it assumes that the time needed per operation does not depend on N, so you can just count all the operations and not worry about their relative speed, because that's just a constant factor. If the time taken for a memory read _does_ depend on N, then you can't do that. For example, if you want to look at the algorithmic complexity of these algorithms in terms of clock cycles, and a compare is one clock cycle (might not be, depending on what your data is!) and a memory read if sqrt(N) clock cycles (due to caching effects), then the complexity of algorithm 2 will be O(N * sqrt N * log N), whereas algorithm 1 is O(N * sqrt N).

Of course once you're caring about actual time very often constant factors start to matter too, and the whole idea of doing asymptotic analysis might break down. But it might not. It really depends on your exact problem and on the assumptions underlying the asymptotic analysis...

dlubarov · on Aug 29, 2016

> It's not uncommon in the analysis of such algorithms (both searching and sorting) to only consider the number of data compares and assume memory reads are completely free.

Can you link to an example? That seems plainly wrong and I haven't seen an analysis which makes that assumption.

bzbarsky · on Aug 29, 2016

https://en.wikipedia.org/wiki/Binary_search_algorithm#Perfor... is a simple example. It goes to great pains to talk about numbers of comparisons per iteration, but never once considers the number of memory accesses per iteration (which happens to be 1 or 2 depending on how you count things).

In fact, most of the performance analysis of algorithms on Wikipedia that I've seen look like this. Similar for introductory algorithms textbooks, if I recall correctly (which I may not; it's been a while since I looked at one of those).

Now in practice the number of memory accesses in an algorithm like this is typically O(1) in the number of comparisons so an implicit assumption that all your memory accesses are O(1) means just adding a constant factor to your overall complexity, so it's OK to just not worry about the whole thing. And counting "memory accesses" is complicated because whether something is a "memory access" or not is a bit tricky. For example, in the binary search case, do you need to "memory access" the value you're searching for before every compare, or can you assume it's stored in O(1) storage like a register? But all of this falls down if your memory accesses are not in fact O(1).

AstralStorm · on Aug 30, 2016

There is a complimentary notation for additional memory use.

Any engineer worth their salt actually specifies basic operations separately. This includes comparisons, fused or normal multiplies and additions, divisions of applicable, memory loads and writes.

Even then you can get fooled due to locality.

TillE · on Aug 29, 2016

Specifically, Big-O describes the relative performance of an algorithm as the input grows. That's why constants are discarded, and that's why O(n^2) is probably fine if you only ever have very small data sets.

mypalmike · on Aug 30, 2016

> purely to compare one algorithm to another, not actually measure literal, real world performance.

Estimating real world performance is the whole point of Big-O notation. Trying to solve the traveling salesman problem on all the stars in the Milky Way is something your PC isn't going to do in your lifetime, for instance.

GFK_of_xmaspast · on Aug 30, 2016

If you ignore the nonEuclidean effects of general relativity, there are very good polynomial time approximation algorithms.

fapjacks · on Aug 29, 2016

This right here is the most relevant response to the article.

jcbeard · on Aug 29, 2016

I agree with aaronbwebber that for general use, and for teaching, that it's far easier to assume O(1) memory access. It's a decent approximation, and works until you start having applications where loads/stores dominate. It also falls apart when things like multi-threaded apps start spinning on locks, but lets not go there in a HN post, could make an entire blog out of mem issues.

Here's why we should continue to use O(1) access in general. Unless you enjoy rat-holes (I wouldn't have a job without them...so, why not). If we really wanted to, for every algorithm we could consider things like: 1) pre-fetch algorithm (assuming you know which one will be chosen) 2) associativity 3) cache size at each level (including buffering) 4) queueing depth at all levels 5) DDR controller capacity & while we're at it, do we have latency of mem controller on-chip..wait, which one? 6) bank conflicts in DDR 7) NUMAness (how about NUMA cache behavior?) 8) even worse...should we consider cost of differing mem tech. 9) TLB behavior...there's a lot there, can't list

Here's why we don't: YOU CAN'T. Do you realize the incredible amount of work required to get the info for the analysis? The list of things that influence memory behavior is huge. Even if you can get all of it, you work it out for one algorithm...you've now wasted a year, and figured out a cost for one architecture/OS/run-time combo. Congratulations. I'm all for including the cost of memory...just realize what you're asking before you write a post asking about it. There's no end to how detailed you want to go. And in the end you end up destroying the asymptotic bound you're trying to create b/c you'll realize that it's now a probability distribution that really doesn't lend itself to generalizable asymptotic behavior, which is what you want when considering algorithms.

When choosing an algorithm for a specific hardware, then you can choose an algorithm for that arch.....but most people don't do this, it's not necessary unless you're in HPC, Datacenter scale analytics, or in the embedded space/cyper-physical space.

xenadu02 · on Aug 29, 2016

Permit me to disagree; I think it is useful to internalize the general rule that for every 100x increase in data size the theoretical performance will drop by 10x. This isn't a perfect rule but it is just as valid a shortcut as saying hash lookups are O(1).

For all intents and purposes all modern CPU hardware works the same way (a cache hierarchy, superscalar, pipelined, with relatively slow DRAM hanging off some bus).

In practice measure different approaches to performance-critical code with realistic datasets and don't make assumptions. Don't assume that you know dictionaries are O(1) and therefore searching an array must be slower.

jcbeard · on Aug 29, 2016

The problem is, that's not really true. There isn't a hard/fast rule that'll get you what you want given the way modern hardware is designed. Run something on a core designed for mobile workloads and that changes vs. one designed for server workloads vs. one designed for HPC. They're definitely not all the same, and the differences will kick you in the rear when trying to simplify.

I do agree teaching people that there are differences, and give them the tools to learn when algorithmic changes are necessary to optimize for the hardware.

neolefty · on Aug 29, 2016

The article rationalizes with physics, arguing that a physically large data store, when it is efficiently implemented, is limited by the inverse-square of distance inherent in a 3D world.

Certainly there are exceptions when you make a system more efficient (for example, filling empty memory slots or moving systems closer together or paying for a better interconnect).

claudius · on Aug 29, 2016

> Here's why we don't: YOU CAN'T.

But that’s one of the main points of the article: You don’t have to. Asymptotically, you won’t be better than O(sqrt(N)) because physics does not allow it. You can put all the different caches in you want, you can buy the most expensive RAM there is, you can buy the best interconnect systems for the fanciest NUMA cluster you can find: You won’t beat O(sqrt(N)) if you wish to go to arbitrary N.

Of course, if you bound your problem size, you may well get O(1) access times and for some applications, it may well be sensible to consider the coefficient associated to the square-root scaling small enough to neglect it, but if you truly want to talk about algorithmic complexity (and not just the scaling of one particular implementation for one particular range of valid N), you should take that additional square root into account.

AstralStorm · on Aug 30, 2016

The one exception is possibly quantum computation in some cases.

claudius · on Aug 30, 2016

I don’t know of a mechanism in quantum mechanics that would allow to pack information more densely (or even infinitely dense as would be required by O(1) random access times), what did you have in mind?

brassic · on Aug 29, 2016

There is a middle ground between tuning an algorithm to the exact memory hierarchy of the system it is running on and ignoring the problem altogether: https://en.wikipedia.org/wiki/Cache-oblivious

jcbeard · on Aug 29, 2016

this is true....and thankfully this technique is also being taught in some CS algorithms classes.

fanf2 · on Aug 31, 2016

Check out "cache-oblivious algorithms" for code that is designed to work well with memory hierarchies but which does not need to be tuned for the specifics of any particular architecture.

kazinator · on Aug 31, 2016

I like it: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

It gives a name to a good default way of writing code.

smallnamespace · on Aug 29, 2016

> but how often do you scale something through 6 orders of magnitude?

Well, that's the whole point of using big-O notation.

Otherwise, there's no point in distinguishing between O(1) and O(lg N) either--lg(10^6) is only ~20, after all.

klodolph · on Aug 29, 2016

> but how often do you scale something through 6 orders of magnitude?

I can't speak for others but the team I'm on has a mind-boggling amount of data. 6 orders of magnitude is not even close (no, I'm not doing something clever like starting with "byte" or "bit" as smallest unit). The truth is most people don't have "big data" but plenty of people do—petabyte scale is becoming pedestrian these days and a few larger organizations are into the exabyte range.

Latency grows as you go from registers to RAM to hard drives in distant data centers. Consider that data centers don't really stack on top of each other, they have to be built flat or you can't get rid of the waste heat fast enough. Same with microprocessors and RAM. Everything is flat, in practice, with a limit to how much it gets stacked.

Maybe you think that this is a special case, but with everyone building their applications on top of the same cloud providers, your provider's scalability starts to affect you personally. So even if you build a small application in the cloud, your ability to access data is impacted by the fact that your cloud provider is providing access to exabytes of data.

stdbrouw · on Aug 29, 2016

You're failing to account for the fact that it's a log-log graph, though, so even a modest slope can be pretty significant.

If you want a rule of thumb, might as well just stick to "hash tables are fast" instead of the false precision of O-notation.

jimminy · on Aug 29, 2016

Big O-notation isn't precise, it is an approximation by nature. It only cares about the largest factors for determining the estimate.

epistasis · on Aug 29, 2016

The log-log graph is the way to determine those large factors. Taking the log removes the multiplicative constants. The differences between n^.5 and n^1 and n^1.5 are quite significant for big-o

aaronbwebber · on Aug 29, 2016

The graph really is totally flat below the L1 cache size line, and even on a log-log graph the slope between 10 MB and 1 GB is not significant.

AstralStorm · on Aug 30, 2016

That being kB range for L1, single digit MB for L2. Relatively simple Java apps tend to eat multiple GB, while accessing most of it.

malisper · on Aug 29, 2016

The definition of big-O is what happens as n approaches infinity. You are free to model your computer as having O(1) memory access as it's a much simpler model and close enough in practice.

adrianratnapala · on Aug 29, 2016

As N climbs orders of magnitude, we start using ever more distant kinds of memory, so access time does not scale as O(1) in theory.

As for practice, it's not O(1) either or else cache misses would just be a theoretical nicety that didn't matter in the real world.

malisper · on Aug 29, 2016

If you make the assumption that there is an upper bound on the maximum time a read can take (e.g. reading from disk), reading from memory becomes an O(1) operation regardless of cache misses.

Symmetry · on Aug 29, 2016

If you make that assumption, sure. If instead you assume that you live in a universe where information can't propogate faster than the speed of light (true), storage takes up space (true), and memory is arranged on an plane (generally true in practice), then you get back to sqrt(n) access times. If some day we move to totally 3D computers we can get that down to cube root of n.

roywiggins · on Aug 29, 2016

Anthropic O(1) argument: An algorithm will either finish before all humanity dies, or it won't. If it does, it's bounded from above by a constant. If it doesn't, it's completely irrelevant to everyone. So for all practical purposes every algorithm is O(1).

mindslight · on Aug 29, 2016

By similar assumption, every terminating algorithm running on a real computer (having bounded memory) is O(1). This is technically true, but not useful.

Most people assume the average memory access will be reasonably fast (why we have a cache hierarchy in the first place). Sometimes the growth of it matters, sometimes it does not.

Algorithm complexity is not something to be memorized, but analyzed.

malisper · on Aug 29, 2016

The assumption that the computer you are modeling has an infinite amount of memory finite number of caches is a useful assumption as it dramatically simplifies the analysis and still allows you to use big-O to somewhat accurately analyze the performance of a real computer.

mindslight · on Aug 29, 2016

You missed what I was illustrating by referencing bounded memory - every realized program has a finite (therefore constant-bounded) number of states. The point is that such analysis is not useful, because it "cheats" with an misleading implicit constant for the bound.

Similarly, if one is analyzing the performance of a general algorithm and assumes memory access is constant time, then that constant needs to be understood as the full access time of main memory (assuming that's what problems will always fit in).

One cannot casually make this assumption to get an O(n) complexity result, profile the program with a problem size that actually fits in cache, and then combine the two to extrapolate larger running time. They each have different baked-in assumptions.

firethief · on Aug 29, 2016

Yes, that assumption can be used to make any operation O(1).

Ericson2314 · on Aug 29, 2016

We live in a world where we buy hardware to run our programs, not write programs to run our hardware. [This is a Dijkstra paraphrase—can somebody find me the original?]

Think of the O(1) as paying the worst cost of a given memory tier, uniformly. The more data your program uses the more hardware you need to run it the worse your worst case will be.

AstralStorm · on Aug 30, 2016

Makes the analysis completely worthless. The idea of the analysis is to evaluate asymptotic behaviour for large data sets. Of the idea ever a nonlinear cost of accessing the data, you should take it into account.

Suppose the memory is tree structured like typical TLB, you do get at least one extra log n factor. And actually there are even more costs on the way.

sabarn01 · on Aug 29, 2016

It happens all the time. In real world code people throw hash tables and maps at all types of problems of all sizes. If you hit arbitrary boundaries where your fundamental assertion of access time becomes non constant its an issue.

This post might spur new thought about containers and how there usage should really be governed by their expected capacity.

nbadg · on Aug 29, 2016

I think that the "best" message here is that the performance of memory access, hash lookup, etc, is a more complicated question than always just O(1). Knowing the context -- basically, the second-to-last graph that highlights performance vs size vs cache limits -- is the bigger picture for the system. And it's perfectly usable everyday, you just have to remember a few more words: "memory access is O(1) as long as I stay within L1".

The problem I have with the article is that the author has replaced one broad generalization with another equally reductive but diametrically opposed one. It's no more accurate, and no more decontextualized, than the original statement, and in my head that means it's also not really all that useful.

AstralStorm · on Aug 30, 2016

Simpler version: locality of data matters a lot.

rocqua · on Aug 29, 2016

One can move across all these scaling factors by changing from linear access to random access. A common example is linked lists vs arraylists. Many articles show it takes about 5 inserts per read for a linked list to beat arraylists.

AstralStorm · on Aug 30, 2016

Not with large data sizes it doesn't. Which is the whole point. Array would be even slower than the list on insert because you actually have to move much more data around.

The one thing array is better at its sometimes locality, especially if you iterate over neighbouring elements.

ChuckMcM · on Aug 29, 2016

Since it is a topic I'm interested in I took the time to read all 4 parts, the author manages to summarize it in a paragraph which would have been helpful at the beginning:

When somebody says “Iterating through a linked list is a O(N) operation” what they mean to say is “The number of instructions needed to be executed grows linearly with the size of the list.”. That is a correct statement. The argument I’m trying to make is that it would be a mistake to also assume that the amount of time needed would grow linearly with the size of the list as well. This is an important distinction. If you only care about the number of instructions executed that’s fine, you can use Big-O for that! If you care about the time taken, that’s fine too, and you can use Big-O for that too!

Sadly, he doesn't take this knowledge to its conclusion. Let's introduce the notation Oi() for the Big-O notation in instructions, and Ot() for the Big-O notation for time.

Lemma: For all f(N), if Oi(f(N)) > Oi(g(N)), Ot(f(N) will be > Ot(g(N)).

Or put another way, it's important not to confuse complexity scaling with time scaling, but the more complex the computation, the longer it will take.

aschampion · on Aug 29, 2016

A complexity measure is a measure of complexity over a model. Mostly we talk about input size models, but we can (and do) talk about other models. It isn't a question of complexity scaling vs. time scaling, but of complexity scaling over an input size model vs. a machine architecture model.

There are tons of well-researched machine architectural models for complexity analysis. In the case of the article, any one of the many external memory (EM) or hierarchical memory models would be appropriate. These can capture latencies and bandwidth of hierarchical memories and storage, and in some cases even eviction, contention, partition, etc.

This isn't some esoteric point. In high performance computing, distributed algorithms, or even systems & architecture it's pretty much expected that if you talk about complexity, you should do so under some sort of EM model.

ChuckMcM · on Aug 29, 2016

Well stated. That's a much better way to say it.

gnud · on Aug 29, 2016

Don't think your conclusion holds - and that's actually something the author brings up.

It might be quicker in time (not in the amount of instructions executed) to calculate something multiple times, or to do a more complex calculation or what have you, in order to touch less memory.

DanWaterworth · on Aug 29, 2016

> if Oi(f(N)) > Oi(g(N)), Ot(f(N) will be > Ot(g(N)).

That doesn't hold. Some instructions take longer than others.

ChuckMcM · on Aug 29, 2016

This is how I would go about stating the proof.

1) The article shows that the time to process N datums of data is related to the size of the total data processed. And that as the size of the data set grows, the time to process it grows with the root of N.

2) Computational complexity's order of magnitude is relative number of datums need to be processed in order to complete the computation.

3) Less complex computations process fewer datums, fewer datums mean a smaller total memory footprint.

Therefore, processing fewer datums will lower your total memory processing cost which will lower your total time. Q.E.D.

SamReidHughes · on Aug 29, 2016

Q.E.Nope. Memory can get reused.

jacobparker · on Aug 29, 2016

If all instructions complete within some constant deadline independent of N/the input then it does hold even if they take different amounts of time.

malisper · on Aug 29, 2016

The article demonstrates that memory access is an instruction that actually has O(n^1/2) performance.

DanWaterworth · on Aug 29, 2016

It doesn't hold for binary tree lookups vs B-tree lookups.

malisper · on Aug 29, 2016

The hypothesis of the statement doesn't hold. O(n log n) isn't greater than O(n log n).

DanWaterworth · on Aug 29, 2016

You have to evaluate it in a model with an explicit cache width size, then O(log_b(n)) < O(log(n))

SamReidHughes · on Aug 29, 2016

O(log_b(n)) is the same as O(log n). With O(sqrt i) memory accesses you still have a constant factor separation between the absolute time performance of the two.

DanWaterworth · on Aug 29, 2016

> O(log_b(n)) is the same as O(log n)

Not in every abstract model. See other reply.

SamReidHughes · on Aug 29, 2016

It has nothing to do with the model and is just a question of whether b is a variable in the big O notation.

And in this case of O(sqrt i) memory access times, a binary tree and b-tree stay within a constant factor even as you vary b. (The reason is, the binary tree accesses that a single b-sized access replaces get exponentially more "local.")

DanWaterworth · on Aug 29, 2016

> It has nothing to do with the model and is just a question of whether b is a variable in the big O notation.

There are models, like the cache oblivious model, where b is assumed to be a variable, so the model matters.

> a binary tree and b-tree stay within a constant factor even as you vary b.

it's not a constant factor if b is a variable. It's a factor of log b.

SamReidHughes · on Aug 29, 2016

> it's not a constant factor if b is a variable. It's a factor of log b.

It's between 1/(sqrt(2)-1) and sqrt(2)/(sqrt(2)-1), depending on how full the b-tree nodes are.

DanWaterworth · on Aug 29, 2016

I don't know where you got those numbers from.

SamReidHughes · on Aug 29, 2016

> The reason is, the binary tree accesses that a single b-sized access replaces get exponentially more "local."

sqrt(n)+sqrt(n/b)+sqrt(n/b^2)+... versus sqrt(n)+sqrt(n/2)+sqrt(n/4)+...

And since b-tree nodes can be half empty, there's a sqrt(2) uncertainty. (And of course there is no other memory overhead at all, none whatsoever.)

DanWaterworth · on Aug 29, 2016

Thanks, I see. Are you assuming ephemeral usage of nodes to make that claim?

SamReidHughes · on Aug 29, 2016

I don't know what that means. I'm assuming a random element of the tree is picked, that parent nodes are in a smaller or equal cache level than children, that all but one cache levels are used completely, that each cache level has O(sqrt n) access time, and that there is an upper bound on the ratio between successive cache sizes.

Or less generally: it takes sqrt(j) nanoseconds to dereference the pointer with value j, and parent nodes are at smaller addresses than their children.

DanWaterworth · on Aug 30, 2016

Right, you're assuming there's one tree being maintained.

SamReidHughes · on Aug 30, 2016

Or any fixed number of them, or any curve where the number is polynomially smaller than the total size...

GauntletWizard · on Aug 29, 2016

log_b(n) * log(b) = log(n). log(b) is a constant in our algorithmic analysis, and can therefore be removed. There will never be a log_b(x^2) that grows at slower than log(x), no matter how big b is.

DanWaterworth · on Aug 29, 2016

Which is why you have to evaluate in a model with an explicit cache width. In such a model, b is not a constant. The cache-oblivious model [1] is a fairly well-known example, but that won't work in this case, since we need to know b to set the tree width. Any of the other external memory models will do.

[1] https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

ChuckMcM · on Aug 29, 2016

I think the notation is imprecise.

The quantity the author describes is commonly called the working set. His explicit thesis is that the time to process the working set is related to the size of the working set regardless of computational complexity. He spends the entire first part of his discussion belaboring this point with a linked list example. He goes on to do a systems analysis of the results and concludes that time of execution is system dependent not complexity dependent. Again using the same algorithm engaging with a larger and larger memory system.

Generally engineers would experience this important principal in a more binary way, once their program started swapping to disk, its performance would fall through the floor.

If you hold the size of the datum constant (and that is important), then processing fewer of them will take less time than processing more of them.

The argument you make, as I understand it is as follows; Consider two algorithms that process N and P datums respectively to achieve the same computational result, and where N > P. There exists an algorithm such that P datums takes less per unit time to process than N. Resulting in a violation of my Lemma because Oi(P) > Oi(N) but Ot(P) < Ot(N). And you cite cache obliviousness as the principle that enables that violation.

Would you agree that I have accurately summed up your argument?

DanWaterworth · on Aug 29, 2016

It's roughly right, but it's not cache obliviousness that enables the violation. For the argument to work out, you have to evaluate it within a model that has an explicit cache and hence, treats big O results as being parametrized over the cache width.

pedrocr · on Aug 29, 2016

Does that really follow? The sqrt(N) factor is the overhead of memory access only, so if you're comparing an algorithm where every operation is a memory access to another where that is not the case it would seem you could have the inverse relation in Oi and Ot.

Edit: A counterexample then:

Oi(f(N)) = N^(2/3) (memory accessing operations) Oi(g(N)) = N (non memory accessing operations)

Ot(f(N)) = N^(7/6) Ot(g(N)) = N

So Oi(f(N)) < Oi(g(N)) but Ot(f(N)) > Ot(g(N))

Terr_ · on Aug 30, 2016

I just assume that every big-O equation has some sort of units (possibly implicit) and that neither the inputs nor the results are dimensionless quantities.

We just tend to assume that "N" is clear from context and that the result is some number of similarly-loose "instructions".

Kenji · on Aug 29, 2016

Your Lemma doesn't hold if the time to execute one instruction is dependent on N - which is pretty much the entire point the blog post makes.

reikonomusha · on Aug 29, 2016

I think there's some good info in this article covered by various degrees of misinformation. For some reason, the article starts off with this totally wrong definition of big-O, and proceeds to make conclusions with this wrong definition. Let me provide the accurate definition:

The statement "f is O(g)" means there exists some input, call it t, such that for every x >= t, it only takes some constant multiplier M (i.e., constant in x) to always have g absolutely no smaller than f. In notation:

|f(x)| <= M * |g(x)|, where x is at least t.

This bit about "x is at least t" is very important and notifies us that this is "asymptotic behavior".

It does not make a difference how wacky or weird f is compared to g below t. It can contain all these crazy memory hierarchy artifacts, it could contain a short burst of exponential slowdown, it could contain anything.

Furthermore, according to the above definition, big-O has nothing to do with any tangible quantity whatsoever. It's a method for comparing functions. The functions may represent whatever is of tangible or intangible interest: memory, time, money, instructions, ...

Big-O analysis usually posits that the details below t aren't the details that matter. (Of course, there are situations where they do, but in such you would not use big-O.) If you want to have some analysis that is global, you don't need asymptotic analysis (though it might help as a start). You can just talk about functions that are strictly greater than or less than your function of interest everywhere. But these analyses are difficult because a much higher level of understanding of your function of interest is required.

smallnamespace · on Aug 29, 2016

> It does not make a difference how wacky or weird f is compared to g below t. It can contain all these crazy memory hierarchy artifacts, it could contain a short burst of exponential slowdown, it could contain anything.

Yes, but if you continue reading the next post, the author is essentially arguing that the cache hierarchy extends indefinitely due to 1) the speed of light and 2) the amount of information that can fit within a light cone of a certain radius.

In that case, cache hierarchy effects aren't 'weird things that happen below t', they represent the asymptotic behavior we can expect due to the laws of physics, no matter how far out to the right we go.

reikonomusha · on Aug 29, 2016

I think it's great to make such arguments but a bit more rigor would be appreciated in such theoretical constructions. If the main point to get across is in Part II, and Part I does not support Part II, as a matter of convenience, I would elide Part I.

zeroer · on Aug 29, 2016

I wouldn't. It's nice to show these effects in practice. In fact, I would have added a part 1.5 that gave theoretical bounds for how much heat can be dissipated from a volume that contains N bits.

smallnamespace · on Aug 29, 2016

Heat dissipation is proportional to the surface area, so this suggests a reason that information density is (physically) bound by surface area, rather than volume.

klodolph · on Aug 29, 2016

Another factor that gives the same limit in theory is gravity. If you have a volume of space and start shoving hard drives in it, eventually the hard drives will collapse into a black hole, and your information density is limited by the surface area of the black hole—again, giving you O(R^2) bits of storage for a volume with radius R.

zeroer · on Aug 29, 2016

That's explicitly mentioned in the article.

klodolph · on Aug 29, 2016

It's mentioned in the second article in the series, not the linked article.

amelius · on Aug 29, 2016

Interesting. Perhaps a stupid question but is this somehow related to the recently proposed theory which says that the universe is essentially a hologram?

smallnamespace · on Aug 29, 2016

One way it's related: if a volume's information only depends on its surface area, then you can imagine the volume is really just a hologram with the same number of bits, and the bits are directly embedded on the surface of that hologram.

klodolph · on Aug 29, 2016

From what I understand, the holographic principle (the theory you mentioned) was inspired by the consequences of black hole thermodynamics.

Retra · on Aug 30, 2016

There's only so much surface area you can fit in a given volume.

lloeki · on Aug 30, 2016

Helge von Koch begs to differ ;-) Also, simple counterexamples: https://people.emich.edu/aross15/math121/misc/love-1989-supe...

Retra · on Aug 31, 2016

How are those counterexamples? I'm not talking about abstract mathematical geometry, I'm talking about actual physics. A paper that says "just drill infinite holes in a cube" isn't relevant here.

lloeki · on Aug 31, 2016

Stuffing hard drives together until they form a perfect Schwartzchild black hole is hardly "actual physics", so I posited we're long into abstract constructs giving theoretical upper bounds.

zeroer · on Aug 29, 2016

Exactly!

rosalinekarr · on Aug 29, 2016

Worrying about the speed of light in processor caching these days is like putting a spoiler on your cart and horse to reduce drag. It might matter eventually, and that's kind of cool to think about, but it's absurdly negligible with the technology of today. Processors today are orders of magnitude slower than that.

smallnamespace · on Aug 29, 2016

That's completely false. The speed of light today already constrains processor design (and by extension, on-CPU caches):

http://arstechnica.com/science/2014/08/are-processors-pushin...

"Even if signals in the chip were moving at the speed of light, a chip running above 5GHz wouldn't be able to transmit information from one side of the chip to the other". Note that a light-nanosecond is about a foot, so I'm assuming this is talking about the propagation delay of electrons in silicon that need to go through gates, which I'm sure is much slower than light in a vacuum.

To put this another way, you can view the scaling law as a hard upper bound on the speed of access to data; engineers can only approach that bound from below as technology improves.

But given a certain level of technology, it's easier to make a faster L1 cache than an L2 cache, and that will hold even as we approach theoretical limits.

klodolph · on Aug 29, 2016

A bit of a nit: We're not worried about the propagation of electrons over distance as much as we're worried about the propagation of the electric field. Electrons move too slow, the electric field moves much faster.

https://en.wikipedia.org/wiki/Velocity_factor

creshal · on Aug 29, 2016

> which I'm sure is much slower than light in a vacuum.

Speed of signal in copper is about ⅔c, I suspect silicon will be in a similar ballbark – and the metal layers linking the silicon layers are made of copper anyway. So it's about 8"/ns… and the (three dimensional, 10+ layer) signal routes are anything but a straight line.

klodolph · on Aug 29, 2016

The speed of the signal depends on the properties of the surrounding medium though, right? You get one speed through copper in air but another speed entirely for copper in silicon.

creshal · on Aug 29, 2016

And another for pure silicon, and others for the various doped silicons, …

I don't even know what the actual materials are with current technologies, but whatever it is, the speed will be much lower than the vacuum speed of light.

emn13 · on Aug 30, 2016

Well, that depends on the context. You're totally right in general, but for the purpose of order-of-magnitudes of maximal clock rates, the speed of light hardly varies. I can't find any normal, plausible processor material with a refractive index of 4, let alone 10. So while light in silicon may well travel [almost 3.5 times](http://refractiveindex.info/?shelf=main&book=Si&page=Chandle...) more slowly than in a vacuum - which is hardly nothing, at the granularity we're talking about I think it's not a huge issue. But that's just perspective - clearly the materials matter and will affect signal transmission speeds, and probably even today impose significant (if indirect) limitations.

It's possible weird dopings or quantum effect change this dramatically, I suppose. But if I had to guess, they don't.

hamiltonkibbe · on Aug 29, 2016

And the signals all need to get there at the same time

creshal · on Aug 29, 2016

And that's such a hard problem that you don't even have to go down to CPU scale for it to matter.

If you look at mainboards (or some smaller PCBs), the connecting lanes between e.g. RAM and CPU closer to the centre are often zigzagged, to match the lengths – and delays – of the necessarily longer outer lanes. The speed of light is actually really damn slow.

hamiltonkibbe · on Aug 29, 2016

> The speed of light is actually really damn slow.

I don't think one really appreciates that until they found out that after X hours of place&route they missed timing by some fraction of a nanosecond

Dylan16807 · on Aug 30, 2016

> you don't even have to go down to CPU scale for it to matter.

It's say it's the opposite, really. The longer the wires, the more effort it takes to synchronize their delays within a certain number of picoseconds.

klodolph · on Aug 29, 2016

Well, not entirely. For data signals in general you just need all the lines to resolve to the desired value before the clock. You do need to be very careful about the clock, though.

creshal · on Aug 29, 2016

DDR4-SDRAM achieves speeds in excess of 1.2 GHz over distances of 4" or more. That doesn't leave too much wiggle room.

iheartmemcache · on Aug 30, 2016

Surprisingly, the behavior at that speeds is fairly well known. I don't even think you need to move from FR4 to a more controlled material (e.g.generally in mmWave, Isola FR408, along with sometimes Rogers stuff, IME, is what is used). Allegro PCB SI(Signal Integrity) even models high speed timing fairly well at the design stage.

You've got plenty of stuff at the test stage too. E.g., the gear on high-end Lecroy's metrology test gear is at 100ghz. Agilent (Keysight, whatever, it's still HP to me) has a full test rig for USB3.1[0] at 10gbps for their consumer level gear (again, fairly slow). Step it up to FPGA speeds and here's[1] an app-note by Altera with way higher speeds.

Here's a really brief overview of 'rules of thumb' that work up to PCI-e[2] by TI. 50 minutes and worth a watch. Clean power that won't couple in, matching lines lengths on diff pairs, proper isolated ground planes (give AGND and DGND their own layers) and proper termination will easily get you 95% of the way there.

(Shameless promotion-- available for high-speed design, pre-EMC compliance testing, fault diagnostics, etc).

[0] http://www.keysight.com/en/pd-2472798-pn-U7243B/usb-31-compl... [1] https://www.altera.com/en_US/pdfs/literature/an/an528.pdf [2] https://www.youtube.com/watch?v=A3qw_Ecx9Co

Dylan16807 · on Aug 30, 2016

Note that GDDR5 abandoned length matching, and just measures the latencies of each wire to calibrate itself.

hamiltonkibbe · on Aug 29, 2016

Well yeah, I figured that that's implied, the only times that really matter are the rising and falling edges of the clock.

TheOtherHobbes · on Aug 30, 2016

>"Even if signals in the chip were moving at the speed of light, a chip running above 5GHz wouldn't be able to transmit information from one side of the chip to the other".

That's almost completely misleading. Except for the clock itself, nothing that happens inside a processor takes a single clock cycle. And clock distribution delays need compensation anyway.

That's also true of memory. No one has ever seen 1-0-0-0 CAS RAM. (And no one ever will.)

It's true that pushing up against distribution times makes topology more complicated and lowers the maximum possible speed.

But that's actually the difference between electronics and photonics, which promise to run many orders of magnitude faster than EM-based silicon design. (I've seen "millions" quoted, but that may turn out to be hyperbole. Even so - optical fibre can run at tens of Tbps.)

Of course there's a difference between transmission speed, which is a function of the speed of light, and maximum data rate, which is a function of bandwidth.

But unlike electrons-in-silicon connections, photonics may not need to be 100% serial. It's entirely possible to get multiple data channels down a single line with frequency multiplexing, and perhaps also by phase rotation.

That has huge potential to change the packing density and the speed of processor designs; potentially you could have processor elements that were massively parallel but connected with single lines.

There's also some hope that because power dissipation shouldn't be such a problem, it will be easier to make 3D designs - although that's probably more speculative.

Bottom line is silicon is nearly done, but photonics is just getting started. Like fusion power it's probably a couple of decades away, but when it arrives it will be huge.

pjc50 · on Aug 30, 2016

Most stuff that happens inside processors takes a single cycle, and we used to have single-cycle RAM until the CPUs ran away with increasing speed.

Photonics has a serious density problem: visible light is too large in terms of wavelength. This is already an issue with etching silicon. It's also at best at the 1950s stage where people have developed single photonic transistors in the lab but not yet photonic VLSI.

Anyway, the main factor of delay in chips is capacitance, both between wires and on the gates of FETs. This must be charged through the on resistance of the driving FET and all the intermediate wiring.

emn13 · on Aug 30, 2016

So, what you're really saying is that it's quite accurate, there are just details and more details but in the end - as long as we use a single clock - signals need to be able to travel from one end to the other in a single clock cycle.

Now, if we had (partially) clockless processors...

Also, your details are a lot more misleading than those in the article. lots of things in a modern processor take 1 clock cycle. Agner has lovely tables:www.agner.org/optimize/ and note that most basic integer and floating point operations on a skylake processor take 1 cycle - things like bit shuffling, moving, adding (not floats), and quite a few incidental other operations. Now, I'm sure you can quibble that a latency of 1 cycle in that table isn't really 1 cycle under some unusual interpretation of what's going on, but at that point you're quibbling about what "doing something" means, which isn't helpful. What programmers/compiler writers call 1 clock cycle can execute what most people call 1 instruction for many common instructions, even today.

smallnamespace · on Aug 30, 2016

All cache (and in fact, memory as well) used to be single-cycle.

Why aren't they any more? Because the speed of light doesn't allow it.

Multi-cycle cache/memory latency and the cache hierarchy it implies is a result of being bound by the speed of light.

regularfry · on Aug 30, 2016

On the massively-parallel-connected-with-single-lines theme, there's been some interesting work done with on-chip RF, which does essentially the same thing with conventional silicon.

Dylan16807 · on Aug 30, 2016

If anyone talks about signals crossing the chip in a single cycle, they're barking up the wrong tree. It's absolutely unnecessary. It would not take a tremendous engineering work to make a CPU where signals go no more than 3mm or even 1mm in a cycle. (At 5GHz you have a distance budget of over 40mm.) The most important constraint for single-threaded operation is transistor switching time. It's likely the most important constraint will always be transistor switching time.

pkolaczk · on Sept 1, 2016

Transistor switching time is limited mostly by electrical capacity which is determined by the permittivity of the material the transistor is made from. And permittivity and permeablity are the only things determining the speed of light in a material. So saying that speed of light is the limiting factor of transistor switching time is valid.

aporetics · on Aug 29, 2016

Spoilers don't reduce drag.

They produce downforce to counteract the lift effect of air moving under the car.

It is only necessary at very high speeds, and prevents loss of grip in (especially) the rear tires.

But, yeah, horse and buggy. I'm with you there.

tankenmate · on Aug 29, 2016

The downforce isn't there to counteract air under the car (actually fast flowing air under the car produces downforce; read up on ground effect and Bernoulli).

The downforce is there so the car can go around corners quicker; the downforce increases the friction (Ff<=mu x Fn), i.e. the force of friction is less than or equal to the coefficient of friction (mu / μ) multiplied by the normal force. In the case of a tire the normal force is the downward force (provided by the wings or spoilers). Therefore the more downward force the higher the friction hence the more the tire can exert lateral force (cornering force). For example a F1 car at high speed can exert approximately 4Gs of lateral force when cornering, or 5Gs of force when breaking (at slower speeds a F1 car can't brake as hard).

mac01021 · on Aug 30, 2016

Dragsters don't go around corners (not while racing, anyway), but they still benefit from the extra weight on the drive wheels produced by the air hitting the top of the spoiler.

caf · on Aug 30, 2016

It's the same effect though - the F1 car cornering needs to maintain sufficient friction to prevent the tyres slipping laterally, whereas the F1 car braking or the dragster accelerating needs to maintain sufficient friction to prevent the tyres slipping longitudinally.

tankenmate · on Aug 30, 2016

The air hitting the top of the spoiler isn't generating anywhere near the suction created on the bottom of the spoiler with air underneath having to speed up (and hence have much lower pressure compared to the top).

lasc4r · on Aug 29, 2016

Spoilers reduce drag and provide down force and high speeds.

coredog64 · on Aug 29, 2016

Aerospace engineer here: Producing lift in either the up (airplane) or down (spoiler) direction generates induced drag. The amount depends on your design, in that a large real or effective AR can reduce it, but it's going to be there.

ScottBurson · on Aug 29, 2016

I think what happens in automotive designs, which are not always penned with the same attention to aerodynamics as aerospace designs, is that they often have unintentional lift at the rear. Adding a small spoiler, then, doesn't add actual downforce, it just reduces lift -- and therefore reduces drag into the bargain.

zeroer · on Aug 29, 2016

The author gets the definition of big-O completely backwards. Every single time the author says "Big-O" he means "little-o".

Here are the definitions:

https://cathyatseneca.gitbooks.io/data-structures-and-algori...

Basically, big-O is a claim that your process is at least -this- fast. Small-o is a claim that your process can't be faster than -this-. He means the latter.

> Still not convinced? Think I'm misunderstanding Big-O?

Yea, I do.

rntz · on Aug 29, 2016

Your claims about what little-o and big-O mean don't match the explanation at https://cathyatseneca.gitbooks.io/data-structures-and-algori...

That page says o(f) == O(f) and not Theta(f). When you say "small-o is a claim that your process can't be faster than -this-", that seems to match the definition of Omega(f). Did you mean that the author should have used big-Omega?

As a side-note, I've always thought that {big-O,Theta,Omega,small-o} notation was very confusing. We should just have a notation for "the asymptotic equivalence class of a function", let's say A(f). Then to say f \in O(g), we say A(f) <= A(g). To say f \in Theta(g), we say A(f) = A(g). f \in o(g) becomes A(f) < A(g). And so forth. Instead of lots of new notation, we add a single function A and re-use notation for orderings which everyone already knows. (Of course, I'm omitting the definition of the ordering on these equivalence classes, so maybe there is some difficulty in defining that...)

contravariant · on Aug 29, 2016

>As a side-note, I've always thought that {big-O,Theta,Omega,small-o} notation was very confusing. We should just have a notation for "the asymptotic equivalence class of a function", let's say A(f). Then to say f \in O(g), we say A(f) <= A(g). To say f \in Theta(g), we say A(f) = A(g). f \in o(g) becomes A(f) < A(g). And so forth. Instead of lots of new notation, we add a single function A and re-use notation for orderings which everyone already knows. (Of course, I'm omitting the definition of the ordering on these equivalence classes, so maybe there is some difficulty in defining that...)

Your A is pretty much equivalent to \Theta. Saying f \in \Theta(g) is equivalent to saying \Theta(f) = \Theta(g).

Edit: as an aside O(f) = O(g) is also equivalent to f \in \Theta(g), and O(f) \subset O(g) is equivalent to f \in O(g).

rntz · on Aug 29, 2016

For equality, yes, Theta works! But I care not just about equality but about comparison. In particular, it would be nice if the "natural" ordering on A(f) had the property that

    A(f) <= A(g) iff f is O(g)

Theta(g) is a set, and the most natural ordering on sets is subsetting/inclusion. However, Theta(x) is not a subset of Theta(x^2), even though x is O(x^2)! So the natural ordering on sets doesn't do what I want.

It might be possible to define an ordering on Theta-classes that does what I want, though.

Edit: I just saw your comment about O. I think O might just work! So instead of remembering big-O, Theta, and Omega, I can just use

    O(f) <= O(g)  for  f is O(g)
    O(f) = O(g)   for  f is Theta(g)
    O(f) >= O(g)  for  f is Omega(g)

Is that true?

(I think small-o might be slightly more complicated, unfortunately. It has a more unusual definition: https://en.wikipedia.org/wiki/Big_O_notation#Little-o_notati...)

contravariant · on Aug 29, 2016

>Edit: I just saw your comment about O. I think O might just work! So instead of remembering big-O, Theta, and Omega, I can just use

> O(f) <= O(g) for f is O(g) > O(f) = O(g) for f is Theta(g) > O(f) >= O(g) for f is Omega(g)

>Is that true?

Yeah, I'm pretty sure that will work (haven't checked it very rigorously though). The definition of O(f) <= O(g) would simply be O(f) \subset O(g), which immediately satisfies almost all criteria (only tricky bit is proving that either O(f) is a subset of O(g) or the other way around).

For little o I think you should be able to use O(f) < O(g), using the ordering defined above.

jez · on Aug 29, 2016

> Is that true?

Yes, and that's the reason why

- g \in O(f) is an upper bound on g

- g \in Omega(f) is a lower bound on g

- g \in Theta(f) is a "tight" bound on g, or both an upper and a lower bound at the same time

zeroer · on Aug 30, 2016

Crap, you're right. I should have said Omega.

MrManatee · on Aug 30, 2016

I'm afraid the author is correct in using big-O here instead of Omega or little-o.

For comparison, suppose that someone claims that for all x, sin(x) ≤ 1/2. That would just be wrong. If someone claims that sin(x) ≤ 2, then that is true, but not as informative as saying that sin(x) ≤ 1. There's something special about 1 here. If you want to highlight that, you can say that 1 is the least upper bound for sin(x). This is a bit of a mouthful, but we cannot achieve the same meaning by just replacing "≤" by "≥" or "=" or "<". Any of these replacements would just make the statement incorrect.

Big-O is something like the asymptotic version of ≤. For example, how many comparisons does heapsort need to sort an array of length n? If someone says O(n), that's just wrong. If someone says O(n^2), then that's correct, but not as informative as saying O(n log n). O(n log n) is special here, since it is the smallest complexity class that contains the number of comparisons. Again, it is a bit of a mouthful, but we cannot say the same thing by just replacing big-O by Omega, Theta, or little-o (the asymptotic versions of ≥, =, and <, respectively). For example, Theta(n log n) and Omega(n log n) are incorrect, since heapsort only requires a linear number of comparisons if the array already happens to be sorted.

The author argues that random memory access time is O(sqrt(n)), meaning that for large n, it might take up to constant * sqrt(n) time, but might also be faster if the memory to be accessed happens to be very close. Using Omega(sqrt(n)) instead would mean that random memory access can take an arbitrarily long time, but at least constant * sqrt(n). This is not what the author is trying to say.

smallnamespace · on Aug 29, 2016

You're right, but almost all colloquial usage of Big-O seems to be close to theta-O because people pick the tightest bound they can.

That's why everyone writes that comparison sorts are O(n lg N), even though it's also technically O(exp N) as well.

zeroer · on Aug 29, 2016

Yea, I thought about that, but he doesn't mean theta, either. What he's giving is a physical impossibility proof -- you can't do memory access better than sqrt(N). When he gives a physical example of a machine of arbitrary size that accesses memory in sqrt(N) time, then we can talk about theta.

Animats · on Aug 29, 2016

Well, cube root of N; you can stack memory.

The speed of light limitation on distance to DRAM memory is real, but so far seems to affect mostly supercomputers. Are there any large server boards where speed of light lag is the limit on memory capacity?

zeroer · on Aug 30, 2016

Read part two of the article. You can't stack memory indefinitely -- you'll get a black hole.

perfectfire · on Aug 30, 2016

> it's also technically O(exp N) as well.

I wonder if this is what one interviewer was trying to get at when he asked me over and over again if a binary search was worst case O(log N). I just kept saying that if you took more than log N steps to perform a binary search then you by definition did not perform a binary search. He was off his rocker so I kind of doubt he was looking for the difference between little-o, theta, and big-O. It's good stuff to remember and might impress a more sane interviewer some day though.

jo909 · on Aug 29, 2016

Please see https://en.wikipedia.org/wiki/Best,_worst_and_average_case

Big-O is valid notation for _many aspects_ of an algorithm. The author applied it to a very practical one and explained what exactly he describes very well.

zeroer · on Aug 29, 2016

The best/worst/average issue is orthogonal to the definition of what big-O means. The time it take an algorithm has best/worst/average cases, and each of these is a function and thus is a member of big-O classes.

The definition of big-O has a completely standardized and unambiguous definition, and the author of the article somehow has a four-part blog post on big-O without knowing it.

reikonomusha · on Aug 29, 2016

I think it's a massive and needless abuse of standard computer science terminology to re-define it as something completely different.

MrManatee · on Aug 30, 2016

I'm not entirely sure what you are referring to. You might be referring to the fact that the author's definition of big-O doesn't say anything about constant factors or asymptotics. This makes the definition incorrect, or at least sloppy. But judging by usage, it seems that he actually knows and is using the standard definition. The error is just in that one sentence, and it doesn't affect the rest of the argument.

You might also be objecting to the fact that he makes a distinction between time and instruction count, and is using big-O notation for both. I don't think there's anything nonstandard about making this distinction when it needs to be made. Take, for example, Karmarkar's algorithm:

https://en.wikipedia.org/wiki/Karmarkar%27s_algorithm

DanWaterworth · on Aug 29, 2016

Re-read the article, bearing in mind that O(sqrt n) is a set that contains O(1).

angry_octet · on Aug 30, 2016

Yes, but it isn't very helpful to think of it that way, we're not competing to find the weakest constraint.

scott_s · on Aug 29, 2016

> For the purpose of this series of articles I'll be using the O(f(N)) to mean that f(N) is an upper bound (worst case) of the time it takes to accomplish a task accessing N bytes of memory (or, equivalently, N number of equally sized elements).

That's not really valid; it's not how algorithmic analysis works. The author's conclusion for what is happening and why is correct, but I believe he is confused about how to get there.

Simply, when doing complexity analysis on an algorithm, one must always count an operation. It's not okay to point to the time taken for an implementation and say "That's our function." It is a function, but it's a function of time, not a count of how many operations are performed at given sizes of N.

However, he is correct that naive analysis of arrays and linked lists will result in this odd behavior: arrays will tend to outperform lists on real systems. The problem with the naive analysis is in what it counts. For example, on an insert, a naive analysis will count the number of elements accessed in the structure. That's naive because it assume all accesses are the same - which is what he's getting at with the "myth of RAM". Because of the memory hierarchy, they are not all equal.

But the correct response is not to give up counting operations and look at time, the correct response is to find the right thing to count. And the right thing to count is basically going to be last level cache misses - the operations that force one to go to memory. If you do that, then you will find that the operations you are counting will correlate much better to the actual time spent.

In some places, the author gets this mostly correct: "You can also use Big-O to analyze the time it takes to access a piece of memory as a function of the amount of memory you are regularly accessing." That's fine, as you're counting memory accesses.

In other places, it's not correct: "That I use Big O to analyze time and not operations is important." You can't count time, only operations. You want to count the operations that correlate with your actual running time, but the entire point of good analysis is to find those operations. You can't just shortcut it, only measure time, and then call it algorithmic analysis.

The author gets a lot right, but despite the lengthy discussion, I think he still has some confusions about algorithm complexity analysis.

For the record, these lessons should be familiar to anyone who has done serious performance analysis of computer systems, either on their own, or in the context of a course that focused on systems or architecture.

_delirium · on Aug 29, 2016

Fwiw theoretical computer scientists do this kind of analysis as well, while the post makes it sound like they don't. The equal-cost model of RAM operations is only one model used in asymptotic algorithm analysis, and there are other cost models with other properties. The introduction of this paper gives a decent concise overview of some of them: http://www.cs.cmu.edu/~rwh/papers/iolambda/short.pdf

wscott · on Aug 29, 2016

Great series of articles and the lessons are very important to someone writing performance system's programs.

Here is another chart I like you show people: https://dl.dropboxusercontent.com/u/4893/mem_lat3.jpg

This is a circular linked list walk where the elements of the list are in order in memory. So in C the list walk looks like this: while (1) p = *p;

Then the time per access was measured as the total length of the array was increased and the stride across that array was increased. The linked-list walk prevents out of order processors from getting ahead. (BTW another huge reason why vectors are better than lists)

(This is from an old processor that didn't have a memory prefetcher with stride detection in the memory controller. A modern x86 will magically go fast.)

From that chart you can read, L1 size, L2 size, cache line size, cache associativly, page size, TLB size. (It also exposed an internal port scheduling bug on the L2. A 16-byte stride should have been faster than a 32-byte stride.)

jcoffland · on Aug 29, 2016

Math is pure and not constrained by the real world. Big O analysis begins with the assumption that you have unlimited uniform memory. The author points out that memory is not uniform in the real world. It's equally untrue that we have infinite memory at our disposal. The limits of the real world are good to remember but that does not invalidate Big O analysis.

claudius · on Aug 29, 2016

For the analysis in the article, it is enough to assume an unlimited amount of space into which you can put your memory as well as an unlimited amount of time. For your own analysis, you can of course make any assumptions you wish, but the author argues (and I agree with him there) that the spatial scaling O(sqrt(N)) for random access is both relevant in practice (as seen by the caching graph) and cannot be improved on when considering the laws of physics.

> The limits of the real world are good to remember but that does not invalidate Big O analysis.

No, it just means that you failed to take one specific bit into account and that your analysis may hence be less relevant to the real world. Much the same way we can say that integer addition is O(1), because usually we deal with fixed-size integers, we can say that memory access is O(1), because usually we deal with a fixed maximal memory size. Of course, integer addition is O(log(N)) and, according to the arguments made in the article, memory access is O(sqrt(N)).

jcoffland · on Aug 29, 2016

The author assumes throughout the article uniformly random memory access. This is the worst case for a cache. If you're going to get down to the nuts and bolts of your implementation and come out of the high world of mathematics and Big O then you cannot just consider one element, namely, in this case caching. You should also consider your access pattern which very likely is not uniformly random and therefore probably does not fit the authors analysis. In fact, the only reason caching works is because the authors premise is generally wrong.

claudius · on Aug 29, 2016

The author does address this point in part three of the series where he compares access times to a small area of a larger memory region. Even with absolutely sequential access to an array of size K, you first have to find this particular array in your larger memory region of size N, giving you total cost to iterate over the full array as O(sqrt(N) + K).

I am not sure how costly it is to iterate through your entire memory (assuming a no-op), but I would argue that eventually you reach some bottleneck and end up at O(sqrt(N)+N), too.

michaf · on Aug 29, 2016

Interesting read. Researchers in the HPC community have developed a number of performance models to predict real-world performance in more detail than possibe through simple Big-Oh of number of operations, e.g. while OP concentrates on latency, the Roofline model ( https://en.wikipedia.org/wiki/Roofline_model ) mainly considers limited memory bandwidth.

corysama · on Aug 29, 2016

A lot of people are pointing out that BigO is a purely theoretical, mathematical model that should be understood and used properly without regard to silly details like physics.

That is theoretically correct. But, the difference between theory and practice is that in practice there exists a large percentage of programmers writing code for the real world without understanding and using BigO properly. Their mental model of performance begins and ends with BigO. As far as they are aware, its model is reality.

Source: I've been giving a large number of programmer job interviews lately. It's a rare day when I encounter an engineer (even a senior one) who is aware of any of the issues brought up in this series. And, I work in games!

MaulingMonkey · on Aug 29, 2016

The article is still wrong - iterating through a linked list is O(N log(N) sqrt(N)). You can't have infinite nodes in a 16-bit, 32-bit, or even a 64-bit address space - to deal truly with N, one must consider the more generic case of a variable address encoding, which has a variable size (log(N)) and associated lookup etc. costs as the number of nodes grows.

This is the motivation behind e.g. the "x32 ABI" in Linux: All the power of x86-64 instructions, with none of the additional cache pressure/overhead of 64-bit pointers - log(32) being cheaper than log(64).

...ahh, being this explicit in your Big-O notation is probably not that useful, usually, although I've seen it occasionally in papers (where they're quite explicit about also counting the number of bits involved). Maybe they're dealing with BigNum s, which would make it a practical concern? The key takeaway is this:

> That I use Big O to analyze time and not operations is important.

Time depends on compiler settings, allocation strategy, and a whole host of other factors that are outside the purview of your algorithm. Operations is a lot easier to contrast and compare between different algorithms, the meat of what you're trying to do most of the time. Both are valid choices, just know which one you're dealing with.

The time factors are good to be aware of, to be sure - the performance pitfalls of (potentially) highly fragmented, hard-to-prefetch linked lists over unfragmented flat arrays should be well known to anyone charged with optimizing code - but it's probably easier to think of them as some nebulous large time constant (as even array iteration is going to hit the same worse-than-O(N) behavior, although with proper prefetching the bottleneck may become memory bandwidth rather than memory latency) and deal with those differences with profiling and other measurements, instead of Big-O notation.

AstralStorm · on Aug 30, 2016

Sorry, but for asymptotic analysis pointer size is considered constant. So, even the smallest last uses B-bit pointers.

Feel free to introduce a packed pointer type that would be asymptotically faster, but then it is a different data structure.

MaulingMonkey · on Aug 30, 2016

Since when does a linked list mandate unpacked, fixed-width pointers? Citation needed! I note they are not called pointered lists, and challenge the assumption that they even need use pointers - or do you think this abstract computer science concept is something different when applied to disk storage formats by using file offsets as the link? Linked lists as a concept date back to ~1955, when RAM was very expensive - magnetic-core memory being just introduced. Think punch cards.

Returning back to the modern era - one also does not need a packed pointer type, merely support for multiple pointer sizes within the same process, and the ability to dynamically dispatch to the correct one for the job. And multiple pointer sizes within the same process are a matter of routine:

In the 16-bit C era, vanilla "near" and "far" pointers very clearly weren't the same size. A very similar situation occurs under the hood when dealing with 32-bit processes on 64-bit windows via WoW64, which can lead to some interesting complications if you use a 64-bit application to create a crashdump for a "32-bit" process, where you can actually inspect the 64-bit side of things too (fire up WinDbg and dig in: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3... )

This is to say nothing of 20+ byte member function pointers on 32-bit architectures because C++ is only out-crazied by C++ compilers. Or dealing with the complications of multiple pointer sizes via shared memory. Or dealing with file "pointers" (read: offsets) of varying sizes.

DanWaterworth · on Aug 29, 2016

> You'll know that iterating through a linked list is O(N), binary search is O(log(N)) and a hash table lookup is O(1). What if I told you that all of the above is wrong?

It's not wrong, it doesn't have enough contextual information to be right or wrong.

tantalor · on Aug 29, 2016

"Not even wrong" https://en.wikipedia.org/wiki/Not_even_wrong

Retric · on Aug 29, 2016

Ehh, his assumption is that O notation should say something at lower bounds, which is simply wrong. He should be testing 10-15+ GByte data structures not < MB data structures.

Consider, O(X * Log (X)) is generally thought of as good. But, let's assume the algorithm takes 24 hours + x * log(x) nanoseconds. Well for low values of X that 24 hours is going to be a pain and it's going to look like constant time, but for really really big data sets x * Log(x) will actually dominate.

kenjackson · on Aug 29, 2016

There's been a fair bit written on the topic. One of the better papers that has a parameterized model is here: https://www.computer.org/web/csdl/index/-/csdl/proceedings/f...

I should note that this paper is more than 25 years old. :-)

maker1138 · on Aug 30, 2016

It's amazing how many people didn't actually read all 4 parts of the article.

His argument has nothing to do with caching or prefetching, etc.

First, it's about random access. You can't prefetch a random fetch!

Second, he's measuring time, a perfectly valid thing to do. And the reality is when you lay your memory cells out in 2 dimensions it takes order of sqrt(n) time to fetch a random memory cell value, where n is the number of memory cells you're using.

Third, it turns out order of sqrt(n) time is the best you can do even if you had the best technology in the universe.

StillBored · on Aug 29, 2016

I guess the author is trying to simplify, but its way more complex than that. Simply assuming a few layers of cache completely misses all the other layers that have effects starting with.

Cache lines, RAM Read vs write turnaround, dram pages, number of open dram pages, other CPU's interfering with the same RAM channel, remote NUMA nodes, and probably some I'm forgetting. All this is very similar to secondary storage access rules (even for SSDs)...

AstralStorm · on Aug 30, 2016

Assuming each layer is at worst tree-like, but you can easily devise the upper bound.

StillBored · on Aug 30, 2016

Sure, but I think the point is that big-O notation fails miserably at analyzing certain algorithms because it doesn't have a way to represent the locality of the algorithm.

Put another way, all the little "constants" thrown away in the analysis may not actually be constants, and their non-constantness may be enforced with actual physics. In other words, like the article says, the idea that storage access times are constant is nonsense. Due to physical limitations, this is insurmountable rather than being a side effect of architecture. So, its quite possible that for certain algorithms the "constant" factors may be overriding terms in the analysis.

jlarocco · on Aug 29, 2016

The article is conflating theoretical algorithm analysis and low level implementation details.

Big O analysis is a theoretical measurement of algorithm performance. By definition it ignores details like memory access speed, the exact instructions used, and other details of specific hardware architectures.

Real life algorithm implementations obviously need to deal with those low level implementation details, but that doesn't change the theoretical analysis. It's easy enough to find (or design) machines without cache where this difference in memory speed doesn't exist.

arielb1 · on Aug 29, 2016

"RAM+arithmetic"-style complexity analysis in terabyte-scale-and-above problems is unable to distinguish between very practical and absolutely impractical algorithms. "square-root ideal cache hierarchy + arithmetic" has much better distinguishing power (for parallel programs, you also need to remember to bound throughput, e.g. Bernstein's area*time, for similar reasons).

falcolas · on Aug 29, 2016

I'm not sure the cost of accessing the storage medium belongs in the complexity of the algorithm, since that cost will change based on the storage medium, not the algorithm itself. It strikes me as more of a constant, (even though it isn't constant).

Still, interesting read, nontheless.

malisper · on Aug 29, 2016

In part two, the author demonstrates that if you packed information as densely as possible, i.e. your computer is a black hole, the best you could do is O(n^1/2).

falcolas · on Aug 29, 2016

If, and only if, you're accessing that memory randomly. If you're accessing it in a linear scan, that line is going to drop away from sqrt(N) very quickly.

yxhuvud · on Aug 29, 2016

Yeah, but that is not an argument not to include that in the analysis of how a function behaves when you throw lots of data into it. Also, the author comments this in one of the parts.

ddlatham · on Aug 29, 2016

I agree that in practice most of the time it won't affect your choice of algorithms. But it may be helpful in understanding how your run time will actually scale as your data size increases.

Also, I could conceive it being useful in odd cases if there are choices between two algorithms: Algorithm 1: O(N) memory access operations, O(N log^2 N) cpu operations Algorithm 2: O(N log N) memory access operations, O(N log N) cpu operations

If we assume memory accesses are O(1) time, then Algorithm 1 is faster, but if we assume memory access are O(sqrt(N)) then Algorithm 2 is faster.

falcolas · on Aug 29, 2016

Here's the problem, the sqrt(N) assumption is based on completely random access, where most accesses bust all the caches. What will matter more, when it comes to the cost of memory in real applications, is the actual size of your dataset in relation to the cache sizes, and whether your access is sequential across a region of memory (an array) or random (a scrambled linked list).

In practical terms, the plateaus of the stair step graphs matter more than the general trend of the line.

AstralStorm · on Aug 30, 2016

Unless you're concerned about worst case scenario, where the last part is definitely not a plateau ever. You can reduce it to a Turing machine fetching you data sequentially.

jimminy · on Aug 29, 2016

I find this really odd, it's not wrong, but it doesn't invalidate O(1). It's mashing two-things together that are unneccessary and can cause misunderstanding.

Big-O provides a decent tool for generic analysis and an understanding of access times of memory hierarchies. Since memory hierarchies can vary, they shouldn't be considered while doing generic analysis, much anyways.

Both are important to understand. The key thing is setting your Big-O access expectations to the slowest level of your heirarchy. In that way, your expectation remains generic and still proximally accurate across the average cases.

When you consider them together, think of the heirarchy as a series of piecewise functions that modify the value of the constant time based on the speed of the bounds that fit your data.

This square of N notation falls apart in other cases. 128GB's of RAM would have roughly the same access speed as the 8GB's he had available, if he had that much in his system. But having 128GB of RAM would completely destroy the squaring by flattening an entire magnitude from his hypothesis.

But it is a nice display of memory heirarchies, IMO.

benkuykendall · on Aug 29, 2016

If you go on to read the later parts, the author explains how physics limits random access to N bits to O(sqrt(N)) even without a memory hierarchy. This has practical implications when comparing sequential access in an array to random access in a hashmap.

exelius · on Aug 29, 2016

> This square of N notation falls apart in other cases.

Not particularly; since O(n^1/2) is typically used as a catchall for all O(n^x) where 0 < x < 1. The whole idea of Big O notation is that it provides an idea of systemic complexity that is normalized to a theoretical "constant" system represented by a coefficient (that is usually dropped when writing out the Big O).

IMO it still holds: distributed systems behave differently than local ones. Now that we're writing algorithms on the cloud, we've added an abstraction layer that adds complexity, and that brings some non-constant amount of overhead. It didn't make sense to think about memory like this 50 years ago; but it does now.

hacknat · on Aug 29, 2016

Nah. Sorry cache-misses don't count as part of a theoretical analysis on complexity. Why? Because you're getting into specific access pattern performance. Complexity is about "all things being equal". Is it the only thing you should consider? At first it should be, then if you run into a problem with a specific structure that has remarkable scale or access then go ahead and consider what the underlying hardware might be doing with the specific access patterns your structure is encountering.

It's interesting to see linked-list as his example, because it is the most likely to have cache-misses as you move through it as the allocations are very fragmented. I'd be very curious to see the same chart on a warmed-up hash-table.

Also, if we're considering the hardware, can we take into account pre-fetching and branch prediction? What are your numbers then? Yeah RAM is farther out then the local caches, but the CPU is also not completely ignorant of what it has to do next.

zeroer · on Aug 29, 2016

Did you read part II of the series? It's a deeper analysis than you seem to give him credit for.

His argument is that in this universe, the specific laws of physics we have here enforce the property that memory access takes o(sqrt(N)) time [1] where N is the size of your data set.

[1] - What's actually going on is that it takes o(N * sqrt(N)) time to touch N bits of information. Since for any particular bit, it might be very close to the processor and thus fast, but hitting N bits necessitates that most of the bits you access are far away.

hacknat · on Aug 29, 2016

I did read part II, and I think his analogy is off. Memory isn't linearly limited (it happens to be in the processor, but this isn't a given), I don't even understand why he brings circles into his conception of memory. Memory isn't linearly bound to the previous local cache. RAM isn't an order of magnitude larger than the L3 cache, it's MANY of order of magnitude larger! His analogy is just wrong. A better analogy would be library A takes X amount of time to access its books and can store 100 of them, and library B takes 10X amount of time to access its books, but it can store 1,000,000 books!

Yes as we increase in memory size latency goes up, but he never proves that this is sqrt(N) (he correlates that it is). Each jump in up can be explained by cache misses in each successive local cache, but RAM can scale more than a few orders of magnitude beyond the L3 cache. He needed to keep his chart going past 1GB to see that there is actually a plateau to be hit. If he had scaled to 10GB and then to 100GB he would have seen access times be about the same that they were for 1GB.

zeroer · on Aug 29, 2016

Re-read the "The theoretical limit" section in part II. His arguments depend on the physics in this particular universe, not how computers happen to be built.

hacknat · on Aug 29, 2016

My response was about his math and physics:

If you scale the radius of a sphere by one unit then you indeed get an order of magnitude increase in volume (bits of info), but that's not the correct model! We don't increase by on unit! RAM isn't a one unit increase in radius! It's an order of magnitude increase! If you increase the radius by an order of magnitude you get a 3 fold order of magnitude increase in memory. So you jump up in latency by one order of magnitude and you get 3 orders of magnitude of memory in return.

His analysis and math are wrong.

zeroer · on Aug 30, 2016

You're totally missing his argument. If you scale RAM cubicly at any positive density, as you are suggesting, eventually you will achieve the matter density sufficient for your RAM to gravitationally collapse into a black hole.

If the density of your RAM is d, then the volume that a mass of M RAM takes up is M/d. If it's arranged in a sphere, the radius is ((3M)/(4dpi))^(1/3). Notice how this radius is a constant times the cube root of M.

Whereas the Schwarzschild radius scales proportionally to mass.

https://en.wikipedia.org/wiki/Schwarzschild_radius

Thus, if you put enough mass of constant postive density together (no matter how small that density is) eventually you get a black hole.

N.B. - The author's argument is a little more subtle because he's talking about information density via the Berkenstein Bound, and he gets a square root instead of a cube root. But the argument is the same flavor.

Jweb_Guru · on Aug 30, 2016

You clearly did not understand the linked article at all. He is referring to theoretical results on the information density of a black hole, which indeed collapses to area. In practice, the reason RAM doesn't appear to increase by three orders of magnitude is not that (it's that we don't actually build RAM in a sphere) but his point is that whether you are looking at theory or practice the square root is the correct model.

DanWaterworth · on Aug 29, 2016

> Sorry cache-misses don't count as part of a theoretical analysis on complexity.

They don't in the random access machine model, but there are other models where cache misses do factor in.

mmalone · on Aug 29, 2016

Yes. This is what I came to say. Big-O assumes a model of computation. Typically people assume a random access machine with certain constant time operations like arithmetic. In some models arithmetic might not be O(1) (e.g., in a pure lambda calculus it's O(n)). A random access machine model makes a bunch of simplifying assumptions that don't actually hold in practice. As the article suggests, if you're writing code in industry it's not a bad idea to understand how your specific hardware differs from the model.

geophile · on Aug 29, 2016

Why is this wrong-headed discussion top-rated on HN?

And why is there so much misunderstanding on HN of big-O notation wrt cache misses lately?

All you kids, get off my lawn.

jandrewrogers · on Aug 29, 2016

Closely related but unfamiliar to most software geeks, Bélády's work in the 1960s and later on the theoretical limits of operation throughput when using cache hierarchies is very relevant to high-performance software design. The theory generalizes nicely to any topology where you can control how access latencies are distributed, and carefully designed software can get relatively close to the throughput limits (though it is somewhat incompatible with the way most software engineers design systems these days e.g. multithreaded concurrency is a non-starter).

infinite8s · on Aug 30, 2016

Do you think something like lightweight modular staging (https://scala-lms.github.io/) would allow fairly high level code to get to the theoretical limit?