Hacker News new | past | comments | ask | show | jobs | submit login
When Big O Fools You (jackmott.github.io)
205 points by matthewwarren on Aug 23, 2016 | hide | past | web | favorite | 130 comments



This article points out a few right things, but also skips over the wrong parts.

Namely, the amortised time complexity of dynamic lists.

Amortised analysis treats operations not as single events but looks at the time complexity over the span of many operations (through something called "Accounting"). An initial "investment" of array over-allocation will be amortised by inserting, but only over time. Inserting into an ArrayList will be in O(1) given enough insert operations.

Essentially, you over-allocate in order to save time. You're trading time complexity for space complexity.

A really good explanation on amortised analysis can be found here on Wikipedia, which explicitly treats ArrayList:

https://en.wikipedia.org/wiki/Amortized_analysis#Dynamic_Arr...

As always, a look into the source code of your language of choice helps. In Python, a list object over-allocates using a very peculiar, but finely tuned formula: https://github.com/python/cpython/blob/09dc3ec1713c677f71ba7...


The nuance you're missing here is that the author is always inserting at the front of the arraylist. So even though the list grows dynamically, it still needs to move every value one spot over.

e.g. arraylist: [0][1][2][3][ ]

Even though there is space left in the array, we still need to move all the values one index to the right to be able to insert at the front, which takes O(n) time. In contrast, inserting at the front of a linked list is simply a matter of moving pointers and therefore constant time.


Some languages, such as Perl, start the array in the middle to avoid exactly this situation.


Can't you just imagine it is reversed, where the_array[num_things-1] is the first ?


Yes, then it would be a discussion about append performance instead of insert performance. Might be handy to do if you know that all the changes to an array will be inserts at the front, then you can just write the array in reverse and also read in reverse.

But it is helpful to keep the terminology clear, because an insert in the middle of a linked list is still O(1), but inserting into the middle of an array will require moving some fraction of the data.

The OP is discussing inserts at the start just because it is the worst-case scenario for arrays, then a positive result will still be true if some of the inserts are in the midddle or even near the end.


> But it is helpful to keep the terminology clear, because an insert in the middle of a linked list is still O(1), but inserting into the middle of an array will require moving some fraction of the data.

This is true if you already have a reference to the middle of the list. If you don't (say, because you want to insert while preserving the fact that the list is sorted), then inserting into a linked list is O(n) just like the array is.


Memory reads and memory writes are not the same. Linked list insert in the middle needs to do only O(1) writes. But because linked lists preclude memory prefetch, they should not be used these days.


Afaik, if you read in reverse, then you will miss cache optimization since it only works forward.


Intel cache prediction will detect both forward, backward, and some odd movement patterns. It has not been only forward prediction for a very long time.


Do you know how the popular ARM based chips fare in comparison?


Stay strong.


The awful thing about writing benchmarks is that if you don't give the worst case example, people give you shit for coming up with metrics that say whatever you want to say.

If you give metrics for the case that should be worst for your argument, people give you shit about using such a stupid example.

There's no way to avoid someone giving you grief no matter what you do.


Um, give both?


Then both groups agree that your examples are terrible ;)


This point should be understood very clearly: The article isn't wrong that an ArrayList has O(n) for a single insertion. However, the article misses that amortized analysis shows that the same structure can be made to have O(n) for n insertions as well. In effect, ArrayLists can be thought to have O(1) insertion time in practice meaning that all you are comparing are the differences in constants between the two structures.

I think the biggest takeaway from this isn't that big-O can fool you (which it very well can), but that misunderstanding analysis can lead to poor decisions.

A better argument for big-O fooling you can be found using Quicksort and any proper O(n Log(n)) sorting algorithm. Quicksort will outperform most in practical cases, but is technically O(n^2) in the worst case (average case O(n Log(n)).


Isn't that an argument for average case performance being a better predictor of average cases than worst case performance?


Well except quick sorts worst case is very easy to hit. Hence why you don't use it in CS101 form.


Amortising allocation won't help the article's example, which is insertion at the front, unless you have a quite unorthodox implementation of your array list (e.g. implementing something like a deque using a circular buffer).

The Python growth allocation doesn't look peculiar to me. It's just calculating exponential growth without floating point operations. Whether you over-allocate in 50%, 25%, 12.5%, or whatever proportional increments (all trivial to calculate using bit shifts), they are all enough to make dynamic growth of the array O(n) over time. Any exponential growth will do. If anything, Python's list growth looks to me like it trades off slightly too much time to save space. Actual results will vary depending on realloc implementation etc of course.


Would be interesting to see the performance of Python's overallocation algo compared to a simpler one like Lua's, which just overallocates to the next power of 2 (so if you have a list with 4 items and want to add, it jumps to 8, then 16, then 32). I can't imagine the bitwise operations adding up to that much time added in Python, but Lua needs to increase size much less the larger the list (after 7 iterations Lua has room for 64 items, Python only 46 according to that link). Interesting choices, wonder which optimizes for what.

Would love to get Ierusalimschy and van Rossum in a room together and let them talk it out.


Even with only 25% growth per realloc, you hit logarithmic overhead (only with a higher k because more actual copying.)


I think what you are saying is only true if you insert at the end of the ArrayList, (or at the beginning if you have one which leaves open space at either end)

If you insert into arbitrary positions, or the front of these particular array lists, it will be O(n) even ignoring the growth issue, because every element has to be copied over 1 spot.


Insertion at the head of a dynamic list avoids the issue of amortized complexity, though, since every element in the array needs to be copied on every insert, whether a new array needed to be allocated or not. Unless you're concerned about the time complexity of the memory allocator?


In no so many words, the article is pointing out that O(n) + O(1) is not necessarily quicker than O(n) + O(n): both add to O(n) (since only the highest-order factor matters), and the constant factor is unsurprisingly better for array lists.

He's not benchmarking an O(1) operation against an O(n), or anything surprising, or even pointing out a "hidden" O(n) operation, it's simply a demonstration that O(n) + O(1) = O(n).


Of course this was a single example.

The main point is that the big O notation ignores the cache misses.

For example:

- try writing a quicksort as a template to avoid the comparison function calls

- in the sorting loop, switch to an insertion sort when buckets have <= 32 elements

You will see speed improvements of 2 to 3 times faster than a vanilla implementation of quicksort.

Although this approach is slightly more complex on the abstract level, it's faster because it reduces lots cache misses.


Surely the main point is that

    O(n) + O(n) = O(2n) = O(n) = O(n + 1) = O(n) + O(1)
and the constant factors left out of O() often dominate execution time.

Cache performance is just one example of a constant factor, right?


Cache performance is not constant, if it was, we wouldn't even consider it when optimizing.

The point is to use data that was recently fetched into the (faster) cache memory as much as possible instead of incuring the penalty of a cache miss

Cache performance really depends on memory access patterns.

Anyways :) I'm being pedantic here, I should probably go back to work.


Memory access time is a constant multiplier, whether it hits or misses the cache. There can be a big difference in that constant depending on the access pattern, but it is still considered a constant factor.

Yes, random access (missing the cache every time) may be 100x slower than sequential (hitting the cache almost always), but if you're iterating through an array twice as large, it will still be 100x slower.


I agree, on the same machine, the slow-downs are constant for each level of cache and for ram access.


One of my favorite ACM papers was an analysis on how the asymptotic runtime behaviors of various sort algorithms didn't tell the full story, and that the faster algorithms took a pretty big data set before they even broke even with some of the other sort algorithms. In fact I think at even 10,000 records it was still about 30% slower than a 'worse' algorithm and you had to get up to sorting a considerable amount of data before it was 'best'.

I don't know about you, but I don't sort 100k records in a single batch, and if I am it's because I messed up. But I might sort different batches of 100 records 1000 times a minute.


Not necessarily. You might want to create an algorithm that iterates in a sequential fashion through memory, since your memory controller will see the iteration and prefetch the next block of memory. This is a huge performance win on larger data structures.


Also, Big-O notation is a theoretical construct for thinking about complexity that works well for academic proofs but glosses over many real world considerations. Those ignored constant factors can be large and the datasets are not always large enough to amortize their cost.


Furthermore amortization isn't always accounted for in Big-O. Inserting into into the head of a linked list is O(1), and for a vector it is O(n). But on modern hardware vector inserts are much faster.

If your software engineering begins and ends with Big-O notation, your likely doing a terrible job. Measure, and test. Always.


I would say, "If your software engineering begins and ends with [jargon] you're likely doing a terrible job. Measure, and test. Always."

As a field, do we measure and test the effect of using various programming paradigms, patterns, and coding standards? I don't think most companies do this. Most developer decisions are made by "gut." As a field, we are still more like the "before" of Moneyball than the after.


There's a third option: analysis and understanding.

This seems to be entirely forgotten by the "measure, test" crowd - the scientific method requires a hypothesis to test for measurements to be meaningful, and separately not all things that are produced are "tested" in every aspect, since many things are well established, and the application of that existing theoretical framework makes measurement and testing a waste of time.

Having a theoretical understanding of the thing you care about is far more important than the "measure, test" step - which is certainly important, but by itself of really limited power.


The programming field isn't lacking in typical programmers with hypotheses. What it's lacking are typical programmers who are checking their hypotheses.


If you take the word hypothesis at its meanest, and ignore everything else I said, sure. But a hypothesis that is grounded in an absence of analysis or understanding of the situation, or an appreciation of the existing body of literature related to a topic, or any other informing principle, is of limited value.

i.e., it's not about checking your hypothesis, it's about checking the method by which you formed your hypothesis.

Checking your hypothesis is fine, but of value only to reject, not to actually find a hypothesis you can accept.


Many would rate Aristotle as a 1st rate thinker. Too bad he didn't check his hypotheses more.


> the scientific method requires a hypothesis to test for measurements to be meaningful

That's the simplistic grade school version. In practice, characterization and measurement is part of a feedback cycle that is used to develop and refine hypotheses. Experiments need a hypothesis to test, but measurements do not require experiments. Where the "measure, test" crowd usually goes off the rails is in not entering that scientific feedback cycle after the initial measurements and instead jumping to conclusions based on gut feel interpretation of them.


I deleted from my comment an explicit calling out of this feedback loop, as it was implied, I thought, by the statement that the measurement and testing were both important, but of limited value by themselves.

However I disagree that measurements have any meaning unless they are conducted as experiments, or potentially as exploratory work looking for an experiment to conduct. Even such exploratory work needs to be driven by at least the loosest of hypotheses, since measuring everything is infeasible. So this is just as much an experiment, just one you don't have a high expectation of predicting the outcome of.

Either way, this recognition of the necessity of at least a minimal hypothesis in all of these scenarios is typically lacking is my fundamental point.

The cycle of revision of that hypothesis is clearly the next step and, as I say, I thought clearly implied.


> However I disagree that measurements have any meaning unless they are conducted as experiments, or potentially as exploratory work looking for an experiment to conduct. Even such exploratory work needs to be driven by at least the loosest of hypotheses, since measuring everything is infeasible. So this is just as much an experiment, just one you don't have a high expectation of predicting the outcome of.

> Either way, this recognition of the necessity of at least a minimal hypothesis in all of these scenarios is typically lacking is my fundamental point.

That's an overly broad definition of hypothesis that results in every decision-making process fitting the model. Even the "measure, test" people you were complaining about have that. They formulate a "minimal hypothesis", take measurements, fix what they see, and call it a day. What they don't do is iterate on the process and get down to the root causes.

Furthermore, in some closed systems measuring everything (at some level of granularity) is feasible. It leads to a troubleshooting cycle of continuously focusing the measurement towards an indeterminate point, with the specific point being the answer. This kind of fault isolation is much more productive, in general, than continuously developing and testing hypotheses about what specific things might be an issue.


I am not buying your distinction between measurements and experiments - experiments are measurements with a purpose.

You don't have to start with measurements. You can often do some analysis before you have anything to test, and that may save you a lot of time measuring, testing and tinkering with something that doesn't have a prayer of working. If you are measuring the environment that the system will interact with, you have already done some analysis in determining what to measure, and developed some theories about what is going to matter; that is hard to do without some causal reasoning. If your tests show you have a problem, you need to think about what is going on in order to find a solution.

There's a curious anti-intellectualism in software development that is opposed to any suggestion that thinking about problems is useful. It seems to go hand-in-hand with a simplistic world view that only sees dichotomies, together with the assumption that there can only be one answer.


> I am not buying your distinction between measurements and experiments - experiments are measurements with a purpose.

I'm not making one. I'm saying that measurements don't require an experiment to be meaningful.

> You don't have to start with measurements.

I didn't say you did, I said you can. The person I was responding to claimed otherwise.

> There's a curious anti-intellectualism in software development that is opposed to any suggestion that thinking about problems is useful. It seems to go hand-in-hand with a simplistic world view that only sees dichotomies, together with the assumption that there can only be one answer.

As someone who constantly rails at the lack of engineering rigor in software "engineering", I agree completely.


Hell, let's go a step farther. If your software engineering begins and ends with jargon, you're likely doing a terrible job.

(the business people always pick the worse solution that sounds better unless you can explain why they should care, in english, meant for normal human beings, without sounding condescending about it)


What's worse, the field is constantly changing. What was true on one architecture may not be true on another, and even different generations of a single architecture can change a lot.


A cache miss can be a slowdown of ~200x on modern CPU architectures. Lets say you have a choice of a O(n) algorithm that always misses the cache and a O(n lg n) algorithm that never misses the cache.

The crossover point where it makes sense to use the O(n) algorithm will occur when:

200 * n = n lg n

which occurs for n > 2^200. Most of us are not dealing with problems larger than can be represented in our universe, so cache efficiency matters, kids.

Note that this changes for other comparisons. Going from n maximally cache inefficient vs n^2 cache efficient is only worth it for n up to 200, and n lg n --> n^2 up to n about 2223.

Of course, in practice your algorithm won't be missing the cache on every access, so reality is somewhere between these values.


Make Sure the Abstraction is Worth It

Yes! All abstraction has a price. This is what people really mean when "all abstractions leak." Basically, all code and runtime features have a cost in terms of developer resources, cpu, memory, etc. The "inner game" of development isn't being able to muster huge amounts of "cleverness" power. The "inner game" is being able to put whatever resources you have to the best possible use.


This isn't true at all. What's the cost in developer resources, cpu, or memory of making a complex number class in C++ a template over floats and doubles? How about the cost of using Rust's generics to make a parser that can parse from any linear source of bytes (e.g. both files and in memory byte arrays)?

There's plenty of abstractions which don't actually cost anything, and there's room for even more.


What's the cost in making a template more generic? Greater compile time. Need for compiler that can handle generics. Greater number of functions in binaries, and thus more unique places for bugs to occur. Dynamic library bloat. Harder to understand code. Quirky edge cases. Harder to learn languages. Harder to parse languages. And of course, developer time is spent dealing with all of this.

Abstractions leak because they are models, and thus are necessarily different from the concrete things that they model. These differences can and will cause problems. Generic programming is an abstraction that pretends you aren't working on low-level un-typed system. And you pay for that abstraction any time you need to deal with low-level demands through that abstraction.


> Greater compile time.

Not really with newer generics implementations (.NET, Rust, etc.). I don't think anyone thinks C++ templates are ideal, and the others show this at least a non-essential cost.

> Greater number of functions in binaries, and thus more unique places for bugs to occur.

Less human written code, so this reduces the surface area to the compiler; which is an "issue" no matter how low level your code is.

> Dynamic library bloat.

Your alternatives are A. duplicate the classes yourself B. let the compiler duplicate them. Where's the bloat? C++ requires you to explicitly instantiate templates if you want them in a library, and usually they're just left to the header. The others I listed won't generate anything unless you actually use a particular instantiation.

> developer time is spent dealing with all of this

It's pretty hard to argue that more developer time is wasted by the compiler writer than that of the time saved for developers who use the language for such a broadly used feature.

> Abstractions leak because they are models, and thus are necessarily different from the concrete things that they model.

What concrete thing do generics model? Tediously copy-pasted code? These will only be different if you mess up when you do the copy-pasting.

> Generic programming is an abstraction that pretends you aren't working on low-level un-typed system.

What? It's an abstraction over an already typed system that obviates the need to duplicate code, which after the generics are unwound produces the same code you would have had anyway. You're just as far or close to the low-level un-typed system as you were before.

"All abstractions leak" is a nice platitude, but handwaving about "models" and "concreteness" doesn't prove anything.


You seem to be misunderstanding the generality at which I'm speaking. Take this:

>It's pretty hard to argue that more developer time is wasted

That's NOT what I'm arguing. I'm saying that it takes a different amount of time. Hence the concept of "trade-offs", AKA "cost."

With that said...

>Your alternatives are A. duplicate the classes yourself B. let the compiler duplicate them. Where's the bloat?

C. Monomorphise at runtime.

>Less human written code, so this reduces the surface area to the compiler; which is an "issue" no matter how low level your code is.

Yes, but it's a different issue. Again, there are trade-offs.

>What concrete thing do generics model? Tediously copy-pasted code?

No. They model the behavior of the algorithm as it physically exists in a machine.

>What? It's an abstraction over an already typed system that obviates the need to duplicate code, which after the generics are unwound produces the same code you would have had anyway. You're just as far or close to the low-level un-typed system as you were before.

Generics are part of your language. They do not abstract over the language. They abstract over the behavior of your program. It's just a feature of the language that does so with greater generality that the rest of the language.

And the code you "would have written anyway" is machine code. Which you are abstracting over. Generics allow you to pretend your algorithm doesn't need a fixed machine representation. The fact that you can produce two different machine representations for the same template code is an example of what I'm talking about, not a counter example. The abstraction leaks. You write one algorithm, and it has to be compiled into two different representations. That's the cost of the abstraction. You get increased generality at the cost of memory and time.


> Generics are part of your language. They do not abstract over the language.

Most generics systems allow the code to be unfolded into generic free code; this is how templates are usually compiled in C++. So in a material sense the compiler treats templates as syntactic sugar added over C++ without templates (which is a valid language, which the compiler uses internally).

> C. Monomorphise at runtime.

A lot of generics (like my complex number example) can't be monomorphized at runtime since floats and doubles don't have dynamic dispatch in these languages. Not to mention the loss of type safety even if you implemented it this way. So that is not equivalent.

> And the code you "would have written anyway" is machine code.

That's not analyzing the cost of generics, that's analyzing the cost of the whole language, and then putting it on generics. If that's fair, why wouldn't we add the cost of designing and operating computers instead of doing it on paper? There's nothing special about machine code here.

> The abstraction leaks. You write one algorithm, and it has to be compiled into two different representations. That's the cost of the abstraction.

If (as in my example before) you need complex numbers with floating point numbers, and complex numbers with doubles, the compilation will likely be faster because the compiler doesn't need to take two parallel implementations through the whole parsing process, and can instead generate the necessary ASG itself. If you use on only one, then maybe there is a detectable differential cost.

Also, how is that even a leak? That's the abstraction working absolutely perfectly and giving you exactly what you intended.


I agree with the fundamental point in the article, but isn't it well known that when you consider multiple levels of the memory hierarchy as a whole, big O notation needs to be modified to take into account the relative cost of access?

I think the problem is not so much the big-O notation, but that the underlying assumption that the data access is from an "all main memory" model with no cache or secondary storage is often forgotten.

For example, in database algorithms, all memory operations are often considered to be unit cost since they are an order cheaper when compared to accessing the disk storage?


Big O notation is typically counting comparisons, and ignores constant factors. It isn't that memory hierarchy is ignored so much as the equating of comparison growth (in n) with performance.

The field of cache oblivious algorithms is focused explicitly on memory hierarchy, and accounts for cache misses.


It's typically not counting comparisons, there are none when inserting elements. But sometimes it pretends that comparisons are O(1).


That depends on whether you expect the data to already be in memory or not. If it's already in memory, you would care a lot about cache locality. And in general, you don't want to spuriously lose on some benchmark if there's anything you can do about it.


Big O isn't fooling you, it's giving you a one-function explanation of how the algorithm runs on an idealized system as the size of the problem increases. There's other systems you can simulate for big O, but the math is much harder and you wouldn't generally use it unless you're doing something complex like a cache-oblivious algorithm.

If you're using solely big O to decide on an algorithm, you're fooling yourself. If performance is an issue, profile, benchmark, study, don't guess.


Big O notation is asymptotic. Using only 5 insertions to understand Big O is definitely the wrong way. For example, insertion sort works better when the array is small, but insertion sort is definitely O(n^2), worse than qsort.

Don't let Big O notation fool you, don't misunderstand Big O.


Inserts in that chart is the ratio of inserts to reads.

5 means "write five times then read once per iteration" not "write five times".


The result is that when you iterate over contiguous memory, you can access it about as fast as the CPU can operate, because you will be streaming chunks of data into the L1 cache.

This is not true at all. You'll be able to access it about as fast as the memory can operate - it'll still be much faster than randomly chasing pointers all over the place - but the maximum processing speed of the CPU is an order of magnitude greater than that again.


A lot of the focus in the HN comments is on ArrayLists, but I'm curious why the author chose to give linked lists an O(1) insert time. In some implementations (doubly linked list, inserting at the head of the list) I can see the O(1) time, but when appending to the end it also is O(n) because one has to traverse the entire list of elements before updating the link at the end of the list. Might just be nitpicking here but that might also be affecting the author's results.


Just keep a pointer to the head and tail of the list. You can then append to either end of the list in O(1). When you put a new node at the tail end, use the tail pointer to get to the tail node instead of walking the whole list. Of course, you need to make sure the head/tail pointers are always updated to point to the current head/tail of the list.


The question I always ask before replacing an n^2 process with an nlg(n) one is "Am I in that window where n^2 is actually faster?"


Ported to C++, added two ATL collections, also added 100M elements data point:

https://github.com/Const-me/CollectionMicrobench

Arrays are still generally faster than lists.

The funny thing is Microsoft’s linked lists are faster than C++ standard vectors.


>The funny thing is Microsoft’s linked lists are faster than C++ standard vectors.

If I had to guess, it's because the std::vector is more conservative in memory use and it causes more malloc/array copy calls.


I think the main reason is CAtlList class encapsulates its own memory pool. It allocates RAM in batches. The default batch size for CAtlList is 10 elements/batch, user-adjustable in constructor, but I kept the default value 10.

The elements are created directly adjacent to each other. This makes iteration faster because RAM locality despite the pointer-based data structure.


There's a simpler and broader point: don't use Big O as the sole means of analysis of a high level language's data structures. The theoretical time/space complexity of a data structure may or may not accurately reflect how it's actually implemented in that language.


I just read this nice piece on linked lists and why they have almost no place in programming nowadays

http://cglab.ca/~abeinges/blah/too-many-lists/book/


Arrays are fast for getting random elements and inserting/deleting at the end (amortized), but slow for inserting/deleting in the middle.

Linked lists are fast for inserting/deleting/getting at the beginning and end, but slow for random access for anything in the middle.

So far, that's what the article covered. Now if you want fast random insertions/deletions and fast random access, this is possible with balanced trees. A tree-based list can support all single-element operations in O(log n) time. Sample code: https://www.nayuki.io/page/avl-tree-list


That's not what the article is about nayuki. The author of the article makes and proves a claim that arrays are faster than linked lists regardless of the insertion point. The metrics presented clearly show this.

The reason for this abnormality is cache locality (there are other reasons, which I will not go into right here).

Balanced trees can be pretty slow in fact. After some operations, a tree structure can become quite fragmented in memory, leading to many cache misses. In my experience, it is often faster to work with arrays instead of trees when processing in memory. However, using external storage, a b-tree often introduces quite some performance gain.


Here's a question: How often does this matter? No, not that big O can fool you. That always matters.

How often does it matter that non-contigouous memory access is slow? Really. How much do those few useconds really matter? In most apps, I would guess that a CPU cache miss isn't noticable by humans.

Yes, non-contiguous structures are significantly slower, but if you don't need to be as fast as possible, eliminating them for that reason only (assuming that there is any non-perf reason to stick with them) is a premature optimization.

But if you ARE optimizing, yeah, you need to think about how often worst-case occurs. Because big O only tells you about worst-case. NOT the average.


A cache miss isn't noticeable by a human. Code that cache misses a lot runs 10-100x slower than code that takes into consideration that it's running on a physical machine and not an abstraction. That is very noticeable by humans. Even when your data structures and algos are designed with a nice O(logN), it's very noticeable when one program bogs down with 1/6 the data compared to another.

I work in games, so the story I tell new kids is: The PlayStation2 ran at 300Mhz and a cache miss would cost you 50 cycles. The PlayStation3 ran at 3200Mhz and a cache miss would cost you 500 cycles. So, if you are cache missing a lot, your PS3 game will run as if it were on a PS2.

In other words, not paying attention to cache make your computer run like it's 10 years older than it is. You paid for a modern machine, but you are getting the results of a Craigslist junker. This is true outside of games. It's the reason 4x2Ghz cellphones struggle to run seemingly simple apps. It's a big part of the reason people struggle to orchestrate armies of servers (distributed computing is easier when it's 90% less distributed).

Is it really harder to work with the cache system instead of ignoring it? Yeah, it requires a tiny bit of study an a little bit of planning. In contrast, the theme I see a lot online is to completely dismiss physical reality in favor or theory. And, the theme I see almost universally in the students (and many senior engineers) I interview is complete ignorance of the very existence of cache and it's effects on the code they write. It's very concerning...


No, I don't deny it's important to know that cache misses exist, and what they can do: to the contrary, it's vital.

However, in 90% of applications, it's not going to matter, because those applications are spending hundreds of cycles waiting anyways: Disk or network IO, user input, all that stuff is way slower than a cache miss. If you're writing a video game, or a database, or other software with very high soft-realtime speed requirements, or heavy data access, by all means, optimize to avoid cache miss.

But if you're writing a company-internal Rails app, nobody's going to notice, even if you're getting cache miss after cache miss. Which you probably won't.

Actually, if your language isn't compiled, a cache miss is the least of your worries, perf-wise.

And now I've got to see if I can optimize my code to avoid cache misses. But the code's in Scheme, so unless the initial access cost is amortized, I'm already doomed...


You're looking at this problem backward. For example, you mention user input. Users may need a second to click or touch a button, but when they do the software should react instantly, and that does not leave you that many cycles. My smartphone's lock screen is my go-to example: most times it fails to follow my finger, and I barely have anything running on it.

Most of the dynamic languages are data and instruction cache-miss machines. They chase objects and pointers all around the memory.


>My smartphone's lock screen is my go-to example: most times it fails to follow my finger, and I barely have anything running on it.

...That doesn't sound like a cache miss. Knowing Android, A cache miss is probably the least of your worries.

>Users may need a second to click or touch a button, but when they do the software should react instantly, and that does not leave you that many cycles.

You raise a good point...

>Most of the dynamic languages are data and instruction cache-miss machines. They chase objects and pointers all around the memory.

...and this is part of my point. If you look at a problem and think, "A high-level language is fast enough," then you are implicitly saying that the latency of a cache miss is acceptable. And IME, in most cases that's true. I mean, heck, I'm using Scheme, so while I may have pointer chases like the Amazon has trees, but I CAN optimize them into array lookups, and my code is compiled: not great, but better than most HLLs.

It's the same argument as always: perf vs. development speed. You can be in the C and FP loop, or the Lisp and JS loop.


> That doesn't sound like a cache miss. Knowing Android, A cache miss is probably the least of your worries.

My example was meant to illustrate the user input problem. From what I know about Android, the absymal performance is very much a case of "death from a thousand cuts".

> It's the same argument as always: perf vs. development speed. You can be in the C and FP loop, or the Lisp and JS loop.

The fast(er) languages we have are old and full of warts, and that makes them slow to develop in. The heavily used HLLs such as Python and Ruby were made by people who did not care much (at all?) about performance, and it shows in many design decisions. But here's the thing: we could have both at the same time. I don't buy this dichotomy.


>But here's the thing: we could have both at the same time. I don't buy this dichotomy.

That's actually not true: OO, dynamism, late binding, and a lot of the other things that HLLs have to offer require a lot of pointer chasing and non-consecutive datastructures. I'm mostly a Schemer, and Scheme and Lisp have had decades of research put into making them compile and run fast. Most dynamic languages aren't so lucky. But the required pointer chasing and garbage collection mean they'll never be as fast as C.

Functional programming languages, however, are rarely late-binding, and don't expose as much about their implementation, so some of the pointer chasing can be avoided.

Rust doesn't need a GC, and is fairly C-like - or rather, ALGOL-like and BLISS-like - with added memory safety. So with a programmer who knows what they're doing, it can be pretty fast. But here's the rub: the faster a language is, the closer it has to be to the metal, and the less it can do with high-level features.

So yes, you can make HLLs faster, but you can't take the cache misses out of an HLL, and you can't make a systems language wearing an HLL's clothing - although Rust is making an admirable attempt.


> OO, dynamism, late binding

None of these are required for ease of development. At least the first two often result in precisely the opposite.

HLLs are not required to focus on slow abstractions. For example, homogeneous arrays of tagged unions can replace inheritance most of the time. And they avoid breaking your code in 10 files and 20 classes (though for some reason this metric is seen as a good thing way too often).


OO, perhaps, but without dynamism and late binding, metaprogramming is difficult, as is any number of techniques. Like, for instance, class generation, and extending methods.

And while I don't set much store by inheritance, it set considerably more store by duck-typing and polymorphism.

>None of these are required for ease of development. At least the first two often result in precisely the opposite.

So are you talking about Java-style OO? Because I was talking about Smalltalk OO, which is pretty different.

Required, no, but they're tools, and they come in handy. Certainly make development a lot more comfortable.

And some of the time, they make delvelopment a lot easier.


> Like, for instance, class generation, and extending methods.

This is precisely the kind of thing that leads to unmaintainable "magic" code, even though it can still be useful but with extreme moderation. So I don't see the point of making that a core feature of any language.

If you have any example to the contrary, I'd love a link to a good open-source project that uses these things extensively.


RSpec? A lot of ruby uses metaprogramming in some respect.

As for not seeing the point of making it a feature of your language, how about Lisp? And I'm not just talking macros. Lambdas, late binding, and in Scheme, the ability to rebind anything lead to a lot of cool tricks and capabilities.

And late binding is extrordinarily important.!


I don't consider the Ruby ecosystem to be a good example of much. Idiomatic Ruby code is much slower and usually not more maintainable than C++. Actually, it may even be worse thanks to dynamic typing, which makes refactoring much more painful than it already is in large code bases.

Well I guess it's good at making CRUD web sites. Hardly rocket science.


Okay then: Lisp.

Tinyclos is a fairly sophisticated implementation of OO and the MOP, written in Scheme.

Its descendants, COOPS, GOOPS, and others, are in most schemes today. Many of them are written in their respective dialect of scheme, with little or no specific compiler support.

SXML allows for writing XML in native scheme syntax.

The anaphoric macros (aif, acond, etc.) are all, well, macros, and thus use metaprogramming principles.

tclOO, [incr tcl], and other OO TCL systems are usually implemented in regular TCL.

Give or take, any large Lisp or Smalltalk codebase takes advantage of dynamic typing, late binding, and some form of metaprogramming.

However, you've made it clear that you hate Ruby, Dynamic Typing, and other such things, as given as much of metaprogramming requires this sort of flexibility, I very much doubt anything I say will convince you that dynamic languages are in any way useful.


I don't hate them. I've used python more than once, and will continue to do so. And I think it's great teaching material. It's a good scripting language. But I think its drawbacks far outweigh its advantages for large projects.

All your examples are programming gimmicks, and I've yet to see stuff that solves actual hard problems. I'm not interested in programming for programming's sake. I want to use it to make my computer do useful stuff.


Gimmicks? Some of them, yes, but CLOS and its descendants are used in Real applications, as are the TCL OO systems. But if you want Real World, I'll give you real world.

Maxima is a descendant of the original MACSYMA, developed at MIT. It is still a usable and viable system, even if it has ageda bit.

Emacs is a popular programmer's text editor written in C and a dialect of lisp.

Both of the above programs are large, useful, and written in an HLL - one particularly amenable to pointer chasing, I might add - and they make use of the variety of abstractions which that HLL provides.

If those aren't modern enough for you, check out some Clojure applications.


Considering the Ruby object and memory models, you probably will get cache miss after cache miss. You just won't have the tools to do anything about them.


You're probably right. I was just thinking, "It can't possibly be THAT bad," especially seeing as there are so many objects that are referenced frequently.


This advice applies a ton to databases. At my job, our application is heavily disk i/o bound, with a database query typically taking several seconds. I benchmarked how different sorting the rows in our db by time would be compared to our current layout (partially sorted by user, partially sorted by time, which was completely accidental). Sorting by time benchmarked ~10x faster, which is absolutely huge.


Yeah, heavy data access is where this matters.


In game dev, it often matters a lot. The difference between code that takes cache performance into account and code that does not is often measured in frames per second.


Memory contiguity may be the most fundamental thing to keep in mind when doing numerical computing, for example.


Okay, so now I know.


> How much do those few useconds really matter

There is at least one area where even a split micro will cost you millions.

P.S.: and even in consumer apps there microseconds here and there add up and then user will have overall sluggish experience.


Big O notation is used to describe whatever you want it to. The average case is the usual choice, in fact.


Oh. Sorry.


In games, critical systems and high traffic backend systems it means a lot.


That I could guess. My point is that most code isn't in high traffic backend systems and games.


There are a million things that a person writing a boring CRUD web app doesn't need to know. But if you want to be a good programmer, you should probably understand how a computer works.


I didn't say you shouldn't. I just asked how often that kind of performance optimization mattered. The world is not comprised only of CRUD and situations where that kind of perf matters.

But I do agree that even if you don't have to worry about cache latency for every single thing you write, it is something you should know about.


I think the overemphasis of big O, especially during job interviews, is a sad thing.

I think multi-threading is an equally important skill, that gets less attention.


It's not. The reason is, a lot of stuff isn't multithreaded or is just a bunch of threads talking to a database. (I've been asked questions about multithreaded stuff, in interviews at companies that specifically did multithreaded stuff. Companies that talked to databases would be better off asking SQL stuff.)


One thing I didn't see in the article: in addition to cache misses, the list stores 3 words per word; the array between 1 and 2 (up to 3 only when reallocating,) so array touches less RAM, even when inserting.


The author is doing less than 10 insertions per benchmark, that's practically a constant O(1), even if you have to copy the entire array 5 or 10 times don't affect the Big O analysis


But he's inserting at the front of an array, which requires the rest of the elements to be copied into a new array. DotNetPerls has a clearer example: http://www.dotnetperls.com/list-insert


Technically speaking, he ought to be using Big Theta (Θ) to describe his bounds. Throwing Big O around to describe everything is foolish.


The benefits of the list abstraction are small---because it's a clumsy, blub-like list abstraction with a container object and iterators.

Look at the code; termination of the loop is even based on an integer count pulled from the container.

A True Scotsman's linked list has no such thing. It's either an empty indicator (like NIL in Lisp) or a binary cell consisting of an item and a pointer to the next one.

The benefit of that abstraction is that you can recurse over it directly without clumsy additional parameters having to be passed.

Another benefit is substructure sharing. We can insert at the front of a list not only in O(1) time and that is great. But perhaps more importantly, existing copies of the list before the insertion do not change. And it is the same way if we delete from the front: we just move the local head pointer to the previous node, which doesn't affect anyone else who is still holding on to the pointer to the original front node.

These lists also allow lock-free operation, unlike "heavy weight" containers. Suppose we have a list that acts purely as a container, but is shared by multiple threads. We can insert into it by consing a new node onto the head of the current snapshot of the list, and then doing an atomic-compare-swap to install that head in place of the old head. If it fails, it means the list changed behind our back; we cons the node onto the new list (or rewrite the cons to point to the new one as its "rest" pointer) and try again.

Some of the caching benefits of the array will disappear if the array holds only references/pointers to items. In this example, the containers are typed. The List<int> actually can allocate the int objects in an array that packs them together in memory. Whereas the LinkedList<int> has individual separately allocated nodes which hold the int. Suppose the List and LinkedList hold pointers to heaped objects. Then the impact of caching is softened. It's still the case that multiple pointers in the array can be cached together; but these have to be traversed to access the items themselves. In the case of the LinkedList, we have to traverse a pointer to get to the next node and traverse a pointer to get to the heaped object. But two pointer traversals versus one is not as bad as one traversal versus zero. If the objects being traversed are fairly complex, and have pointers to additional objects that have to be examined during the traversal, it matters even less what kind of list they are accessed from. If I have a list of File objects, for each of which a stream is opened and a regex scan performed, why would I care whether it's an array list or a linked list.

The results shown in this test case tell me that the performances of the two containers are not that far off! Here they are subject to a test case that is designed to highlight the difference between them by making the container content, and its processing, almost as trivial as possible. That 38 seconds versus 51 difference is almost purely in the container-related operations. That is as bad as it gets: from there, the more actual real work you do per container node, the smaller the actual difference. (What is 51 versus 38 in "orders of magnitude"? Why 0.12 orders. It's 0.42 "binary orders of magnitude" (where 1 binary order is a doubling; terminology mine). So in terms of classic Moore's Law (speed doubling every 18 months), that's a 7.6 month advance. "My arrays are 7.6 months ahead of your linked list container, in Moore's Law, in a pure benchmark; eat my dust!"


Oprah fooled me once!


If the author sees this: please consider changing your color choices. It'd make reading the content you put so much work into producing easier for everyone. I couldn't finish the post. The 90s hacker's lair colors were that offensive.


While I don't doubt that the author could fix this up, you are not helpless either. If you have colour/vision problems it is often worthwhile to install a client-reader bookmarklet or plugin to your browser, which may come along with a host of other benefits (like saving, place-marking, etc).


Or a bookmarklet like this one:

    javascript:(function(){function%20R(w){try{var%20d=w.document,j,i,t,T,N,b,r=1,C;for(j=0;t=["object","embed","applet","iframe"][j];++j){T=d.getElementsByTagName(t);for(i=T.length-1;(i+1)&&(N=T[i]);--i)if(j!=3||!R((C=N.contentWindow)?C:N.contentDocument.defaultView)){b=d.createElement("div");b.style.width=N.width;%20b.style.height=N.height;b.innerHTML="<del>"+(j==3?"third-party%20"+t:t)+"</del>";N.parentNode.replaceChild(b,N);}}}catch(E){r=0}return%20r}R(self);var%20i,x;for(i=0;x=frames[i];++i)R(x)})();%20javascript:(function(){var%20newSS,%20styles='*%20{%20background:%20white%20!%20important;%20color:%20black%20!important%20}%20:link,%20:link%20*%20{%20color:%20#0000EE%20!important%20}%20:visited,%20:visited%20*%20{%20color:%20#551A8B%20!important%20}';%20if(document.createStyleSheet)%20{%20document.createStyleSheet("javascript:'"+styles+"'");%20}%20else%20{%20newSS=document.createElement('link');%20newSS.rel='stylesheet';%20newSS.href='data:text/css,'+escape(styles);%20document.getElementsByTagName("head")[0].appendChild(newSS);%20}%20})();%20javascript:(function(){var%20d=document;%20function%20K(N,w)%20{%20var%20nn%20=%20d.createElement(w),%20C%20=%20N.childNodes,%20i;%20for(i=C.length-1;i>=0;--i)%20nn.insertBefore(C[i],nn.childNodes[0]);%20N.parentNode.replaceChild(nn,N);%20}%20function%20Z(t,w)%20{%20var%20T%20=%20document.getElementsByTagName(t),%20j;%20for%20(j=T.length-1;j>=0;--j)%20K(T[j],w);%20}%20Z("blink",%20"span");%20Z("marquee",%20"div");%20})();%20javascript:(function(){var%20H=["mouseover","mouseout","unload","resize"],o=window.opera;%20if(document.addEventListener/*MOZ*/&&!o)%20for(j%20in%20H)document.addEventListener(H[j],function(e){e.stopPropagation();},true);%20else%20if(window.captureEvents/*NS4*/&&!o)%20{%20document.captureEvents(-1/*ALL*/);for(j%20in%20H)window["on"+H[j]]=null;}%20else/*IE*/%20{function%20R(N){var%20i,x;for(j%20in%20H)if(N["on"+H[j]]/*NOT%20TEXTNODE*/)N["on"+H[j]]=null;for(i=0;x=N.childNodes[i];++i)R(x);}R(document);}})();%20javascript:(function()%20{%20var%20c,%20tID,%20iID;%20tID%20=%20setTimeout(function(){},%200);%20for%20(c=1;%20c<1000%20&&%20c<=tID;%20++c)%20clearTimeout(tID%20-%20c);%20iID%20=%20setInterval(function(){},1000);%20for%20(c=0;%20c<1000%20&&%20c<=iID;%20++c)%20clearInterval(iID%20-%20c);%20})();
which strips out basically all the ill-considered design choices you're likely to find on a page, including best-of-1993 color palettes like that under discussion here.

If you don't want to go that route, most modern browsers have a reader function built in, which extracts the page content and displays it with high contrast, a readable font, and no annoyances.


Note that it's not so much the grey-on-black that's offensive, but the bright green and dark purple that produce eye strain. If you're going to use a black background, use light pastel colors for your fonts. That's usually a pretty good reading experience.


I agree. Its quite distracting. I didn't get through it.


I didn't find it offensive. The color scheme isn't that far off from what I prefer for my terminal settings.

What would you have preferred the author do differently?


I absolutely agree, but not everyone is gifted with good taste. Try the Instapapr Text bookmarklet!


I like the colors a lot. Much easier on the eyes than blinding white with gray text.


This is not what "offensive" means.


It's a bit of an older usage than is common today, but it is correct. You'll also find this usage in phrases like "offensive odor".


That's exactly what offensive means.

adjective 1. causing resentful displeasure; highly irritating, angering, or annoying: offensive television commercials. 2. unpleasant or disagreeable to the sense: an offensive odor. 3. repugnant to the moral sense, good taste, or the like; insulting:


Yeah, every website needs to be plastered with big Fisher Price[tm] Facebook buttons and have unreadable hipster light-gray-on-white text.


True story.

I don't know how long it will take us till we finally stop with the white backgrounds...


The process of weeding out. Sometimes it's better to select against sensitive people.


Why would you select against people with poor vision?


The green seems a bit harsh, yes.

But all in all it is a good page for people with poor vision.

No low-contrast and no white background, which are both considered bad on screens.


Yes, the white on black is fine. I checked the #8a7ae2 purple (http://webaim.org/resources/contrastchecker/) and, surprisingly, it passes WCAG AAA guidelines. I say "surprisingly" because it hurts my eyes. I think jumping between the alternating white and purple is what does it.


Why do you hate freedom?


Why would you select against people reading your article?


Why would you select against `; drop database prod;?


This is a great point and I'll probably have to do some performance measurements and change parts of my code now. I should be more careful with sacrificing contiguous storage for O(1) insertions.

It is also worth noting that some manuals say that appending to an array list is O(1) amortized. Which is true, if you make an amortized analysis (which essentially distributes the workload of copying the array into a larger block over all the inserts). That's to keep in mind for systems that have to be realtime or at least produce stable framerates. The worst-case is important and amortized analysis generously glosses over it.

EDIT: Not inserting, appending


Insertion into an array list at a uniformly-distributed location is always O(n): you can't avoid moving half the list. Appending is amortized O(1).


Not the person you're responding to but perhaps that person meant assuming your insertions are uniformly distributed? Then I think insertion is O(n) amortized..?

At an insertion you'd have

(n + (n-1) + .. + 1)/n = O(n^2) / n = O(n)

The first part comes from each possible run times, each with probability 1/n. You might have to expand the list but that'd also be O(n) amortized.


You are right, my bad. I meant appending.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: