Knuth was writing for MIX. This machine, like all its contemporaries, had a flat memory hierarchy.
If you've ever wondered why the CS literature seems full of trees, but in 2010 practice, sets, associative memories, etc., are often implemented with hashing, consider that a cache miss on much modern hardware is 300 cycles. The equivalent of
for (register int i = 0; i < 299; i++) ;
can pay for itself by saving a cache miss. The log N cache misses to find something in a tree (the pointers all point somewhere random, and you should expect them to miss) is much more expensive than hashing reasonable-length keys and taking a single cache miss in a hash table, for almost any value of N > 1.
The author has made a nice contribution; as Colivas probably was trying to hint to him, there is a long history of converting linked data structures to block-friendly form by an analogous transformation. A straight write-up of his work, without all the chest-pounding over his rediscovery of the fact that block devices operate on blocks, would have been a pleasure to read. Pity.
This is a big problem for functional languages, and there've been a bunch of efforts to solve it. For example, Clojure's default dictionary type uses a hash array-mapped trie, with each level sized so that a typical block can fit in a single cache line. It's very common for the youngest generation in a generational garbage collector to be sized to fit entirely into L2 cache, so that you know that every write or read of a recently-allocated value will at least be an L2 cache hit. (Unfortunately, it's usually not practical to make it a L1 hit, lest you be stuck thrashing in the GC, so you're still looking at ~12 cycles.) Deforestation optimizes out allocation entirely, instead performing the next computation immediately on values that are already in registers. Many GCs will also try to copy nearby pointers next to their parents, so that the pointers aren't quite random.
Note that the article's "improved" data structure is 30% slower than the naive binary heap when there's no VM pressure. And his heap is small compared to the cache itself.
In other words, if he had just used mlock() to make sure his heap wasn't getting swapped out, he would have gotten better performance and avoided the need to invent a "new" data structure. He chose a really bad example to get on a soapbox about.
I'm having a hard time reconciling this statement from the beginning of the article (where he establishes his premise):
"One particular task, inside Varnish, is expiring objects from the cache when their virtual lifetimers run out of sand. This calls for a data structure that can efficiently deliver the smallest keyed object from the total set."
with this statement, where he defends his claims that the Stoopid Computer Scientists have it all wrong:
"Did you just decide that my order of magnitude claim was bogus, because it is based on only an extreme corner case? If so, you are doing it wrong, because this is pretty much the real-world behavior seen....Creating and expiring objects in Varnish are relatively infrequent actions. Once created, objects are often cached for weeks if not months, and therefore the binary heap may not be updated even once per minute; on some sites not even once per hour."
So, maybe I'm reading this wrong, but it sure sounds like he's going out of his way to find a scenario that results in an "order of magnitude" difference, simply so that he can write an article with a big claim. The only justification for this is left as an exercise in critical thinking for the reader:
"At this point, is it wrong to think, 'If it runs only once per minute, who cares, even if it takes a full second?' We do, in fact, care because the 10 extra pages needed once per minute loiter in RAM for a while, doing nothing for their keep—until the kernel pages them back out again, at which point they get to pile on top of the already frantic disk activity, typically seen on a system under this heavy VM pressure."
(that sound you hear is the frantic waving of hands)
Basically, the claim is that if you build a binary heap for an exceptionally infrequent operation, and make that heap big enough to require multiple pages (about 8 MB, in this case, on a machine that is presumably managing a multi-gigabyte resident web cache), then do absolutely nothing to ensure that it stays in memory, then pick the worst possible runtime scenario (touching every item in the heap in a pattern that results in repeated page faults) you can get pathological behavior.
I think I speak for dopey CS professors everywhere, when I say: Duh.
Don't misunderstand my point: it's not that the article is wrong...it's just that it's so arrogantly written that it's hard to forgive the fact that the situation described is contrived. If I presented this problem to any of my own CS professors, I'm willing to wager that I'd be asked why I was being so stupid as to allow my index to page out of memory, when it represents such a trivial percentage of my heap.
I could be wrong, but it sounds to me like you may not have the base of knowledge necessary to accurately evaluate Poul's writing.
> So, maybe I'm reading this wrong, but it sure sounds like he's going out of his way to find a scenario that results in an "order of magnitude" difference, simply so that he can write an article with a big claim.
What he's saying is that this order of magnitude difference is where his software, Varnish, spends most of its time in reality. Systems that run Varnish are almost always at the end of his graph where VM pressure is the greatest, largely because Varnish is specifically written to allow the VM to manage disk-to-memory and memory-to-disk movement, rather than implementing it (poorly) in-process like Squid does.
> (that sound you hear is the frantic waving of hands)
I don't see how his statement is a hand wave at all.
> Basically, the claim is that if you build a binary heap for an exceptionally infrequent operation, and make that heap big enough to require multiple pages (about 8 MB, in this case, on a machine that is presumably managing a multi-gigabyte resident web cache),
Pages are 4KB on most systems; I don't know where you're getting your 8MB number from.
> then do absolutely nothing to ensure that it stays in memory,
Why should you? It's only used infrequently.
> then pick the worst possible runtime scenario (touching every item in the heap in a pattern that results in repeated page faults) you can get pathological behavior.
You're not understanding. A heap's ordinary behavior causes excessive page faults on account of its poor locality. I assume you recall (or can remind yourself easily) the algorithm for using a heap as a priority queue: you remove the first element, substitute a leaf element, and sift the leaf element down, swapping it until it's greater than both its children). Because the children of element `k` are elements `2k` and `2k+1`, after the first 2,047 elements, every single comparison between levels potentially requires that the OS page in another 4KB page from disk. When you've got a million elements, that's 11 fast, in-memory comparisons for the first page, and 19 comparisons which require at least 1ms to read from disk the VM page where the children reside. This is not a worst case at all, but the common case. This doesn't require reading or writing all the elements in the queue, it just uses the heap normally.
> If I presented this problem to any of my own CS professors, I'm willing to wager that I'd be asked why I was being so stupid as to allow my index to page out of memory, when it represents such a trivial percentage of my heap.
If you kept the whole heap resident in memory, that'd just mean that other pages which you probably need more frequently than once per minute would be paged out to disk instead. At high VM pressure, you'll probably pay a higher paging cost if you keep an infrequently-used heap in memory, because pages you need more, that would stay in memory if you allowed the OS to page out your heap, end up being paged out instead.
"I could be wrong, but it sounds to me like you may not have the base of knowledge necessary to accurately evaluate Poul's writing."
Touché. I'm not Knuth or anything, but I try.
"Pages are 4KB on most systems; I don't know where you're getting your 8MB number from."
When I said "a heap big enough to require multiple pages", I actually meant it. Hence, I was referring to the size of the heap he used for his test: 8MB (1M records, 512 elements per page, 1954 pages allocated in memory).
I may not be brilliant, but to compensate for my lack of intellectual horsepower, I tried to read the article closely.
"If you kept the whole heap resident in memory, that'd just mean that other pages which you probably need more frequently than once per minute would be paged out to disk instead....pages you need more, that would stay in memory if you allowed the OS to page out your heap, end up being paged out instead."
Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of pages for other uses. I guess it's a trade-off then...is it better to use 8MB of RAM on a system with gigabytes of main memory, or to re-write fundamental data structures for worst-case timing of operations that occur once per minute? It's certainly a conundrum....
> When I said "a heap big enough to require multiple pages", I actually meant it.
Did you? Your math that follows seems to indicate otherwise.
> Hence, I was referring to the size of the heap he used for his test: 8MB (1M records, 512 elements per page, 1954 pages allocated in memory).
I don't understand. I demonstrated that a heap with only 2048 records requires multiple 4 KB pages. Are you perhaps confusing KB with MB?
> Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of pages for other uses. I guess it's a trade-off then...is it better to use 8MB of RAM on a system with gigabytes of main memory, or to re-write fundamental data structures for worst-case timing of operations that occur once per minute?
The knowledge you seem especially to lack here is that Poul's software, Varnish, is designed to operate in conditions where swap is being used. It's irrelevant how many gigabytes of main memory are available: Varnish will use it all, and let the operating system decide which VM pages should be evicted to disk and which should remain resident in memory. His software lives at high VM pressure by design, so every page his index/heap requires in memory is a page of cached content that Varnish will not be able to deliver quickly. He doesn't just pay that cost when the heap runs once per minute, but also when the heap's pages are paged out again, and pages which would otherwise have remained resident in memory are brought back into memory.
Why are you still arguing about this? Your original claim was, "if you build a binary heap...and make that heap big enough to require multiple pages (about 8 MB". It does not take 8 MB to make a heap "big enough to require multiple pages." It takes 4 KB. PHK's example was a middling-sized heap, not the minimum heap necessary to require multiple pages, as you implied with your original statement.
The original was make that heap big enough to require multiple pages (about 8 MB, in this case
You are arguing that it has to be 4 KB. timr is arguing it has to be big enough, and that it is 8 MB. The implication is that it is big enough to qualify, without commenting on how big is big enough.
The wording was misleading, and you made the mistake of focusing on what you saw to be an error in expression and taking it to be error in meaning, even after debate. Sadly, that's a very easy mistake to make, especially if you haven't argued with people very much, and it is only exacerbated by a textual medium with memory that can be referenced. If you ever find yourself in a similar situation again, try to read the whole offending paragraph for context. A lot of people make inaccurate statements that can only be correctly understood by identifying their frame of thought (which can be in error, muddled, or just presuming).
Varnish makes the assumption that the OS cache replacement policy is awesome. You should let it do its thing and write your data structures accordingly. Considering that the author is intimately familiar with that policy, it's a valid approach.
You could alternately say that you don't want to depend on the OS policy, which you don't control. You could say that using a differently-packed heap layout is more important than retaining 8MB more page content in the cache. (Maybe you have a nice fully-debugged heap library handy and you don't want to throw it away to get that 8MB back.) In that case, wiring that 8MB into physical memory is a valid approach.
Databases (and, indeed, OS kernels) override generic cache replacement policies like this all the time.
What is not a valid approach ("doing it wrong") is not thinking through issues like this at all, ignoring the dominating runtime cost of the memory hierarchy.
I'm not talking past anyone -- I'm just choosing to ignore the silly aspersions to my intelligence. It's possible to make an argument without calling the other guy stupid.
What you're seeing here is really just the usual battle between ideology and reason that seems to pop up on HN on the weekends. We've got a fundamentally silly article that wants to take all of computer science to task, based on a corner-case analysis of worst-case performance of a single algorithm, in a single, exaggerated context.
You're right that the author wants to rely entirely upon the OS cache replacement policy for Varnish and that's fine, as far as it goes. But this ideology leads directly to the problem observed (namely: some rarely used things that really should stay in memory are evicted, because the OS has no ability to discern the semantics of memory use by the application). Rather than acknowledging this limitation, the author has decided instead that the algorithms are all wrong, and that the Stoopid Computer Scientists are all a bunch of short-sighted eggheads.
Again, it's not a question of who's right, and who's wrong -- it's a matter of philosophy. You can assume that the OS memory manager is the all-knowing, all-powerful Wizard of RAM, or you can give it some guidance. In this case, locking an 8MB heap into RAM is hardly a trade-off, when you're talking about a system that is actively managing several orders of magnitude more memory on a regular basis. Spending days of coding time optimizing a basic data structures for worst-case memory access patterns is short-sighted, when the alternative is an mlock() call.
> I'm not talking past anyone -- I'm just choosing to ignore the silly aspersions to my intelligence.
There haven't been any "silly aspersions" to your intelligence. You just didn't exhibit a sufficient knowledge of the problem to dismiss the article like you did. Computer science is a large field, and it doesn't say anything at all about your intelligence if you're unaware of one particular implementation of one particular type of software by one particular author. You're the one who's cast this whole conversation into a competition, not me.
> It's possible to make an argument without calling the other guy stupid.
Exactly, which is why I responded like I did, and gave you an opportunity to explain whether I misunderstood you. You, on the other hand, responded with sarcasm and rhetoric, and clarified nothing.
Not totally, it started with "the article is dumb", to which someone replied "nuh uh! you're dumb", to which the OP replied, "oh no i'm not", to which another reply said "oh yes you are". Very focused.
(The rest of the posts were good reading, but getting the emotion involved just makes everyone mad. Neither of the commenters cared anymore about their argument, they only cared about winning.)
At least on 4chan we get pictures of cats with this stuff ;)
> The rest of the posts were good reading, but getting the emotion involved just makes everyone mad.
For my part, there was no emotion involved. That's why I didn't bother replying to the emotional aspects of timr's second post.
> Neither of the commenters cared anymore about their argument, they only cared about winning.
I care a lot about the argument; I think Poul's article was interesting and relevant, and should not be dismissed by people who (it seems clearly to be the case) did not understand it, but think they did.
First: congratulations, the level of discussion here is a fair bit above what I have seen so far on reddit and slashdot.
Second: A sense a number of bruised egos. Good. That is often a sound indication that a sore point successfully poked.
Third: user "gaius" hits the problem spot on:
"In mine, we learnt about the hardware, the cache hierarchy and so on, completely separately from the algorithms and complexity theory. Two different classes, two professors. Probably they both knew it themselves, but it never occurred to them to cross-pollinate their course materials."
If I can get those two professors to talk to each other, my article will have been worth the effort. Hopefully the HW-prof will tease the algorithm-prof with my article and some fruitfull cooperation ensue.
The important point in my article is not the B-heap, that took me all of an hour to figure out and I'm sure most of you could have done the same thing, had your thoughts been wandering in that direction.
No, the important point is that most CS educations don't even mention that actual computers have Virtual Memory, Write buffers, multilevel caches and so on, and if they do, the certainly don't mention that the O() function depends on 13+ more or less stochastic variables extracted therefrom.
Several people here, and elsewhere, seem to make the sign of the cross at there mere mention of virtual memory and RAM overcommit. That is understandable, in particular if nobody ever taught them how to properly use those facilities.
Getting into a fight with the kernels VM bits is a sure-fire recipe for lousy performance, and after one or two such experiences, becoming a bit sceptical is fair.
But the facts are, that by intelligently using the those VM bits you save a lot of complex code, and get as good performance as you can hope for.
But you have to know what you are doing, and for most of us, that means that the professor should have told us.
And there are different frameworks for building algorithms that perform well in the memory hierarchy, such as Architecture Cognizant and Cache Oblivious algorithms (although, as you can tell by the name, they have somewhat opposing ideological beliefs).
Nevertheless, this is a good example as to why performance on real machines often needs to be implemented, rather than speculated, as the the architectural complexity can lead to some interesting peformance characteristics (of course, some good ground in complexity and experience will help prioritize which algorithms are even in the ballpark).
Nevertheless, this is a good example as to why performance on real machines often needs to be implemented, rather than speculated, as the the architectural complexity can lead to some interesting peformance characteristics
I don't really agree with this. In the old days, to really get an idea of how fast something would run, you had to count cycles and think about pipeline hazards and whatnot. Today, you can simply estimate the number of misses at the various memory levels -- this is often not that difficult to do -- and get a good idea of the program's performance.
So architectural differences used to be much more important back when I/O didn't dominate as much as it does today.
Actually, even if your goal is to simply count jumps in the memory hierarchy, it can still be tough for several reasons. For example, the set associativity of a given level of the memory hierarchy can make analysis very tricky. Somethings that would ostensibly be efficient w/ respect to the memory hierarchy could actually have a whole bunch of conflict misses. Modeling this is possible, but not easy. Especially for non-trivial algorithms.
Another example of a complicating issue is garbage collection. If the GC uses something like compaction you really don't have a great idea of where your data is. We used to do all these tricks in 'C' to map data and move it around to get the best locality. This isn't nearly as feasible in C# and Java.
And there really are kind of two related questions:
1) How does this algorithm perform for a given memory hierarchy?
2) What is the best parameterization of this algorithm for the memory hierarchy?
Question 1 is a fair bit easier than question 2. And you'd be surprised at how much faster you can make the same algorithm, by changing certain parameters (chunk size, etc...) for a given architecture. And these things are hard to answer by simple reasoning. And a local maxima, may in fact be far from the global maxima.
> We used to do all these tricks in 'C' to map data and move it around to get the best locality. This isn't nearly as feasible in C# and Java.
Did we really? With malloc things are just as unpredictable. If you resort to mmap-like memory-management, such schemes are almost as feasible in high-level languages (with appropriate FFIs) as in low-level languages, provided you are accessing data and not code.
Well there are some big differences. One thing being that you don't have the same type safety in C as in many higher level languages. If I say the memory at 0xdeadbeef is of type X, well then its type X.
The other thing is that memory locations don't change. If my object is at location L then I know its at location L unless I move it. With compaction, unless I'm pinning objects (and making the GCs and allocators life hard) my object doesn't have a fixed location.
No, it's not. The van Emde Boas layout splits the data set into sqrt(n) chunks of sqrt(n) items. Then it recursively splits each of those chunks similarly.
This version simply divides the data set into page sized chunks. That has worse memory access complexity than the van Emde Boas layout, but is likely simpler to deal with in practice. It's not trivial to maintain the van Emde Boas layout under insertion and deletion for example.
It's good of him to remind us not to live in the clouds, I guess, but I think he's hitting a bit of a straw man here. I just finished teaching a CS2 course where a recurring theme is that big-O complexity isn't and can't be the only thing you consider when choosing an algorithm. My understanding is that this is typical---that this idea is usually covered somewhere in the first couple terms of a standard CS major. Admittedly, I didn't talk specifically about memory paging (since they won't really hit that until their OS/Networking class next year), but it's a fairly natural extension of the tradeoffs I did train them to consider.
Put another way, if presented with this article, I'd expect my students not to say "ZOMG what have I been doing??" but rather "Ah, another thing to look out for!" And I suspect this is typical of modern CS2 courses, not to mention more advanced algorithms courses.
Well this certainly sparks my interest. Could I ask for some recommendations for any books and articles on algorithms which take the current computer/memory architecture into account?
I'm currently reading "Mastering algorithms with C", and I had some other books in queue. But I'm always looking for some fresh content.
To be honest. I think it's a misconception that CS is about algorithms. The science should be modeling the relationships between the environment, the desired functionality and the resulting trade-offs. There's actually a lot of this to go around, but it's not recognized enough. The electrical engineering establishment, which is more concerned with buildings specific systems and algorithms, is far more influential.
We're currently craftsmen. We have very little knowledge about building systems, except our experience and some arcane beliefs. It's time to start digging ourselves out of the stone age.
Good article but a little resarch into recent publications in systems conferences could have helped put it in a better direction. Althogh lot of programmers use text-book algorithm there exist a large body of work (in CS) that focus on the engineering aspects of algorithms. Look into compression/decompression algorithms like PPM for example and you will see O(n^2) algos are proposed as improvements over O(lg n). (run time vs. execution time).
Design and analysis of algorithm is only one area in CS. Author is correct in pointing out CS programs lack engineering focus, at least in undergrad education.
How much of this applies to transparent object persistence? Things like Rucksack & Elephant (in common lisp) let you use objects in RAM but automatically read and write them to disk to keep them across system restarts and crashes. These systems are also essentially using RAM as a cache explicitly. Could performance be improved by taking advantage of virtual memory?
What they do is if you use foo.bar, and bar is not in RAM then they load it into RAM. If you do foo.bar = baz then they store the modification to disk. So they are keeping a copy of persistent objects in RAM and on disk.
this article is ridiculous. He makes blanket claims based on the performance of a very specialized application. For many / most applications, memory is so cheap that you CAN effectively ignore VM. Where some thought about the relative latency of memory is useful though is in parallel applications on NUMA hardware.
this article is great. He gives a specific example based on real-world performance, showing that you can't always rely on theory assuming a general case. For many intensive applications, memory is so precious, you CAN'T effectively ignore VM. Some thought about the relative latency of memory is useful in more than just parallel applications on NUMA hardware.
(Git and Mercurial are designed with an eye towards optimizations based on disk latencies. Python is optimized with regard to CPU cache. We're back to "Abstractions Leak.")
One problem is that the article seems to incorrectly assume that "theory" always ignores memory hierarchies. That was true in, say, 1975, but the past 20 years of algorithms theory pays a lot of attention to memory hierarchies. You can even get all sorts of off-the-shelf algorithms designed to perform well on typical modern memory configurations.
I mean, he's basically arguing that CS hasn't revisited heaps since 1961, and hasn't noticed that things like caches or VM pressure might change what the optimal algorithm looks like. But that's of course not the case.
His specific B-heaps might indeed be novel; I'm not claiming he makes no contribution. I'm just objecting to the portion of his paper that claims that nobody in CS has ever thought of the idea of optimizing heaps for the properties of a modern computer's memory hierarchy. He seems to really believe Knuth's 1961 paper is the last word on the subject, or at least says so.
It's quite possible that no existing heap layouts solve his specific problem, but he could've at least acknowledged that there exist heaps newer than Knuth's, and that many of them specifically look at the influence of the memory hierarchy on performance.
Hmm, my undergrad algorithms class at least mentioned the existence of cache effects, though it didn't really go into any detail. It had a bit of disclaimer along the lines of: for simplicity, we don't take into account caches in this class, but more modern models of computation do, and for real-world applications you might want to look up cache-aware algorithms.
In mine, we learnt about the hardware, the cache hierarchy and so on, completely separately from the algorithms and complexity theory. Two different classes, two professors. Probably they both knew it themselves, but it never occurred to them to cross-pollinate their course materials.
The article is really poorly written. The specific example is very specific, the claims that "nobody else gets it" are in fact unsubstantiated. All the effects he mentions are very known among right professionals.
And the title text is not about "relative latency" you mention but about the cost of paging, once something gets paged out.
the claims that "nobody else gets it" are in fact unsubstantiated
I'm pretty sure that the entire shop at my last consulting job would all claim it was news to them. Most people don't have to deal with issues like the ones discussed in the article, so it's perfectly understandable that it would be news to them. I don't have to deal with such issues on a daily basis, though I think I would have been able to do the same analysis if I had his job.
And the title text is not about "relative latency" you mention but about the cost of paging
I didn't know "relative" was so associated with memory and paging that you could jump to a conclusion about my specific meaning. (And actually, I meant to be quite general.) AFAIK, the cost of paging has something to do with the latency of resident memory vs. latency memory that was paged out.
That depends whether, when you're talking about 'most applications', you mean most by sheer number, or most by processing time used. Because yes, there are a zillion tiny applications that don't care about VM space. But the ones that do are used far more often (running nonstop on servers, etc) than that random iBoobs app on your iPhone.
Even for most applications "running non-stop on servers", VM is almost moot these days. For anything computational, swap is the kiss of death. For many applications where the working set is larger than available memory, the work is typically done inside a DBMS which usually has it's own systems for managing paging. Algos in DBMS are certainly aware of non-uniform access time: http://en.wikipedia.org/wiki/B-tree
The request has been canceled by the administrator or by the server.
coldfusion.monitor.event.MonitoringServletFilter$StopThreadException: The request has been canceled by the administrator or by the server.