If you've ever wondered why the CS literature seems full of trees, but in 2010 practice, sets, associative memories, etc., are often implemented with hashing, consider that a cache miss on much modern hardware is 300 cycles. The equivalent of
for (register int i = 0; i < 299; i++) ;
The author has made a nice contribution; as Colivas probably was trying to hint to him, there is a long history of converting linked data structures to block-friendly form by an analogous transformation. A straight write-up of his work, without all the chest-pounding over his rediscovery of the fact that block devices operate on blocks, would have been a pleasure to read. Pity.
In other words, if he had just used mlock() to make sure his heap wasn't getting swapped out, he would have gotten better performance and avoided the need to invent a "new" data structure. He chose a really bad example to get on a soapbox about.
He's writing a data structure that's not intended to fit within physical memory. mlock() doesn't do you any good in that scenario.
For example, see: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.5...
And there are different frameworks for building algorithms that perform well in the memory hierarchy, such as Architecture Cognizant and Cache Oblivious algorithms (although, as you can tell by the name, they have somewhat opposing ideological beliefs).
Nevertheless, this is a good example as to why performance on real machines often needs to be implemented, rather than speculated, as the the architectural complexity can lead to some interesting peformance characteristics (of course, some good ground in complexity and experience will help prioritize which algorithms are even in the ballpark).
I don't really agree with this. In the old days, to really get an idea of how fast something would run, you had to count cycles and think about pipeline hazards and whatnot. Today, you can simply estimate the number of misses at the various memory levels -- this is often not that difficult to do -- and get a good idea of the program's performance.
So architectural differences used to be much more important back when I/O didn't dominate as much as it does today.
Another example of a complicating issue is garbage collection. If the GC uses something like compaction you really don't have a great idea of where your data is. We used to do all these tricks in 'C' to map data and move it around to get the best locality. This isn't nearly as feasible in C# and Java.
And there really are kind of two related questions:
1) How does this algorithm perform for a given memory hierarchy?
2) What is the best parameterization of this algorithm for the memory hierarchy?
Question 1 is a fair bit easier than question 2. And you'd be surprised at how much faster you can make the same algorithm, by changing certain parameters (chunk size, etc...) for a given architecture. And these things are hard to answer by simple reasoning. And a local maxima, may in fact be far from the global maxima.
Did we really? With malloc things are just as unpredictable. If you resort to mmap-like memory-management, such schemes are almost as feasible in high-level languages (with appropriate FFIs) as in low-level languages, provided you are accessing data and not code.
The other thing is that memory locations don't change. If my object is at location L then I know its at location L unless I move it. With compaction, unless I'm pinning objects (and making the GCs and allocators life hard) my object doesn't have a fixed location.
The work by Arge, et al is probably the foundational work in priority queues for modern memory hierarchies. If I recall correctly, they use a cache oblivious approach: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106....
That's a good start if you're looking at the literature.
This version simply divides the data set into page sized chunks. That has worse memory access complexity than the van Emde Boas layout, but is likely simpler to deal with in practice. It's not trivial to maintain the van Emde Boas layout under insertion and deletion for example.
(The post you link is a good one btw.)
First: congratulations, the level of discussion here is a fair bit above what I have seen so far on reddit and slashdot.
Second: A sense a number of bruised egos. Good. That is often a sound indication that a sore point successfully poked.
Third: user "gaius" hits the problem spot on:
"In mine, we learnt about the hardware, the cache hierarchy and so on, completely separately from the algorithms and complexity theory. Two different classes, two professors. Probably they both knew it themselves, but it never occurred to them to cross-pollinate their course materials."
If I can get those two professors to talk to each other, my article will have been worth the effort. Hopefully the HW-prof will tease the algorithm-prof with my article and some fruitfull cooperation ensue.
The important point in my article is not the B-heap, that took me all of an hour to figure out and I'm sure most of you could have done the same thing, had your thoughts been wandering in that direction.
No, the important point is that most CS educations don't even mention that actual computers have Virtual Memory, Write buffers, multilevel caches and so on, and if they do, the certainly don't mention that the O() function depends on 13+ more or less stochastic variables extracted therefrom.
Several people here, and elsewhere, seem to make the sign of the cross at there mere mention of virtual memory and RAM overcommit. That is understandable, in particular if nobody ever taught them how to properly use those facilities.
Getting into a fight with the kernels VM bits is a sure-fire recipe for lousy performance, and after one or two such experiences, becoming a bit sceptical is fair.
But the facts are, that by intelligently using the those VM bits you save a lot of complex code, and get as good performance as you can hope for.
But you have to know what you are doing, and for most of us, that means that the professor should have told us.
Please don't sign comments, especially with your url.
They're already signed with your username. If other users
want to learn more about you, they can click on it to see
"One particular task, inside Varnish, is expiring objects from the cache when their virtual lifetimers run out of sand. This calls for a data structure that can efficiently deliver the smallest keyed object from the total set."
with this statement, where he defends his claims that the Stoopid Computer Scientists have it all wrong:
"Did you just decide that my order of magnitude claim was bogus, because it is based on only an extreme corner case? If so, you are doing it wrong, because this is pretty much the real-world behavior seen....Creating and expiring objects in Varnish are relatively infrequent actions. Once created, objects are often cached for weeks if not months, and therefore the binary heap may not be updated even once per minute; on some sites not even once per hour."
So, maybe I'm reading this wrong, but it sure sounds like he's going out of his way to find a scenario that results in an "order of magnitude" difference, simply so that he can write an article with a big claim. The only justification for this is left as an exercise in critical thinking for the reader:
"At this point, is it wrong to think, 'If it runs only once per minute, who cares, even if it takes a full second?' We do, in fact, care because the 10 extra pages needed once per minute loiter in RAM for a while, doing nothing for their keep—until the kernel pages them back out again, at which point they get to pile on top of the already frantic disk activity, typically seen on a system under this heavy VM pressure."
(that sound you hear is the frantic waving of hands)
Basically, the claim is that if you build a binary heap for an exceptionally infrequent operation, and make that heap big enough to require multiple pages (about 8 MB, in this case, on a machine that is presumably managing a multi-gigabyte resident web cache), then do absolutely nothing to ensure that it stays in memory, then pick the worst possible runtime scenario (touching every item in the heap in a pattern that results in repeated page faults) you can get pathological behavior.
I think I speak for dopey CS professors everywhere, when I say: Duh.
Don't misunderstand my point: it's not that the article is wrong...it's just that it's so arrogantly written that it's hard to forgive the fact that the situation described is contrived. If I presented this problem to any of my own CS professors, I'm willing to wager that I'd be asked why I was being so stupid as to allow my index to page out of memory, when it represents such a trivial percentage of my heap.
> So, maybe I'm reading this wrong, but it sure sounds like he's going out of his way to find a scenario that results in an "order of magnitude" difference, simply so that he can write an article with a big claim.
What he's saying is that this order of magnitude difference is where his software, Varnish, spends most of its time in reality. Systems that run Varnish are almost always at the end of his graph where VM pressure is the greatest, largely because Varnish is specifically written to allow the VM to manage disk-to-memory and memory-to-disk movement, rather than implementing it (poorly) in-process like Squid does.
> (that sound you hear is the frantic waving of hands)
I don't see how his statement is a hand wave at all.
> Basically, the claim is that if you build a binary heap for an exceptionally infrequent operation, and make that heap big enough to require multiple pages (about 8 MB, in this case, on a machine that is presumably managing a multi-gigabyte resident web cache),
Pages are 4KB on most systems; I don't know where you're getting your 8MB number from.
> then do absolutely nothing to ensure that it stays in memory,
Why should you? It's only used infrequently.
> then pick the worst possible runtime scenario (touching every item in the heap in a pattern that results in repeated page faults) you can get pathological behavior.
You're not understanding. A heap's ordinary behavior causes excessive page faults on account of its poor locality. I assume you recall (or can remind yourself easily) the algorithm for using a heap as a priority queue: you remove the first element, substitute a leaf element, and sift the leaf element down, swapping it until it's greater than both its children). Because the children of element `k` are elements `2k` and `2k+1`, after the first 2,047 elements, every single comparison between levels potentially requires that the OS page in another 4KB page from disk. When you've got a million elements, that's 11 fast, in-memory comparisons for the first page, and 19 comparisons which require at least 1ms to read from disk the VM page where the children reside. This is not a worst case at all, but the common case. This doesn't require reading or writing all the elements in the queue, it just uses the heap normally.
> If I presented this problem to any of my own CS professors, I'm willing to wager that I'd be asked why I was being so stupid as to allow my index to page out of memory, when it represents such a trivial percentage of my heap.
If you kept the whole heap resident in memory, that'd just mean that other pages which you probably need more frequently than once per minute would be paged out to disk instead. At high VM pressure, you'll probably pay a higher paging cost if you keep an infrequently-used heap in memory, because pages you need more, that would stay in memory if you allowed the OS to page out your heap, end up being paged out instead.
Touché. I'm not Knuth or anything, but I try.
"Pages are 4KB on most systems; I don't know where you're getting your 8MB number from."
When I said "a heap big enough to require multiple pages", I actually meant it. Hence, I was referring to the size of the heap he used for his test: 8MB (1M records, 512 elements per page, 1954 pages allocated in memory).
I may not be brilliant, but to compensate for my lack of intellectual horsepower, I tried to read the article closely.
"If you kept the whole heap resident in memory, that'd just mean that other pages which you probably need more frequently than once per minute would be paged out to disk instead....pages you need more, that would stay in memory if you allowed the OS to page out your heap, end up being paged out instead."
Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of pages for other uses. I guess it's a trade-off then...is it better to use 8MB of RAM on a system with gigabytes of main memory, or to re-write fundamental data structures for worst-case timing of operations that occur once per minute? It's certainly a conundrum....
Did you? Your math that follows seems to indicate otherwise.
> Hence, I was referring to the size of the heap he used for his test: 8MB (1M records, 512 elements per page, 1954 pages allocated in memory).
I don't understand. I demonstrated that a heap with only 2048 records requires multiple 4 KB pages. Are you perhaps confusing KB with MB?
> Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of pages for other uses. I guess it's a trade-off then...is it better to use 8MB of RAM on a system with gigabytes of main memory, or to re-write fundamental data structures for worst-case timing of operations that occur once per minute?
The knowledge you seem especially to lack here is that Poul's software, Varnish, is designed to operate in conditions where swap is being used. It's irrelevant how many gigabytes of main memory are available: Varnish will use it all, and let the operating system decide which VM pages should be evicted to disk and which should remain resident in memory. His software lives at high VM pressure by design, so every page his index/heap requires in memory is a page of cached content that Varnish will not be able to deliver quickly. He doesn't just pay that cost when the heap runs once per minute, but also when the heap's pages are paged out again, and pages which would otherwise have remained resident in memory are brought back into memory.
No. There are 1M records, 512 per page. This is directly from the article. From this, we can deduce that each record is 4096 / 512 = 8 bytes. Eight bytes multiplied by 1M records: 8MB.
Don't believe my math? OK. The article explicitly said that there were 1954 pages in the heap. If we have a 4k page size, that's (4096 * 1954) bytes == 7.63 MB. Even better.
You are arguing that it has to be 4 KB. timr is arguing it has to be big enough, and that it is 8 MB. The implication is that it is big enough to qualify, without commenting on how big is big enough.
The wording was misleading, and you made the mistake of focusing on what you saw to be an error in expression and taking it to be error in meaning, even after debate. Sadly, that's a very easy mistake to make, especially if you haven't argued with people very much, and it is only exacerbated by a textual medium with memory that can be referenced. If you ever find yourself in a similar situation again, try to read the whole offending paragraph for context. A lot of people make inaccurate statements that can only be correctly understood by identifying their frame of thought (which can be in error, muddled, or just presuming).
Varnish makes the assumption that the OS cache replacement policy is awesome. You should let it do its thing and write your data structures accordingly. Considering that the author is intimately familiar with that policy, it's a valid approach.
You could alternately say that you don't want to depend on the OS policy, which you don't control. You could say that using a differently-packed heap layout is more important than retaining 8MB more page content in the cache. (Maybe you have a nice fully-debugged heap library handy and you don't want to throw it away to get that 8MB back.) In that case, wiring that 8MB into physical memory is a valid approach.
Databases (and, indeed, OS kernels) override generic cache replacement policies like this all the time.
What is not a valid approach ("doing it wrong") is not thinking through issues like this at all, ignoring the dominating runtime cost of the memory hierarchy.
What you're seeing here is really just the usual battle between ideology and reason that seems to pop up on HN on the weekends. We've got a fundamentally silly article that wants to take all of computer science to task, based on a corner-case analysis of worst-case performance of a single algorithm, in a single, exaggerated context.
You're right that the author wants to rely entirely upon the OS cache replacement policy for Varnish and that's fine, as far as it goes. But this ideology leads directly to the problem observed (namely: some rarely used things that really should stay in memory are evicted, because the OS has no ability to discern the semantics of memory use by the application). Rather than acknowledging this limitation, the author has decided instead that the algorithms are all wrong, and that the Stoopid Computer Scientists are all a bunch of short-sighted eggheads.
Again, it's not a question of who's right, and who's wrong -- it's a matter of philosophy. You can assume that the OS memory manager is the all-knowing, all-powerful Wizard of RAM, or you can give it some guidance. In this case, locking an 8MB heap into RAM is hardly a trade-off, when you're talking about a system that is actively managing several orders of magnitude more memory on a regular basis. Spending days of coding time optimizing a basic data structures for worst-case memory access patterns is short-sighted, when the alternative is an mlock() call.
There haven't been any "silly aspersions" to your intelligence. You just didn't exhibit a sufficient knowledge of the problem to dismiss the article like you did. Computer science is a large field, and it doesn't say anything at all about your intelligence if you're unaware of one particular implementation of one particular type of software by one particular author. You're the one who's cast this whole conversation into a competition, not me.
> It's possible to make an argument without calling the other guy stupid.
Exactly, which is why I responded like I did, and gave you an opportunity to explain whether I misunderstood you. You, on the other hand, responded with sarcasm and rhetoric, and clarified nothing.
Not totally, it started with "the article is dumb", to which someone replied "nuh uh! you're dumb", to which the OP replied, "oh no i'm not", to which another reply said "oh yes you are". Very focused.
(The rest of the posts were good reading, but getting the emotion involved just makes everyone mad. Neither of the commenters cared anymore about their argument, they only cared about winning.)
At least on 4chan we get pictures of cats with this stuff ;)
For my part, there was no emotion involved. That's why I didn't bother replying to the emotional aspects of timr's second post.
> Neither of the commenters cared anymore about their argument, they only cared about winning.
I care a lot about the argument; I think Poul's article was interesting and relevant, and should not be dismissed by people who (it seems clearly to be the case) did not understand it, but think they did.
That doesn't change the fact that, as best I can discern, the OP has made some factual errors which demand (<http://xkcd.com/386/>) response.
(I have it hung on the wall in my office for this very reason)
We're currently craftsmen. We have very little knowledge about building systems, except our experience and some arcane beliefs. It's time to start digging ourselves out of the stone age.
Put another way, if presented with this article, I'd expect my students not to say "ZOMG what have I been doing??" but rather "Ah, another thing to look out for!" And I suspect this is typical of modern CS2 courses, not to mention more advanced algorithms courses.
Design and analysis of algorithm is only one area in CS. Author is correct in pointing out CS programs lack engineering focus, at least in undergrad education.
What they do is if you use foo.bar, and bar is not in RAM then they load it into RAM. If you do foo.bar = baz then they store the modification to disk. So they are keeping a copy of persistent objects in RAM and on disk.
(Git and Mercurial are designed with an eye towards optimizations based on disk latencies. Python is optimized with regard to CPU cache. We're back to "Abstractions Leak.")
I mean, he's basically arguing that CS hasn't revisited heaps since 1961, and hasn't noticed that things like caches or VM pressure might change what the optimal algorithm looks like. But that's of course not the case.
Then kindly link me to the paper which describes his B-heaps; I'd love to read it.
Fwiw, here's a widely cited 1996 paper that describes a different variety of block-aggregated heaps, "d-heaps", aimed mainly at cache-aware performance: http://lamarca.org/anthony/pubs/heaps.pdf
It's quite possible that no existing heap layouts solve his specific problem, but he could've at least acknowledged that there exist heaps newer than Knuth's, and that many of them specifically look at the influence of the memory hierarchy on performance.
There should at least be exposure on the level of a walkthrough of the optimization of a real-world server. There was a concurrency optimization video posted to HN a couple of weeks ago.
Sounds good to me.
the claims that "nobody else gets it" are in fact unsubstantiated
I'm pretty sure that the entire shop at my last consulting job would all claim it was news to them. Most people don't have to deal with issues like the ones discussed in the article, so it's perfectly understandable that it would be news to them. I don't have to deal with such issues on a daily basis, though I think I would have been able to do the same analysis if I had his job.
And the title text is not about "relative latency" you mention but about the cost of paging
I didn't know "relative" was so associated with memory and paging that you could jump to a conclusion about my specific meaning. (And actually, I meant to be quite general.) AFAIK, the cost of paging has something to do with the latency of resident memory vs. latency memory that was paged out.
I'm pretty sure that the entire shop at my last consulting
job would all claim it was news to them
The request has been canceled by the administrator or by the server.
coldfusion.monitor.event.MonitoringServletFilter$StopThreadException: The request has been canceled by the administrator or by the server.