
You're Doing It Wrong: CS in the real world - fragmede
http://queue.acm.org/detail.cfm?id=1814327
======
kmavm
Knuth was writing for MIX. This machine, like all its contemporaries, had a
flat memory hierarchy.

If you've ever wondered why the CS literature seems full of trees, but in 2010
practice, sets, associative memories, etc., are often implemented with
hashing, consider that a cache miss on much modern hardware is _300_ cycles.
The equivalent of

    
    
      for (register int i = 0; i < 299; i++) ;
    

can pay for itself by saving a cache miss. The log N cache misses to find
something in a tree (the pointers all point somewhere random, and you should
expect them to miss) is much more expensive than hashing reasonable-length
keys and taking a single cache miss in a hash table, for almost any value of N
> 1.

The author has made a nice contribution; as Colivas probably was trying to
hint to him, there is a long history of converting linked data structures to
block-friendly form by an analogous transformation. A straight write-up of his
work, without all the chest-pounding over his rediscovery of the fact that
block devices operate on blocks, would have been a pleasure to read. Pity.

~~~
haberman
Note that the article's "improved" data structure is 30% slower than the naive
binary heap when there's no VM pressure. And his heap is small compared to the
cache itself.

In other words, if he had just used mlock() to make sure his heap wasn't
getting swapped out, he would have gotten better performance and avoided the
need to invent a "new" data structure. He chose a really bad example to get on
a soapbox about.

~~~
davidcuddeback
FTA: _A 300-GB backing store, memory mapped on a machine with no more than 16
GB of RAM, is quite typical. The user paid for 64 bits of address space, and I
am not afraid to use it._

He's writing a data structure that's not intended to fit within physical
memory. mlock() doesn't do you any good in that scenario.

~~~
rlpb
I think your quote is about the entire cache, not just the heap.

------
kenjackson
It's a good article, although certainly not new, even to academics. There is a
whole branch of CS that is dedicated to understanding complexity with respect
to the memory hierarchy.

For example, see:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.5...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.587)

And there are different frameworks for building algorithms that perform well
in the memory hierarchy, such as Architecture Cognizant and Cache Oblivious
algorithms (although, as you can tell by the name, they have somewhat opposing
ideological beliefs).

Nevertheless, this is a good example as to why performance on real machines
often needs to be implemented, rather than speculated, as the the
architectural complexity can lead to some interesting peformance
characteristics (of course, some good ground in complexity and experience will
help prioritize which algorithms are even in the ballpark).

~~~
gojomo
Does the 'B-heap' PHK describes already have a name in the literature?

~~~
mad
It's the van Emde Boas layout. For example, see here
[http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-
obli...](http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-
data-structures.aspx).

~~~
ssp
No, it's not. The van Emde Boas layout splits the data set into sqrt(n) chunks
of sqrt(n) items. Then it recursively splits each of those chunks similarly.

This version simply divides the data set into page sized chunks. That has
worse memory access complexity than the van Emde Boas layout, but is likely
simpler to deal with in practice. It's not trivial to maintain the van Emde
Boas layout under insertion and deletion for example.

(The post you link is a good one btw.)

------
phkamp
Ohh, dear, another forum to keep an eye on :-)

First: congratulations, the level of discussion here is a fair bit above what
I have seen so far on reddit and slashdot.

Second: A sense a number of bruised egos. Good. That is often a sound
indication that a sore point successfully poked.

Third: user "gaius" hits the problem spot on:

"In mine, we learnt about the hardware, the cache hierarchy and so on,
completely separately from the algorithms and complexity theory. Two different
classes, two professors. Probably they both knew it themselves, but it never
occurred to them to cross-pollinate their course materials."

If I can get those two professors to talk to each other, my article will have
been worth the effort. Hopefully the HW-prof will tease the algorithm-prof
with my article and some fruitfull cooperation ensue.

The important point in my article is not the B-heap, that took me all of an
hour to figure out and I'm sure most of you could have done the same thing,
had your thoughts been wandering in that direction.

No, the important point is that most CS educations don't even mention that
actual computers have Virtual Memory, Write buffers, multilevel caches and so
on, and if they do, the certainly don't mention that the O() function depends
on 13+ more or less stochastic variables extracted therefrom.

Several people here, and elsewhere, seem to make the sign of the cross at
there mere mention of virtual memory and RAM overcommit. That is
understandable, in particular if nobody ever taught them how to properly use
those facilities.

Getting into a fight with the kernels VM bits is a sure-fire recipe for lousy
performance, and after one or two such experiences, becoming a bit sceptical
is fair.

But the facts are, that by intelligently using the those VM bits you save a
lot of complex code, and get as good performance as you can hope for.

But you have to know what you are doing, and for most of us, that means that
the professor should have told us.

Poul-Henning

~~~
mhartl
Glad to have you here! Since you're new to HN, I've got one quick note from
the site guidelines (<http://ycombinator.com/newsguidelines.html>):

    
    
      Please don't sign comments, especially with your url. 
      They're already signed with your username. If other users
      want to learn more about you, they can click on it to see
      your profile.

------
timr
I'm having a hard time reconciling this statement from the beginning of the
article (where he establishes his premise):

 _"One particular task, inside Varnish, is expiring objects from the cache
when their virtual lifetimers run out of sand. This calls for a data structure
that can efficiently deliver the smallest keyed object from the total set."_

with this statement, where he defends his claims that the Stoopid Computer
Scientists have it all wrong:

 _"Did you just decide that my order of magnitude claim was bogus, because it
is based on only an extreme corner case? If so, you are doing it wrong,
because this is pretty much the real-world behavior seen....Creating and
expiring objects in Varnish are relatively infrequent actions. Once created,
objects are often cached for weeks if not months, and therefore the binary
heap may not be updated even once per minute; on some sites not even once per
hour."_

So, maybe I'm reading this wrong, but it sure sounds like he's going out of
his way to find a scenario that results in an "order of magnitude" difference,
simply so that he can write an article with a big claim. The only
justification for this is left as an exercise in critical thinking for the
reader:

 _"At this point, is it wrong to think, 'If it runs only once per minute, who
cares, even if it takes a full second?' We do, in fact, care because the 10
extra pages needed once per minute loiter in RAM for a while, doing nothing
for their keep—until the kernel pages them back out again, at which point they
get to pile on top of the already frantic disk activity, typically seen on a
system under this heavy VM pressure."_

(that sound you hear is the frantic waving of hands)

Basically, the claim is that if you build a binary heap for an exceptionally
infrequent operation, and make that heap big enough to require multiple pages
(about 8 MB, in this case, on a machine that is presumably managing a _multi-
gigabyte_ resident web cache), then _do absolutely nothing_ to ensure that it
stays in memory, then _pick the worst possible runtime scenario_ (touching
every item in the heap in a pattern that results in repeated page faults) you
can get pathological behavior.

I think I speak for dopey CS professors everywhere, when I say: Duh.

Don't misunderstand my point: it's not that the article is _wrong_...it's just
that it's so arrogantly written that it's hard to forgive the fact that the
situation described is contrived. If I presented this problem to any of my own
CS professors, I'm willing to wager that I'd be asked why I was being so
stupid as to allow my index to page out of memory, when it represents such a
trivial percentage of my heap.

~~~
jemfinch
I could be wrong, but it sounds to me like you may not have the base of
knowledge necessary to accurately evaluate Poul's writing.

> So, maybe I'm reading this wrong, but it sure sounds like he's going out of
> his way to find a scenario that results in an "order of magnitude"
> difference, simply so that he can write an article with a big claim.

What he's saying is that this order of magnitude difference is where his
software, Varnish, spends most of its time _in reality_. Systems that run
Varnish are almost always at the end of his graph where VM pressure is the
greatest, largely because Varnish is specifically written to allow the VM to
manage disk-to-memory and memory-to-disk movement, rather than implementing it
(poorly) in-process like Squid does.

> (that sound you hear is the frantic waving of hands)

I don't see how his statement is a hand wave at all.

> Basically, the claim is that if you build a binary heap for an exceptionally
> infrequent operation, and make that heap big enough to require multiple
> pages (about 8 MB, in this case, on a machine that is presumably managing a
> multi-gigabyte resident web cache),

Pages are 4KB on most systems; I don't know where you're getting your 8MB
number from.

> then do absolutely nothing to ensure that it stays in memory,

Why should you? It's only used infrequently.

> then pick the worst possible runtime scenario (touching every item in the
> heap in a pattern that results in repeated page faults) you can get
> pathological behavior.

You're not understanding. A heap's _ordinary_ behavior causes excessive page
faults on account of its poor locality. I assume you recall (or can remind
yourself easily) the algorithm for using a heap as a priority queue: you
remove the first element, substitute a leaf element, and sift the leaf element
down, swapping it until it's greater than both its children). Because the
children of element `k` are elements `2k` and `2k+1`, after the first 2,047
elements, every single comparison between levels potentially requires that the
OS page in another 4KB page from disk. When you've got a million elements,
that's 11 fast, in-memory comparisons for the first page, and 19 comparisons
which require at least 1ms to read from disk the VM page where the children
reside. This is not a worst case at all, but the common case. This doesn't
require reading or writing all the elements in the queue, it just uses the
heap normally.

> If I presented this problem to any of my own CS professors, I'm willing to
> wager that I'd be asked why I was being so stupid as to allow my index to
> page out of memory, when it represents such a trivial percentage of my heap.

If you kept the whole heap resident in memory, that'd just mean that other
pages which you probably need more frequently than once per minute would be
paged out to disk instead. At high VM pressure, you'll probably pay a _higher_
paging cost if you keep an infrequently-used heap in memory, because pages you
need more, that would stay in memory if you allowed the OS to page out your
heap, end up being paged out instead.

~~~
timr
_"I could be wrong, but it sounds to me like you may not have the base of
knowledge necessary to accurately evaluate Poul's writing."_

Touché. I'm not Knuth or anything, but I try.

 _"Pages are 4KB on most systems; I don't know where you're getting your 8MB
number from."_

When I said _"a heap big enough to require multiple pages"_ , I actually meant
it. Hence, I was referring to the size of the heap he used for his test: 8MB
(1M records, 512 elements per page, 1954 pages allocated in memory).

I may not be brilliant, but to compensate for my lack of intellectual
horsepower, I tried to read the article closely.

 _"If you kept the whole heap resident in memory, that'd just mean that other
pages which you probably need more frequently than once per minute would be
paged out to disk instead....pages you need more, that would stay in memory if
you allowed the OS to page out your heap, end up being paged out instead."_

Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of pages
for other uses. I guess it's a trade-off then...is it better to use 8MB of RAM
on a system with gigabytes of main memory, or to re-write fundamental data
structures for worst-case timing of operations that occur once per minute?
It's certainly a conundrum....

~~~
jemfinch
> When I said "a heap big enough to require multiple pages", I actually meant
> it.

Did you? Your math that follows seems to indicate otherwise.

> Hence, I was referring to the size of the heap he used for his test: 8MB (1M
> records, 512 elements per page, 1954 pages allocated in memory).

I don't understand. I demonstrated that a heap with only 2048 records requires
multiple 4 KB pages. Are you perhaps confusing KB with MB?

> Indeed. If you fixed your 8MB heap in memory, you would lose that 8MB of
> pages for other uses. I guess it's a trade-off then...is it better to use
> 8MB of RAM on a system with gigabytes of main memory, or to re-write
> fundamental data structures for worst-case timing of operations that occur
> once per minute?

The knowledge you seem especially to lack here is that Poul's software,
Varnish, is designed to operate in conditions where swap is being used. It's
irrelevant how many gigabytes of main memory are available: Varnish will use
it all, and let the operating system decide which VM pages should be evicted
to disk and which should remain resident in memory. His software lives at high
VM pressure _by design_ , so every page his index/heap requires in memory is a
page of cached content that Varnish will not be able to deliver quickly. He
doesn't just pay that cost when the heap runs once per minute, but also when
the heap's pages are paged out again, and pages which would otherwise have
remained resident in memory are brought back into memory.

~~~
wrs
You guys are totally talking past each other.

Varnish makes the assumption that the OS cache replacement policy is awesome.
You should let it do its thing and write your data structures accordingly.
Considering that the author is intimately familiar with that policy, it's a
valid approach.

You could alternately say that you don't want to depend on the OS policy,
which you don't control. You could say that using a differently-packed heap
layout is more important than retaining 8MB more page content in the cache.
(Maybe you have a nice fully-debugged heap library handy and you don't want to
throw it away to get that 8MB back.) In that case, wiring that 8MB into
physical memory is a valid approach.

Databases (and, indeed, OS kernels) override generic cache replacement
policies like this all the time.

What is _not_ a valid approach ("doing it wrong") is not thinking through
issues like this at all, ignoring the dominating runtime cost of the memory
hierarchy.

~~~
timr
I'm not talking past anyone -- I'm just choosing to ignore the silly
aspersions to my intelligence. It's possible to make an argument without
calling the other guy stupid.

What you're seeing here is really just the usual battle between ideology and
reason that seems to pop up on HN on the weekends. We've got a fundamentally
silly article that wants to take all of computer science to task, based on a
corner-case analysis of worst-case performance of a single algorithm, in a
single, exaggerated context.

You're right that the author wants to rely entirely upon the OS cache
replacement policy for Varnish and that's fine, as far as it goes. But this
ideology leads directly to the problem observed (namely: some rarely used
things that _really should_ stay in memory are evicted, because the OS has no
ability to discern the semantics of memory use by the application). Rather
than acknowledging this limitation, the author has decided instead that the
_algorithms_ are all wrong, and that the Stoopid Computer Scientists are all a
bunch of short-sighted eggheads.

Again, it's not a question of who's right, and who's wrong -- it's a matter of
philosophy. You can assume that the OS memory manager is the all-knowing, all-
powerful Wizard of RAM, or you can give it some guidance. In this case,
locking an 8MB heap into RAM is hardly a trade-off, when you're talking about
a system that is actively managing several orders of magnitude more memory on
a regular basis. Spending days of coding time optimizing a basic data
structures for worst-case memory access patterns is short-sighted, when the
alternative is an mlock() call.

~~~
jemfinch
> I'm not talking past anyone -- I'm just choosing to ignore the silly
> aspersions to my intelligence.

There haven't _been_ any "silly aspersions" to your intelligence. You just
didn't exhibit a sufficient knowledge of the problem to dismiss the article
like you did. Computer science is a large field, and it doesn't say anything
at all about your intelligence if you're unaware of one particular
implementation of one particular type of software by one particular author.
You're the one who's cast this whole conversation into a competition, not me.

> It's possible to make an argument without calling the other guy stupid.

Exactly, which is why I responded like I did, and gave you an opportunity to
explain whether I misunderstood you. You, on the other hand, responded with
sarcasm and rhetoric, and clarified nothing.

------
AndrejM
Well this certainly sparks my interest. Could I ask for some recommendations
for any books and articles on algorithms which take the current
computer/memory architecture into account? I'm currently reading "Mastering
algorithms with C", and I had some other books in queue. But I'm always
looking for some fresh content.

------
tybris
To be honest. I think it's a misconception that CS is about algorithms. The
science should be modeling the relationships between the environment, the
desired functionality and the resulting trade-offs. There's actually a lot of
this to go around, but it's not recognized enough. The electrical engineering
establishment, which is more concerned with buildings specific systems and
algorithms, is far more influential.

We're currently craftsmen. We have very little knowledge about building
systems, except our experience and some arcane beliefs. It's time to start
digging ourselves out of the stone age.

~~~
whimsy
This is very general. What would you recommend, more specifically?

------
blahedo
It's good of him to remind us not to live in the clouds, I guess, but I think
he's hitting a bit of a straw man here. I just finished teaching a CS2 course
where a recurring theme is that big-O complexity _isn't_ and _can't be_ the
only thing you consider when choosing an algorithm. My understanding is that
this is typical---that this idea is usually covered somewhere in the first
couple terms of a standard CS major. Admittedly, I didn't talk specifically
about memory paging (since they won't really hit that until their
OS/Networking class next year), but it's a fairly natural extension of the
tradeoffs I did train them to consider.

Put another way, if presented with this article, I'd expect my students not to
say "ZOMG what have I been doing??" but rather "Ah, another thing to look out
for!" And I suspect this is typical of modern CS2 courses, not to mention more
advanced algorithms courses.

------
kvs
Good article but a little resarch into recent publications in systems
conferences could have helped put it in a better direction. Althogh lot of
programmers use text-book algorithm there exist a large body of work (in CS)
that focus on the engineering aspects of algorithms. Look into
compression/decompression algorithms like PPM for example and you will see
O(n^2) algos are proposed as improvements over O(lg n). (run time vs.
execution time).

Design and analysis of algorithm is only one area in CS. Author is correct in
pointing out CS programs lack engineering focus, at least in undergrad
education.

------
jules
How much of this applies to transparent object persistence? Things like
Rucksack & Elephant (in common lisp) let you use objects in RAM but
automatically read and write them to disk to keep them across system restarts
and crashes. These systems are also essentially using RAM as a cache
explicitly. Could performance be improved by taking advantage of virtual
memory?

What they do is if you use foo.bar, and bar is not in RAM then they load it
into RAM. If you do foo.bar = baz then they store the modification to disk. So
they are keeping a copy of persistent objects in RAM and on disk.

------
slioslat
this article is ridiculous. He makes blanket claims based on the performance
of a very specialized application. For many / most applications, memory is so
cheap that you CAN effectively ignore VM. Where some thought about the
relative latency of memory is useful though is in parallel applications on
NUMA hardware.

~~~
stcredzero
this article is great. He gives a specific example based on real-world
performance, showing that you can't always rely on theory assuming a general
case. For many intensive applications, memory is so precious, you CAN'T
effectively ignore VM. Some thought about the relative latency of memory is
useful in more than just parallel applications on NUMA hardware.

(Git and Mercurial are designed with an eye towards optimizations based on
disk latencies. Python is optimized with regard to CPU cache. We're back to
"Abstractions Leak.")

~~~
_delirium
One problem is that the article seems to incorrectly assume that "theory"
always ignores memory hierarchies. That was true in, say, 1975, but the past
20 years of algorithms theory pays a lot of attention to memory hierarchies.
You can even get all sorts of off-the-shelf algorithms designed to perform
well on typical modern memory configurations.

I mean, he's basically arguing that CS hasn't revisited heaps since 1961, and
hasn't noticed that things like caches or VM pressure might change what the
optimal algorithm looks like. But that's of course not the case.

~~~
jemfinch
> I mean, he's basically arguing that CS hasn't revisited heaps since
> 1961...But that's of course not the case.

Then kindly link me to the paper which describes his B-heaps; I'd love to read
it.

~~~
_delirium
His specific B-heaps might indeed be novel; I'm not claiming he makes no
contribution. I'm just objecting to the portion of his paper that claims that
nobody in CS has ever thought of the idea of optimizing heaps for the
properties of a modern computer's memory hierarchy. He seems to really believe
Knuth's 1961 paper is the last word on the subject, or at least says so.

Fwiw, here's a widely cited 1996 paper that describes a different variety of
block-aggregated heaps, "d-heaps", aimed mainly at cache-aware performance:
<http://lamarca.org/anthony/pubs/heaps.pdf>

It's quite possible that no existing heap layouts solve his specific problem,
but he could've at least acknowledged that there exist heaps newer than
Knuth's, and that many of them specifically look at the influence of the
memory hierarchy on performance.

~~~
jemfinch
That's a great paper, thanks for the reference.

------
ez77
I doubt he'll bother, but I would love to know Knuth's reaction to this
article.

~~~
jrockway
"TAOCP is written for MIX and MIX is not FreeBSD."

------
albertcardona
Reads like their server is beyond capacity at the moment. Any cached copies?

EDIT:
[http://webcache.googleusercontent.com/search?q=cache:Q3SZ-4y...](http://webcache.googleusercontent.com/search?q=cache:Q3SZ-4yVYJIJ:queue.acm.org/detail.cfm)

    
    
      500
      
      The request has been canceled by the administrator or by the server.
      
      coldfusion.monitor.event.MonitoringServletFilter$StopThreadException: The request has been canceled by the administrator or by the server.
      	at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:65)
      	at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
      	at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
      	at jrun.servlet.FilterChain.service(FilterChain.java:101)
      	at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
      	at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
      	at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:286)
      	at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
      	at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
      	at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
      	at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)

