
B-Heap vs. Binary Heap (2010) - navinsylvester
https://queue.acm.org/detail.cfm?id=1814327
======
slx26
A great reminder that spatial locality can be critical to performance. But
that's actually quite well-known, and maybe a more important lesson would be:
don't take performance for granted, even in standard libraries of well
stablished languages or algorithms, there might be much room for improvement
for your use case (if you need it).

This said, the article feels quite arrogant. As other commenters say, the idea
behind the optimization is nothing new, just some basic concepts well applied.
Which is a job well done, but probably doesn't justify the tone (that at least
I perceive) from the article.

~~~
rocqua
I find it interesting that he is really concerned with page-faults as opposed
to cache misses. Most of the recent stuff I've read on performance focuses
much more on the cache.

Usually, I've seen using swap-space met with derision. I'd only expect page-
faults to really happen when using a memory-mapped file. Again on the note of
caches, I'd expect more fanfare be made regarding the TLB when accessing a
different page than actual page-misses.

~~~
blattimwind
The internal caches of the CPU are orders of magnitude faster than handling a
page fault (which means: CPU traps, switch to kernel, VM lookup, issue I/O
request, block for completion, switch to user).

TFA's context is a larger-than-memory data structure mapped from disk, where
heap block size = page size.

------
userbinator
It's pretty much the same reason why databases use B-trees and not binary
trees --- when you have to consider the nonuniform latency of storage access,
traditional algorithmic analysis that assumes every memory address is
accessible in the same time as any other becomes extremely misleading.

~~~
vbezhenar
Every memory access could be limited with some big enough number (just assume
that there's no cache in your CPU), so traditional analysis is fine. People
usually don't like that memory access is so slow, it's another matter. O(N^2)
might be faster than O(NlnN) if multiplier for the latter algorithm is bad
enough.

------
abainbridge
I agree with his rant about "CS departments' algorithmic analysis" failing to
cover these real-world hardware issue. That was certainly true of my ~1997
undergrad CS course at the University of St Andrews, that was otherwise
exemplary.

The only thing I don't understand is why his data structure for finding the
least-recently-used item wasn't just a linked list - where each time you touch
an object you move it to one end of the list, and when you want to discard an
object you remove from the other end. That'd be o(1) and would cause, at most,
one page fault per update and none for the add and remove cases (since the two
ends would surely be paged "in"). And it'd be a lot easier than his fancy
thing. I guess I missed something.

It might be worth pointing out that NVMe or Intel Optane would solve his
problem (1) without his fancy data structure too. Some might go as far as to
say that while his fancy data structure is out dated now, the stuff that the
CS departments teach became correct again. Maybe St Andrews had the right idea
all along.

1) Or huge-pages or NUMA node pinning, to some extent.

~~~
zombieprocesses
> I agree with his rant about "CS departments' algorithmic analysis" failing
> to cover these real-world hardware issue.

I disagree. CS algorithms is about the logical/mathematical understanding of
algorithms. If you want to understand hardware implications, you are generally
taught that in an comp architecture class. Blaming algorithms class for the
lack of consideration for real world hardware issues is like blaming the
computational class because the turing machine doesn't factor in hardware
issues. You abstract it out to make general logical observations.

~~~
lorenzhs
I disagree. It should at least cover what aspects should be considered when
designing an algorithm for use in the real world. You dismiss the entire field
of experimental algorithmics as if it were nothing more than a bunch of
algorithmicists who know some computer architecture things, but there’s a lot
more to it.

I’m not arguing against teaching classical algorithm design and analysis,
they’re both incredibly important imho. But we should stop ignoring the
shortcomings of the models used, hand-waving them away with “the rest is just
a bit of engineering”. It’s not.

------
alecco
A similar but more sophisticated use of this was Intel Research's seminal
paper on FAST tree (index). Also from 2010.

[http://www.webislands.net/pubs/FAST__SIGMOD10.pdf](http://www.webislands.net/pubs/FAST__SIGMOD10.pdf)

------
dang
Past discussions:

[https://news.ycombinator.com/item?id=5586889](https://news.ycombinator.com/item?id=5586889)

[https://news.ycombinator.com/item?id=1426211](https://news.ycombinator.com/item?id=1426211)

------
frankmcsherry
If anyone wants to read more, there are (pre this article) papers on cache
friendly/oblivious priority queues, e.g.

[http://erikdemaine.org/papers/BufferTrees_STOC2002/paper.pdf](http://erikdemaine.org/papers/BufferTrees_STOC2002/paper.pdf)

The rough gist is that if the only thing you need to be able to do is insert
and pop-min, then you can buffer up roughly a page's worth of incomplete work
for each page, getting amortized performance that tracks the optimal number of
cache transfers at each level of the cache.

Further, in the paper above you are introduced (in the related work) to the
citation:

"A. LaMarca and R. E. Ladner. The influence of caches on the performance of
heaps. Journal of Experimental Algorithmics, 1, 1996."

which describes the author's "B-Heap" (at least, it describes a b-ary heap
where you pick b to match your cache line/page size).

I'm guessing the problem isn't that academics don't understand these things so
much that practicing programmers don't always read very much.

------
Goopplesoft
> ..., not to mention wasting enormous amounts of hardware and electricity.

I wonder how legitimate this concern is from an environmental angle. On one
hand a lot of code in the world probably has much worse performance ratios
than than the one described in the article -- using python is sufficient to
cause an order of performance loss. On the other hand data centers are using
more renewable energy sources and reducing their carbon output.

Is this only a wasted money problem? How worth while is it to write "green
code"? Back of the envelope calculations on how much developer comfort costs
would be interesting.

------
throwaway080383
I had a quick question about this part: """ Two details are worth noting:

* Once we leave a VM page through the bottom, it is important for performance that both child nodes live in the same VM page, because we are going to compare them both with their parent.

* Because of this, the tree fails to expand for one generation every time it enters a new VM page in order to use the first two elements in the page productively. """

Is the point just that you want to pack 8 elements into each page, but if the
tree expands the first generation of a page, then you end up only being able
to pack in 6?

------
matte_black
Binary Heap was always my favorite data structure because of how it would
improve my A* pathfinding performance. Guess I should have used B-Heap.

------
fithisux
Despite the arrogance, I would be more interested if there were some more
mathematical oriented explanation and/or a self-contained code sample like for
example the TinyQueue javascript implementation (any language is welcome
though.

------
kqr
Do I understand this correctly: to optimise for virtual memory, do the things
you'd do to optimise for the CPU cache except on a larger scale?

~~~
blattimwind
Yes. Because virtual memory is backed by a cache (the page cache); so
optimizing algorithms to be "more cache friendly" works.

------
daddosi
There is a joke inthere about economics but i cant think of it right now

~~~
anonytrary
Care to explain further? I am genuinely interested in the reference.

