
You're doing it wrong: B-heap 10x faster than binary heap (2010) - eyegor
http://phk.freebsd.dk/B-Heap/queue.html
======
tempguy9999
This article appeared in the ACM a while ago[0] and it didn't go down entirely
well (some of the snarky comments I remember seem to be missing there, but
some remain [see Edit below]). He just laid out the data to match the memory
architecture and... that's it, really. Important but hardly a new field of
computer science.

[0]
[https://queue.acm.org/detail.cfm?id=1814327Data](https://queue.acm.org/detail.cfm?id=1814327Data)

Edit: Here's the full comments:
[https://queue.acm.org/fullcomments.cfm?id=1814327](https://queue.acm.org/fullcomments.cfm?id=1814327)

~~~
joe_the_user
From comment in link: _" So you reinvented the T-tree, plus some Straw Man
graphs and some comments on how you're smarter than the entire CS field.
Congrats?"_

Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great
as long as you can count on the architecture remaining constant.

"Cache oblivious" are one seemingly better approach since they're optimal to
some degree on a wide range of cache structures - but no theoretically as fast
purely in-memory structures, if they are somehow in a single memory type.

"Architecture aware" trees are yet another scheme, if the program can discover
the architecture.

Of course, a lot of algorithms are going to GPUs and how these structures
relate to GPU processing is another thing I'd be curious about.

~~~
rumanator
> Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great
> as long as you can count on the architecture remaining constant.

That's a theoretical problem that might happen sometime in the future.

Meanwhile the author's approach brings a 10x speedup in real-world
deployments?

Why the snark? Theoretical work is vital to understand some problems but
practical applications is where the real problem expresses itself, and where
real-world constraints manifest hemselves.

~~~
ultrablack
Its not a bad idea. I did it myself in 98, down to making data fit in the L3
cache. Memory optimisation isn’t really new.

~~~
nol3cachein1998
I'm calling BS because there was no L3 cache in 1998, Pentium pro only had l2

~~~
saltcured
All the world was not x86. The DEC Alpha definitely had 3-level cache systems
by the mid 90s, and I would not be surprised if other RISC systems did as
well.

------
phkamp
I guess it is that season again ? :-)

This article seems to pop up about every other year, and pretty much garner
the same general reactions every time.

A lot of CS-majors immediately go in to defensive mode, starting to spew
O(blablabla) formulas while thumbing their text-books.

That actually just underscores the main point I tried to communicate in that
article: CS education simplified the computing platform used for algorithmic
analysis so much that it has (almost) lost relevance.

The big-O has little to do with real-world performance any more, because of
the layers of caching and mapping going on. In practice O(n²) can easily be
faster than O(n) for all relevant n, entirely depending on ordering of access
in the implementation.

Also, CS-education tends to gloss over important footnotes. Take quicksorts
worst-case performance, which even wikipedia describes as "rare". It is not.
Sorted data are very, very common.

Originally researchers were more careful about these issues, and they operated
with several distinct big-O's. Re-read TAoCP about sorting on tape-drives for
a classic example.

Some pull out "Cache-Oblivious Algorithms", usually having never actually
tried to implement, benchmark or use any of them.

They are, as a rule, horribly complex, and seldom faster in practice, because
they hide a very large performance constant in their complexity.

More often than not, there are no available reference implementation, so
people cannot easily try them out, and none of them are fast enough to
compensate of the embuggerance of the software patents which cover them.

There are usually one or two who claim that "Virtual Memory" is no longer a
relevant concern and that one should just put "enough RAM" in the machine.

Throwing hardware at the problem is a nice job if you can get it, and a nice
hobby if you can afford it.

However, even if the money is there, what about the energy consumption ?

Our climate is rapidly deteriorating because of greenhouse gasses from our
energy-production, and more and more algorithms run on battery-power.

If I were writing a CS-thesis or CS-textbook today, it would be all about
big-E() notation, and how it only weakly correlates with big-O() notation.

~~~
staticassertion
I dropped out of a not very good program, but even in my time there we did
learn that O(N^2) can (and almost certainly will) be faster than O(N) on most
architectures due to data locality and large constants hidden by Big O
notation.

I'm not asking this to be snarky, again I'm a dropout and don't have a ton of
insight into this sort of thing - do most CS programs not cover this in at
least some detail at some point?

~~~
Izkata
Mine not only did not (2006-2010), when the professor was demonstrating
something he dropped a constant I knew from prior experience would overpower
the big-O in question, then danced around it when I tried to ask about it.

Didn't even try to go down the path of "we're focusing on big-O, so we're only
ignoring it for this particular analysis", he just pretended the term didn't
exist.

~~~
staticassertion
Interesting, thanks. I suppose it isn't necessarily part of a core curriculum
(though I did not make it through the curriculum, which is why I was curious).

------
asdfasgasdgasdg
absl::btree_map and absl::btree_set are drop-in replacements for std::map and
std::set that use b-trees under the covers. Worth noting that this simulation
doesn't factor in pressure on the CPU cache, which means that even if you
aren't under VM pressure, b-trees can still offer a speedup. They are also
gentler to the allocator.

[https://abseil.io/docs/cpp/guides/container#b-tree-
ordered-c...](https://abseil.io/docs/cpp/guides/container#b-tree-ordered-
containers)

------
kragen
A lot of people seem to be missing that Varnish is a two-billion-dollar
company, Fastly, largely but far from entirely as a result of PHK doing a good
job of optimizing it. This is not a question of freshman homework quibbling.

------
beagle3
Constant speedup, but you need to optimize for page size; if you optimized for
the wrong page size, you may get a lower constant, perhaps even < 1.

And the computation is much less trivial.

Still, an interesting and important observation - page sizes change every
decade or two, and almost exclusively upwards, so this is worth pursuing if
you need to squeeze more speed from your heap.

~~~
swiftcoder
> if you need to squeeze more speed from your heap

Everyone needs to "squeeze more speed from their heap". CPU-bound problems are
quite rare in modern software - despite the legions of software engineers who
will stare at a CPU profile that says 100% utilisation, and declare "it's CPU
bound, nothing to be done".

(this is a pet peeve of mine, in case you couldn't tell)

~~~
devbug
Latency kills: memory latency, network latency, and storage latency. In that
order.

~~~
jhayward
> _Latency kills: memory latency, network latency, and storage latency._

I think the point of the article is that

> _memory latency_

Is actually a hierarchy of its own, and non-linear depending on access
patterns.

Instruction fetch latency, branch mis-predict latency, cache hit/miss latency
(L1 through L3), cache pressure, register pressure, TLB hit/miss, VM pressure,
etc. all become significant at scale.

Just knowing big-O/big-Ω characteristics aren't enough.

------
perl4ever
"VM pressure, measured in the amount of address space not resident in primary
memory"

Ok, question from someone clueless without recent low-level programming
experience - is this in fact a normal situation on modern systems?

Because I had the general impression that exactly because the gulf between
storage and memory speeds has widened so much, that virtual memory is more of
a historical appendix than anything.

My laptop has 8 GB of memory, which I think is fairly small these days, so how
likely is it that I would be using approaching but less than 8MB more -
between 100% and 100.01% of my memory?

...I went and looked for the date of this article, and it's almost 10 years
old, so perhaps things have changed...

~~~
mcguire
Poul-Henning Kamp is rather famous for writing code that uses mmap and a
straightforward data structure for giant blobs of data and arguing that the
kernel memory manager should handle the details. To my knowledge, most people
dealing with data larger than memory use more active techniques.

~~~
hinkley
A common failure mode for this is that devs and less sophisticated Ops people
see that the machine is only “using” 60% if available memory and they carve
out a chunk for another process or a feature. They don’t get that file caching
and mmapping are providing them a lot of their performance, and aren’t clearly
accounted for by ‘free’.

So suddenly we are at 90% memory and everything is slow as molasses.

~~~
mcguire
Or say things like, " _had 90% of their CPU available for twiddling their
digital thumbs_ ". CPU usage is irrelevant if all your processes are blocked
on i/o.

------
malisper
The tl;dr is that for many use cases, you can get order of magnitude
improvements by optimizing constant factors. In their case, they made a disk
optimized version of a binary heap by gathering nodes into pages. This is
better because it's faster to read one 4kb page from disk, than it is to
perform multiple 64b reads from disk.

I think what's even more interesting are cases where algorithms with worse
asymptotic complexity perform better than algorithms with a better asymptotic
complexity. You see this all the time in the database world, since you are
often better off optimizing for disk operations than for pure asymptotic
complexity.

The main example that comes to mind is that if you are performing a GROUP BY
on a dataset that won't fit in memory, it's usually faster to calculate the
GROUP BY with mergesort than it is with a hash-table. Hash tables are
notoriously bad for disk performance. Every lookup or insert becomes one disk
read or write. This is in contrast to mergesort where you can take two sorted
files, and in one pass merge the two files together. This leads to a much
smaller total number of disk reads and writes.

Two other cases like this that come to mind are Fibonacci heaps vs binary
heaps[0] and matrix multiplication algorithms[1].

[0]
[https://en.wikipedia.org/wiki/Fibonacci_heap#Practical_consi...](https://en.wikipedia.org/wiki/Fibonacci_heap#Practical_considerations)

[1]
[https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...](https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm)

~~~
hinkley
The log(n) that dominates in search algorithms relates to the height of the
tree. The larger block size doesn’t just improve the locality of reference for
finding a node, it also substantially reduces the height of the tree.

------
Franciscouzo
The problem is that the most used memory model for algorithms performance
research is not true in the real world: O(1) memory access. In the real world
memory access is not a O(1) operation, it takes a different amount of time to
access memory in the different levels of cache.

[https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html](https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html)

~~~
jstimpfle
Memory access time is bounded by a constant time - given by the last level in
the hierarchy (which would be the SSD or HD in this case). So, access time is
indeed O(1).

What's not true in the real world is that the actual value of the constant can
be ignored. Memory or disk access on the one hand, and simple arithmetic
operations on the other, are both O(1) but the difference is significant.

~~~
CalChris
No, it isn't bounded by the access time of the last level in the hierarchy.
You could miss in the TLB before you even get to the storage hierarchy. Your
TLB miss may even require a page table walk with more misses. You could even
get rescheduled because your quantum was exceeded.

You probably think this never happened but it used to _always_ happen. The
Linux scheduler used to traverse a linked list of pages (because Linus didn't
know any better). Once the number of processes being scheduled exceeded
something like 64, it slowed to a crawl consuming most of the quanta just
rescheduling.

~~~
jstimpfle
Yeah, there are finer points to consider to get at an actual bound, and there
might be some quirks and infelicities on some platforms. But this is not about
competing processes on the system. For matters of asymptotic Analysis the
access time is still O(1), i.e. bounded by a constant. If it wasn't, what
would it be?

------
totalperspectiv
Isn't this just a version of a cache oblivious algorithm, using van emde boas
ordering to make maximal usage of bytes per pull into the cache? Or is there
more to it that I'm not seeing?

Good link to a similar discussion: An Introduction to Cache-Oblivious Data
Structures -
[https://news.ycombinator.com/item?id=16317613](https://news.ycombinator.com/item?id=16317613)

------
inlined
This reminds me of a mongo optimization problem I faced years ago. Given the
query {x:A, y:B} where x is a scalar and Y is an array, it was 40x faster to
index [x, y] than [y, x].

Both indexes will show a query plan that’s index only and will hit the same
number of records. Locality of memory in that index (since Y had repeats) was
hugely important.

------
elif
I have a less negative reaction than most commenters seem to. I am not phased
by the snarky know-it-all attitude of the author, and it actually makes his
point more concise and clear IMO.

I also can't help but appreciate the annecdote of abstraction hiding exactly
what needs to be shown for performance computing considerations.

------
dgaudet
for a long time i used a variation of this as an interview question for
performance analysis positions at google. instead of focusing on I/O costs i
was focusing on cache miss costs, but it's pretty much the same observation at
a different scale. it's always seemed like an excellent educational example of
the difference between theoretical and practical concepts in computer
engineering, such as big-O and turing machine vs. practical problem sizes and
hierarchical storage/cache capacities/costs.

------
FullyFunctional
The problems start in the first sentence. "Optimal" doesn't mean what Poul
thinks, so he's attacking a straw-man. The data structure he poo-poos is CS
101 material. Anyone doing serious development knows better and uses
structures appropriate for the underlying hardware (which fx. on an FPGA might
be vastly different from an embedded process, and typically different from a
workstation).

Databases have used B-tree and variants for at least 50 years, not binary
trees, so obviously we Know this.

Much to do about absolutely nothing.

EDIT: typos

------
mcnichol
Can I get some clarity on pink bits?

Haven't heard the term and googling around, I am certain the author isn't
talking about what I am finding at the top.

~~~
thwarted
> Today Varnish is used by all sorts, from FaceBook, Wikia and SlashDot to
> obscure sites you have surely never heard about, many of which serve mostly
> pink bits, lots of pink bits.

Adult content.

~~~
mcguire
It's hard to overstate the importance of adult content to progress in
networking, specifically.

~~~
dclusin
Adult content & companies have always been quick to utilize new technologies.
JPEG's, first online payments, massive video services, etc. Its presence is
even felt in places you might not expect, like reddit and imgur. I remember
reading somewhere at one point 50% of reddit's traffic was adult in nature.

It took netflix until 2015 to beat out porn & piracy as total bandwidth
consumer on the public internet[1].

1 - [https://learnbonds.com/news/netflix-inc-nflx-now-uses-
more-b...](https://learnbonds.com/news/netflix-inc-nflx-now-uses-more-
bandwidth-than-porn/)

------
jmull
Hm... this is interesting, but I’m wondering whether it makes more sense to
compare b-heap to other block-oriented data structures? I’m thinking of b-tree
and it’s variants.

~~~
proverbialbunny
Not a benchmark but rrb-trees are pretty neat. They give an effective O(1) for
all operations.

Modern languages like Scala's Vector type is an rrb-tree under the hood:
[https://docs.scala-
lang.org/overviews/collections/performanc...](https://docs.scala-
lang.org/overviews/collections/performance-characteristics.html)

For more fun: [https://youtu.be/sPhpelUfu8Q](https://youtu.be/sPhpelUfu8Q)

~~~
mruts
Unfortunately, Scala’s vector type actually sucks big time in regards to
performance:

[http://www.lihaoyi.com/post/BenchmarkingScalaCollections.htm...](http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html)

Moreover if you want random access just use Arrays (especially now that 2.13
has immutable ones). The only use case for Vector is when you don’t know the
collection size up front _and_ need random access. Though even then it
honestly might just be a better idea to build up a list and then convert it to
an array.

------
LaserToy
I mean, isn’t array vs linked list similar? In theory linked list has a lot of
good applications, but in real life it doesn’t.

~~~
nilsb
Indeed, this is why for C++ people are generally told to use std::vector (i.e.
arrays) when some other data structure may in theory be more appropriate.

------
Ciberth
Out of pure curiousity: "... That is where it all went pear-shaped: that model
is totally bogus today. ..." I agree that most concepts about operating
systems are dated but where are the good resources that give an overview
without being overwhelming?

------
andromaton
Hey, I object to the ZX81 C64 and TRS-80 being called toys!

------
martamoreno2
I wonder if he does realize that CS uses this thing called O-notation, it's
really useful. So a 10x factor by the very definition of complexity theory is
a zero sum game. Nobody cares. You implementation might be faster if it is
optimized for a certain architecture, CPU instructions, CPU cache and so
forth. It literally doesn't mean anything, it has certainly nothing to do with
being more "efficient" or with Knuth not being able to make it more efficient.
Optimality is discussed in O-notation, not some contrived laboratory constant
factors in front of it.

~~~
kstrauser
People writing the AWS checks care a whole awful lot about that insignificant
constant.

------
purplezooey
A bit of a specialized use case.

