
On modern hardware the min-max heap beats a binary heap - pedro84
https://probablydance.com/2020/08/31/on-modern-hardware-the-min-max-heap-beats-a-binary-heap/
======
noctune
And if you only need a monotone priority queue (i.e. priority queue where the
popped elements are monotonically increasing/decreasing) you should consider
using a radix heap. This monotone requirement can be satisfied more than you
would expect when eg. using time as key or in pathfinding. I have a simple
radix heap implementation here: [https://github.com/mpdn/radix-
heap](https://github.com/mpdn/radix-heap)

------
twotwotwo
Like folks mentioned there, I wonder if a higher-fanout heap (people asked
about 4-ary) might also do well in practice. Looking at Wikipedia
([https://en.wikipedia.org/wiki/D-ary_heap](https://en.wikipedia.org/wiki/D-ary_heap)),
it looks like maybe so -- the growth in number of comparisons isn't that bad,
and it mentions 4-ary heaps working well specifically.

(Like how linear search wins for small N, or insertion sort helps with small
subarrays in introsort: more steps, but each step is also a lot cheaper on
modern hardware due to fewer branch mispredictions, better cache locality, or
something else that better fits the hardware's strengths.)

~~~
barbecue_sauce
I don't have a formal background in CS. Any insight into why "d-" is used as
the prefix?

~~~
ncmncm
There are only two important things taught in CS that most people don't seem
to pick up on the job. The most important is order notation. The second is the
relation between grammars and state machines. Both are worth as much attention
as you can afford for a week.

Things not taught in CS that you need to know on the job are legion. By far
the most important of these is the use of invariants. Second might be the
memory hierarchy, and cache semantics. Third might be use of bitmaps and
bitwise operations.

~~~
iudqnolq
I'm learning about invariants in a relatively introductory cs course and I've
been told they end up being relatively useless in actual practice. When have
you found them useful?

~~~
ncmncm
Invariants are essential when designing a system, to ensure the resulting
system can be understood by actual humans who will be responsible to maintain
it.

They are right that invariants will not be essential for your assignments, at
least for the first couple of years, although they would save you from some
dead ends.

You will be able to code some of your invariants as assertions, but as often
not. The ones you can't are the ones you have to pay the most attention to.

If the system invariants get too complicated, or hard to express, the design
is wrong--reliably. That is when they are most valuable. Millions of systems
designed without invariants could have been right, but we're stuck with what
we got, instead.

------
gliese1337
I have used a min-max heap once. I don't remember why I needed it at the time
--it was a previous job--but I do remember that I had to roll my own, because
it's just not that popular of a data structure, and it was the obvious and
only good solution to the problem at the time.

So, it's nice to see a detailed analysis of the structure like this! Perhaps
if it becomes more popular, I will find more places to use it.

~~~
mav3rick
IIRC it's used in chess programs to evaluate moves.

~~~
Hello71
you're thinking of minimax.

~~~
mav3rick
Ah I was wrong. Yes, you are absolutely correct.

------
ncmncm
> " _...C++20 finally added a way of accessing this instruction, using
> std::countl_zero. I don’t have C++20 though so I had to do it the old
> platform specific ways ..._ "

You don't need C++20. Even C++98 had std::bitset<>::count(), which has a nice
conversion from unsigned long, and which, when compiled for SSE3 or better
(e.g. Core2), uses a POPCNT instruction. It is pretty simple to produce
various other results, including countl_zero, from a popcount, with just a
couple of additional bitwise ops.

Modern compilers are happy to keep a bitset<32> in the same register as an
unsigned long, so the conversion takes exactly zero cycles. POPCNT takes three
cycles, and the extra bitwise ops another couple.

------
ghj
In certain domains, the trend has been to give up constant factors in order to
increase programmer productivity (e.g., python pays a 10x slowdown but is
batteries included).

So in that case I would use this data structure even if it weren't faster. I
can't count the number of times I have had to mess with inserting negative
priorities into a min heap to create a max heap! We should just have one data
structure that does both.

(though taking this idea to the logical extreme means we should just use Order
Statistic Tree for everything since it not only gives you log(n) min/max, but
also log(n) find kth and get rank of x)

~~~
twotwotwo
So, kind of a non sequitur, but I occasionally think about Qt's QList<T>. When
introduced, it was odd for reserving space at the beginning as well as the
end, so you had constant-time inserts/removals at the beginning. That and
storing pointers for any T larger than a pointer also reduced the constant
factor on insertion/deletions from the middle. Since it was always stored as a
list of pointer-sized items, some code could be shared between implementations
for different types, so your program's binary could be a bit smaller. I think
they said at the initial release that they benchmarked a bunch of options in
some real code and this won.

That was waaay at odds with the mindset of the STL, which seemed to me and I
think a lot of folks like the state-of-the-art of containers at the time: with
the STL if you prepend/insert much you need to use a deque/linked list, you
decide pointer vs. value each time you declare a type, and if you want to know
about the constant factor on an insert in the middle of the list you probably
already lost.

(Qt has/had some other fun pragmatic things, like copy-on-write types in a lot
of places. I didn't really do nearly enough with it to discover the pitfalls,
but was fun to read.)

So I think about that, about how there are likely some trees out there that
could just as well be sorted arrays, about adaptive structures that change
their behavior based on dynamically observing how they're used (one recently
discussed here: [http://blog.erlang.org/the-new-scalable-ets-
ordered_set/](http://blog.erlang.org/the-new-scalable-ets-ordered_set/)), even
about tricks in in JS string implementations and VMs. I don't have answers,
but I sometimes wonder if we should be thinking _more_ in terms of tools that
may not be perfect on all axes but soften or eliminate performance cliffs we
take for granted. Not that the zero-overhead fine-tuned types are going away,
just that they aren't the only imaginable approach.

I wonder if we'll see progress in those sort of imperfect-but-resilient tools
over time. I think the long-run trend is towards _some_ form of programming
being accessible to more people more easily (like how, as you note, folks do
real work in Python now). That certainly fits with trying to make your tools
resilient to "misuse"\--or, better, properly supporting more patterns so
they're not even misuse. No conclusions, just hand-waving, but I do wonder if
anyone working in software construction tools will eventually more usefully
wave their hands in that direction :P

~~~
ghj
You would probably enjoy reading about an alternative implementation for STL's
std::deque, called a tier vector[1][2].

It supports O(1) push_front/pop_front/push_back/pop_back/operator[]/at (like
you would expect from a deque) but also O(sqrt(N)) insert and remove from
middle!!!

The paper was from 1999 and 2001 but I only learned about it from a recent HN
post[3] where some guy rediscovered it (20 years too late). I still wonder why
this design lost to the std::deque we have in STL today.

[1] [http://www.cphstl.dk/Report/Deque-first-attempt/cphstl-
repor...](http://www.cphstl.dk/Report/Deque-first-attempt/cphstl-
report-2001-4.pdf)

[2]
[https://www.ics.uci.edu/~goodrich/pubs/wads99.pdf](https://www.ics.uci.edu/~goodrich/pubs/wads99.pdf)

[3]
[https://news.ycombinator.com/item?id=20872696](https://news.ycombinator.com/item?id=20872696)

~~~
twotwotwo
This made me search around about unsorted B-tree-like collections. Turns out
Clojure[1] and the im immutable-collections crates in Rust[2] use a structure
called an RRB-tree, and searching 'blist' (because why not) turned up a Python
module[3]. So, yes, a thing!

Tricky bit is trying to work well when used like a plain vector too. Clojure's
RRB-tree special-cased repeated small appends to the end to help with this,
and the Rust page mentions a clever chunking technique for append/prepend. And
I _imagine_ largeish leaf pages might help with average
prepend/append/iteration overhead, at some cost when inserting in the middle?

Anyway, just neat that it's been taken further than I realized!

[1]
[https://infoscience.epfl.ch/record/213452/files/rrbvector.pd...](https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf)

[2]
[https://docs.rs/im/15.0.0/im/struct.Vector.html](https://docs.rs/im/15.0.0/im/struct.Vector.html)

[3] [https://pypi.org/project/blist/](https://pypi.org/project/blist/)

------
nickcw
If you need a min-max heap (or double ended heap) in Go here is one I've used:
[https://github.com/aalpar/deheap](https://github.com/aalpar/deheap)

Very useful when you need it!

------
cellularmitosis
It would be neat to fire this up on an older processor which doesn’t have
modern instruction-level parallelism and verify the difference in performance

~~~
brandmeyer
On x86 you'd have to search pretty far back before the available ILP really
dropped off. Some of the lower-end OoO ARMs might be a good testing ground,
though. Say, a Raspberry Pi 4? Earlier-gen RPi used in-order cores.

------
usefulcat
I've been using the author's flat_hash_map
([https://github.com/skarupke/flat_hash_map](https://github.com/skarupke/flat_hash_map))
for several years now and have been really impressed. I've yet to find a
single bug, it's nearly as fast as folly's hash maps (for my use case anyway)
but far easier to integrate than folly.

------
gpderetta
Impressive. Looking forward to the d-heap article.

~~~
IshKebab
Yeah I was going to say, it sounds like it is faster because it has higher
arity rather than because it is min and max. So if you only need min or max a
d-heap is probably better. Hopefully he will update the article.

[https://en.wikipedia.org/wiki/D-ary_heap](https://en.wikipedia.org/wiki/D-ary_heap)

(Also I didn't know they were called d-heaps, thanks!)

------
ww520
Wow. This is a very good analysis on a fundamental algorithm. Haven’t seen a
high quality analysis like this for a good while.

