
Binary Search Is a Pathological Case for Caches (2012) - ot
http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
======
jandrewrogers
This article demonstrates why performance intensive software systems (like
database engines) implement many of their own data structures and algorithms,
often not even reusing those implementations internally, and avoid standard
libraries for many use cases. The suboptimality of any particular use case can
be quite large relative to a purpose-built implementation for that spot in the
software. It is a lot more work but it can also dramatically improve real-
world throughput and variance because you have visibility into exactly what
the access pattern will be.

If you measure these things at large scales you will see surprisingly smooth
and regular constructive interference patterns that emerge from the
interaction of boundary constants in the software and hardware. This is a
difficult type of variance to eliminate. That obvious solution is to inject
pseudo-randomization into scheduling and boundary constants. This dampens the
patterns but even if you turn the randomization knobs pretty high the patterns
do not go away! And you can't go too crazy because that randomization has its
own performance costs as the algorithms move further away from their
theoretical optimum.

Some latency hiding computing architectures, like barrel processors, actually
hash memory addressing in hardware to eliminate locality so that you do not
see these kinds of locality and aliasing effects. However, most modern CPUs
are specifically designed to exploit locality so this technique generalizes
poorly.

------
IgorPartola
This to me signals that our hardware is just not designed for the kind of
workloads we actually want to do. Trying to control cache lines is akin to
trying to drive a car from the back seat using a hanger, a pool cue, and rope.
You can do it, but why not get in the driver's seat? Sadly, the modern CPU
doesn't let you have the type of explicit control you really want over the
very fast memory here. You are always at the mercy of some specifics of how
the automagic works.

This is where a GPU or a co-processor would be nice. When you can dedicate
hardware and memory (cache and RAM) to specific processes, you get to write
algorithms that can fully take advantage of the hardware. This can be very
useful where, for example, you run a database server: you don't care about
random other tasks being extremely fast: bash, cron, STMP server, etc. can all
take a long time, but the all-important database should be very fast. And
that's when you can run its sorting algorithms on a few dozen dedicated cores
where you are allowed to directly place data into the on die cache (which
would no longer be cache, but just another time of RAM).

~~~
lorenzhs
Manually managing the cache would defeat the purpose of a cache. Computing
with what to fill the cache and filling it would likely take more time than
handling cache misses.

Even specifying the caching strategy (whether it be associativity, eviction
strategy, or something else) seems very unlikely to be beneficial to me.
Caches are fast because they are hard-wired and close to the registers. For
them to be useful, they need to be able to decide in the blink of an eye
whether a value is already present or not. Doing that in software would be
absurd. That is why the CPU vendor tries to find a strategy that performs well
in a broad range of scenarios.

I have no idea what GPUs have to do with this, but co-processors are nothing
new. What you are describing is a NUMA system with all the background stuff
pinned to one node and your actual workload to the other (or both).

EDIT: Also, 99.9% of the time, the CPU is better at caching than a human could
hope to be. You don't know the input data in advance, how you want to know
what caching strategy would be optimal? From what information do you draw
conclusions as to how the cache should behave for optimal performance? There's
a reason programmers don't manage registers manually (because the compiler is
almost always better at it), what gives you the impression that managing the
cache manually would have a positive outcome?

~~~
IgorPartola
While what you are saying is true for caches, I am talking about a CPU
architecture where the near-die memory is not a cache at all. If I can request
that my process get direct access to what used to be the L2 cache, with, let's
say, a capacity of 4 MB's, I could then place my 2 MB array that needs sorting
into it, and sort it without ever leaving L2. I then relinquish control over
this fast hardware buffer, so the next process can use it.

There are of course a lot of issues with this model in a time sharing
multiprocess OS: how do you request access, how do you ensure clean
allocation, etc.? Those are problems that would need to be resolved. This does
work well in PIC's: you get some small amount of RAM (something like 64-512
bytes), and then you can attach a separate RAM chip via i2c or SPI. Perhaps if
a CPU die was surrounded by a reasonable number of such buffers (something
like 16 x 8 MB), we could write completely different type of code.

------
lorenzhs
This article is a great example of the very tangible effects caused by
hardware properties that are often disregarded. Very interesting read!

Also, I can't help but note the (probably non-existent) connection to 3-way
partitioning for quicksort (which is consistently ~10-15% (iirc) faster than
traditional quicksort). It is interesting how we seem to assume that surely
factors of two must be the best choice, computers working in binary and all.
But sometimes this is true only in theory, but not in practice (binary
search), sometimes it's good but not optimal (quicksort), and sometimes it
outright is the provably worst choice (dynamic arrays)!

------
ntoshev
I wish the author included data sorted in a van Emde Boas layout, which should
be the cache-friendly equivalent of a sorted array to be searched with binary
search.

~~~
andreasvc
I believe it is rather memory hungry.

------
TheLoneWolfling
Personally, trying to automagically cache controlled at the hardware level
has... issues. It far too often is/becomes a leaky abstraction.

The problem is that hardware is too inflexible. And trying to explicitly
manage things in software, that is from the operating system, has other
issues. It's expensive, both in terms of time taken for the context switch /
etc - and it tends to wipe out a large chunk of cache too.

Modern processors are mainly cache, in terms of die area. There's plenty of
room for adding new bits and pieces if it helps speed things up.

So, what's between the hardware and the OS? Well, what about an FPGA? So you
have the hardware read cache if present, but have the FPGA make the decisions
about how to manage the cache. What to keep, what to toss, that sort of thing.
This FPGA could also be used for basic system calls / scheduling, and perhaps
even for specialized disk access (read: something like "this process can read
in this range without consulting the OS") (As OS time can be most of the
latency when reading from an SSD.) - with a system to fallback to a full
syscall if necessary.

You could perhaps even implement the branch prediction unit in the FPGA.

------
agumonkey
old reddit thread :
[http://www.reddit.com/r/programming/comments/xe1zz/binary_se...](http://www.reddit.com/r/programming/comments/xe1zz/binary_search_is_a_pathological_case_for_caches/)

