
RobinHood: tail latency aware caching – dynamic reallocation - godelmachine
https://blog.acolyer.org/2018/10/26/robinhood-tail-latency-aware-caching-dynamic-reallocation-from-cache-rich-to-cache-poor/
======
jandrewrogers
This is similar to how continuous load shifting works in some massively
parallel database engines, though working from a different architectural
perspective. This problem is at the core of how you scale out mixed workloads
when the distribution of data and workload is intrinsically unpredictable and
non-uniform. However, in the database cases the cache itself is sharded and a
fixed size (basically RAM fixed to a core), for performance and scalability
reasons -- the cache isn't a separate system.

Since you have all the local cache metadata and you can't change the cache
size per node, you ensure approximately uniform load across nodes by
continuously and transparently shipping the smallest number of shards (while
under full load) that will give each logical node approximately the same cache
performance profile. This works well in my experience. It sounds onerous
resource-wise but the number of shards you typically need to ship even in a
very dynamic and high-throughput system is relatively small if your algorithm
picks them well and the total latency to ship a shard can be pretty small
(tens or hundreds of milliseconds?) depending on the design. As a percentage
of system load the cost of shipping is pretty nominal and it all but
eradicates hotspotting. However, the whole system has to be designed under the
assumption that you'll be doing this, it isn't something you can reasonably
add on later.

It is great to see more work being published on achieving effectively and
apparently uniform distribution of load for data models that intrinsically
have unpredictable and dynamic non-uniform distribution of data and load.
There are many high-value data models that can't be handled at scale without
addressing this problem directly.

~~~
d--
Author here: thank you pointing out the connection to parallel DBMs - I was
not aware of the problem similarity but difference in approach.

Can you point me to a DBMs that does that, both an OSS or in production?

------
d--
Direct link to paper:
[https://www.usenix.org/system/files/osdi18-berger.pdf](https://www.usenix.org/system/files/osdi18-berger.pdf)

Repo:
[https://github.com/dasebe/robinhoodcache](https://github.com/dasebe/robinhoodcache)

~~~
kltutor
What's the license of RobinHood's core algorithm? This seems useful.

But this being sponsored research (Microsoft) brings me to thinking of the old
ARC cache licensing problem.

[http://www.varlena.com/GeneralBits/96.php](http://www.varlena.com/GeneralBits/96.php)

~~~
d--
No, we do not expect any licensing problems. There's an open issue on github:
[https://github.com/dasebe/robinhoodcache/issues/1](https://github.com/dasebe/robinhoodcache/issues/1)

The issue mentions 3 Clause BSD, which will probably be implemented today.

------
gstaro
How widely has this been tested? Is this being used in production?

~~~
d--
Author here: this is very much a research project with the purpose of
prototyping the idea that _caching can help tail latency_. Previously, people
didn't believe that caches can be used that way and it turns out you really
have to rethink how you use the caches.

We build a testbed and replayed production workloads. This is not running in
production yet. We're actively looking for new workloads and scenarios to test
this at scale!

