
Cache Eviction: When Are Randomized Algorithms Better Than LRU? (2014) - zwliew
http://danluu.com/2choices-eviction/
======
kazinator
> When Are Randomized Algorithms Better Than LRU?

When the access pattern is random: recently used items are _not_ more likely
to be accessed than other items.

Or worse: the access pattern is "anti-recent": items recently accessed are
less likely to be accessed again than other items.

This is undergraduate stuff I once had to know for exams in machine
architectures and operating systems.

The choice of LRU replacement is favorable when there exists "locality of
reference"; that's what it's predicated upon.

All of the replacement policy choices are only substitutes for a "magic
oracle": a replacement strategy which can peer into the future and know which
items will be accessed soonest. With a magic oracle, we can not only retain
those items in the cache in preference to other items, to great advantage, but
we can automatically prefetch the correct items that are not yet in the cache
but are about to be accessed.

In general, randomized algorithms provide a defense against situations in
which things don't work out according to the best case assumptions. For
instance, they help defend against tickling worst cases in algorithms (e.g.
sorting) that have good average-case behavior.

~~~
chacham15
This comment is dangerously trivializing the problem. Even in cases where the
access pattern is contiguous, not "anti-recent", etc. LRU can still fail to
perform. For example, if you simply iterate over a contiguous list of 5 items
with a cache size of 4, you will have a 0% hit rate with LRU. With 2 random,
it would be > 0% because there is a chance you wouldnt evict the item in
exactly the wrong time. The main takeaway from the article is that it is this
mismatch between cache sizes which can cause problems and is evident by the
data. In L1+L2 caches, LRU outperforms, but in L3 2-random outperforms because
at that point, you are more likely to hit this exact case.

~~~
kazinator
Repeatedly iterating over a sequence items is in fact "anti-recent". Round-
robin cycling is probably the simplest possible algorithm for accessing a set
of items such that, at any time, the item that is accessed next is the one
that was least recently visited. Proof: since that item was last visited,
every other item in the set was visited.

~~~
chacham15
You're missing exactly the point. "anti-recent" is a relative term. Imagine
that the data set is actually 1000 items long, but we iterate through it five
at a time, but we iterate over that five 10 times each before moving on to the
next 5. Those five items are much more likely to be necessary than all the 995
other items, but the currently added member is less likely than the other 4
for the next loop. You see, its a relative term. If the cache size had been 5,
we would say that this is a nearly perfect example of caching the most likely
members to appear.

------
okl
> If you have a tight loop, LRU is going to be perfect as long as the loop
> fits in cache, but it's going to cause a miss every time if the loop doesn't
> fit. A random eviction policy degrades gracefully as the loop gets too big.

That's the key take-away from my experience with optimizing control loops for
microcontrollers with small caches. As soon as your loop is too bit, your code
suddenly runs at a hundredth of the speed.

~~~
eternalban
I am not sure I understand what you mean by the size of the "loop".

Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2
caches, you will see the performance hit you mention (given that accessing the
associated LL node and reordering the list will most likely result in two
cache misses).

Try libclc [1] for a compact _segmented_ cache of n 64B containers that
exactly fit a cache line, with each container having its own eviction policy.
The overhead of LRU per container (of 7 data lines) is 9.14 bits/entry and the
same cache line access will give you both the data lines and the LRU order.

[1]:
[https://github.com/alphazero/libclc](https://github.com/alphazero/libclc)

~~~
blattimwind
> Conventional LRU uses linked-lists and when the LL no longer fits in the
> L1/L2 caches

He refers to the LRU-like policy of the I/D cache itself.

~~~
eternalban
Got it. Thanks for the clarification.

------
vesinisa
Interesting read, but I wish he'd explain how 2-random and LRU differ better.
In a nutshell, LRU tracks the time when a cache item was last accessed and
always discards the oldest item. 2-random also tracks the age of all cache
items, but when discarding picks two random items and discards the oldest of
those two.

~~~
wizeman
It doesn't discard the oldest item, it discards the item that was least
recently used/accessed (LRU, like the name says). To see the difference more
clearly, note that the oldest item can in fact be the _most_ recently used
one.

------
boyter
Love reading this. It has always been one of those interesting things I kept
in the back of my mind in my day to day.

I was very excited when I actually got to implement it on a real world
project.

I was writing a scale out job which used ffmpeg to make clips of video files.
To speed it up I kept the downloaded files (which could be 150 GB in size) as
a local cache. Quite often a clip is made of the same file. When the disk was
full (there was a separate disk for download and clip output) selected two of
the downloaded files randomly and deleted the older one. Loop till there was
enough disk space, or no files.

It's something I thought I would never actually get to implement in the real
word, and thus far is working very well, the caching speeds things up and the
eviction seems to avoid too many cache misses.

~~~
zodvik
2-random was also used in this HDFS change -
[https://issues.apache.org/jira/browse/HDFS-8131](https://issues.apache.org/jira/browse/HDFS-8131).

Default block placement policy chose datanodes randomly for new blocks. This
change selects 2 datanodes randomly and picks the one with lower used disk
space.

------
pvorb
Caffeine, a caching library for Java, has a nice write-up on more advanced
cache eviction strategies. [https://github.com/ben-
manes/caffeine/wiki/Efficiency](https://github.com/ben-
manes/caffeine/wiki/Efficiency)

------
harryf
Worth reading [http://open-zfs.org/wiki/Performance_tuning](http://open-
zfs.org/wiki/Performance_tuning) about the Adaptive Replacement Cache
algorithm (
[https://en.m.wikipedia.org/wiki/Adaptive_replacement_cache](https://en.m.wikipedia.org/wiki/Adaptive_replacement_cache)
) which is less vulnerable to cache flushes than LRU

~~~
shereadsthenews
The only big gain of ARC is the adaptive sizing. If you can live with static
sizing you will do just as well with 2Q, which is not patented.

I haven't used a true LRU in quite a while because the coordinated bookkeeping
required is as disaster on multithreaded programs. ARC and 2Q can be largely
lock-free.

~~~
NovaX
Can you explain how ARC can be more easily lock-free than LRU? ARC has to
coordinate 4 doubly-linked lists and a pivot on every access.

~~~
shereadsthenews
You can implement ARC that way and there is a Go library on GitHub that does,
but that is not going to scale well. Normally you won't need to evict from the
frequently-used list, and there is no need to maintain LRU order on that cache
during access. One need only atomically update the access time of the entry
and defer ordering until such a time as eviction is required. During an access
only a read lock on that cache is needed. During eviction from frequently-used
you need exclusive access to all four structures, but if you can arrange for
that to be rare, you'll have uncontested read access to your hot items without
the need to maintain the LRU property.

~~~
NovaX
That certainly helps, though you assume strict LRU vs approximate ARC. One can
of course use similar schemes on LRU (like memcached does) or sampling LRU,
making it approximate but reducing the lock contention. The technique that
I've used is to buffer and replay events against the policy, so that reads are
just a CAS into a ring buffer and replaying is under performed under a non-
blocking lock. That works well with any policy and extends to other features,
like expiration.

~~~
shereadsthenews
I think we'd definitely have things to discuss :-) It sounds like we agree
there is rarely a good reason to maintain strict LRU property no matter what
scheme or structure is being used.

BTW for expiration I have simply invalidated on next access. This seems to
have worked well for me with DNS caching, for example, where basically you
have three hot records (gmail.com, hotmail.com, yahoo.com) that expire every 5
seconds. In this case of course there is value in proactively refreshing cache
entries with some probability proportional to the expiry time so you don't
have thundering herds.

~~~
NovaX
Yes, that would be fun. :)

For fixed expiration (e.g. 5 min TTL), I use time-bounded LRU queues. Then if
the head item has not expired, its successors have not either. If it has, it
is merely a poll.

For custom expiration (varies per-entry), I use a hierarchical timer wheel.
This replaces sorting with hashing, with an amortized O(1) cost.

For refresh, that's when I do it on access since its being used. I keep a
refresh timestamp and an expiration timestamp, so that a refresh can occur
while the data is valid but aging, but a blocking load is required when its
too stale (expired).

You might enjoy this slide deck of the ideas that I use:
[https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsU...](https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsUUBQfAJ2c2NsFPNPr2qymIBs/edit?usp=sharing)

------
npstr
An algorithm that's more optimal than LRU on most workloads is Window TinyLFU.
Some nice simulations and graphs can be found here: [https://github.com/ben-
manes/caffeine/wiki/Efficiency](https://github.com/ben-
manes/caffeine/wiki/Efficiency)

~~~
Scaevolus
LHD is more optimal than LFU on most workloads. It's harder to implement, but
more accurately computes the actual expected value of each cache item.
[https://www.cs.cmu.edu/~beckmann/publications/papers/2018.ns...](https://www.cs.cmu.edu/~beckmann/publications/papers/2018.nsdi.lhd.pdf)

~~~
6keZbCECT2uB
I glanced through the paper and didn't see a comparison to TinyLfu.
Unfortunately, it seems to be a misleading name which would make one think it
was simply an LFU algorithm. The description in caffeine seems to indicate
that it's more of an LRU algorithm with an LFU admission heuristic.

If the overhead of the cache itself is significant, as in kernel virtual
memory management. A clock based algorithm is usually preferable which uses an
array to approximate node based structures. ARC and LIRS have CAR and CLOCK-
Pro respectively to simulate them. CAR is vulnerable to re-use distances near
the boundary of the cache size. [1] CLOCK-Pro is widely used, notably in Linux
VM. [1]
[https://www.slideshare.net/huliang64/clockpro](https://www.slideshare.net/huliang64/clockpro)

~~~
NovaX
Naming aside, the structure is <LRU | LFU filter | LRU>. The latest version of
that is adaptive, where the front LRU may grow or shrink based on the
workload. This allows the cache to mimic LRU for recency-skewed workloads
(e.g. job queues) and LFU for frequency-skewed ones (e.g. loops). In a stress
test, ARC and LIRS had 20% hit rate, Caffeine has 39.6%, and Optimal is 40.3%.

[https://user-
images.githubusercontent.com/378614/52891655-29...](https://user-
images.githubusercontent.com/378614/52891655-2979e380-3141-11e9-91b3-00002a3cc80b.png)

------
kccqzy
Randomization can really do wonders at times. There are some algorithms for
which the most efficient algorithm is essentially using randomness in a clever
way. The Miller–Rabin primality test comes to mind. Or the randomized minimum
cut problem. And even when deterministic algorithms have the same big-O
asymptotic complexity as randomized algorithms, the randomized ones often
perform better due to simpler code. Treap vs red-black tree is a classic
example.

------
iampims
It’s also working very well for load-balancing.

[https://brooker.co.za/blog/2012/01/17/two-
random.html](https://brooker.co.za/blog/2012/01/17/two-random.html)

------
p1necone
The article never explains what the difference between 2-random and pseudo
2-random is, does anyone know?

~~~
j16sdiz
Pseudo 2-Random use Pseudo LRU. Check the Wikipedia for Pseudo LRU.

It is a tree algorithm for tracking the approximate last use time.

------
antirez
Note that LFU does not have the same limits and will adapt to access patterns
that are pathological for LRU like circular accesses. LRU is an approximation
of LFU based on the idea that last access time is related to the future access
frequency of an object. The pathological cases where LRU fails and random
eviction is better is related to the fact that such access patterns violate
such premise. LFU does not have such limits because it attempts to measure the
actual access frequency (smoothed by time) of an object, and uses that as
prediction of the future frequency.

~~~
hinkley
I have just started fiddling with treaps recently. Not applicable I suspect
for a hardware cache but this thread has morphed a bit into conversations
about software caches as well.

A treap is an _unbalanced_ binary tree where the tree depth is proportional to
the frequency of use to get fewer indirections per access (assuming an uneven
distribution of lookups). Essentially it has a MFU or LFU queue built into it
and now I'm curious about how it would behave as a cache data structure.

------
drallison
It would be interesting to compare the behavior of real workloads with
different cache protocols. In a "real workload" the sequence of reads and
writes to memory (and the cache) come from a multiplicity of independent
processes functioning co-sequentially with some distribution of lifetimes of
cache ownership. In many cases, the caches will be in a non-equilibrium state
and behaving badly.

------
phyzome
Note that Redis's LRU algorithm is randomized: [https://redis.io/topics/lru-
cache](https://redis.io/topics/lru-cache)

> The reason why Redis does not use a true LRU implementation is because it
> costs more memory.

...but it sounds like it wouldn't necessarily be totally desirable to use true
LRU anyhow. :-)

------
known
Random boosting could have saved Mars pathfinder
[https://en.wikipedia.org/wiki/Priority_inversion](https://en.wikipedia.org/wiki/Priority_inversion)

------
Labo333
The experiments are interesting but they never explained what the strategies
are. It's OK for LRU and FIFO but certainly not for 2-random!

