Hacker News new | past | comments | ask | show | jobs | submit login
Cache Eviction: When Are Randomized Algorithms Better Than LRU? (2014) (danluu.com)
235 points by zwliew 30 days ago | hide | past | web | favorite | 42 comments



> When Are Randomized Algorithms Better Than LRU?

When the access pattern is random: recently used items are not more likely to be accessed than other items.

Or worse: the access pattern is "anti-recent": items recently accessed are less likely to be accessed again than other items.

This is undergraduate stuff I once had to know for exams in machine architectures and operating systems.

The choice of LRU replacement is favorable when there exists "locality of reference"; that's what it's predicated upon.

All of the replacement policy choices are only substitutes for a "magic oracle": a replacement strategy which can peer into the future and know which items will be accessed soonest. With a magic oracle, we can not only retain those items in the cache in preference to other items, to great advantage, but we can automatically prefetch the correct items that are not yet in the cache but are about to be accessed.

In general, randomized algorithms provide a defense against situations in which things don't work out according to the best case assumptions. For instance, they help defend against tickling worst cases in algorithms (e.g. sorting) that have good average-case behavior.


This comment is dangerously trivializing the problem. Even in cases where the access pattern is contiguous, not "anti-recent", etc. LRU can still fail to perform. For example, if you simply iterate over a contiguous list of 5 items with a cache size of 4, you will have a 0% hit rate with LRU. With 2 random, it would be > 0% because there is a chance you wouldnt evict the item in exactly the wrong time. The main takeaway from the article is that it is this mismatch between cache sizes which can cause problems and is evident by the data. In L1+L2 caches, LRU outperforms, but in L3 2-random outperforms because at that point, you are more likely to hit this exact case.


Repeatedly iterating over a sequence items is in fact "anti-recent". Round-robin cycling is probably the simplest possible algorithm for accessing a set of items such that, at any time, the item that is accessed next is the one that was least recently visited. Proof: since that item was last visited, every other item in the set was visited.


You're missing exactly the point. "anti-recent" is a relative term. Imagine that the data set is actually 1000 items long, but we iterate through it five at a time, but we iterate over that five 10 times each before moving on to the next 5. Those five items are much more likely to be necessary than all the 995 other items, but the currently added member is less likely than the other 4 for the next loop. You see, its a relative term. If the cache size had been 5, we would say that this is a nearly perfect example of caching the most likely members to appear.


Your loop example is more akin to the "anti-recent" pattern: the most recently accessed item will not be accessed again sooner than any of the four other items. Contiguous patterns may well be anti-recent, whether an access is "recent" or not should be considered in relation to the size of the cache.


>>When Are Randomized Algorithms Better Than LRU?

>When the access pattern is random: recently used items are not more likely to be accessed than other items.

Wouldn't they perform equally well in this case? There's no advantage in LRU, but also no disadvantage if they're really unpredictably random.


A random strategy has less overhead than LRU because there is no bookkeeping.


This argument does not apply to the OP as it does not account for any bookkeeping. I think an actual answer is more complicated and will entirely depend on the definition of “random access pattern.”


Yes, and bookkeeping and randomness both cost something.


Even when the access pattern is not random, random can provide some benefits.

It can prevent any otherwise harmless sweep/scan of the system from completely busting your entire cache for all of the other uses.


> If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it's going to cause a miss every time if the loop doesn't fit. A random eviction policy degrades gracefully as the loop gets too big.

That's the key take-away from my experience with optimizing control loops for microcontrollers with small caches. As soon as your loop is too bit, your code suddenly runs at a hundredth of the speed.


I am not sure I understand what you mean by the size of the "loop".

Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches, you will see the performance hit you mention (given that accessing the associated LL node and reordering the list will most likely result in two cache misses).

Try libclc [1] for a compact segmented cache of n 64B containers that exactly fit a cache line, with each container having its own eviction policy. The overhead of LRU per container (of 7 data lines) is 9.14 bits/entry and the same cache line access will give you both the data lines and the LRU order.

[1]: https://github.com/alphazero/libclc


> Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches

He refers to the LRU-like policy of the I/D cache itself.


Got it. Thanks for the clarification.


It goes beyond microcontrollers to every step up the chain. You want to optimize you system to hit the cache at every level as much as possible. The processor and its cache. The database and its cache. Web servers and their cache. Because going outside of the cache at any level usually incurs a performance hit orders of magnitude worse.


Interesting read, but I wish he'd explain how 2-random and LRU differ better. In a nutshell, LRU tracks the time when a cache item was last accessed and always discards the oldest item. 2-random also tracks the age of all cache items, but when discarding picks two random items and discards the oldest of those two.


It doesn't discard the oldest item, it discards the item that was least recently used/accessed (LRU, like the name says). To see the difference more clearly, note that the oldest item can in fact be the most recently used one.


Love reading this. It has always been one of those interesting things I kept in the back of my mind in my day to day.

I was very excited when I actually got to implement it on a real world project.

I was writing a scale out job which used ffmpeg to make clips of video files. To speed it up I kept the downloaded files (which could be 150 GB in size) as a local cache. Quite often a clip is made of the same file. When the disk was full (there was a separate disk for download and clip output) selected two of the downloaded files randomly and deleted the older one. Loop till there was enough disk space, or no files.

It's something I thought I would never actually get to implement in the real word, and thus far is working very well, the caching speeds things up and the eviction seems to avoid too many cache misses.


2-random was also used in this HDFS change - https://issues.apache.org/jira/browse/HDFS-8131.

Default block placement policy chose datanodes randomly for new blocks. This change selects 2 datanodes randomly and picks the one with lower used disk space.


Caffeine, a caching library for Java, has a nice write-up on more advanced cache eviction strategies. https://github.com/ben-manes/caffeine/wiki/Efficiency


Worth reading http://open-zfs.org/wiki/Performance_tuning about the Adaptive Replacement Cache algorithm ( https://en.m.wikipedia.org/wiki/Adaptive_replacement_cache ) which is less vulnerable to cache flushes than LRU


The only big gain of ARC is the adaptive sizing. If you can live with static sizing you will do just as well with 2Q, which is not patented.

I haven't used a true LRU in quite a while because the coordinated bookkeeping required is as disaster on multithreaded programs. ARC and 2Q can be largely lock-free.


Can you explain how ARC can be more easily lock-free than LRU? ARC has to coordinate 4 doubly-linked lists and a pivot on every access.


You can implement ARC that way and there is a Go library on GitHub that does, but that is not going to scale well. Normally you won't need to evict from the frequently-used list, and there is no need to maintain LRU order on that cache during access. One need only atomically update the access time of the entry and defer ordering until such a time as eviction is required. During an access only a read lock on that cache is needed. During eviction from frequently-used you need exclusive access to all four structures, but if you can arrange for that to be rare, you'll have uncontested read access to your hot items without the need to maintain the LRU property.


That certainly helps, though you assume strict LRU vs approximate ARC. One can of course use similar schemes on LRU (like memcached does) or sampling LRU, making it approximate but reducing the lock contention. The technique that I've used is to buffer and replay events against the policy, so that reads are just a CAS into a ring buffer and replaying is under performed under a non-blocking lock. That works well with any policy and extends to other features, like expiration.


I think we'd definitely have things to discuss :-) It sounds like we agree there is rarely a good reason to maintain strict LRU property no matter what scheme or structure is being used.

BTW for expiration I have simply invalidated on next access. This seems to have worked well for me with DNS caching, for example, where basically you have three hot records (gmail.com, hotmail.com, yahoo.com) that expire every 5 seconds. In this case of course there is value in proactively refreshing cache entries with some probability proportional to the expiry time so you don't have thundering herds.


Yes, that would be fun. :)

For fixed expiration (e.g. 5 min TTL), I use time-bounded LRU queues. Then if the head item has not expired, its successors have not either. If it has, it is merely a poll.

For custom expiration (varies per-entry), I use a hierarchical timer wheel. This replaces sorting with hashing, with an amortized O(1) cost.

For refresh, that's when I do it on access since its being used. I keep a refresh timestamp and an expiration timestamp, so that a refresh can occur while the data is valid but aging, but a blocking load is required when its too stale (expired).

You might enjoy this slide deck of the ideas that I use: https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsU...


An algorithm that's more optimal than LRU on most workloads is Window TinyLFU. Some nice simulations and graphs can be found here: https://github.com/ben-manes/caffeine/wiki/Efficiency


LHD is more optimal than LFU on most workloads. It's harder to implement, but more accurately computes the actual expected value of each cache item. https://www.cs.cmu.edu/~beckmann/publications/papers/2018.ns...


I glanced through the paper and didn't see a comparison to TinyLfu. Unfortunately, it seems to be a misleading name which would make one think it was simply an LFU algorithm. The description in caffeine seems to indicate that it's more of an LRU algorithm with an LFU admission heuristic.

If the overhead of the cache itself is significant, as in kernel virtual memory management. A clock based algorithm is usually preferable which uses an array to approximate node based structures. ARC and LIRS have CAR and CLOCK-Pro respectively to simulate them. CAR is vulnerable to re-use distances near the boundary of the cache size. [1] CLOCK-Pro is widely used, notably in Linux VM. [1] https://www.slideshare.net/huliang64/clockpro


Naming aside, the structure is <LRU | LFU filter | LRU>. The latest version of that is adaptive, where the front LRU may grow or shrink based on the workload. This allows the cache to mimic LRU for recency-skewed workloads (e.g. job queues) and LFU for frequency-skewed ones (e.g. loops). In a stress test, ARC and LIRS had 20% hit rate, Caffeine has 39.6%, and Optimal is 40.3%.

https://user-images.githubusercontent.com/378614/52891655-29...


The LHD paper lacks a robust analysis against different workloads. The few traces they use are either private or not considered representative workloads. For MSR, the provider (SNIA) specifically states: "WARNING: These traces are over 10 years old! They should not be used for modern research!".

It would be much more interesting if they used the traces from ARC and LIRS, perhaps also UMass and Wikipedia. These traces are publically available and used by multiple papers from a variety of groups.

I have not seen a sampling-based policy that outperforms a policy that has tracks history outside of its working set. Since the examples specifically avoided that comparison, I suspect that it under performs in comparison. That isn't to say it isn't a useful advancement, but that a more robust analysis is needed to validate their claims.


Randomization can really do wonders at times. There are some algorithms for which the most efficient algorithm is essentially using randomness in a clever way. The Miller–Rabin primality test comes to mind. Or the randomized minimum cut problem. And even when deterministic algorithms have the same big-O asymptotic complexity as randomized algorithms, the randomized ones often perform better due to simpler code. Treap vs red-black tree is a classic example.


It’s also working very well for load-balancing.

https://brooker.co.za/blog/2012/01/17/two-random.html


The article never explains what the difference between 2-random and pseudo 2-random is, does anyone know?


Pseudo 2-Random use Pseudo LRU. Check the Wikipedia for Pseudo LRU.

It is a tree algorithm for tracking the approximate last use time.


Note that LFU does not have the same limits and will adapt to access patterns that are pathological for LRU like circular accesses. LRU is an approximation of LFU based on the idea that last access time is related to the future access frequency of an object. The pathological cases where LRU fails and random eviction is better is related to the fact that such access patterns violate such premise. LFU does not have such limits because it attempts to measure the actual access frequency (smoothed by time) of an object, and uses that as prediction of the future frequency.


I have just started fiddling with treaps recently. Not applicable I suspect for a hardware cache but this thread has morphed a bit into conversations about software caches as well.

A treap is an unbalanced binary tree where the tree depth is proportional to the frequency of use to get fewer indirections per access (assuming an uneven distribution of lookups). Essentially it has a MFU or LFU queue built into it and now I'm curious about how it would behave as a cache data structure.


It would be interesting to compare the behavior of real workloads with different cache protocols. In a "real workload" the sequence of reads and writes to memory (and the cache) come from a multiplicity of independent processes functioning co-sequentially with some distribution of lifetimes of cache ownership. In many cases, the caches will be in a non-equilibrium state and behaving badly.


Note that Redis's LRU algorithm is randomized: https://redis.io/topics/lru-cache

> The reason why Redis does not use a true LRU implementation is because it costs more memory.

...but it sounds like it wouldn't necessarily be totally desirable to use true LRU anyhow. :-)


Random boosting could have saved Mars pathfinder https://en.wikipedia.org/wiki/Priority_inversion


The experiments are interesting but they never explained what the strategies are. It's OK for LRU and FIFO but certainly not for 2-random!




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: