Hacker News new | past | comments | ask | show | jobs | submit login
Cache eviction: when are randomized algorithms better than LRU? (2014) (danluu.com)
192 points by signa11 on Apr 19, 2017 | hide | past | favorite | 67 comments

Related: the author, Dan Luu has, IMHO, one of the best IT-related Twitter feeds out there: https://twitter.com/danluu

Highly recommended.

One thing I've seen is that a weighted LRU is usually slightly better for unpredictable workloads. i.e. it will never evict high use cache data when there's a flurry of one-off traffic that would normally start evicting a normal LRU cache, and even a 2 random LRU scheme.

This is particularly relevant for databases and keeping rows in memory.

The algorithm is something like "add two weight when an item is used, and decay one from every item every second, evict items under pressure from the lowest weight to the highest."

Well, once you go down this path you gotta consider ARC: https://en.wikipedia.org/wiki/Adaptive_replacement_cache as I believe it is designed to be scan-resistant.

ARC has modest scan resistance due to a limited history. Like LRU, it suffers from cache pollution. See the glimpse trace for this workload type [1].

LIRS and TinyLFU make better use of their history to filter out low value items early. The fundamental difference in their designs is starting point: LIRS starts from LRU and infers recency, whereas TinyLFU starts from frequency and infers recency. There is work on an Adaptive TinyLFU that mimics LRU by using an admission window, sampling the hit rate, and using hill climbing to adjust the region sizes to best fit the workload.

[1] https://github.com/ben-manes/caffeine/wiki/Efficiency#glimps...

Hmm given that patents span for 20 years, and that's essentially forever in computer world, once this patent expires it probably will be worthless.

In a world where advances are built on top of other advancements, patents just stifle innovation.

I'd disagree. There's a common saying that the great companies of today are made by someone trawling the forgotten CS papers of the 1980s, and implementing them to solve problems that didn't exist then.

Also this patent expires in 8 years. Will we still need caches in 8 years? Why yes we will!

And yet innovation proceeds at a pace unmatched in history, and is so commonplace it gets dismissed as not counting when it's not a revolutionary breakthrough.

When you're referring to innovation, you're talking about everything. I'm referring to software.

It's relatively impolite to tell someone what they meant

PostgreSQL uses ARC (actually 2Q which is very similar), and they worked around the patent.

Postgres doesn't use 2Q either, sorry. It's a simple clock-sweep approximated LRU implementation currently.

Nope, no arc there anymore. It got ripped out shortly after introduction due to IP concerns.

I was going to joke about how we need generational LRUs and here I see a comment about ARC. I guess its not about ref counts though.

Looks like they just combined LRU + LFU + a victim cache. Or is there something more insightful about its design?

It keeps a history of evicted keys. If there's a miss, it knows if the miss was in the lfu or lru part, and it tunes the ratio of space allocated to the lru and lfu part of the cache.

Hmm, that sounds pretty cool. The history part is exactly a victim cache, so it seems like the size tuning is what's novel.

Are there any languages that allow you to give the compiler a hint that you're about to grind over a gigantic dataset so don't bother to cache any of this data because it won't be accessed again for a long time? It seems like it could be helpful in keeping a big crunch from obliterating the cache constantly. You might also be able to apply other optimizations, like preloading the next data blocks so they're ready when the CPU rolls around. Maybe compilers already do this behind the scenes?

There are "stream" instructions available on some CPU. These instructions tell the CPU to store data, but not to load it into a cache line (it might actually load it into a temporary cache line that is not part of the normal cache lines, depending on architecture). This is useful is you are writing data to memory but don't read it - there's no point in loading that data to the cache line since you will not access it again.

One such instruction can be used with the _mm_stream_si128 intrinsic:

void _mm_stream_si128(__m128i *p, __m128i a);

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.

Learned about it in "What every programmer should know about memory": https://www.akkadia.org/drepper/cpumemory.pdf

It's from 2007, but still relevant and a good read.

Not the compiler per se, but you can remap memory to mark it as "non-cacheable". This is typically done in drivers for instance, when you access memory-mapped device registers that should most definitely never be cached.

You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).

Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.

You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.

Don't. Yes, there are instructions for this. No, don't use them, unless you really exactly know what you are doing and optimizing towards a specific, single µarch only, otherwise they will invariably hurt performance, not improve it.

Similarly explicit prefetching usually does not improve performance, but reduces it.

(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)

Links, examples, tutorials?

I suspect you will find its sorta unnecessary, most modern machines have stream detection built into the caches as well, and simple memcpy's and the like won't actually roll the entire cache. They might roll a set, or simply stop filling after some number of cache lines. This is also why prefetch hints don't tend to help streaming operations.

That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.

There are non-temporal loads and stores that bypass the cache, but last I checked the non-temporal loads don't actually work with cacheable memory types - this may have changed by now.

The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.

What level of cache?

For memory mapped regions there is madvise() DONT_NEED to tell the kernel to that you won't need this data again so don't bother with the disk cache.

For the CPU cache the non-temporal instructions MOVNTI that tells the CPU to not bother caching the move to prevent cache pollution.

The short answer is no, because the machine language doesn't allow general manipulation of the cache. So the cache is pretty strictly controlled by the hardware. In general the hardware works very well for a wide variety of different workloads. It can often work better than a compiler can, because it works at runtime, and is thus flexible to changes in workloads. The preloading optimization you mentioned is sort of done already: it's called prefetching. The idea is a cache will automatically cache memory near where you just accessed, because for a lot of memory intensive operations you are linearly scanning a specific section of memory.

SSE allows you to do non-temporal loads & stores, within certain constraints.

"A case for the non-temporal store" https://blogs.fau.de/hager/archives/2103

Languages, no, but the madvise syscall has flags for that kind of pattern.

This 2 random cache scheme reminds me a lot of the 2 random load balancing scheme laid out in "The Power of Two Choices in Randomized Load Balancing" https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.... Which shows an exponential improvement in choosing the less loaded of two randomly chosen servers over just random.

It is indirectly referenced near the bottom of the page, with a pointer to http://brooker.co.za/blog/2012/01/17/two-random.html, which references http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

The "2 random choices" idea also shows up as a good idea in other settings, such as load balancing for hashing. I wrote a writeup[1], but if you google "power of two random choices" you'll see a lot of literature on it.

[1]: https://jeremykun.com/2015/12/28/load-balancing-and-the-powe...

Redis supports LRU and random eviction: https://support.redislabs.com/hc/en-us/articles/203290657-Wh...

I wonder if it would be worth adding a k-random eviction strategy, for a balance between the two.

Hello, now Redis (4.0, still in release candidate) supports LFU (Least Frequently Used) which should work better in most cases since LRU also is also just a way to approximate LFU, since the most frequently used items are often the last recently used ones.

[...] since LRU also is also just a way to approximate LFU, since the most frequently used items are often the last recently used ones.

That is not really true. LRU and LFU favor different items, LRU ones used frequently in the short term, LFU ones used frequently in the long term. Only under special circumstances does one approximate the other, for example if your cache is large enough that LRU does not evict the most frequently used items when many less frequently used items are placed in the cache before the most frequently used items are accessed again, then LRU approximates LFU.

What I mean is that, the most frequently used objects, if accessed roughly with an even period of time, tend to be also the least frequently used objects, so if the accesses are very even LFU and LRU tend to be similar. For uneven access patterns, they are different, but usually LRU is used because access patterns are even so that it approximates LFU, since what we want to have in cache is, without other application-specific clues, the objects that we use more often. Knowing exactly the access pattern one can use an application-assisted strategy which is optimal, but without clues LRU adapts well to different scenarios while LRU may fail in a catastrophic fashion.

Also note that your idea of LFU (in this context), from what you write, is perhaps one that does not adapt over time. Redis LFU employs a decay strategy so that the recent frequency is computed, not the whole-lifetime frequency of access, so even short lived items accessed very frequently for some time, will do well and will not get easily evicted.

Full story: http://antirez.com/news/109

That is what I meant, under specific assumptions about the access pattern they can show similar behavior. Classical LFU tracking the absolute frequency of items will keep items that are used frequently in absolute terms but infrequently in terms of the instantaneous frequency in the cache while LRU will evict such items in favor of items with high instantaneous frequency. Those two algorithms are in some sense two extremes, LRU cares about what is most frequently used right now, LFU cares about what is most frequently used over the entire lifetime of the cache.

The LFU variant you describe in the linked article is somewhere in the middle, it tracks the absolute frequency of access just like classical LRU but then it also decays this frequency and therefore turns it into something like a weighted and averaged relative frequency. But it is, at least for me, hard to tell what exactly the algorithm tracks, how the accesses are wighted over time and therefore were exactly this falls on the spectrum between LRU and classical LFU.

Algorithms like ARC and CAR also try to strike a balance between LRU and LFU with the difference that they adaptively control where on the spectrum to operate.

Yes, there is a tuning problem indeed. I made two parameters user-tunable (the decay and the logarithm factor of the Morris counter), but tuning needs some expertise. I've severe limitations on the number of bits I can use per object, but depending on the user feedbacks I may try some auto-adaptive approach as well in the future.

I used hill climbing to tune TinyLFU. A larger admission window improves recency-biased workloads, while a smaller improves for frequency. By sampling the hit rate and making an initial adjustment, it can determine when to change directions and hone in on the best configuration. Other than a small warm-up penalty, it quickly optimizes itself.

Redis's LRU eviction is actually 3-random by default. It does not support true/full LRU.

Edit: this is clearer - https://redis.io/topics/lru-cache#approximated-lru-algorithm

Excerpt from https://raw.githubusercontent.com/antirez/redis/2.8/redis.co...

  # LRU and minimal TTL algorithms are not precise algorithms but approximated
  # algorithms (in order to save memory), so you can select as well the sample
  # size to check. For instance for default Redis will check three keys and
  # pick the one that was used less recently, you can change the sample size
  # using the following configuration directive.
  # maxmemory-samples 3

Redis does a much better thing in recent versions: it uses a few samples per iteration, but instead of discarding the accumulated data, it takes a pool of good candidates to re-check in the next iteration, so the approximation to LRU is very close compared to vanilla "among 3" algorithms. Moreover since the performance was improved, now the default is to look at 5 items making the algorithm better. More than the precision of LRU itself, I'm more intrigued with the potential improvements that LFU (Redis >= 4.0) can provide, but being 4.0 just in RC state, most users are currently just using LRU.

Nice! Thanks for the in-depth explanation :)

[...] choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently.

There are many other cache replacement policies [1] and they can outperform LRU especially if the quoted assumption is not true. It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use. In those cases a more complex policy like ARC or CAR can provide noticeable improvements.

[1] https://en.wikipedia.org/wiki/Cache_replacement_policies

> There are many other cache replacement policies [1] and they can outperform LRU especially

The article is about CPU caches so speed and memory usage are very critical.

One of the main benefit of LRU is that it only needs one extra bit per cache line.

Even when discussing CPU caches, there are policies that can severely outperform LRU, specifically at the Last-Level Cache (LLC).

LLCs often are exposed to access streams where the temporal locality (which is the assumption that makes LRU good) has been filtered out by earlier caches.

For that reason, newer cache replacement policies such as RRIP, SHiP, and Hawkeye have significant headroom to improve upon LRU (and PseudoLRU).

> One of the main benefit of LRU is that it only needs one extra bit per cache line.

This might be well known, but why's that? I've recently seen a Go implementation of LRU and it uses a lot of memory. Maybe this would allow us to save some of that in a better implementation.

Requiring only a single bit is only true for a two-way set associative cache, i.e. the content stored at every memory address may only be cached in two different cache slots. In this case you can simply flag the other corresponding cache entry you did not access as least recently uses every time you access a cache entry.

The implementation in software was probably fully associative, i.e. every item can be cached in every slot. This requires a lot more memory and it is the same for caches in processors, they require more additional bits and logic the more freedom they have where to cache the content of every address.

To be more precise, you have to keep track of the order all cache entries were accessed which requires at least about n * log(n) bits unless you only implicitly store the access order by using something like move to front.

> It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use.

I can't help but see the parallels with garbage collection algorithms, which seem to actually be the same problem approached from a different angle. Which objects should be scanned for cleanup and how often for optimal use of time maps very closely to what items to keep or event from a cache.

Or when transmitting data packets over a network you want to favor real time data like phone or video calls over file downloads. And in case of a failed transmission you want to just forget about the few milliseconds of voice you just lost but you certainly want to retransmit a lost packet from the middle of a binary.

It's all the same, identifying objects with different characteristics in order to handle them with tailored algorithms instead of simply using a generic but less optimal algorithm for all objects.

LRU is predicated on locality of reference.

Ideally we could peer into a crystal ball and know which items are going to be accessed in the near future and keep those in the cache at all costs. Since we don't know that, we use a predictor based on past behavior. If past behavior is random, then prediction is for shit. So we assume that behavior has locality: something referenced recently is more likely to be referenced again than something not referenced recently. Based on that guess, we sort things from least recently used to most recently (or approximately sort, using aging tricks to avoid actually sorting), and boot out the least recently used entries.

If accesses are completely random, then the LRU selection is pure overhead. Or worse; some patterns, like repeated sequential accesses of something large through a small cache, are anti-locality. If we are accessing a big buffer in a circle, then the next thing we are accessing at all times is precisely the least-recently used item that LRU wants to boot out.

For situations like that, caching mechanisms sometimes support some API for providing a hint. "These memory locations are going to be needed soon". "This memory mapped file is being sequentially accessed". And such.

Another characteristic of the data which would make random particularly bad is if reloading/recomputing the most popular items took significantly longer than reloading unpopular items. This could be the case for instance if item data size grew in proportion to popularity.

Dumb question, what is "2-random"?

From the article:

...on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?

"2-random" is his short hand for the above scenario.

[edit: formatting, I have no idea how to signify a quote; I don't want preformatted because it makes a scroll box.]

On HN, typically a quote is done like this:

> This is a quote. It won't become a scroll box, no matter how long it gets. It will just wrap to the next line. True, the next line won't begin with a ">" character, but the convention is that the whole paragraph is a quote if the first line begins with a ">".

Thanks! Sorry, I missed that.

Pick two entries at random, and then choose the least recently used one.


Basically pick-2 allows you to make a mixture of a totally random and optimal choice in cases where you have imperfect information that is much more resilient to the staleness of data than either a random choice or the globally optimum choice.

Often you can get away with dumping the whole cache when it fills up and starting fresh.

that's a horrible pattern. As soon as you go 1 bit over your cache size in a hot loop you'll have a 100% miss rate. (assuming each element is loaded once in the hot loop).

You don't use this pattern while looping. You use it while memoizing results that you don't want to compute again.

Here's an example, in the Python package wordfreq [1]. When you look up word frequencies, there's some normalization it has to do to your text. Some words are very common, and it would be silly to normalize them repeatedly. So frequency lookups are cached.

The cache dictionary has a maximum size of 100,000, so that exceptional or malicious cases don't use up all your memory. It is extremely rare for text to contain 100,000 distinct words, but if that happens, the cache gets full and the entire thing is dropped. It will quickly be re-populated with common words, of course.

Yes, you can make this perform horribly by looking up frequencies of ["zero", "one", "two", "three", "four", ..., "one hundred thousand"] in a loop. That's not realistic data.

Do you actually have a suggestion for how to do this faster? We benchmarked this against other cache strategies (with smaller cache sizes so it would drop sometimes) on realistic data. It's much faster than Python's LRU decorator, for example.

[1] https://github.com/LuminosoInsight/wordfreq/blob/master/word...

Strict LRU has the exact same problem.

That's one reason the article was advocating for 2-Random, since it has more gradual performance degradation when the data set gets too large.

Ouch. At 600+ clock cycles for a cache line fill (more if you're doing NUMA) this doesn't sound even remotely okay.

Sad to see this downvoted.

There is definitely a case where this applies, which is the case where going over your cache size is an exceptional circumstance that you don't expect to reach very often. In that case, all the bookkeeping for determining what to evict is wasted effort.

Curious - why wouldn't your CPU L1/L2/L3 caches always be "full?" What does it even mean to have a "full" cache in that context? Wouldn't they, under normal operations, typically fill up in a few seconds? (If not milliseconds?)

They will, but not with the data of the currently running process.

When the cost of retrieving an item from your slower storage is pretty much the same for every item, may it be old or new, small or big, popular or not.

Ajax made predictive (random)prefetching mainstream .its only natural that random eviction algorithm are coming along . when you pick one end of stick you pick another too

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact