This is particularly relevant for databases and keeping rows in memory.
The algorithm is something like "add two weight when an item is used, and decay one from every item every second, evict items under pressure from the lowest weight to the highest."
LIRS and TinyLFU make better use of their history to filter out low value items early. The fundamental difference in their designs is starting point: LIRS starts from LRU and infers recency, whereas TinyLFU starts from frequency and infers recency. There is work on an Adaptive TinyLFU that mimics LRU by using an admission window, sampling the hit rate, and using hill climbing to adjust the region sizes to best fit the workload.
In a world where advances are built on top of other advancements, patents just stifle innovation.
Also this patent expires in 8 years. Will we still need caches in 8 years? Why yes we will!
One such instruction can be used with the _mm_stream_si128 intrinsic:
void _mm_stream_si128(__m128i *p, __m128i a);
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.
Learned about it in "What every programmer should know about memory":
It's from 2007, but still relevant and a good read.
You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).
Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.
You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.
Similarly explicit prefetching usually does not improve performance, but reduces it.
(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.
The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.
For memory mapped regions there is madvise() DONT_NEED to tell the kernel to that you won't need this data again so don't bother with the disk cache.
For the CPU cache the non-temporal instructions MOVNTI that tells the CPU to not bother caching the move to prevent cache pollution.
"A case for the non-temporal store"
I wonder if it would be worth adding a k-random eviction strategy, for a balance between the two.
That is not really true. LRU and LFU favor different items, LRU ones used frequently in the short term, LFU ones used frequently in the long term. Only under special circumstances does one approximate the other, for example if your cache is large enough that LRU does not evict the most frequently used items when many less frequently used items are placed in the cache before the most frequently used items are accessed again, then LRU approximates LFU.
Also note that your idea of LFU (in this context), from what you write, is perhaps one that does not adapt over time. Redis LFU employs a decay strategy so that the recent frequency is computed, not the whole-lifetime frequency of access, so even short lived items accessed very frequently for some time, will do well and will not get easily evicted.
Full story: http://antirez.com/news/109
The LFU variant you describe in the linked article is somewhere in the middle, it tracks the absolute frequency of access just like classical LRU but then it also decays this frequency and therefore turns it into something like a weighted and averaged relative frequency. But it is, at least for me, hard to tell what exactly the algorithm tracks, how the accesses are wighted over time and therefore were exactly this falls on the spectrum between LRU and classical LFU.
Algorithms like ARC and CAR also try to strike a balance between LRU and LFU with the difference that they adaptively control where on the spectrum to operate.
Edit: this is clearer - https://redis.io/topics/lru-cache#approximated-lru-algorithm
Excerpt from https://raw.githubusercontent.com/antirez/redis/2.8/redis.co...
# LRU and minimal TTL algorithms are not precise algorithms but approximated
# algorithms (in order to save memory), so you can select as well the sample
# size to check. For instance for default Redis will check three keys and
# pick the one that was used less recently, you can change the sample size
# using the following configuration directive.
# maxmemory-samples 3
There are many other cache replacement policies  and they can outperform LRU especially if the quoted assumption is not true. It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use. In those cases a more complex policy like ARC or CAR can provide noticeable improvements.
The article is about CPU caches so speed and memory usage are very critical.
One of the main benefit of LRU is that it only needs one extra bit per cache line.
LLCs often are exposed to access streams where the temporal locality (which is the assumption that makes LRU good) has been filtered out by earlier caches.
For that reason, newer cache replacement policies such as RRIP, SHiP, and Hawkeye have significant headroom to improve upon LRU (and PseudoLRU).
This might be well known, but why's that? I've recently seen a Go implementation of LRU and it uses a lot of memory. Maybe this would allow us to save some of that in a better implementation.
The implementation in software was probably fully associative, i.e. every item can be cached in every slot. This requires a lot more memory and it is the same for caches in processors, they require more additional bits and logic the more freedom they have where to cache the content of every address.
To be more precise, you have to keep track of the order all cache entries were accessed which requires at least about n * log(n) bits unless you only implicitly store the access order by using something like move to front.
I can't help but see the parallels with garbage collection algorithms, which seem to actually be the same problem approached from a different angle. Which objects should be scanned for cleanup and how often for optimal use of time maps very closely to what items to keep or event from a cache.
It's all the same, identifying objects with different characteristics in order to handle them with tailored algorithms instead of simply using a generic but less optimal algorithm for all objects.
Ideally we could peer into a crystal ball and know which items are going to be accessed in the near future and keep those in the cache at all costs. Since we don't know that, we use a predictor based on past behavior. If past behavior is random, then prediction is for shit. So we assume that behavior has locality: something referenced recently is more likely to be referenced again than something not referenced recently. Based on that guess, we sort things from least recently used to most recently (or approximately sort, using aging tricks to avoid actually sorting), and boot out the least recently used entries.
If accesses are completely random, then the LRU selection is pure overhead. Or worse; some patterns, like repeated sequential accesses of something large through a small cache, are anti-locality. If we are accessing a big buffer in a circle, then the next thing we are accessing at all times is precisely the least-recently used item that LRU wants to boot out.
For situations like that, caching mechanisms sometimes support some API for providing a hint. "These memory locations are going to be needed soon". "This memory mapped file is being sequentially accessed". And such.
...on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?
"2-random" is his short hand for the above scenario.
[edit: formatting, I have no idea how to signify a quote; I don't want preformatted because it makes a scroll box.]
> This is a quote. It won't become a scroll box, no matter how long it gets. It will just wrap to the next line. True, the next line won't begin with a ">" character, but the convention is that the whole paragraph is a quote if the first line begins with a ">".
Basically pick-2 allows you to make a mixture of a totally random and optimal choice in cases where you have imperfect information that is much more resilient to the staleness of data than either a random choice or the globally optimum choice.
Here's an example, in the Python package wordfreq . When you look up word frequencies, there's some normalization it has to do to your text. Some words are very common, and it would be silly to normalize them repeatedly. So frequency lookups are cached.
The cache dictionary has a maximum size of 100,000, so that exceptional or malicious cases don't use up all your memory. It is extremely rare for text to contain 100,000 distinct words, but if that happens, the cache gets full and the entire thing is dropped. It will quickly be re-populated with common words, of course.
Yes, you can make this perform horribly by looking up frequencies of ["zero", "one", "two", "three", "four", ..., "one hundred thousand"] in a loop. That's not realistic data.
Do you actually have a suggestion for how to do this faster? We benchmarked this against other cache strategies (with smaller cache sizes so it would drop sometimes) on realistic data. It's much faster than Python's LRU decorator, for example.
That's one reason the article was advocating for 2-Random, since it has more gradual performance degradation when the data set gets too large.
There is definitely a case where this applies, which is the case where going over your cache size is an exceptional circumstance that you don't expect to reach very often. In that case, all the bookkeeping for determining what to evict is wasted effort.