
Constant Time LFU - lukastyrychtr
https://arpitbhayani.me/blogs/lfu
======
ridiculous_fish
I read it, I will try to summarize the idea with some intuition.

A classic LRU cache has a map from key to value, with the entries of the map
arranged in a doubly linked list. When a value is accessed, it is promoted to
the front of the list. So the last item in the list is least-recently
accessed.

In a LFU cache, you might try the same idea, except each node also has a
counter. When an item is accessed, increment its counter, and then swap with
previously-higher-counter items to keep the list sorted. In this way an
infrequently accessed item does not get promoted.

However this has a problem for certain access patterns. If every item has the
same counter value, then accessing the last item in the list will force it to
migrate up the whole list - O(N). You could use a heap instead, and get log
time, but that's still not the constant time that LRU provides.

What's proposed here is to bucket items by counter. Each bucket has a linked
list of entries, and each entry knows its bucket. The buckets themselves are
arranged in a linked list. When an item in the N bucket is accessed, it
conceptually unhooks itself from the N bucket and adds itself to the N+1
bucket, creating it if necessary.

Incidentally, a tip for writing this sort of cache: make your list circular,
starting with a sentinel node (I call it the "mouth" since it eats its tail).
This eliminates all the null checks in the linked list manipulation.

~~~
fluffy87
Is there a way to make these data structures hardware friendly? So that SIMD
instructions can be used, memory is traversed continuously, without multiple
pointer indirections, etc?

Are there any blogposts about this?

~~~
jasonwatkinspdx
For chaining hash tables like this you can use a linked list of buckets where
each bucket allows multiple entries, and entries can be scanned with SIMD
operations. The downside is it considerably complicates the insertion/update
logic, so depending on the workload it may be a net performance loss. If the
hash table is designed for concurrency the extra complexity can very likely
lead to tricky to understand bugs or performance cliffs.

------
NightMKoder
In practice I think cache eviction policies have trended to optimizing
_memory_ usage. Storing two pointers per cache entry for LRU on 64-bit CPUs
uses 16 bytes of data per entry - which adds up quickly! Especially if you're
storing e.g. a like count (so an int64 or float per cache value). This
approach to LFU has an even higher overhead.

There's a few cool alternatives. The first and most obvious - use a good dense
hash table implementation! For eviction strategies: a random eviction strategy
actually does pretty well in the general case and requires no extra memory -
great if you're caching an int32->int32 mapping and the cost of recomputing
isn't astronomical. You can also use try the CLOCK algorithm which just
requires 1 bit of storage per cache entry.

From an abstract perspective, all page replacement algorithms can also be used
as cache eviction policies. There's a whole list of these at
[https://en.wikipedia.org/wiki/Page_replacement_algorithm#Pag...](https://en.wikipedia.org/wiki/Page_replacement_algorithm#Page_replacement_algorithms)
.

------
rrobukef
Cache sizes are nowhere near the size where a O(log n) factor is significant.
In 1TB ram you can fit at most 2^36 entries (64-bit key to 64-bit value
without additional datastructures). A factor of 36 is easily dominated by CPU-
cache behaviour. This analysis should not be done with time complexities but
with deeper tools.

While binary heaps are notorious for bad CPU-cache behaviour, improvements
exist. Thus only benchmarks will convince me.

------
eloff
The overhead here in terms of space per element and amount of work while
holding a mutex seems like it would only work well for some kinds of
workloads. Particularly the design doesn't lend itself well to lock free
algorithms or other techniques for reducing contention.

Knowing how some real life high performance LFU caches work, I think this is
optimizing the wrong thing. Look at the design of caffeine and ristretto[1]
for some discussion of the tradeoffs with actual measurements. I believe both
are based on TinyLFU with modifications.

I don't know any serious LFU implementations that actually use a min heap and
are still competitive, so the premise that LFU is commonly log N is not
actually true.

[1] [https://dgraph.io/blog/post/introducing-ristretto-high-
perf-...](https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-
cache/)

------
ghj
I've implemented this for leetcode without realizing it's paper worthy:
[https://leetcode.com/problems/lfu-cache/](https://leetcode.com/problems/lfu-
cache/)

My implementation:
[https://pastebin.com/48mMTTdK](https://pastebin.com/48mMTTdK)

I am surprised to see that the paper was only published in 2010. I would
expect something this simple to have been discovered earlier. I guess it's
easy only after someone tells you constant time is possible.

~~~
dominotw
I had this on phone screen :D.

~~~
person_of_color
Did you get the job?

------
contravariant
You could do this, although LRU has the nice property that it is at most a
constant factor slower than an optimal (prescient) algorithm that has less
memory available [1] (e.g. with twice as much memory it's at worst half as
slow). Interestingly FIFO has the same property, suggesting that even LRUs
attempt to keep frequently used items in cache can only have a limited effect
(so in some cases FIFO might be faster simply due to lower overhead).

Interestingly LFU has access patterns that force it to be slower than LRU.
From the paper referenced above:

>A counterexample for LFU is a sequence of k + 1 accesses to each of N - 1
pages, followed by 2k alternating accesses to two new pages, k to each, with
this pattern repeated indefinitely (so that all but the initial N - 1 pages
have k accesses). [here N is the size of the cache]

[1]: [https://www.cs.cmu.edu/~sleator/papers/amortized-
efficiency....](https://www.cs.cmu.edu/~sleator/papers/amortized-
efficiency.pdf)

------
T0pH4t
Maybe also consider tinylfu
([https://arxiv.org/abs/1512.00727](https://arxiv.org/abs/1512.00727)). I've
implemented the windowed version with segmented caches and it works really
well. Though it is a much more complicated implementation.

~~~
NovaX
Have you tried making it adaptive [1]? Using hill climbing with a decay rate
has a nice impact to correct for LRU-biased workloads [2].

[1]
[https://dl.acm.org/doi/10.1145/3274808.3274816](https://dl.acm.org/doi/10.1145/3274808.3274816)
[2] [https://github.com/ben-
manes/caffeine/wiki/Efficiency#adapti...](https://github.com/ben-
manes/caffeine/wiki/Efficiency#adaptivity)

------
The_rationalist
recency and frequency are two hortogonally useful informations that can allow
better evictions. It's not that LFU is superior to LRU or the reverse.

The best of both worlds is to use LRFU. Now a 0(1) LRFU implementation would
be nice to see. Also, no talk on this blog about the constant factor I believe

After a code search on github, gecko, chromium and Linux have zero LFU despite
having tons of LRUs. And zero LRFU too. So it seems that browsers and OS
engeeners should learn better cache algorithms, it could allow huge
performance gains!

~~~
jleahy
The main linux page cache is 50% LRU and 50% LFU (half of the available memory
devoted to each).

------
namanaggarwal
I have been asked to implement this in Google, Amazon and Facebook interviews
and used the same method. :)

------
compsciphd
this is a core data structure built into java.

[https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHa...](https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html)

~~~
pxx
You're thinking of the data structure commonly used for an LRU cache, not an
LFU cache.

~~~
compsciphd
you're 100% right. stupid mistake on my part. Though the implementations are
very similar, just can't easily reuse the java implementation I linked to for
this.

