Hacker News new | past | comments | ask | show | jobs | submit login
Caches: LRU vs. Random (2014) (danluu.com)
92 points by eatonphil 3 months ago | hide | past | favorite | 26 comments



I've been playing a lot of Baldur's Gate 3 lately and this kind of reminds me of when you have to roll a d20 in a skill check. Early in the game there are skill checks with a difficultly class of 2 (need to roll a 2 or higher to succeed) but you might have proficiency in that skill so you get a +3 modifier to your roll.

But if your base d20 roll result is a 1 then this is considered a critical fail and your modifiers are not added – you fail no matter what. So there is a 5% chance of failing no matter what (1 in 20).

However, you can have special items or spells which can grant you advantage in a roll: you roll 2 d20 instead of 1 and take the higher of each. Now the only way to critical fail is if you roll a 1 with both d20s. This has only a 0.25% chance (1 in 400).

So the "2 random choices" cache eviction policy feels kinda like rolling with advantage in D&D: Roll 2 die (pick to random keys) and evict the higher time since last access to evict.


I love that game and all the source material, but I really dislike the d20 as a base. I've been working on a system that uses a fixed number of d6, since with just a few d6 you get something more normally distributed, and so you can actually model real life skills more accurately, and provide a better spread of outcomes for higher level players over " I have a 5% chance to fail, or a 90+% chance in something I'm not good at "

20 years of d&d will motivate some unofficial errata.


Check out the Dominions series, which just released it's 6th installment this week! It uses "exploding D6s" for it's rolls, which means rolling a bunch of D6s and then re-rolling any 6s (recursively) and summing the result. This results in a fascinating long-tailed distribution (similar to advantage/disadvantage) so that even the weakest unit can deal damage to bosses on occasion.


I didn't mention it but I stole this wholesale! It does add a very nice long tail.


I love the Shadowrun system, where you roll a whole bunch of d6. To suceed a certain number of these need to hit a threshold. Proficiency and other advantages are granted by allowing you to increase the number of d6 you roll


Yeah that's nice. Does it reduce the d6 to a d2 basically?

In this system we're trying, you roll a fixed batch of die, and grab as many as you can that add up to your skill level. The number so "captured" is how well you do.

This very well matches the norm CDF, with naturally higher skill being better, and has nice means and variances that "spread out" at higher levels of difficulty. So easy tasks are easy because you need to capture less and you are more likely to do so. Harder tasks have a much broader range of skills that have a decent chance of succeeding due to higher variance so everyone gets to help.

We added in a bunch of sugar to make multi-roll complex checks as minigames, synergy so more people can help and so on, to make rolls as collaborative as possible.

I could go on and on...I'm in the process of writing it up for open use.


The critical fail on a 1 is a variant that I hate, it makes everything feel like a slapstick comedy. I do wish it would be played as written.


> So we've seen that this works, but why would anyone think to do this in the first place? The Power of Two Random Choices: A Survey of Techniques and Results by Mitzenmacher, Richa, and Sitaraman has a great explanation. The mathematical intuition is that if we (randomly) throw n balls into n bins, the maximum number of balls in any bin is O(log n / log log n) with high probability, which is pretty much just O(log n). But if (instead of choosing randomly) we choose the least loaded of k random bins, the maximum is O(log log n / log k) with high probability, i.e., even with two random choices, it's basically O(log log n) and each additional choice only reduces the load by a constant factor.

This sounds very powerful! And intuitively doesn't make any sense to me. So, say I have n choices, I do not know which choice is better. Does this result say that it's more efficient to randomly choose k and find the best among them (even for k=2), instead of randomly choosing a single choice? Could someone point me in the right direction?


> This sounds very powerful! And intuitively doesn't make any sense to me. So, say I have n choices, I do not know which choice is better. Does this result say that it's more efficient to randomly choose k and find the best among them (even for k=2), instead of randomly choosing a single choice? Could someone point me in the right direction?

Yes, let's consider the k=2 vs selecting a random item from a set of N items:

Selecting one item uniformly randomly from a set of N is identical to selecting two distinct items and then picking one of those two uniformly randomly.

So if you select items A and B, then pick the best item, you end up with an item of MAX(A,B) utility instead of MEAN(A,B) utility. MAX(A,B) >= MEAN(A,B) should hopefully be obvious.


Random numbers sound possibly expensive for 'small' caches. Though I /offhand/ I guess if there's a hardware RNG source it doesn't need to be cryptographically secure or trusted as long as it's fast for this task.

Similarly if updates are bursty an idle cull could use a bitvector to mark free slots within a fixed array for rather low overhead and a burst of speedy inserts when load picks up. Something close is probably already necessary overhead for when the cache isn't yet populated.


You've explained that wonderfully, thank you! I'm not the original commenter, but was also confused by my intuition here. That makes a lot more sense now :)


It didnt make any sense to me.

I think an intuitive perspective is that if you pick one item randomly, you cant use the LRU signal. So, you pick two items. Now you get to also use LRU. Notice that it doesnt add any value to pick three random items and then apply LRU.


> Notice that it doesnt add any value to pick three random items and then apply LRU.

It actually does. The more items you pick, the better it works, because at the ultimate stage of k=N you will always be picking the best item. For a lot of real-world distributions, there are rapidly diminishing returns with higher k, but the returns are still there.

[edit]

The same example as above works for k=3, observing that discarding one of the 3 chosen uniformly randomly devolves to the k=2 form.

k=3: MAX(A,B,C)

k=2: MEAN(MAX(A,B),MAX(B,C),MAX(A,C))

k=1:


If you think you need a better K, because you have insight into your distribtution, then don't use uniform, instead, use something more appropriate.

When k=N you degrade to LRU.

Your MAX / MEAN logic is completely flawed. If k=N then you'd have MAX(0..N), wouldn't this be optimal? But, this is LRU and it's not optimal.

All this is hand-waving. The proper response is to use the language of mathematics to prove how this algorithm behaves. Perhaps there is an optimal k.


I didn't realize we had moved on to talking about cache-replacement rather than putting balls in buckets. For putting balls in buckets (or any case where it is possible to compare options and come out with a "best" then k=N is optimal.

For cache-replacement strategies, you can prove very little for the general case, since for any two reasonable replacement strategies, there is usually an access pattern that favors one over the other. When TFA says "Random and FIFO are both strictly worse than either LRU or 2-random" the context is in running SPEC CPU with an 8-way associative cache. It's not hard to come up with artificial benchmarks that invert that relationship.

Also TFA contradicts your earlier assertion that k=3 is not better than k=2, at least when used with a pseudo-LRU: "Also, we can see that pseudo 3-random is substantially better than pseudo 2-random, which indicates that k-random is probably an improvement over 2-random for the k."


The algorithm described here (randomly picking some small number of keys and evicting the oldest one among them) is the core of how Redis' "LRU" cache eviction policies work, in case anyone was wondering.


In a similar vein, there's a neat "Efficient Page Replacement" strategy described in a LeanStore paper [0] that combines random selection with FIFO:

> Instead of tracking frequently accessed pages in order to avoid evicting them, our replacement strategy identifies infrequently-accessed pages. We argue that with the large buffer pool sizes that are common today, this is much more efficient as it avoids any additional work when accessing a hot page

> by speculatively unswizzling random pages, we identify infrequently-accessed pages without having to track each access. In addition, a FIFO queue serves as a probational cooling stage during which pages have a chance to be swizzled. Together, these techniques implement an effective replacement strategy at low cost

[0] https://db.in.tum.de/~leis/papers/leanstore.pdf


Came here to say the same thing.

There's a follow-up paper from last year that talks about some refinement of this strategy (6.2 page replacement), among other things.

"The Evolution of LeanStore"

https://dl.gi.de/server/api/core/bitstreams/edd344ab-d765-44...


Thanks for the pointer :) it seems there's a follow-up to that paper in turn, with an explicit focus on that eviction strategy: "Write-Aware Timestamp Tracking"

https://www.vldb.org/pvldb/vol16/p3323-vohringer.pdf


Ah, thanks, that was actually the paper I think I was looking for.



It depends on the distribution of probabilities of being used. If least recently used data has the same probability to be requested than the others, then random picking will do as well as LRU.

When the least recently used data does have a lower probability to be requested, than LRU will outperform random picking.

There is no silver bullet algorithm.


Even just as a cheap way of approximating LRU the k-random algorithm is quite nice.

Honestly I think LRU doesn't get enough recognition for being provably only a fixed factor away from what an optimal algorithm could do with less memory [1]. In fact LRU does about as well as an online algorithm can do, without knowing up front which memory is about to be accessed (basically you can do statistical analysis, but this means other patterns must perform worse). My take from it is that more memory beats clever algorithms.

The paper is interesting by the way, I think it's one of the first to use potentials to prove amortized optimality, and it actually generalizes a bit further showing that 'move-to-front' is similarly close to optimal if the cost function of accessing an item increases with some convex function.

[1]: https://dl.acm.org/doi/10.1145/2786.2793


> My take from it is that more memory beats clever algorithms.

Well, that's about to be expected from a cache?


Not really, I mean more CPU only beats better algorithms up to a point. With caching you can guarantee 90% of the performance of the best possible algorithm by increasing the RAM tenfold. This is not the case with CPU, you can increase it however many times you want there's no 'master' algorithm that guarantees a performance up to 90% of the optimal algorithm, no matter how much more CPU you throw at the problem (though it would be handy, we could finally solve the Collatz conjecture).

I mean I do vaguely recall that there are sometimes ways to guarantee close to optimal asymptotic complexity by running all possible programs in parallel and seeing which one finishes first, but it's nowhere near as practical as LRU.


> I mean I do vaguely recall that there are sometimes ways to guarantee close to optimal asymptotic complexity by running all possible programs in parallel [...]

Yes, that's a neat parlour trick, but it's utterly impractical in practice. It blew me away the first time I learned about it.

(Though it would have been even more impressive to me a few decades ago, because in the meantime I learned enough of the building blocks to immediately work out the algorithm, once I heard someone mention that it's possible. Still: very, very neat.)

> With caching you can guarantee 90% of the performance of the best possible algorithm by increasing the RAM tenfold.

Sorry, I thought you were just comparing caching schemes. And I also thought our prototypical example was reading eg from disk so you have to pay that cost on a cache hit, and we were talking about burning CPU cycles to decide on a better cache eviction strategy (but even an optimal strategy will have to hit the disk, when the cache is too small).

It seems like you want to talk about a situation where you use the cache to avoid recomputing some result with CPU cycles. Eg like investigating the Collatz conjecture. That's also a very interesting application, but not what I had in mind originally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: