Rough Summary: Start with a dumb-cache using "FIFO" (first in/first out) Keep tr...

gridspy · on June 23, 2023

Useful summary. Made me want to read about the ghost cache. The cache layout is:

Probationary 10% + Traditional 90% (of memory).

The idea is that most items are not requested again prior to dropping out of probationary and so don't take up space and process in the Traditional Cache.

Some infrequently used items however would remain inside the Traditional cache but never make it through the probationary cache. So a small "ghost" cache is used to specifically detect these items rejected from the Probationary cache.

Next time we go to fetch a ghost item they go directly into the Traditional cache. The ghost cache is just storing metadata (pointers essentially).

firstlink · on June 23, 2023

What's the eviction rule from the ghost cache, though? Does it share an eviction queue with the Traditional cache?

gridspy · on June 23, 2023

It is the same size as the traditional cache, but it is also a FIFO. Naturally it doesn't need to worry about tracking access, since the moment the ghost cache is "hit" the relevant memory location is loaded into the traditional cache and the ghost cache entry becomes irrelevant. So I guess that means the ghost cache could just be a queue.

ramses0 · on June 23, 2023

So a “bypass probation” ID-only queue.

valleyer · on June 23, 2023

This sounds pretty similar to the pageout scan in Mach (i.e., NeXTSTEP, Darwin, macOS, iOS, and derivatives). It's also used by 4.4BSD and some of the modern BSDs.

There is an "active" queue and an "inactive" queue. The pageout daemon tries to maintain them in a 1:2 ratio. Active drains to inactive; inactive drains to be paged out. Both are simple FIFO, but the daemon clears the page table referenced bit of pages as they are enqueued onto inactive.

When dequeueing a page from inactive, the referenced bit is checked. If it's been set (i.e., the page was referenced since being moved to inactive), the page is diverted back up to active. Otherwise, it's paged out.

1a1a11a · on June 23, 2023

This is interesting! Do you have a good reference?

valleyer · on June 24, 2023

Not a great one offhand, but here's a decent one from Apple, from back in the days when they put energy into documenting internal stuff like this. (As a result, some of the details in it are slightly out-of-date, but the general shape is still correct.)

https://developer.apple.com/library/archive/documentation/Pe...

1a1a11a · on June 24, 2023

This is very helpful (for someone who has never worked on Darwin)! The algorithm uses two FIFO lists with a "visited" bit (lazy promotion), and is a great idea! There are two differences between what is described and our algorithm, 1. the two FIFO lists are not fix-sized, which cannot guarantee quick demotion; 2. new data (hard fault) are inserted into the active list (I assume it is large than the inactive list), which is different from the quick demotion.

weebull · on June 23, 2023

What I'm not getting is...

1. You've got a cache miss 2. You need to evict something, so look at the next item due for eviction. 3. It's a candidate for promotion, so you reinsert it. 4. You still need to evict something, so goto 2.

Isn't that a potential infinite loop? And even if not (because there's some other exit clause), how is it quick?

1a1a11a · on June 23, 2023

There are two things. One is reinsertion or lazy promotion as you described, which is used to keep popular objects (reinserted objects have the bit reset to 0, so it won't be infinite loop). And it performs fewer work than LRU because many hits on a popular object result in one "lazy promotion".

The second is quick demotion which uses a small FIFO to filter out most objects that are not requested soon after insertion in the cache.

eis · on June 23, 2023

It looks like your lazy promotion algorithm is O(n). If you get a particularily unlucky request pattern you'll be walking the whole list to know which item to evict. I don't think the claim that this LP FIFO method is always less intensive than LRU will hold up to scrutiny. But I still find the paper inspiring. Thanks for shaking a bit the established believes in the field :)

1a1a11a · on June 23, 2023

This is common confusion. It is O(n) where n is the number of objects, but LRU is O(m) where m is the number of requests, which is often more than the number of objects :)

eis · on June 23, 2023

I think we are not talking about the same thing. LRU needs to access/modify the tail and head of the queue once per request for both Get and Insert.

Your algorithm on the other hand does not need to modify the order of the queue on a Get (just mark single item as hit if not marked already) which is really good. But on Insert your algorithm needs to do a reverse scan through the queue until it finds an item it can evict (not marked as hit). This might need to walk the whole queue if unlucky. And it can get so bad that it needs to do that on every single eviction so latency and CPU usage could skyrocket. Granted, this is a worst case scenario that shouldn't happen usually but if it does then it could be a real problem.

So in that sense LRU would be O(m) in terms of queue modifications and reads but yours would be O(n*m) for queue reads.

The O(n) I mentioned was for the number of items the algorithm needs to check on a single insert that results in an eviction aka the usual case.

1a1a11a · on June 24, 2023

Yes, in the worst case, it would need O(n) to evict, and it can hurt tail latency, but it only happens once in a while in the worst case because each time when an object is checked, the bit is set to 1.

I don't understand the O(n*m) argument, does the answer invalidates it?

On the tail latency part, if it becomes a problem, we can bound the number of objects it checks.

eis · on June 24, 2023

It might not happen only once in a while. Consider the following pattern in which a Fetch() does a Get and then an Insert on cach-miss:

   Fetch("1"); Fetch("1"); Fetch("2"); Fetch("2"); Fetch("3"); Fetch("3"); ...

This will result in your queue being 100% filled with items that are marked as hit because of the second Fetch for each key. After the queue is full each first Fetch will result in an eviction (new key to insert) which has to walk the whole queue because the only item marked not-hit is at the head of the queue due to being lazy promoted.

My O(n*m) reply was because your O(m) was considering a whole run over multiple requests and my O(n) was only considering one request. So if I were to also talk about multiple requests then your eviction algorithm needs to look at O(n*m) items for evictions in the example pattern I presented. O(n) per request for m requests.

A malicious actor could abuse this fact to cause a DoS.

No you can't just bound the number of objects it checks because that would result in a flurry of cache misses as it would mean you have to fail the insert.

1a1a11a · on June 24, 2023

Consider the the request sequence you gave a, a, b, b, c, c, d, d ... and a cache size of 2, when the first c arrives, the cache is in the following state (left is head and right is the tail, inserted at the head): b(1) a(1)

and it will trigger an eviction, and requires resetting 2 (or O(m)) objects with the following state changes b(1) a(1) -> b(1) a(0) -> b(0) a(0) -> c(0) b(0)

the next c will change the bit to 1, and we will have c(1) b(0),

When d arrives, b will be evicted, and we have d(0) c(1)

Note that we only check/reset one object this time instead of O(m).

Therefore, in the worst case between each O(m) eviction, you need O(m) requests to set the m objects to "visited", strictly no more than a LRU cache.

eis · on June 25, 2023

Ok I see where the difference is now. You are clearing the hit bit for all items marked as hit on a single eviction. I mean yes I guess that means that during the following request you don't have to check all items in the queue but now you have a different problem: you evict items prematurely. With a cache of size 2 I can't show it but consider a cache of size 3 and the following pattern: a, a, b, b, c, b, d, e

When d arrives state changes c(0) b(1) a(1) -> d(0) c(0) b(0). When e arrives state changes d(1) c(0) b(0) -> e(0) d(0) c(0).

Note how b got evicted from the cache even though it was requested more recently and more frequently than c because d cleared out the hit bit for both b and c.

Now you traded off hitrate vs latency. But the worst case latency is still O(n) just can't happen multiple times in a row anymore.

Imagine you have a cache of 1 million items all marked as hit and you get a new request which would evict all 1 million items. You gained latency for some subsequent requests but increased it in the current one. Crucially you also just cleared out the hit marker of whole cache.

1a1a11a · on June 25, 2023

I agree with you.

Any eviction algorithm would have an adversarial workload, so a simpler algorithm is better because you know what the workload was and whether it happens in production; on the other end, the complicated one only makes this worse and the failure of a cache is often catastrophic. :)

eis · on June 25, 2023

I can't agree with that. More sophisticated algorithms don't necessarily have to make things worse. In fact I think the opposite is true. Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns.

You don't always know the workload beforehand and can pick the appropriate cache policy. Workloads can even change during runtime.

I haven't seen a simple cache algorithm that performs well in every workload and I believe any that wants to succeed needs to incorporate different sub-algorithms for different workloads and be adaptive.

1a1a11a · on June 26, 2023

Have you experienced workload changes that caused an algorithm to be less effective? I am happy to learn about it. From my experience and study of traces collected in the past, I haven't seen such cases. Of course, there is diurnal changes, and sometimes short abrupt changes (e.g., failover, updating deployment), but I have not seen a case that a workload change that needs a different algorithm, e.g., switch from LRU-friendly to LFU-friendly.

Having an ideal adaptive algorithm is certainly better, and I do observe different workloads favor different algorithms, but all adaptive algorithms have parameters that are not optimal for a portion of workloads and also hard to tune (including ARC).

If you give me an algorithm, I can always find a production workload for you that the algorithm does not work well.

"Simple algorithms are the ones most likely to fail catastrophically because they are made for a certain range of access patterns" I do not see how FIFO and LRU fail catastrophically except on scanning/streaming workloads. Happy to learn about it.

eis · on June 26, 2023

The classic example of a changing workload is a database which has a normal workload that maybe resembles a zipf or bell distribution but then someone runs exports which do full scans over large parts of the DB. These changes in workloads can be intermittent but they exist. A very similar change in workload that is well known is the pagecache for filesystems (e.g. backup doing full scan over disk and evicting everything from the cache) which is why admission policies were invented in the first place. So you have two different algorithms already. Just maybe not adaptive.

A friend of mine works on a online-shop / distribution system that during most of the day has a zipf like distribution but after normal working hours when retail shops close he gets completely different patterns as the shops push their data for the day into the system and then all kinds of BI tools run for a few hours.

Of course adaptive caches are never completely optimal because it takes time for them to learn which workload is actually currently happening but after a while they can get very close. A good adaptive cache should not need tuning to get good results because the whole reason for it to be adaptive is to tune itself. Of course there are quite a few adaptive caches, especially the first ones that pioneered this, that still needed tuning.

I don't understand the last question. You say you can't see how FIFO can fail catastrophically and at the same time state a case how it can. Scanning workloads are something that does show up in the real world all the time. NovaX seems to have run your algorithm through some traces where it had a near zero hitrate. It can be more important to avoid these extreme cases than having a slightly better hitrate in the normal case.