Hacker News new | past | comments | ask | show | jobs | submit login
Sieve is simpler than LRU (cachemon.github.io)
287 points by SerCe on Jan 3, 2024 | hide | past | favorite | 183 comments



Here is a flaw in the algorithm. The Hand pointer is essentially a roving pointer around a circular buffer. Elements that it passes over can live for another cycle (at least; more if they get visited).

But there is a bias in the algorithm. Arbitrarly, the circle is broken into a linear queue, and the not present items which are accessed are arbitrarily inserted into the head of the queue (leftmost slot).

This does not compute: a circular roving pointer sweeping through a queue? What?

The problem is that items that are inserted at the head of the queue during a moment when the Hand is in a low position (close to the head, toward the left) are closer to being doomed than items added when the Hand is in the rightmost position. This is for no good reason other than just luck.

How it should work is that new items should be added at a fixed distance from the Hand pointer. In other words, an Insert point moves around the circular buffer in lock step with the Hand. When the Hand obliterates an evicted item, items between that item and the Insert point will shuffle over, and the new item goes to the Insert point. Next time the Hand moves, the Insert point will move with it in the same direction.

And now here is an elegant thing: we can make Insert and Hand be the same slot! When the Hand finds a victim we just replace the Hand slot with the new item, and mark it visited. Don't move any items around. On the next miss, the Hand will hop away from that item and mark it unvisited, and that's it. That item can then live until the next visit of the Hand, if it isn't touched again. This is just a FIFO with protection for the visited, essentially.

The distance between Insert and Hand is a parameter; the algorithm can be measured at various values of that parameter.


> This is for no good reason other than just luck.

> When the Hand finds a victim we just replace the Hand slot with the new item, and mark it visited

That would cause newly inserted items to take at least N (size of the cache) rounds to be evicted. Evicting cache entries early is extremely important for performance. Their algorithm means new items will have a higher chance of being evicted, and that is by design, not random chance.

Here's the paper: https://yazhuozhang.com/assets/pdf/nsdi24-sieve.pdf


I clearly articulated in my comment that this is one choice we could make for the distance between the Insert point and the Hand pointer. That distance is a parameter whose space can be experimentally tested.

Without my proposal, there is no control over the parameter: the distance is an average, and so about half the cache size. When Hand is on the far right, a new item inserted at that time will stay in the cache long. When the Hand is far left, it may get evicted almost immediately.

The only thing important for performance is a good hit ratio, however it is achieved. An evicted item is one that cannot be hit. Early eviction is not ipso facto good for performance; only with regard to some access patterns where early eviction of the right items improves retention of other items that yield good hits.


Early eviction is fine for performance if the average eviction you get is good enough that any improvement in eviction you get from a more optimal choice of distance is outweighed by the effort in picking the insertion point.

You really need to measure, because this will depend on your workload, how fast your implementation is relative to the cost of a miss, what the ratio of "likely to never be requested again" to potentially long-lived objects are, etc.


The roving Insert point is harder to implement, due to the reversed case:

  [ ][ ][I][ ][ ][H][ ][ ]
         ^^^^^^^^
                 `these items move right

  [ ][ ][H][ ][I][ ][ ][ ]
   ^^^^        ^^^^^^^  ^- this item moves to the top (left most slot)
    |                 `--- these items move right
     ` these items move right
If we think about doing it in hardware, like for each set of a set-associative L1 or L2 cache, it's a hindrance.


I'd just have implemented it as a ring instead of a list and insertion at any point is identical, but my point was exactly that unless your workload specifically fits insert at the "hand" better than the current algorithm, you will need to account for the cost of whatever complexity you add in order to move the insert point.

> If we think about doing it in hardware, like for each set of a set-associative L1 or L2 cache, it's a hindrance.

I don't see why I would think about doing it in hardware.


I totally agree with you, this is not designed for hardware... and as others mentioned RRIP might be better for set-associative caches


With respect to my other reply mentioning a ring instead of a linked list, here's a quick and dirty Ruby port because I couldn't resist that uses a ring with a single dummy head node to simplify the logic:

https://gist.github.com/vidarh/b72615c16673cb38630cab5148190...

You'll note it avoids a bunch of conditionals in add_to_head, remove_node and evict because the wraparound happens automatically and the next/prev pointers are never nil/None

I've not verified it matches the behaviour of yours exactly other than adding a handful of extra access calls and comparing the state.

If you're not familiar with Ruby, the one particularly dirty trick I'm using is that in Ruby you can define methods directly on a specific object, so by defining the method "@head.visited" to always return true, the eviction will just try to set it to false and move straight over it every run, cutting a conditional. You could do a ring without a dummy @head to but it makes the insertion/deletion logic more annoying.


it looks like the ring is implemented using a linked list?


Yes, I wanted to change as little as possible to illustrate why you almost always want to close your lists into a ring when using a linked list ;)

It simplifies a lot of algorithms to be able to guarantee no null pointers. It's a very old trick. You trade a lot of conditionals and a risk of null pointers dereference for a smaller risk of non-termination (failure to check for a null pointer is always bad; failure to check for the end is only a problem if the end of the list is actually your termination condition).

You could do a ring with an array and modulus too, but then you end up copying the whole thing to open up slots to insert to maintain the property of your insertion point being separate from the hand.

That may or may not do worse, but likely to depend on cache size and implementation language.

What you'd almost certainly want in a production implementation anyway would in any case be to do away with constantly allocating nodes and just move them to a free list and reuse to avoid dynamic allocation overhead and garbage collection.


I added a second, singly-linked ring version to the gist I linked above. It doesn't save much, as to be able to unlink the hand you need to keep a pointer to the node adjacent to the hand (I changed "direction" in the naming because it feels weird to have a singly linked list with just "prev" pointers, but that doesn't matter), but it does mean a pointer less for each Node and from experience a lot of people struggle to get the pointer updates right for doubly linked lists, and that is a lot harder to get wrong when singly linked...


Got it. Make sense, I like the reason why you want to make it circular. :)


I was curious too so I took a look. I think this is the relevant bit:

Recent work [94] shows that “lazy promotion” and “quick demotion” are two important properties of efficient cache eviction algorithms.

[...]

Quick demotion removes most objects quickly after they are inserted. Many previous works have discussed this idea in the context of evicting pages from a scan [16,60,67,70,75,77]. Recent work also shows that not only storage workloads but web cache workloads also benefit from quick demotion [94] because object popularity follows a power-law distribution, and many objects are unpopular.

[...]

[94] Juncheng Yang, Ziyue Qiu, Yazhuo Zhang, Yao Yue, and K. V. Rashmi. FIFO can be better than LRU: the power of lazy promotion and quick demotion. In The 19th Workshop on Hot Topics in Operating Systems (HotOS 23), 2023.


I read that before.

Caching is predicated on locality. We put things into a cache based on the idea that something accessed once is likely to be used again, due to locality.

Lazy promotion is useful in a situation in which we don't trust that the access pattern exhibits locality, nor not fully. Perhaps it exhibits locality for some data, but is widely scattered for others.

For instance, an algorithm is working with numerous local variables in a loop (good locality) but the loop is also stepping through a very large set of data in memory, much larger than the cache. Moreover, items in this set are visited only once (poor/nonexistent locality).

If we regard accesses with a bit of suspicion ("this might not have good locality") we can avoid those latter items from overstaying in the cache. Items in the cache that are not going to be touched again effectively shrink the cache. Like if at any given time about half our cache has such items, the cache is effectively halved.

With a deterministic, fixed distance between the Hand pointer and the Insert point, which we can tune, we could empirically adjust for this. For instance we could have it so that, consistently, every new item that is not visited within 4 subsequent cache misses will be evicted.

If we had a magic oracle, we would not use LRU or any of these other approaches. We would ask the oracle questions "Great Oracle, given these items in the cache, and this new one which is not in the cache, which one(s) will not be seen again, or else which one(s) will be seen again farthest into the future?" That's really what we want to know, but since oracles don't exist, we use heuristics, like the idea that the longer an item goes without being accessed, the more likely it is that it's a never-again-accessed item.


Great point! We tried the oracle (Belady MIN) and found that the oracle often does not admit new objects (<20% chance admitting new objects) in most traces. That's why we need to evict new objects quickly


> This is for no good reason other than just luck.

There is a good reason. And that reason is that many recently accessed items survived. Then it's only prudent to devote more of the cache for surviving items than for fresh ones in such circumstances.

This continues until the cache is completely filled with recent survivors which moves the hand back to tail and oldest previous survivors get reevaluated for eviction. This, I imagine happens periodically in practice and often enough for the algorithm to perform well.


The problem is that the more the cache is filled, and the closer the Hand gets to the head of the queue, the closer are newly inserted items to being evicted.

There doesn't seem to be any good reason for this over putting the newly inserted items at a consistent distance away from the Hand, so they are a predictable, controllable number of steps away from the threat of eviction, and so that the Hand movement is just a cyclic treadmill with no spot of the circular buffer being arbitrarily special.

If randomization of the distance is beneficial, let that be explicit! Feed a stream of pseudo-random integers to the cache which is used to select the distance between the Hand and the Insert point. That could be still be nicely controlled as a jitter within a minimum and maximum range.


> ... the closer are newly inserted items to being evicted.

That's a good thing because newly inserted items didn't prove yet that they are worth keeping.

They have less time to proove themselves but if they prove themselves they can have a long lifetime because they will remain when the hand loops back to tail.

End of cache is filled with items of proven utility. New ones are unproven. Shrinking distance between head and hand reflect accumulation of proven items which leaves less space for unproven ones. Which is good. Eventually (when the space for fresh ones ends) you want to reconsider the proven ones again and this is achieved by moving hand back from head to tail.

It might be unfair to new items comming in at the times when there are a lot of proven ones accumulated but it is a good strategy from the point of view of the whole process. Good solutions are not necessarily fair.

> If randomization of the distance is beneficial, let that be explicit!

But it's better than random. It's adaptive to how many of the items already stored were visited recently.


> That's a good thing because newly inserted items didn't prove yet that they are worth keeping.

It isn't good that the probation is random, depending on how far left the Hand is at the time the item is inserted. Or else if it is good for that distance to be randomized, it would be good to make that random choice explicit.

> It's adaptive to how many of the items already stored were visited recently.

It's not adaptive whatsoever. The Hand moves left when there are cache misses, in order to find items to evict, and then wraps back to the rightmost slot. The current position of the Hand bears no relation to a newly observed cache miss, yet affects its retention.


You are probably thinking of block I/O or CPU cache workloads where scans cause the miss ratio to increase. This algorithm targets web workloads where scans are not common, and the miss ratio does not change abruptly.

Yes, SIEVE does not adapt to miss ratio, but it adapts to the inherent popularity in the workloads. You can see the evidence in the results.


> The Hand moves left when there are cache misses

It moves only if there were cache hits. If there are only cache misses it stays at the tail so the whole cache is dedicated for new unproven entries. Once there are some cache hits then they are retained and the area shrinks. It's not random. It reacts to whether it managed to catch something valuable or not.

> The current position of the Hand bears no relation to a newly observed cache miss, yet affects its retention.

Yes, but it is related to recent cache hits. It sacrifices chances of new items getting retained in exchange for retaining items that were recently proven to be worth keeping.


When there is a cache hit, the accessed item is flipped to visited state (if necessary). No eviction takes place and so the Hand stays where it is.

The Hand moves when it has to scan over visited items in search of an unvisited item to evict, when there is a cache miss. The scanned items are flipped to unvisited.


Exactly. And if there are no visited items, evictions happen at hand and hand is at tail and doesn't move. Full cache is dedicated to finding items that will repeat. They are getting the best chance they can.

Once some visited items arrive they are retained and the fraction of cache dedicated to new ones shrinks, so they have less time to get visited again and avoid eviction. But this happens for the benefit of retained items. If they were already seen second time there's a chance they'll occur for the third time and more. They are worth more than fresh ones. It's unfair to new items but beneficial for the cache as a whole.

When many different items are revisited hand moves and eventually reaches the head. Then the cache is full. So it's moved back to tail to see if we can evict some of the old stuff because maybe they weren't re-visted since last time we considered them for eviction.

Can you see it's not random and that it's actually useful?


The algorithm is arbitrarily unfair towards new items based on its state, which has no relation to those items.

It's like being rude to customers whenever the day of year is divisible by five. Sure it's not mathematically even pseudo-random, but it's unrelated to the customers. It's effectively random to them.


It's rude to new customers when restaurant is already full of paying customers.

It's unrelated to the new customers and in that sense unfair to them. But it's related to how successful is the restaurant at the moment.

Fairness is not the goal. Success of the restaurant is.

Objects near the tail are the people who occupy the table but keep ordering new dishes. Objects near the head are people who ordered just one thing and now they are politely asked to leave so new, potentially more generous customers can be accomodated.

If restaurant is more full, new customers have less time to justify their continued occupation of the table by ordering again, but it's fine because it allowes more of the older customers with repeated orders to be served again and again.

There is potential problem with the algorithm if there's very little space left for new items so they are never retained so hand never moves to head. Probaby you could refine the algorithm by moving hand to tail when it gets closer to head than some fixed value, finetuned to the production load. Or just occasionally move hand to tail every n inputs on average (also finetuned).


You're still on a mistaken program whereby because items are inserted at the leftmost slot of the array, that slot is the head. And so the rightmost slot is the tail.

This is not so. The tail of the buffer is the Hand pointer, and it behaves like a circular buffer tail.

This is at odds with the fixed head.

Your full versus empty analysis is flawed, because once the cache warms up, which takes only one cycle, it stays permanently full. At each subsequent cache miss, one item is evicted and one item is inserted. One in, one out. There is no tide-like ebb and flow between empty and full. That is just rhetoric which is not anchored to anything that is real in the algorithm.

Items that are booted out are identified at the circular buffer tail. But, oddly, the new ones are inserted at the physical top of the circular buffer's linear implementation which bears no relation to the circularity.

In a circular buffer, the linear indices have no special meaning.

So it is like a code smell, but for algorithms.


You can look at this cache as circular buffer. But in this interpretation there is not one pointer into that buffer, but two. Insertion pointer and eviction pointer. Their location in memory is irrelevant. What matters is that eviction pointer moves while insertion point doesn't.

Those two pointers split the circular buffer inti two parts. The part after insertion pointer but before eviction pointer you could call staging. It's a space where items that have not yet been considered for eviction are held. When an item gets evicted items in this staging space are moved one spot towards eviction pointer so that under insertion pointer a free space appears which is filled by the newly inserted item.

As items are considered for eviction and rejected (because they had visited status) eviction pointer moves closer and closer to insertion pointer shrinking the staging area. So the new items have less and less time to acquire visited status before the eviction pointer reaches them. This is what's unfair. That the chances of item getting visited and spared from eviction is lower for the items that arrive when staging is small.

This is somewhat alleviated by the fact that after eviction pointer in its motion reaches stationary insertion pointer it passes it so the staging area grows from almost nothing to the full size of the circular buffer. The items which had a bad luck to arrive when the staging is small, if they are repeated, have a later chance to arrive again when staging is large and then have a lot of time to earn visted status and avoid eviction the next time they are considered.

What you are advocating for is that insertion pointer should move too, either to keep fixed distance from the eviction pointer or to keep explicitly randomized distance. Those are of course valid choices when you design your cache algorithm. But the author(s) of this one went with another method. They reduce the distance between insertion and eviction pointer by one only after item is considered for eviction but rejected (because it got visited status since the time it was inserted or considered for eviction the previous time). Once the distance between insertion and eviction point reaches small enough value (for example zero) they increase the distance to be maximum possible (to the size of the circular buffer).

You don't see a reason for this choice and I'm telling you that the reason is that the second part of the buffer (the part starting at eviction point and ending at insertion point) is devoted exclusively to items that since their insertion acquired visited status at least once thus prooving they are worth keeping in cache.

So it might be a good strategy to dedicate larger part of the buffer to those items in times when they are abundant in the input rather than fixed part or random part as you propose.

To address your concern that new items are inserted at the physical top of circular buffer implementation ... It doesn't matter. Insertion pointer could be placed at any spot in the buffer. You can randomize its initial (and ultimate since its not moving) placement and it would change nothing in the performance of the algorithm (after the buffer is initially filled). This cache could be implemented as circular linked list with two pointers into it.

Do you understand all that? Because I'm not sure if I can explain it any more clearly.


> You can look at this cache as circular buffer.

After that, you cannot "unsee" it any more.

> items that since their insertion acquired visited status at least once

> the second part of the buffer (the part starting at eviction point and ending at insertion point) is devoted exclusively to items that since their insertion acquired visited status at least once thus prooving they are worth keeping in cache.

Items can acquire visited status anywhere in the cache, to the left or right of the Hand pointer. There is no special area.

The "reaper" which ages visited items to unvisited, and which collects aged items, moves in a circle.

If an item is inserted while the Hand is in the leftmost position, it will be evicted immediately when another not-present item is accessed. If an item is inserted while the Hand is farthest away at the rightmost slot, and not visited again, it can stay in the cache for a number of miss cycles, until the Hand goes to the top again.

That variability, as such, may be useful and good.

One problem is that it's not actually random, and it's not even a good form of pseudo-random.

I posit that input sequences exist in which two items of equal importance, A and B, are consistently treated unequally, with B enjoying much longer retention in the cache than A.

Moreover, the algorithm doesn't allow for experimentation with randomly varying the trim distance between two arbitrary values. Like say random between 4 and 8 slots in a 64 slot cache. That's a disadvantage, because for some given patterns, it prevents the optimal parameters from being discoverable.

Algorithms like LRU don't have the above weird property. While their hit rate may be good or bad depending on the locality of patterns, you cannot construe a sequence of access patterns under LRU in which two items A and B appear equally frequently, yet enjoy significantly different retention times.


> Items can acquire visited status anywhere in the cache, to the left or right of the Hand pointer. There is no special area.

They can lose visited status only if they are between head and hand pointer. That's the special area. There's no way for item to loose their visited status if they are between hand and tail until the hand reaches head. In this sense area between hand and tail is privileged.

> If an item is inserted while the Hand is in the leftmost position, it will be evicted immediately when another not-present item is accessed.

You see the cost of small head-hand area but you don't see the benefit of large hand-tail area.

> Moreover, the algorithm doesn't allow for experimentation with randomly varying the trim distance between two arbitrary values.

In its basic form it doesn't. However I suggested that you might tweak how far from the head, the hand loops back to tail. It doesn't have to happen at exactly zero distance if you don't want to ever disadvantage new entries too much.

> .. construe a sequence of access patterns under LRU in which two items A and B appear equally frequently, yet enjoy significantly different retention times.

I see no cost of that. Cache doesn't need to be fair. It only needs to be successful. If it can serve item A from cache for 100% of the time and item B 0% of the time despite same patterns its still better than if it serves each of them 25% of time from cache.


> its still better than

* is not worse than


Yeah... It's hard not to read "luck" as, "there's something I don't understand". Luck doesn't really happen with caching.


It's not hard to see that in this algorithm when a new cache miss is handled to insert an item, the current position of the Hand bears no relation to that new item.

Sure, we can contrive access sequences where that is not random, because the whole thing is actually deterministic. That's not what I mean by random; I mean not expected to be related under realistic conditions where the inputs aren't contrived.

A good cache algorithm should be immune against contrived input sequences. With this one, we could devise a degenerate access sequence such that a particular item is always placed just ahead of the Hand, such that it is victimized in the next access cycle. Yet another particular item is given the maximum leeway. And that could be in spite of both items appearing equally often and deserving the same treatment.


Unfortunately most researchers do not consider attack vectors and often consider inputs that do not meet their preferred examples to be invalid. If a hostile user is able to pollute the cache so that it significantly degrades then that exploit becomes an easy denial of service attack. It is somewhat out of scope, but means that more general robust algorithms are needed because most users won't fine tune a system for a bespoke algorithm and instead take an off-the-shelf solution. I am therefore not a big fan of these minor tweaks to historical algorithms by making a one or two line change and therefore only slightly modifying what it is good or bad at. I try for more holistic solutions, even if a small percent worse in some cases, and infuse randomness to avoid deterministic exploits. To me that's more worthwhile and a big factor in whether it is a research toy or production-grade engineering.


I agree with you; any non-random algorithm has the problem


Well, any pseudo-random algorithm. If we avoid pseudo-random behaviors, we can provide certain guarantees that no pattern of access can get around, such as that if an item is freshly inserted into the cache, then it will stay there for (at least) the next N cache misses, for some specific N; that those N misses will evict something else.

If we consider LRU, no contrived pattern of access on LRU will cause it to evict some item Y if there is an item Z present which was less recently touched than Y.


Those guarantees are also what makes the algorithm exploitable and inflexible to handle unknown or adversary workloads. Your LRU example suffers in scan / loop / zigzag patterns, which the author's policies do poorly at as well. They target CDNs workloads where that can be ignored there due to its scale, but that is not a generalizable assumption for other workloads where it more common like databases, search, and analytics.

If the algorithm is known and accepts user facing requests then it becomes a target. A client can use the network speed of a dialup modem to bring down a modern server simply by crafting a request pattern to exploit hash flooding. This is why you generally want some injected noise, adaptivity, and use an algorithm that offers robustly good hit rates. A specialized policy that is superior in a narrow set of workload is still very useful when targeted at an expensive problem, but that's a design assumption that few can reasonably make.


Do you have an example of a non-random algorithm that does not have adversarial workloads?


> How it should work is that new items should be added at a fixed distance from the Hand pointer.

This would help with the implementation with a circular buffer. But for now, let's decouple the algorithm from the implementation.

> we can make Insert and Hand be the same slot

This is just CLOCK algorithm. It is easy to implement with a circular buffer. But the change we made deliberately improves the miss ratio significantly.


Wouldn't a hand that follows the head at an offset of ca 10% of the ring buffer size be mimicking the S3-FIFO quick demotion threshold and likely reap the same kind of benefits?


Yes if you add a ghost queue. :)


In reference to a key aspect of ARC's design.


> The problem is that items that are inserted at the head of the queue during a moment when the Hand is in a low position (close to the head, toward the left) are closer to being doomed than items added when the Hand is in the rightmost position.

That could possibly reveal interesting security-related properties, i.e. maybe future vulnerabilities.


Whether that is "how it should work" and whether your proposed fix is better or worse for a given workflow depends on a lot of factors.

You're right that putting the insert in a fixed location will mean the time a new element has before it must either be visited or get evicted will vary quite a bit, but it's not obvious that it will always be better to "fix" that.

E.g. consider the typical "generation gap" where the set of objects are bimodal and either likely to be frequently used or be garbage right away. Depending on the odds that a new object falls in either category, it might be better for the insert to be closer to the hand than further away (or it might be opposite). That is, you might want the chance for a new entry to survive to be fairly low.

If the best bet is to have the new entry on average an inconvenient distance from the hand, it might well be the variability is better than either moving the insert to the hand or a more convoluted scheme to keep them at a fixed distance.

I agree it'd be more predictable and/or easier to reason about if it's a fixed distance, and if the fill size of the cache is the right distance then certainly inserting it at the hand works great.

But really, I think you should measure for your specific workflow. It'd be interesting to see this option and some other ways of separating them tested, though.

Another thing I think would be interesting to test is whether (it wasn't an accident that I hinted at generational garbage collection above) it'd be worth keeping a second pointer that you move past objects that have survived "long enough" in the cache, and most of the time move the hand back to that pointer instead of "all the way" so you don't spend as much time iterating over objects that won't get evicted anyway. If the number of cache entries is small, probably not worth it, but for a large cache with a decently large proportion of persistently hot objects, maybe (but then you could just also outright move objects to a second cache, and that might well be simpler; again something to measure on actual workloads).

I'm assuming this isn't anything new to people who follow the field more closely in any case.


"Your specific workflow" is a myth. A cache designed for "your specific workflow" will be subject to completely unforseen workflows when put into production.


While that's something to keep in mind, that's a reason for measuring tue effects and choosing accordingly, not a reason not to.

Otherwise on that basis you shouldn't put a cache in at all as there will be workflows where the hit rate is so low the cache has a negative effect, however unlikely.

If you don't know e.g. the rough performance profile of your requests, and the typical distribution of them, you likely have far bigger things you should be worrying about.


I wonder about allowing the hand to 1-D random walk left or right when examining the visited bit for evicting an item.


> SIEVE isn't just playing the part of a cache eviction algorithm; it's stepping up as a cache design superstar. Think of it like giving a fresh spin to classics. We've plugged SIEVE into LeCaR, TwoQ, ARC, and S3-FIFO, swapping out their LRU or FIFO queue for a SIEVE one.

> Swapping LRU with SIEVE in these algorithms isn't just a minor tweak; it's more like giving them a turbo boost. Take ARC-SIEVE, for instance – it's turning heads with its slick efficiency, especially noticeable across different cache sizes. We didn't stop there. We pushed SIEVE a bit further by letting it peek into the future – well, sort of. We tested how well it could guess the next request. It turns out, with this extra bit of foresight, SIEVE is nailing it, outperforming the rest in almost all scenarios.

I find that style of language off-putting and immature.


Their early work was a little too new wave for my taste. But when Sports came out in '83, I think they really came into their own, commercially and artistically. The whole album has a clear, crisp sound, and a new sheen of consummate professionalism that really gives the songs a big boost. He's been compared to Elvis Costello, but I think Huey has a far more bitter, cynical sense of humor.


hahaaaaa

love references to this movie


Nicely done!


I’m sorry this was downvoted because I thought spangry’ comment deserved more praise than just my upvote.


Disclaimer: this is the author.

Thanks for pointing out! I tried to write a blog post about SIEVE that was informative yet not too technical, and to infuse it with some humor. However, I clearly failed to find the balance. I'd like to change it back to a simpler, more straightforward style.

It's a long way to go for me to write a good tech blog. Criticism taken.


Fwiw i don't think the quotes are bad.

The advice I got about posting publicly, don't read the comments section lol


I love your comments!


> I find that style of language off-putting and immature.

I normally do too but somehow it didn’t rub me the wrong way in this instance.

Perhaps the 1980s style day-glo star disarmed me? I didn’t really like that esthetic back when it was in fashion, but for some reason it was disarming here.

Perhaps it was the fact that the page was mainly a simple explanation with some clear drawings so a little stylistic exuberance didn’t overclutter things in my head.

All that been said, I’d normally be the same kind of curmudgeon as you.


I bet this is ChatGPT style


next time try 'in the style of the economist'


It's your right to feel that way as is the right of the author to use whatever style they please.


Those are just weasel words, when I see them in a tech topic - I consider it not directed to people who would care, or just lacking substance... or both.


It's their right to be able to point out language choices that they don't like, just like you've got the right to point out about the rights the author has. Everyone has all the rights, woohoo!


Thanks for pointing that out, I was completely unaware.


Disclaimer: this is the co-author.

Sorry for the hyped language. If my guess is correct, the blog was "polished" by ChatGPT, and the intention was to "polish" the English. The paper has more content and should be read more seriously.

We will fix the animation error and update the writing shortly. Thank you for the feedback!

Regarding some of the complaints on SIEVE performance. Here are some more results that I ran for other purposes. https://observablehq.com/@1a1a11a/sieve-miss-ratio-plots

I haven't had time to add other algorithms to the plot, but they are often similar to or worse than TinyLFU (I did not pick TinyLFU to compare deliberately; I happen to have the figures on hand). While the results do not include byte miss ratio, which is important in some use cases, they are better than the results already shown.

We acknowledge that

* SIEVE may not work well on some block I/O workloads (this has been the statement in our writing from the beginning). However, SIEVE works well on large multi-tenanted block I/O workloads (see the Tencent and Alibaba figure in the link). For web workloads, SIEVE shows consistently good results.

* As the author, when I look at this algorithm. My own feeling is that SIEVE may not be robust because it keeps evicting new objects when the hand moves to the head, effectively behaving as MRU. However, our large evaluations showed that modern workloads have certain properties that enable SIEVE to achieve a low miss ratio. We are still investigating this property and how to model it. If someone is interested, feel free to shoot me an email.

* implementing SIEVE with a log-structured design or circular buffer is not trivial, but we had a design called Segcache (https://segcache.com), which runs a constrained version of SIEVE. In simple libraries, linked list implementation should be easy, and one can check the examples on the website.


Share a bit more information on how we got here. SIEVE is the more general form of the FIFO-Merge algorithm in Segcache. When I designed the FIFO-merge algorithm, I thought it was worse than LRU, but it turned out to be better.

At first, I thought FIFO-Merge keeping objects in the insertion order (FIFO) helps the eviction because I mostly worked with Twitter data in which popularity decay is common.

But later when I evaluated more traces (including block I/O workloads), it turned out that quickly evicting new objects (quick demotion) is the source of a low miss ratio. That's why we wrote these papers.

Interested readers are welcome to read more of related work 1. FIFO can be better than LRU, the power of lazy promotion and quick demotion https://dl.acm.org/doi/pdf/10.1145/3593856.3595887

2. FIFO queues are all you need for cache eviction https://dl.acm.org/doi/10.1145/3600006.3613147


How does it compare to "random two choices" [1]?

[1] https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...

It is simpler to implement than LRU - a single "access timestamp" needed to be attached to data. It also shows performance on par or better than LRU when applied to database caching.


Sorry, I may miss something. How would "random two choices" differ from LRU?


I provided a link, you can read more about differences there.

LRU suffers from linear access pattern, where it cannot degrade gracefully. Try to cache N+1 items in N-item LRU when items are accessed linearly or at random with removal.

The "random two choices" algorithm will replace an entry that is less recently accessed from two randomly selected. There is a chance that some of N+1 linealy accessed items will be in cache after a full scan.


Thank you! Yes, this addresses the scan-resistant issue, but I am not sure how different it would be on workloads without scan. But the workloads we used in evaluations have very few scans, and the power of SIEVE comes from quickly evicting newer objects so I guess "random two" would not improve. But I do not have a number right now, I am happy to add this to the comparison list.


Please, do add the "two random choices" to the list. It is a noticeable omission, let me explain why.

Here are some comparisons of 2-random-choices with LRU in context of CPU caches: https://danluu.com/2choices-eviction/

Relevant quote: "LRU does worse than 2-random when the miss rates are high, and better when the miss rates are low."

"The miss rates are high" is the case when evicting newer but unused in nea future objects is preferable, in my opinion. Thus, "two random choices" algorithm is a direct competitor to SIEVE, they work better than LRU in about same conditions.


I had a try on a few traces; the random two-choice algorithm is only better when there is a big scan, and the LRU miss ratio curve shows a cliff. In all other cases, it is worse than LRU eviction.

Implementation can be found https://github.com/1a1a11a/libCacheSim/blob/develop/libCache...


This is the implementation of hashtable_rand_obj:

  cache_obj_t *chained_hashtable_rand_obj_v2(const hashtable_t *hashtable) {
    uint64_t pos = next_rand() & hashmask(hashtable->hashpower);
    while (hashtable->ptr_table[pos] == NULL)
      pos = next_rand() & hashmask(hashtable->hashpower);
    return hashtable->ptr_table[pos];
  }
It returns first object in random chain, not the random object. This introduces a bias into selection of random objects. I have not had time to review your other code, but I strongly suspect return values of hashtable_rand_obj are skewed into preferring more recently inserted objects.

As the code of RandomTwo depends on the hashtable_rand_obj, it also biased.

The same is for Random cache policy.


And Segcache is available in Rust as part of Pelikan project https://github.com/pelikan-io/pelikan


We also published it as a crate https://crates.io/crates/segcache


My first impression was that this seems related to the S3-FIFO queue data structure that was posted here last year[0], and the personal site of that algorithm's main author actually mentions it[1]:

> S3-FIFO and SIEVE are being evaluated at several companies, if you are interested in joining the adoption army or if you look for help and want to provide help, find my email on this page. :)

Looking into the papers of both algorithms, they basically swapped primary and secondary authors[3][4], with the main author of Sieve being the one to write this blog post.

Just sharing this to give some context that could be useful for people who actually know about these topics to assess the claims being made. I don't deal with the problems these data structures aim to solve at my day job, so I'll refrain from commenting on that.

But I will say that from an implementation perspective I think the algorithms look elegant in their simplicity, and easy to grok, so at least that part of what they promise is held up.

EDIT: Mark Brooker also wrote a blog post about it[4], for those who want to have read the perspective from a third party who has the proper background.

[0] https://s3fifo.com/

[1] https://junchengyang.com/

[2] https://yazhuozhang.com/assets/pdf/nsdi24-sieve.pdf

[3] https://dl.acm.org/doi/10.1145/3600006.3613147

[4] https://brooker.co.za/blog/2023/12/15/sieve.html


I also noticed that S3-FIFO was the one approach that didn't seem to improve when SIEVE was added in.


Yup, SIEVE is good enough for web workloads :)


SIEVE isn't just playing the part of a cache eviction algorithm; it's stepping up as a cache design superstar. Think of it like giving a fresh spin to classics.

Is this how academics write about their own work in blog posts now?

The result seems really strong. It's sad the author feels this kind of hype is necessary.


Recent hype cache algorithms are characterized by cherry-picking hit rates in content caching such as CDNs. Common cache algorithms such as ARC, LIRS, and TinyLFU have made consistent and common comparisons across a variety of disk access patterns, but hype cache algorithms boast hit/miss rates for network accesses without these comparisons. Hype cache algorithms have only been validated in a very narrow domain, and their general superiority has not been validated at all.


Just to clarify, this means hype cache algorithms are basically not replaceable with common cache algorithms such as LRU or ARC.


I probably should have written the opposite. English is unfamiliar.


I haven't read the original article yet but your excerpt here reminds me strongly of what ChatGPT writes when you ask it to rewrite technical prose in an approachable, non-technical "for dummies" style.

As in, I asked ChatGPT to rewrite some text with metaphors and it kept giving me answers that included the phrase "Think of it like...." And whatever the metaphor it came up with was. It also often used words like "superstar" far more than I would.


I agree that the hyped-up language is off-putting.

It is however not at all typical for academic publications and blog posts.

In the few times I have seen it before, the author was always a PhD student and a non-native speaker (as it happens from China) and this case is no exception. They were also all smart people, so I am optimistic that with time their writing style will become more attractive.


That caught my attention too. It mixes the level of communication jumping from highly technical to kindergarten when you have not a cognitive seat belt for that.

Also why? The only value that comes out of that comment in the middle of the technical discussion is vanity and nothing else. It could have been more useful if it would be creating a metaphor that approximates understanding. That would be forgivable. But vanity isn't.


I believe there is also a paper [1] if you're looking for more details or a more academic description.

[1]: https://junchengyang.com/publication/nsdi24-SIEVE.pdf


Impressive how few code changes they managed to convert an existing LRU library into SIEVE.

https://github.com/cacheMon/lru-rs/commit/1c45ef7fe85f4a4395...


I presume you have not experienced being an academic in recent years and having to deal with the expectation to sell your research on "Scientific Outreach Twitter". Now, technically neither have I, but I've worked alongside scientists for a couple of years who do.

Yes, it's kind of sad, but I feel more empathy for situation the author is in that leads to it. The actual papers themselves seem pretty solid from a surface reading as a non-expert, at least in terms of having clear explanations and easily verifiable results.


I also feel empathy. My "sad" is more about the state of the community than the author's own attitude.

But still, even in the current climate, I think this is going too far. The whole paragraph after the part I quoted is just as bad.


> The actual papers themselves seem pretty solid from a surface reading as a non-expert, at least in terms of having clear explanations and easily verifiable results.

If you only read one paper in an unfamiliar field, you will not know if it lacks the necessary analysis.


This isn't exactly a paper on room temperature superconductors...

I read plenty of compsci papers and whitepapers, I just don't have to work with this specific niche of web caching. I doubt the methodology for benchmarking is so radically different from other computer backgrounds that my ability to detect fudging in it, or other structural issues with them would be compromised.

And on top of that, in this case both the code and data sets are available and the papers went through peer review.


> This isn't exactly a paper on room temperature superconductors...

Same in every field.

> I read plenty of compsci papers and whitepapers, I just don't have to work with this specific niche of web caching.

You clearly have no expertise in this field. Over.


Ah, you didn't come here to have a discussion, but to "win" a "debate". Sure, whatever, you win


I didn't even notice the style. I first read the verbal description of the algorithm, got angry that it's incomplete and I lack context of what this is trying to achieve. Then I skipped to reading Python code. Then got back to verbal desc to confirm. Concluded it's not that bad but still incomplete. Then I looked at the performance graphs and concluded I might use it if I ever need something like that.

Why do you guys read prose at all when it's obviously just a filler?


> SIEVE is nailing it, outperforming the rest in almost all scenarios.

I'm curious why there's a lack of transparency regarding the scenarios where SIEVE did not surpass other algorithms.

Another concern is the emphasis on average performance and P10/P90 metrics. When evaluating a caching algorithm metrics like P99x are often more insightful.

The omission of these details coupled with the sensationalist tone of the article definitely triggers some red flags for me.


The con is mentioned in the "SIEVE is not scan-resistant" section

https://cachemon.github.io/SIEVE-website/blog/2023/12/17/sie...


I've always wondered about setting up a Cost-Benefit Cache. Elixir's built in cache made me think of it first because it would simplify the implementation, but it could really be done in any language.

The idea being that you track a few stats for each cache item:

a. Time to retrieve the cached data from original source

b. Frequency of requests for the data from the cache

c. Size of the data to be stored

Size of the data to be stored is the "cost" since it takes up space in the cache that could be occupied by other data.

Retrieval time and frequency of requests would be the benefit since it should reflect the amount of time saved (a 5 second query called 4 times / minute would save 20 seconds per minute of execution time).

(a * b) / c = Priority.

Lowest priority gets evicted first as cache size reaches it's limit.

The idea behind doing this is that you could run just about any data that could potentially be cached through the system and it would only be stored if the priority was higher than the lowest priority item currently in the cache. This should also deal with spikes or sudden slowness automatically if a part of the application is suddenly hammered.


Why call it "FIFO queue"(isn't it just stack?) when actually you need to remove elements from the middle?

Also I think you could do away with storing "prev" for each node and make it like std::forward_list(with some additional work during eviction).

> We pushed SIEVE a bit further by letting it peek into the future – well, sort of. We tested how well it could guess the next request. It turns out, with this extra bit of foresight, SIEVE is nailing it, outperforming the rest in almost all scenarios.

No idea what they mean here.


A stack is actually LIFO?

Otherwise, yeah, if you remove elements from the middle, you either need a structure that allows it, or you amortize it by making such updates less frequent. Still not a queue.


If you need middle or random access then it’s an array, neither a queue nor a stack :)

I couldn’t understand the animation at all. LRU is much simpler to understand and implement what’s so complicated about a doubly linked list ?


The median could trivially be the root of a red/black tree.


I think they invoke FIFO because hand pointer start at tail so the first-in objects have first the chance to get evicted (be out).


I don't necessarily understand why caching wouldn't naturally move to the same strategies as garbage collection? Generational collectors, in particular.

I remember we had a CACHE for credentials in a big service at a previous gig, and we found that periodically we would get our cache wiped clean. Turned out, a lot of our clients were working with a third party system that would periodically log into each of the clients' accounts to check on details. To us, that was every credential getting hit in a near linear scan every hour or so. Which... was perfectly set to kill anyone that was using LRU caching.

Outside of randomness to hopefully retain enough of the items in the cache that are likely to get hit again soon, it seemed natural to have a set of generations for items that have only been seen once, and then another generation for things that have been more active. Which, sounds exactly like a lot of garbage collection schemes to my naive ear.


LRU is a fairly intuitive and simple strategy but a rather lousy performing one. If you look at the various caches in modern CPU such as data and branch target caches they have been moving away from (P)LRU as replacement strategy. The most modern one I'm aware of is DRRIP [1] in the Arm Cortex-X3 and derived V2 chips, as shown here [2] post [3]. And those are implemented in hardware, in software you are even more flexible.

[1] https://en.wikipedia.org/wiki/Cache_replacement_policies#Dyn...

[2] https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2023...

[3] https://chipsandcheese.com/2023/09/11/hot-chips-2023-arms-ne...


> it seemed natural to have a set of generations for items that have only been seen once, and then another generation for things that have been more active

Have you looked at ARC? It sounds similar - it is a cache split between LRU and MFU areas, where the split point changes dynamically depending on the workload. https://www.youtube.com/watch?v=F8sZRBdmqc0 is a fun watch on the topic.


Caching is bounded, GC is potentially boundless. Caching is all about deciding what to evict.


Right, I know there are differences. Easiest consideration there is that caches are probably never walking to root pointers. They can choose to work on individual entries with no regard to the rest.

My point is not any different than how you design your kitchen/house. Things that you use all of the time, you have up front. The more often you use them, the closer they are to where they are used. Same for general storage.

Why would cache not follow the same pattern? In each "bin" you can go with LRU or Sieve, but having multiple bins makes a ton of sense. And it looks a lot like the way that GC works.


Cache design has a lot of different constraints and you can exploit those constraints to build more efficient systems. And yes, people in the space are aware of GC research and leverage design concepts from there where appropriate.

Think of sieve as one component of a full caching systems. Many caching systems will combine primitives to have a “newly encountered” and “recently expired” generation where they’ll use a probabilistic filter or something similarly compact to determine whether they’ve seen something enough to spend resources caching it.

There’s memory usage concerns (frequently you want a hard upper bound on memory consumption), speed/latencies (these systems are handling many thousands of requests per second and need to have near 0 latency), and fit as much as possible into the memory constraints (eg using probabilistic filters for admittance).

SIEVE can be integrated into things like W-TinyLfu / s3fifo to replace certain individual components (I would read their papers - they’re fairly approachable and discuss the full design).


GC is intrinsically bounded by the amount of memory available. GC is also about deciding what to evict albeit the heuristic used to determine that is based on current use rather than frequency of use.

dotnet runtime GC is interesting for caching, because it has multiple generations for the GC. Each time there is a GC the survivors are moved to the next generation. Moving most used items to the next generation and looking in the later generations first when resolving could be pretty quick.


A lab mate of mine is actually submitting a paper on investigating why clock sometimes outperforms LRU very soon. The other benefit of clock is simplicity of implementation: you just have a pointer march around a ring buffer. I’m surprised that sieve is sometimes better, we might have to add a small section to the paper discussing it.


Do you work with Peter? If not, this will probably be the third paper...


Haha I do, are you with CMU's Yang Jun Cheng? Can't really think of anybody else working on this that would know these names.


LRU makes sense to me, so that is what I use.

The fact that some algos outperform it on some distributions (which might or might not be relevant to the use cases I might face) isn't good enough to convince me to switch.


The one problem with LRU that I always remember is one John Carmack one wrote about with an LRU cache for items that were always accessed in the same order (objects to be rendered, in z-order?), which happened to be slightly bigger than the frequently-accessed set... until the day it wasn't. And performance suddenly dropped off a massive cliff.

The problem, it turned out, was that iterating through the same frequently accessed set of items, 1...N, in the same order every time, with an LRU cache of size N-1, would put items 1...N-1 in the cache, and then on the very last step evict item 1 (because it was the oldest) to put in item N. The next time round, when item 1 was not in the cache, it would evict item 2 (the oldest) to make way for item 1. And then evict item 3 to make way for item 2, evict 4 for 3, etc..., until finally evicting item 1 (again) to make way for item N. And do the same on the next iteration.

IIRC, he solved it by switching one of the cache slots as a "scratch" slot for the current iteration, so the last few items in the set would only overwrite each other, and items 1...C-1 would remain in the cache for the next iteration (where C is the size of the cache).

It is an unusual use case, to be sure, and not a reason to avoid LRU caches in general. Most users are unlikely to hit it. But it's useful considering as an example of how edge cases can pop up in places you might not have been expecting. It also shows you don't necessarily need to completely switch algorithms if you hit a pathological case - just tweaking how the cache works slightly (LRU to LRU-plus-scratch-slot) can sometimes get you back to 95% of where you were, which might be good enough.


This is the "scan resistance" pathology mentioned in the article. And the scratch slot is a (very simple) form of generational caching; i.e. generational LRU or multi-generational LRU [0].

[0] https://lwn.net/Articles/856931/


I think one can use random eviction to avoid any kind of pattern changing the average performance.


LRU is expensive in a way that it doesn't scale (thread/core wise). What I do is probabilistic eviction, e.g. take randomly 17 items - evict the lowest three (need some algorithm for scoring). What I tend to use for scoring is 'access count', 'last access time', 'creation time', and perhaps creation costs (e.g. latency), plus a scoring formula.

In that regard - such eviction can be implemented "entirely" concurrently.


I don't understand how it doesn't scale; on a hit you just move a node to the front, which is just a few pointers to update (prev next, next prev, front prev, front, current prev, current next).

If you have multiple threads, then presumably you should think how to serialize access to your shared resource anyway.


You need locking; locking does not scale.

>presumably you should think how to serialize access to your shared resource anyway.

I'd call that 'false'. The shared resource is the LRU Map in that case. The one w/ prob. eviction can be implemented (as in has been implemented) w/o a global lock (per the Map, which the LRU requires it). Prob. eviction can be implemented entirely lock free for that matter.

Locks offer no forward grantees, and don't scale. The changes of the lock state require the coherency traffic between the cores. Reads scale very well, writes don't.

Overall any data structure that requires a lock per the entire structure, especially on the read path -> bad, doesn't scale. Like I've said it's possible to have a Map w/o locks both on the reading/writing paths.


Locking does not mean blocking. If you don't block the readers on the LRU reordering then it scales very well. That's a trick that's been in my production systems for 14 years now, and pretty much everywhere in Java land, that scales to over 900M reads/s on my laptop. Instead of probabilistic eviction you can use probabilistic reordering, e.g. via lossy sampling of the hits. The main benefit over sampled eviction or clock policies is that it is easier to use more robust and constant time algorithms.


From the look of it, it seems like this really doesn't like frequently accessed items being close by, as they probably get evicted the first time there is a cache miss on the currently-pointed item (but anyway end up further down the array which would mitigate the problem the second time around)

Which also means, this does not perform well at all with very high hit rates. A dumb LRU cache might perform better if you can fit most of your items in the cache.

In any case, seems simple enough, but very unintuitive compared to a LRU algorithm which makes obvious sense. It is really hard on paper, without looking at actual numbers, to tell how this would be better than LRU.


But the article has tests with results, in practical terms it has actual results of it working; so is your focus on how well it actually performs? or how do we analyse how it achieves its results?

As to fitting most of your items in the cache; just about any cache method will work well if you can fit most things in the cache (which will typically blow out your hardware bill, ever tried to make your L3 as big as main memory?)


>L3 as big as main memory

That's not possible for many a reason, caches on the CPU die (or substrate, think AMD) tend to use the infamous 6 transistors per bit. The ones in the RAM are composed of a single transistor + capacitor, much denser. We can't have more L3 as the latency to access would pretty much make it main memory. The cache are as big as they can be, increasing their size overall doesn't contribute significant performance boost.

The hardware bill is just not a thing, it's physically impossible to build very large caches that have low latency.


There are CPUs that have external (including L3) caches; and yes, off die they will have a lower latency. And that is not to mention that a cache topology uses much more power, and not just in the T6 cell itself, there's all the cache handling logic as well. But, it is possible to do; insane but possible.


I understood it eventually but both the description and animation were confusing at first.

I think I would have been a lot less confused if there weren't so much hype and buildup about how simple it was gonna be.

Maybe it really is easier to implement than LRU. I'll give the benefit of the doubt on that.

But conceptually it's way more complex. I understand the LRU concept immediately from the name alone.

All the later hype was off-putting too, in a way that reduces trust. I simply do not believe the author is being candid due to the marketing speak. Maybe those parts of the post can just be ignored but it leaves a lingering uncertainty about even the technical points.


I feel like a simple improvement to LRU is to not promote on every fetch. This adds some frequency bias into the mix and reduces the need to synchronize between threads. Is this a known/used technique?


Yes, basically you sample requests to improve concurrency and rely on the hottest entries to be more frequently observed in the sample.

Memcached uses 60s intervals between promotions and a try-lock guard. A small variation is to do this probabilistically using a thread-local random.

Caffeine (BP-Wrapper) decouples the event from the policy by recording into an array of ring buffers hashed to by the thread id. This serves to spread out contention and drains them under a try-lock to update the eviction policy. When the buffers fill up to quickly then it will drop events instead of blocking. This allows the cache to shed load by reducing the sample quality if a burst of activity.

Sieve is a very slight modification to the classic Clock algorithm (1969, Multics) which is a well known and commonly used pseudo LRU. The clock hand is a cursor to scan from and to insert behind, where instead they separated it to maintain FIFO insertion order. Clock and its variations are probably the most commonly used caches, e.g. Linux's page cache and Postgres' buffer pool. Sieve decoupling the hand from the insertion point so that new arrivals are evicted sooner, so the benefit is very workload dependent.

Sampled policies are another common policy type, e.g. where the timestamp is updated on access. Then on eviction a random subset is chosen and a utility function discards the least valuable. Since there are no promotions it is nice for simple concurrent caches and mimics a classic lru or lfu policy.

In short, a concurrent cache will not promote synchronously on every access event. That is only done as a strawman comparison for a simple baseline, though its actually good enough for many simple use-cases. It would be very misleading if researchers pretended that is how concurrent LRU-style caches are implemented.


Do you mention the old version of memcached? Newer versions seem doesn't need that 60s anymore. The architecture feels similar to BP-Wrapper. Ref: https://memcached.org/blog/modern-lru/


Oh, thank you! I didn't realize that LRU Maintainer Thread was more than an expiration reaper. When it was first being introduced that was its first responsibility as lazy expiration removal by size eviction meant dead entries wasted capacity. It was all work in progress when I had read about it [1] and talked to dormando, so it got fuzzy. The compat code [2, 3] might have also thrown me off if I only looked at the setting and not the usage. Its a neat variant to all of these ideas.

[1] https://github.com/memcached/memcached/pull/97

[2] https://github.com/memcached/memcached/blob/9723c0ea8ec1237b...

[3] https://github.com/memcached/memcached/blob/9723c0ea8ec1237b...


There seems to be a mistake in the animation.

> On a cache miss, SIEVE checks the object pointed to by the hand. If the object has been visited, its visited bit is reset

When the animation gets to the point where the I element is checked against the cache, the hand moves off of D without resetting its visited bit.


Yes, the illustration is incorrect, and does not match the code's evict function, which only moves the hand after marking the current object "unvisited".

From the initial state, with the hand pointing to A(visited), when the request for H comes in, A is marked unvisited before moving the hand.

When the hand is pointing to D(visited) and the request for I comes in, it should mark D unvisited before moving the hand. This "feels wrong" because D had just barely been marked visited, but it is an implication of the SIEVE algorithm.

It's generally a red flag when a "better, simpler" algorithm is explained with a faulty illustration, suggesting the author overlooked or misunderstood the implications of the algorithm.


Thanks for pointing out. We have updated the animation.


I may be missing some subtlety but it seems like this is computationally O(N) on insertion, since the eviction will have to visit and mark every node in the pathological case where all existing entries are already visited?

Compared to typical LRU O(logN) or O(1).


I think you're correct about the worst case being O(n) but it's behavior heavily depends on the access pattern of the cached items which makes it hard to reason about.


O(N), but amortized O(1).


How is it amortized O(1)? Able to prove it or at least construct an outline of an argument as to why?


Because every object stepped over is downgraded from visited to not visited. At most 1 object is set to visible per lookup.


There are two important differences between academic and real-world cache replacement designs that should be kept in mind:

First, most competent real-world designs are considerably more complex than the academic model whence they are derived. Support for or resistance to many common cache patterns can often be grafted onto simpler academic designs, making them considerably more robust and flexible than academic analysis may suggest. Consequently, benchmarking performance of an academic design with other academic designs is not all that indicative of how it would perform against a real-world implementation. In practice, thoroughly engineered real-world designs all have similar cache hit rate characteristics regardless of the algorithm starting point because it is relatively straightforward to take ideas from other algorithms and make something that works similarly for your algorithm, albeit a bit less elegantly.

Second, one of the most important properties of cache replacement algorithms that is often overlooked in academic literature is real-world computational cost and operation latency. It is not unusual for a disk cache to be accessed 10s of millions of times per second. This cost can vary by orders of magnitude across algorithm implementations that have similar cache hit rates. Consequently, you often go with a design that has excellent memory locality, efficient concurrency, minimizes CPU cache thrashing etc because that is what keeps cache replacement out of the profiler. Given the first point above, there is a bit of a convergent evolution in real-world cache replacement designs because this has a larger engineering impact than the specific algorithm you use. Heavy optimization of the CPU throughput characteristics severely restricts the practical design space, which tends to push you toward the same couple algorithm starting points. It is not intuitive to most software engineers that this is often the property you most optimize for, not things like cache hit rate, when they first get into the space.

It is a fun design problem with a very high dimensionality tradeoff space.


Good points.

The first point is probably true for most research in computer systems. Different systems have different constraints that make a simple design very complex or not possible.

Can you elaborate on the second point more? It feels to me that S3-FIFO is optimized for latency, throughput, and scalability than existing algorithms (e.g., with multiple LRU queues).


On the second point, there are a few different aspects that designs typically navigate. Effective thread concurrency used to be one, but the strong trend toward thread-per-core architecture has mooted that to a significant extent. More common is how many cold cache lines a cache replacement hits and how amenable the algorithm is to the CPU’s preference for sequential access.

You are probably familiar with this problem but it is trickier than many average software devs allow. Basic CLOCK-derived algorithms are an obvious choice for CPU efficiency, and easy to patch up for most weaknesses, but when cache metadata becomes large, which is common on modern servers, they have a tendency to thrash the CPU cache for some access patterns. This has some pretty negative performance effects. Not having to worry about multithreading (because thread-per-core) enabled some modifications that came about as close I’ve seen to ensuring that cache replacement lives in hot CPU cache lines. This largely involves ensuring that you can find an immediately evictable buffer with a short scan with very high probability, and if that fails (a rare case), the scan metadata is so compressed that it often easily fits within L1. And that metadata is often evaluated with massive parallelism in a handful of clock cycles using vector intrinsics. AFAIK, these variants are the strongest cache replacement designs in terms of CPU efficiency (not necessarily cache hit rate for some access patterns).

Some of the core techniques above are things I learned from other database kernel designers who themselves learned it from someone else. The scan length reduction technique was so elegant and obvious in hindsight that I am surprised that it never occurred to me or most anyone else (though it does make the designs more subtle). Quite a lot of these practical details have never been written down a far as I can tell, it is passed along as lore and beer convos. I am as guilty of this as anyone else.

I thought the SIEVE algorithm was great, by the way! This area of research deserves far more attention than it seems to receive, most people act like it is a fully solved problem. Caches are in the hot path of everything.

That said, in recent years I have become a lot more interested in the theoretical limits of effective cache replacement, regardless of algorithm. There are modern workloads that are asymptotically approaching randomness for all cache replacement even when the cache is 10-20% of the working set. Being able to narrowly split and assign workload to things that are effectively cacheable and things that are not is becoming important. I routinely deal with workloads now where <1% of the working set fits in memory, and it is becoming more common. Caching is important but the impact on overall performance is diminishing.


This is an awesome reply that I would like to see here. Do you mind sharing more details on what type of cache or use case needs the performance you mentioned? Database bufferpool?

> a lot of these practical details have never been written down a far as I can tell, it is passed along as lore and beer convos

I wish the valuable knowledge could be written down and shared.

Yup, I am on the same page as you, after seeing the many state-of-the-art having similar miss ratios, I am inclined to think we are close to the maximum possible hit ratio bounded by the limited information from the request history (information theory).


I’d be happy to chat more, this is probably best taken offline. My contact details are in my profile. I have had an ambition to write up everything I know about this subject, both theory and practice, but I never get around to it as a practical matter.


> show that SIEVE outperforms all state-of-the-art eviction algorithms on more than 45% of the traces.

Doesn't that mean that over half the time, 55%, it's sometimes equal and, more likely so, worse?


Well, it mentions comparison against multiple other algorithms. So if those 55% are divided between a bunch of other algorithms, being the best in 45% could mean it's the best more often than any other algorithm.


My favourite simple "eviction" algorithm is just to wipe out the entire cache as soon as it gets full.

Insertion:

    %hash = () if keys %hash > 100000;
    $hash{$key} = $val;
Retrieval:

    return $hash{$key};
Desirable properties include bounded memory usage (assuming bounded size of $val), O(1) (amortised) insertion/retrieval, and (most importantly) it only takes a second to write and there's almost 0 chance of writing it wrong.

The downside is that the mean utilisation is only 50%.


Another simple one is to just identify a bunch of completely random candidates for eviction, give them some interval of time for a "second chance" to be used, and if not used in that time, flush them.

Has the advantage of lower overhead and simplicity.


I wonder how that would compare to randomly evicting something.


"Why does our service slow to a crawl every 5 minutes?"


I love the simplicity, but wiping the cache is a no-go for most serious apps.


It depends on the application.


If wiping the cache is a no-go, how do you handle (unexpected) restarts?


With a 3 day outage and a status page that says "a small percentage of users are experiencing decreased availability while we investigate a network issue in us-east-1".


This is very interesting! I really like the emphasis on simplicity in addition to performance - seems easy enough to reason about. I guess for observability’s sake, when implementing this, you’d probably want to record a distribution of distance from the pointer to the evicted item, and match it up with request latencies.


In case someone is interested, there is an independent review and implementation at https://github.com/scalalang2/golang-fifo


Made also a JS implementation based on the golang one here: https://github.com/kurtextrem/js-sieve. It's a bit different than your reference implementation.


This is fantastic work!



There is obviously a LRU-like element in the algorithm because

1. Evacuation generally shifts items to the right.

2. The Hand moves to the left.

3. Items skipped by the Hand (appearing to its right) are spared from eviction until the Hand wraps around.

You can see it in the animation. D is flipped to visited just in the nick of time, and the Hand moves over it. Then it is spared from then on until the Hand wraps around (the animation doesn't go that far). Items that are used frequently will tend to be in the Visited state when the Hand passes over them, and stay cached.


I don't understand why D is NOT switched to false when the hand moves left and removes E. At the beginning A is switched to false when the hand moves left over it.


I watched it again and read the algorithm description part, and I think you're right - D should have been switched to false, as the hand was pointing to it when a cache miss happened.


I watched it again and checked it against the python reference code.

The code and documentation are correct, the animation is wrong.


The behavior is inconsistent with what happens to A and B at the very beginning.

Whenever the hand skips a visited node in search of an unvisited one, it must flip it to unvisited.

If D were given this special treatment every time the hand cycles around, it will forever stay in the cache even if not accessed (D-plomatic immunity?)


Yes; D is spared by the fact that the hand passed over it, but it should be turned gray so that if it is not accessed again by the time the hand comes around again, it will be toasted.


Replaced the linked list with a Python list, for clarity: https://gist.github.com/tommie/577b4e154731a280136295180211e... (and I hope I got it right)

Since this is using a 1-bit visited counter, a small cache could use a bit vector and bit shifting to "remove" a value from the queue.


Python list is backed by array. So removing an element from the middle of list is costly. (I assume this was the reason for linked list usage in original version)


Indeed, but this is clearer to read.


Python's ordered dict uses an efficient linked list implementation, it might be elegant here?


Oh, interesting.

The documentation there also says

> They have become less important now that the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7).

https://docs.python.org/3/library/collections.html#ordereddi...

It assumes iterators are stable (and valid) under insertion, and I can't find any docs on that for either. Since you (AFAIK) can't create an iterator at a specific key, we also couldn't use keys as the "hand."


The effectiveness really depends on the request pattern. In some use case, the LRU is the most efficient. When it is purely random, it obviously won't be the best as any value has the same probability to be selected.

Regarding simplicity, there is a simpler algorithm where we don't need the next and prev pointers. We simply replace the non-flagged value at the hand. This is most probably what is used in CPUs.


What does the acronym SIEVE mean? I guess SImple EViction something? I couldn't find that in the paper


This is presented like the best cache algorithm ever invented, but it's not simpler than LRU.


I always heard of this algorithm under the name "Not Recently Used" cache.


It's a bit off topic, but how are the animations made?


Came here to ask the same question.


I used Keynote, especially the magic move function, to create the animation. A magic move is transition that creates the effect of objects moving from their positions on one slides to new positions on the next slide. Then you can export the keynote file as a GIF.


It was made with keynote with the magic move between slides. Maybe @yazhuo can share the file.


@Hrun0 @portpecos Sure, I'd like to share the file if anyone needs it.


No need, but thanks :)


Is there a term for such hyped up language?


Yes, language like "... it's turning heads with its slick efficiency..." and "We didn't stop there. We pushed SIEVE a bit further..." smacks of sales pitches for cheap-quality, overpriced merchandise. Certainly the statements are a kind of hyperbole. Maybe puffing/puffery[0] is close to what you're looking for.

[0]: https://www.law.cornell.edu/wex/puffing


>On a cache hit, SIEVE marks the object as visited. On a cache miss, SIEVE checks the object pointed to by the hand. If the object has been visited, its visited bit is reset, and the hand moves to the next position, keeping the retained object in its original position in the queue. This continues until an unvisited object is found and evicted. After eviction, the hand moves to the next position.

Dude, what?


Look at the animation. The description is confusing.

It uses a list that's treated as a ring buffer to keep track of keys currently in the cache.

The "hand" is a pointer that points to a given node in that ring buffer.

When the cache is full and a space is needed, the "hand" moves through that buffer (possibly wrapping around). Whenever it points at a node that hasn't been touched since the last time it was seen, that node is evicted and you're done for this time, otherwise it clears the visited bit for that node and keeps going.

(You're always guaranteed to find a node to evict, because worst case you scan through the entire set of nodes, and find the first node you cleared the visited bit of)


Maybe it's a bit odd that it resets the 'visited' bit on non-related objects (why are they suddenly not visited)?

For me it helped to rather think about 'visited' as a special case of a counter called 'valuation' (or something like that). New objects come in with valuation 1, on cache hits a objects valuation is set to 2, and when the hand moves left it decreases objects valuations until it reaches one that gets valuation 0, then it's evicted immediately. One could easily imagine a modified algorithm where 'valuation' could be increased further than 2 if an object got visited many times (but then eviction would be much slower in some cases).

Then this is all optimized by using a bool instead of an int since we truncate 'valuation' at max 2.

Idk why they choose to call it 'visited' though.


What "visited" really denotes is "accessed since the last time the hand reached this node".


Yahoo has (or had) a very similar 'cache' for tracking abuse (ydod). As I recall it was patented, but I can't find the patent to confirm it has expired.

Really handy for tracking potentially abusive ips, if your table is big enough; on a request, check if it's in the table - if so increase the count and if it's above the theshold mark it. If it's not in the list, decrement the current item and move current to the next item.

An agressive user will clear out a slot for itself, and rapidly hit the threshold. Other users will come and go. You can iterate through and try to always clear a spot for a new entry, but it might not be necessary, depending on what you're caching.


The hand is a pointer on the queue to the current cache item being inspected. The cache miss eviction behavior depends on if the pointed to element is marked as visited.

If you're familiar with LRU/LFU, ring buffers (though this isn't such), and data structures this will probably make a lot more sense.

A rigorous course in data structures is fantastic for any engineer. Everything is data structures and algorithms at the end of the day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: