Better analysis here than in the SIEVE paper, imho.
Ultimately, the right eviction policy depends on the application behavior, and you'll only find the right answer through profiling against actual data from service logs... and the difference between miss rates isn't large so I really don't think it should even be the first thing that gets profiled.. use LRU or LFU (based on intuition about usage) and move on. But I do love to see newly proposed algorithms, more tools for the bag.
The optimal algorithm can at best be twice as good as LRU using half the memory. And yes that includes an algorithm that knows exactly which memory is going to be requested. If LRU is not enough you need to be clever or use more memory.
Something like LFU has no such guarantees. Just read a big block of data 10 times and you can get close to 100% cache misses just by continuing to read random bits of data 10 times after each other. LRU would only have 10%, and I think that's optimal.
I tend to think similarly about most optimisation things. Sure, sometimes in niche cases (embedded, or sometimes gaming, are good examples) doubling performance takes it from a no-go to a go, so it's definitely worth knowing things like this. Usually though we're looking for more than 50% scalability, and if you're within 50% of max capacity you need to re-think things on a much higher level.
And as far as optimising things go this is a bit of a special case. Often you can make something several orders of magnitude faster or efficient, but when it comes to LRU you know you're already quite close to the optimal algorithm (in logarithmic terms).
Basically there's very little speed-up you can achieve with a smarter algorithm that you couldn't achieve with a simple increase in cache.
Twice as good meaning that the cache miss rate is halved?
A standard homework problem in a computer architecture class is to work through the math on how marginal improvements to cache rates can provide dramatic performance improvements, due to the disparity in times of providing data from cache and from the source. Small improvements to caching are a big deal. I hadn’t heard of SIEVE but I’ve had to work more with traditional block cache strategies, where scans are intermixed with random access. LIRS works pretty well.
Oh, "cache" means a huge amount of functionality, on both hardware and software, with a huge amount of dissimilarity.
People are talking like of the one cache thing they are thinking about is the end-all be-all instance of "cache" and nothing can differ. Even the article's author, that really should know better.
Anyway, you seem to be talking about a basically free network (probably a bus), with huge throughput and relatively short latency. The article seems to be about a more limited network, with comparatively lower throughput and larger latency. The same algorithm won't work for both cases.
This is one thing you can probably easily test on historical data. In fact I wonder if there is an automated tool/framework that can consume some structured log of accessed keys to select and tune the optimal policy.
The problem with that approach is not only the time and effort to capture and analyze the trace, but that the workload changes over the application's lifetime. When the workload is expensive and important then cache analysis is worth doing, and there are cache simulators that are pretty easy to work with.
For the general case the ideal is an eviction policy that adapts to fit the given workload. Unfortunately adaptivity is not a very well explored area so most policies have heuristic settings that default to work well in a target range of workloads, but this has to be tuned for others. In this and the author's other paper, S3-FIFO, their lower hit rates with TinyLFU turned out to be due to the implementation. When running it with Caffeine, a productionized version that adds hill climbing and jitter, it had equivalent hit rates [1]. That cache is pretty robust across a wide range of workloads, but also much more complicated to implement. I'd really like to see more research on adaptivity because I think all policies are converging on similar structures and goals, but self-tuning is still an error prone art.
This should be doable. On each access append the key to a list of recent keys which would also not be terrible expensive if the keys are much smaller than the values. Ever so often you can then look through the list of recent keys and analyze the access patterns and maybe change the eviction algorithm.
But I think this might also not be too helpful overall. I think in many cases there is an obvious choice depending on how and for what the cache is used. And then there are caches that have more unpredictable access patterns, a file system cache might be a good candidate, experiencing the access patterns of all the different applications.
In that case there is probably no best choice as several applications with different access patterns concurrently hit the cache. Or if there is a best choice, it might change very quickly as different applications or components of applications execute. In such a scenario it seems more promising to me to try to categorize the access patterns on a per source basis and treating them accordingly instead of trying to find one best policy for all sources combined.
On the other hand something like this might quickly become prohibitively expensive.
Easy to measure, sure, but that's different than impact on performance. You're much more likely to get a bigger win from analyzing database schema, divide-and-conquering the workload, tuning the algorithms in the critical path, or even choosing _what_ gets cached. If tuning the cache eviction is the main thing it is probably because that's the only thing.
In college we included the optimal case in understanding caching. It’s a very worthwhile process. Sieve’s claim is simplicity and lower eviction latency. That should be measured against hits. A tiny fraction more hits will almost always erase the differences between various eviction algorithms. Depending on cache size and the size of what you’re caching, LRU can add a nontrivial amount of memory.
It is promising to see more experimenting with these algorithms.
I mostly agree with this, except for hardware (CPU) caches, hit latency is absolutely critical, and all logic must be implemented with circuits. On top of the latency requirement, organizing the data structures and handling coherency are really tough design problems.
I love this kind of work. I've had a (40+ year old) belief that LRU is probably the right approach, and since I don't pay much attention to cache algos that belief has settled comfortably in the back of my mind.
It's so great to see that this area is subject if theoretical and empirical work and in fact my default view is likely wrong in most cases.
Do people really look for the best possible general-purpose cache eviction approach?
It appears evident that there is no single approach that will work for all workloads (i.e. combinations of access patterns, read/write ratios, scans, processor/cache parameters, etc.). Perhaps I am wrong, and there's a holy grail out there, but...
If there is no single approach, shouldn't the focus be on matching algorithms to workloads? Better selection approaches would seem to yield far more value than yet another boutique algorithm.
The optimal algorithm is not computable. Approximately optimal algorithms are computationally intractable for all practical systems. This is why practical real-world algorithms are all narrowly optimized, it is required to make the algorithm usefully tractable. It can also work relatively well if you can very narrowly define the workload characteristics you need to optimize for.
Cache replacement algorithms are in the family of universal sequence prediction problems. These problems are unfortunately related to the problem of generally optimal AI/learning algorithms, which gives you a sense of why the task is so difficult.
What if you kept statistics of what access patterns you encounter every day. A lot of workloads do the same thing day in day out. Some workloads almost never scan, others only scan, it would be a quick win to automatically optimize these cases.
Not necessarily if it takes time to collect, ingest and update the data structure. Many algorithms will rebalance based on usage but that also has a cost.
That's pretty much what the self-tuning elements of JVM GCs are doing. I think we might label the overall problem as "non-computable" but we have real-world systems doing it all day long.
No but it's easily findable. If you have a group of say 10 cache machines, you can set each one to a different eviction algorithm, and then track misses, hits, and eviction rates. You can even use a one-arm bandit algo for the algos if you want to be really thorough.
But the main question would be why? In 99.9% of cases, the cache eviction algorithm is not going to be your bottleneck. So might as well just use LRU and move on to more important problems.
It really isn't. This has been known for things like silicon design since at least the 1980s, and there are a few different seemingly basic properties that make an otherwise simple data model have intrinsically poor cache characteristics. One of the best known is the use of high-dimensionality access methods (e.g. graphics, spatial, graph, etc processing).
The traditional solution to cache replacement intractability is latency hiding architectures (e.g. GPUs, barrel processors, etc) for applications that are mostly about throughput. In fact, this is a reasonable working definition of the difference between latency-oriented and throughput-oriented design (caching versus latency hiding).
> In 99.9% of cases, the cache eviction algorithm is not going to be your bottleneck.
What types of cases are we talking about here? For data intensive applications like database engines, using a naive algorithm like LRU will definitely show up in the profiler unless the rest of your code is similarly naive. You don't notice it for high-throughput workhorse systems because someone else has already spent a lot of time tweaking and optimizing this, using some combination of caching and latency hiding.
> If you have 1e4 machines and run a global video distribution network it might be worth an engineer's salary to look into it.
Sure, and that would be the 0.1% of the time where it make sense.
> if you're implementing a feature in the standard library of a widely used language
I would contend that unless your library is very opinionated on exactly how data will be accessed, you can't possibly optimize cache access in a way that makes sense for every user.
You can't possibly optimize sorting in a way that makes sense for every user. Let us use insertion sort in our standard library and call it a day. Why are you bothering to discuss general-case sorting algorithms?
Insertion sort is heuristically good in many practical circumstances. You wouldn't call it optimal, nor choose to use it as your general-purpose sorting algorithm.
Having a strong default starting point is nothing to scoff at. You wouldn't want a standard library to use insertion sort because "no singular approach works optimally in all cases, so why bother trying for good general case?"
I thought the article might be discussing the impact of CPU caches on the Sieve of Eratosthenese when used as a processor benchmark. I'm pretty sure I thought wrong.
Next I thought it might be an algorithm for managing the CPU caches, Wrong again (I think.)
I think it's a discussion of managing application caches or perhaps things like disk caches. Not sure if I'm right.
I'm sad that this ends on a cliffhanger, I was excited to see the results for the circular array version. Hopefully I don't lose track of this story when (if) someone who knows what they're doing does that test (I am not such a person, or I'd do it myself).
I haven't tested an actually fast version of this, but I did hack together two versions that used rings of nodes instead of lists. There are two separate issues, IMHO: Rings - be it in an array or using linked nodes - makes the logic even simpler because you don't need to test for the start/end:
Implementing it using arrays isn't hard. You have two obvious options:
One is to still use a linked ring. In that case your array just saves you dynamic allocations and allocator overhead (and if in a gc'd language, gc pressure).
The other is to copy to compact the right on eviction. Whether that is a horrible idea or might work will depend a great deal on your performance characteristics and the number of entries in your cache.
Personally I think the "linked ring in an array" approach is a nice middle ground - you're likely to get better (CPU) cache locality than a naive linked list (whether enough to matter really depends on your implementation languages underlying allocator - and the singly linked version has quite low extra overhead (and it can be made smaller if size is a concern - e.g. it doesn't need to be a pointer but can be an array index) vs. just an array.
I'm having a hard time understanding why this is better than CLOCK. It seems like a hack to make updates easier which however makes both practical performance and conceptual justification much worse.
The elements which are between 'hand' and 'tail' are going to stay there for a long time, until the hand pointer cycles. If there are lots of elements which get put in the cache but don't really deserve it, that could take a long time. But that's exactly the case where you wouldn't want to keep the elements behind the hand around for a long time. Remember, they got 'promoted' for appearing twice in around the cache size records. Not a great indication that they should be kept for a long time. On the other hand, when the pointer cycles, they could get thrown away quite easily.
On the other hand CLOCK converges to a situation where elements which have appeared a lot are at the front. They are much less likely to be thrown out even if they have a temporary run of scarcity. The elements which don't appear much quickly bubble to the back and become much more likely to be thrown out in the next few evictions. This is accentuated if there are lots of elements which are getting hit relatively more frequently.
After a long time of running SIEVE, there's not really anything about the state which will improve future performance compared to when it starts cold. In many cases it seems it will be locked in to a suboptimal state by overadapting to past observations.
So this made me think of spaced repetition, which is based on how the brain retains memories. I wonder if you could make a cache retention algorithm similar to what the brain uses.
I'm thinking something like this:
Every entry has two values "solidity" and "time since last seen". When an entry is accessed solidity increases based on time since last seen - if it was a short while ago it increases just a little, longer means bigger increase. This is to ensure that solidity will correspond roughly to the long term need for the entry and doesn't go up too much if something is accessed a lot of times in quick succession.
Entries with low solidity and long time-since-last-seen are evicted.
That could work efficiently in a sampled caching policy. For example, Hyperbolic [1] maintains the frequency and insertion time on each entry and evicts the entry in the sample with the least utility value of frequency / (currentTime - insertionTime). For yours, you could easily add it to the SampledPolicy in this simulator [2] and run it against a few traces to see how it compares.
https://arxiv.org/abs/2301.11886 for storage and CDN caches achieves pretty remarkable improvements over standard heuristics by using common heuristics as a filter for ML decisions. I imagine even TLB caches might achieve some benefit from ML models with batched pre-prediction as done in MAT, but L1-3 are almost certainly too high throughout.
This is not really "sieve-ing" per the article, but what prevents me from running another process that periodically queries the data in a cache? Like just running a celery queue in Python that continually checks the cache for out of date information constantly updating it? Is there a word for this? Is this a common technique?
I think this is not as simple, because to achieve good metrics (latency, cache hit) you will need to be predicting the actual incoming query load, which is quite hard. Letting the query load itself set the values is the state of the art.
In some ways, caching can be seen a prediction problem. And the cache hit rate is the error as we lag the previous history at time T. Blending load over time is effectively what these various cache algorithms do to avoid overfitting.
If you have an idea of what you need to cache or can fit everything into the cache it's extremely effective.
Tho potentially just refreshing out of date data in the cache could increase effectiveness given that general assumption of the cache is whats in the cache will probably be used again.
I called it a periodically refreshing cache when I wrote one. Not sure if there is a more formal name.
You might call that prefetching. That's what unbound calls it when it returns a near expired cached entry to a client and also contacts upstream to update its cache. I remember having a similar option in squid, but it might have been only in an employer's branch (there were a lot of nice extensions that unfortunately didn't make it upstream)
You're describing cache maintenance (and cache eviction), a practice for which there are many algorithms (FIFO, LRU, LFU, etc.), including the algorithm the article describes (SIEVE)
I think this is orthogonal to cache maintenance and cache eviction. Instead this is having a background process periodically refreshing the data in the cache to keep it hot.
Refreshing the cache to keep it hot, and deciding how to do it, e.g. which parts do we do with the caching layer directly, which parts do we do with an external process, what to evict to make room, etc, are subtopics of cache management.
If I understand you correctly, you're asking if this is different because an external process is involved. I don't see a use in drawing a distinction, and as far as I know, there's no special term for that pattern.
Update: after looking into it, it looks like this cache/architecture pattern is called "refresh-ahead"
I started reading the original paper and stopped as soon as I read the bit about not being scan-resistant. I'll stick to ARC (Adaptive Replacement Cache, designed by IBM and used among other places in ZFS), thank you very much.
No, it's not. For starters because it's new, and second it's not a given it'd be useful for CPUs at all. Caches are used in all kinds of places, not just CPUs. E.g. operating systems, content delivery networks, databases, reverse web caches. There's a huge number of places for different kinds of cache algorithms, and the tradeoffs that works best varies a great deal depending on workloads and requirements (e.g. whether worst case latency, average latecy, or hit rate, or any number of other factors matters most)
Copy to compact might be fine if your cache is small. The larger the cache is, the more expensive copy to compact gets relative to the rest of your logic. Hence why the original in the paper uses a linked list.
You can still get closer to what an array would buy you. E.g. the example code in the paper uses a doubly linked list and there's reason to, so you can reduce overhead right away there. One of the other benefits of an array is you can wrap around just by using a modulo of an index, but you can avoid the conditionals for their list implementation by linking the list together into a ring instead. Those bits I've implemented in a quick and dirty Ruby version (the singly linked ring is last here: https://gist.github.com/vidarh/b72615c16673cb38630cab5148190... ).
You can then also do a "linked ring in array" version, and use that to both reduce the memory overhead (both from individual node allocations and from the size of the pointers in the linked list) by placing linked list/ring nodes in a statically sized array. You just need to relink the node you evict into where you want to insert. You can save on pointer size if that matters to you as well by using array indices instead of pointers in that case.
Whether any of these matters is something you really should measure in your preferred implementation language.
Not really on topic, but question about papers: when viewing the linked paper, at the top it says title and names of authors and universities, but no date. The way to guestimate the age is to scroll to the bottom to the references and see the most recent date visible in the references. But of course that method is not fool proof if the references happen to be all older.
Putting author names but no date at the top seems to be some sort of standard template or convention that every single paper author or institution everywhere uses since forever, since I never see dates at tops of papers.
Why is date not part of the standard template? Knowing the date is so incredibly important to know the paper in context, and in relation to other science. Especially for science I just don't understand why they don't have a convention to date their stuff.
---
Anyway, why I was interested in the date was because the article said "Why aren’t we all SIEVE-ing?". Well turns out the paper is from 2023 or newer, so that could explain why.
> Putting author names but no date at the top seems to be some sort of standard template or convention that every single paper author or institution everywhere uses since forever, since I never see dates at tops of papers.
The year usually appears in the bottom left of the first page, and in the header or footer of subsequent pages.
The first linked PDF [1] says only "14th USENIX Symposium on Operating Systems Design and Implementation" (without the year), but does have the year on an initial cover page. The second [2] says "HotOS ’23, June 22–24, 2023, Providence, RI, USA". The third [3] says "SOSP ’23, October 23–26, 2023, Koblenz, Germany" in the bottom-left of the first page.
However, this information may be missing from an author produced files/preprints, as it may be inserted along with page numbers when the conference proceedings or journal issue gets created. This seems to be the case for the fourth linked PDF [4].
Links like https://junchengyang.com/publication/nsdi24-SIEVE.pdf is the one we can actually read though, relying on a third party to put this critical piece of information seems suboptimal, rather than make the paper self-contained.
I noticed that arxiv at least puts the dates in vertical on the left side of the PDFs there, but it seems that too is something arxiv puts there, not the paper itself.
If it's about subtleties like "what is exact publication date", at least put some date, at least let us know it's from the 2020's or the 2000's, it's not guaranteed to be easy to tell. It should be possible to put a year during which the text is being typed there.
> Well turns out the paper is from 2023 or newer, so that could explain why.
TFA and the paper both wonder why this extremely simple algorithm was not discovered sooner. Both posit that no paper should be prerequisite to applying this modification.
The question in TFA is less about the paper and more about the concept -- is there some reason why this has evaded us? Is it too good to be true?
The article does a better job than the paper digging in to the details IMO.
Asking why humanity didn't invent an invention earlier, given that we had all the prerequisites, is kind of a banal question. Why didn't we invent cars sooner? Why didn't we invent the internet sooner? We had all the prerequisites, or else they wouldn't have been invented. Why didn't we invent them faster?
The answer to all these questions is that having the prerequisites for inventing something doesn't mean we instantly invent that thing, a la Sid Meier's Civilization. The expectation that we would is actually what's kind of odd here. Humans doesn't even all share the same info, much less automatically know all the possibilities of all combinations of all humanity's knowledge.
Another factor might be that Yet Another Eviction Algorithm isn't necessarily one of humanity's top focuses for inventions at the moment. We could ask any of us here on HN, for example, why we didn't think of it, and sooner. The answer is likely that we were all working on something else we found more important than YAEA.
That banal question is in the original paper. The paper spends significant time explaining that this because this is such a simple modification they find it difficult to believe it's been overlooked until now.
The article's question "why aren't we SIEVE-ing?" follows this theme.
No one is asking the equivalent of "why didn't we invent cars sooner?" here.
I don't doubt that the banal question in question, is in the original paper. My point was how banal the question was, and how easily answered it is, not where it was.
Many inventions in humanity were "simple modifications" of other things. Why didn't we invent them sooner? The answer is in the post you replied to: because it's totally normal for "obvious" inventions to not be invented for a long time, even when somebody thinks, after the fact, that it's a "simple modification". Indeed, what percentage of humanity is currently searching for Yet Another Eviction Algorithm, and what percentage of that percentage is looking in this direction, instead of one of the hundreds of other directions?
So yes, the question is indeed equivalent to "why didn't we invent [X] sooner", replacing X with any invention that at least 1 person out of billions in the world finds "obvious" or a "simple modification".
Do you have any routines that you've refined over the years? Is every step you take in a routine something you were taught or learned elsewhere? Or did you "invent" part of it yourself?
I know how to make tea or coffee. Should someone take the credit for the fact that I know to start boiling water first, while I get everything else together, because this results in the most efficient use of my time?
My perception of the paper is this analogy isn't too far off. It's not an invention, it's a discovery.
Perhaps not everything is an invention, but I think everything we're talking about here is, including algorithms. In any case, that's just semantics: you can call it a "discovery" if you feel better about it, perhaps even run s/invention/discovery/g on my post.
The points therein still stand: [inventions/discoveries] which at least 1 person in the world thinks (after the fact) are obvious and just a "small modification" of other [inventions/discoveries], usually go [uninvented/undiscovered] for quite some time.
The paper and TFA asks, "why?" But the better question is, "why wouldn't they?" There's no reason to think they would be be immediately [invented/discovered] as soon as all the prerequisites are [invented/discovered] (again, even if someone thinks they're similar), for the reasons described in my above posts.
Electrical car makes sense to me, when cars were new all kinds of propulsion methods were attempted, eventually one dominant technology gained traction.
The electric printing telegraph in 1848 though, that's amazing!
Looking at the BibTex metadata can give you a better idea of how recent/relevant the paper is:
year = {2023},
…
month = apr
But since this is not yet published (to appear in 2024 of NSDI), I use a similar method you describe. Most of the time, can also infer from the paper itself (sometimes in the abstract or the discussion)
Ultimately, the right eviction policy depends on the application behavior, and you'll only find the right answer through profiling against actual data from service logs... and the difference between miss rates isn't large so I really don't think it should even be the first thing that gets profiled.. use LRU or LFU (based on intuition about usage) and move on. But I do love to see newly proposed algorithms, more tools for the bag.