I see a couple of monitoring/reporting systems.. but no caching solution. These are tools to catch bugs in the solution you're using. Good work, but not a solution for cache invalidation.
And regarding those tools:
it doesnt sound like Polaris would handle a network partition well. If a cache invalidation triggers it to check the other caches for consistency... that assumes Polaris will receive that invalidation message. Imagine a scenario of 5 cache servers, and a Polaris server. On one side of the split is Polaris and 2 servers, and the other has 3 cache servers... It's possible for the 3 cache servers to receive an update that is not received by the polaris+2 network... And polaris would not just be unaware of the inconsistency, but it also wouldn't know to check later for the inconsistency when the network partition is resolved.
I also feel like the consistency tracing is assuming that only one fill and invalidate is occurring at a time (when in practice, there may be multiple fills and invalidates occurring in parallel on the same data)... and that those calls will arrive in order. If they arrive out of order, it doesnt sound like it would catch that.. and I think you're relying on Polaris to catch this case, but high latency cannot be differentiated from a network partition except the length of the delay... so these two types of errors can be seen together.. in which case, you'd have a cache error that neither tool would detect.
I would like to hear why Im wrong.
I understand this is being used in prod, but network partitions don't occur everyday... and Im not convinced this has seen enough flaky networks to work out the bugs.
Yeah I have learned about some people have different perceptions of what cache invalidation means (I have a narrower definition, a subset of what some people think as cache invalidation). I will not die on this hill. I am happy to rename/rephrase/anything to be helpful.
> it doesnt sound like Polaris would handle a network partition well ...
Good question. And you are right. My thought on this is that over time, polaris would still catch those unique issues happened at network partition or whatnot. Polaris doesn't promise have 100% coverage all the time. However, over time, if there are flaws in the system, Polaris should help surface it. And once it did (even with just one inconsistent sample), it should be very actionable, and we can find out why it happened, and fix the underlying cause.
> I also feel like the consistency tracing is assuming that only one fill and invalidate is occurring at a time (when in practice, there may be multiple fills and invalidates occurring in parallel on the same data)... and that those calls will arrive in order. If they arrive out of order, it doesnt sound like it would catch that..
It doesn't make that assumption actually. Distributed systems are state machines. Consistency tracing essentially logs (it doesn't do much detection) state transitions. So when polaris detects an anomaly, we have all the information to help us diagnose. And you are right that fills and invalidations can happen in parallel and it's fine. E.g. if fill happens before invalidation, we can always track state mutations caused by the invalidation. If invalidation is the culprit (the earlier fill can't be because it happened earlier), we would have a log for it.
I also was surprising to not see the solution to the cache invalidation. The blog post describes how the tracing/logging mechanism works, but what about the actual cache consistency/invalidation strategy? I.e. what about many-to-many where one fetch returns inconsistent results.. how do you even know it's inconsistent? What about data queried while there's a transaction rollback.. when is it updated in the cache?
Sorry if it sounds like silly questions, I was genuinely curious to learn about the strategy since the blog started with "When it comes to cache invalidation, we believe we now have an effective solution to bridge the gap between theory and practice"
These are good questions. I might not have satisfying answers to all of them but I will try.
> what about many-to-many where one fetch returns inconsistent results
Are you referring to the case when cache stores a complicated result of a function that fetches data from many sources, and how to figure out when to invalidate? If so, I am sorry but that's not the cache invalidation I was referring to. (https://news.ycombinator.com/item?id=31676102)
> What about data queried while there's a transaction rollback.. when is it updated in the cache
This might be easier to answer. I assume if you run the database with isolation level higher than read-committed, you will be fine.
> delaying performing the compute-intensive operation until an inconsistent sample crosses the reporting timescale
Could this delay cause a wrong reporting? Since the source of truth (the DB) could already received multiple mutate events of the data Polaris trying to verify? How do you handle this case?
No actually. When a sample has crossed the reporting timescale, and it's inconsistent. Regardless of if the db changes later or cache becomes consistent later, at the moment it's still an anomaly (inconsistency in this case).
> Good work, but not a solution for cache invalidation.
Assuming your definition of cache invalidation is about "when/who" to invalidate on writes. Let's actually try solving it.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
Because the materialization in cache don't take writes themselves, so updates to them are essentially performed blindly and can be ordered by the commit time (TrueTime).
Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline. I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.
- client starts a transaction
- client runs any mutations as needed
- client collects user_ids whose cache entries need to be invalidated
- client invalidates cache
- client commits the transaction
You can now have caches fetching & storing the "old" state between the last two steps. This fails to invalidate cache in all scenarios.
Good catch! I shouldn't have omitted the details here. Roughly there are two ways to solve the race you mentioned here.
You can use a versioning scheme supported by the database to do compare-and-swap – i.e. sending invalidate along with a hybrid-logical-clock. HLC is nice in this case as it handles DB transaction rollback gracefully, if the database supports it.
Or we can always do the invalidation asynchronously (by recording the invalidation keys transactionally and have a tailer that sends out the invalidation).
I will make an edit to the blog to make it more clear. Let me know if that makes sense!
I mean this – cache invalidate/fill race specifically – is very much a solved problem from a protocol perspective, as long as our definition of the problem is the same – do not leave stale data in cache indefinitely. That is not to say it is easy; I am not trying to minimize anyone's struggles. https://research.facebook.com/publications/scaling-memcache-... might be of interest.
A lot of the challenges in cache are in making the tradeoffs between consistency and coordination overhead based on the workload and the requirement, _and actually_ making caches consistent in production. As explained in the post, there are practical challenges that are very unique to cache and cache invalidation.
I will not say cache invalidation is easy; and I am not trying to minimize anyone’s struggles. But cache invalidation is _not_ like FLP impossibility or CAP. Too many systems reason caching (inherently a distributed system) in an ad-hoc way that leads to failures and this belief that cache invalidation is uniquely hard (https://twitter.com/marcjbrooker/status/1534944338341310470?...).
>I understand this is being used in prod, but network partitions don't occur everyday...
Can't we assume they tested it before deploying to production by introducing all the weird stuff that might happen in a distributed system like partitions, high latency, so they aren't relying only on "it seems to work in production"?
My experience is that this is easier said than done. But I agree with you that in theory, one can inject failure and run tests with all possible combinations, and have a repeatable test framework (think about foundationdb’s test framework) to help root cause test failures. Sqlite has very impressive reliability; its test framework is not open sourced. It also benefits from user bug reports, as it’s literally the widest deployed db. Bugs/flaws happen.
Having better tests and better observability in production are also not mutually exclusive.
Do you guarantee referential integrity in TAO? Last I heard the answer is no; complex queries are not supported and clients would have to do their own work with internal versions to achieve a consistent view on the whole graph (if such a consistent view even exists). But it seems to work fine since global referential integrity doesn't seem to be a big deal; there aren't a lot of cases where dependency chains are long enough or actions quick enough that it matters (e.g. a group admin adds a new user, makes them an admin, who adds an admin, and then everyone tries to remove the admin who added them [always locally appearing as if there is >=1 remaining admin], but causes the group to ultimately have no admins when transactions settle). Run into any fun issues like that?
Contrasting that with Spanner where cache coherency is solved by reading from the recent past at consistent snapshots or by issuing global transactions that must complete without conflict within a brief truetime window. I am guessing the cost of Spanner underneath the social graph would be a bit too much for whatever benefits might be gained, but curious if anyone looked into using something similar.
For the fun issues you described, read-modify-write (either using optimistic concurrency control or pessimistic) can work, if I understand your question correctly.
Spanner is awesome and one of my favorite systems/papers. I think it would be very computational and power expensive to run the social graph workload on a spanner-like system. Do you have an estimate of how many megawatts (if not more) are needed to support one quadrillion queries a day on Spanner?
I wish I had more solid data. Cloud Spanner claims about 10K read QPS and 2K write on a "node" which costs $1/hour. The Spanner paper reports about 10K read QPS per core. As I understand the cloud deployment it's 3 replicas at that price, so $0.33/hour should buy me about 8 cores and so I'm not sure what the disparity is (maybe markup? CloudSQL is 2X the cost of compute), but I'll go with 10K QPS/8 cores at the low end. Anyway, something like 500-1000W for ~100 cores so something between 100 QPS/W and 10 QPS/W using the 1000W and high and low performance estimates, or 10^15 / 86400 = 11.6e12 QPS for somewhere between 116MW and a a little over a GW. Sounds comparable(?) to MySQL+TAO on the low end to ridiculously expensive at the high end.
EDIT: I have no clue how efficient MySQL+TAO really are but figure that at least tens of thousands of machines go into it.
It that's on x86_64 than the power budget seems some ten times too low for me. A typical Xeon Silver 4208 is 10 W TDP per core, and other things would also add up, even taking into account that newer CPUs are more efficient.
I am basing this off of Epyc Milan which looks closer to 5W/core, and possibly mutilating my language a bit because I think cloud products are sold by thread (vCPU), not by physical core.
I was actually a little surprised how hard it is to get performance/W figures since I don't currently have any physical server hardware to go off of.
Since I observed data inconsistencies on Facebook multiple times, I believe they aren't as concerned about data integrity as a company doing financial transactions would be.
> At a high level, Polaris interacts with a stateful service as a client and assumes no knowledge of the service internals. This allows it to be generic. We have dozens of Polaris integrations at Meta. “Cache should eventually be consistent with the database” is a typical client-observable invariant that Polaris monitors, especially in the presence of asynchronous cache invalidation. In this case, Polaris pretends to be a cache server and receives cache invalidation events.
I'm a bit confused here. Polaris behaves as a regular cache server which receives invalidation events, but isn't the typical bug related to cache invalidation that a service forgets to contact the cache server for invalidation? So this will only catch cases where (1) you remember to contact Polaris, but you forgot to contact other cache servers [which Polaris happens to know about], OR (2) you're not handling errors during invalidation requests to the cache server [and the request to Polaris was successful]? Or are you cache servers "smart" and might have internal logic which "ignores" an invalidation?
What am I missing?
EDIT: Reading through "A real bug we found and fixed this year" and I'm still a bit confused. It seems like a very contrived bug directed directly to how you deal with versioning (e.g. you allow the latest version to be present with stale metadata, or something?). My main concern with cache invalidation is what to invalidate at what time.
I did an analysis of the bug from the blog post and the TAO paper, and from what I can tell the fundamental bug is therefore the fact that the error handler that handles cache invalidation errors drops only lesser versions, while the cache invalidation contract requires replacing equal or lesser versions.
> isn't the typical bug related to cache invalidation that a service forgets to contact the cache server for invalidation?
That's not the case based on our experience at FB. At-least-once delivery is a solved problem basically.
But you are absolutely right that if there's an issue in the invalidation delivery, it's possible that polaris won't receive the event as well. Polaris actually supports a separate event stream (all the way from client initiated writes) to cover this case.
It helps that TAO is a write through cache so clients can't really forget to invalidate, correct? If someone were to directly write to MySQL shards there would be stale data ~indefinitely.
I'm assuming the versioning is what ensures proper write, invalidation, and fetch ordering so that e.g. slow mysql writes/replication don't cause remote TAO clusters to receive an invalidation message and then read a stale value from a replica?
> It helps that TAO is a write through cache so clients can't really forget to invalidate, correct?
That's not the case.
> If someone were to directly write to MySQL shards there would be stale data
For the sake of this discussion, you can assume this is the setup, and everything should still stand.
> I'm assuming the versioning is what ensures proper write, invalidation, and fetch ordering so that e.g. slow mysql writes/replication don't cause remote TAO clusters to receive an invalidation message and then read a stale value from a replica?
You should join us :p This is getting into the details of our cache invalidation protocol and our cache consistency model. Maybe another post on another day!
> EDIT: Reading through "A real bug we found and fixed this year" and I'm still a bit confused.
yeah that specific bug is convoluted.
> It seems like a very contrived bug directed directly to how you deal with versioning (e.g. you allow the latest version to be present with stale metadata, or something?).
This is mostly for performance reason that I won't go into details too much here. Version is mostly an internal concept that clients don't care about. So it's OK-ish.
> My main concern with cache invalidation is what to invalidate at what time.
Can you elaborate? Our experience on cache invalidation is that you invalidate on mutation, and in terms of "what to invalidate" it depends on the computation, which you might have to give me a little more details.
I am the author of the blog post. I believe the methodology described should be applicable to most if not all invalidation-based caches. I am serious when I say that cache invalidation might no longer be a hard thing in computer science. AMA!
> you have tools to analyze the correctness of your solution
That's half of it. Cache invalidation is hard not only because of the complexity of cache coherence protocols, some of which is not very complicated. But cache invalidation does introduce countless races that manage to introduce cache inconsistencies in ways that are just hard to imagine ahead of time (in my experience). IMO, that's the harder part of cache invalidation – when cache inconsistencies happen, answering the "why" question is much harder than having a cache invalidation protocol (you can have TLA+ for one if you will). And answering the "why my cache is inconsistent" is the problem we solved, which I think is the harder part of the cache invalidation problem.
> Instead, it feels like grandiose claims about solving cache invalidation itself.
That is definitely something I worried about. But at the same time, I do think we solved the harder part of the cache invalidation problem. TAO and Memcache serves quadrillions queries a day. Based on our experience, answering the question of "why my cache is inconsistent" is the hardest thing about cache invalidation.
Cache invalidation protocols can be complicated, but some are pretty managable. You can also verify it using TLA+ if you will. But once the rubber hits the road, some cache entries tend to be inconsistent.
I definitely worry about people taking this the wrong way, but at the same time, I stand by the claim of "cache invalidation might no longer be a hard thing in computer science".
The jist of what I got from the post was that you created a tool which monitors caches and that helped find bugs that cause inconsistent cache issues. How is that solving a computer science problem?
The tool that monitors cache consistency is easy to build. That by itself doesn't solve anything major. The most important contribution is a novel approach on consistency tracing that helps find out "why" caches are inconsistent – pinpoint a bug is much harder than saying "there is a bug". This is based on the key insight that a cache inconsistency can only be introduced (for an invalidation-based cache) in a short time window after the write/mutation, this is what makes the tracing possible.
> How is that solving a computer science problem?
It depends on your definition of a computer science problem. I am definitely not solving P = NP. By your definition, does Google's Paxos Made Live paper solve a computer science problem?
The claim is more a play on Phil Karlton's quote, as the work here makes cache invalidation much easier (in my opinion). Also Phil Karlton's quote doesn't necessarily _make_ a problem a computer science problem, don't you think? I think it's a good quote and there's a lot of truth in it.
I disagree with your claim that "cache invalidation might no longer be a hard thing in computer science" for the same reason that you say that pinpointing bugs is harder than saying "there is a bug". You can't just build a tool and declare a whole problem space in computer science (and a long running joke) as being solved.
> the work here makes cache invalidation much easier
The tool itself doesn't make cache invalidation easier, it makes finding bugs (that happen to be cache invalidation bugs), easier. By that logic, if one never wrote a bug, cache invalidation would still be as difficult as before.
Again, good work on the tool. That's fantastic. I've definitely done some huge work in this area myself and struggled a lot. Let's also realize that it is also FB internal, so whatever solutions you've come up with aren't really helping anyone else without spending the same amount of engineering time and resources on the problem.
You are right. The assumptions I made are
- the definition of cache invalidation https://news.ycombinator.com/item?id=31676102
- subsequently, by that definition, _make_ cache consistent in production is the harder problem
I understand if you disagree with these premises. And all these make sense.
Let's discuss the "cache invalidation problem" by your definition.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline (in the form of some control plane metadata). I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.
How do you handle caching derived data assembled from multiple individual records? For example, how would you maintain a cache for a query like getFriendsOfFriends(userId)?
My context is working on Notion’s caches for page data. Our pages are made out of small units called “blocks” that store pointers to their child content. To serve all the blocks needed to render a page, we need to do a recursive traversal of the block tree. We have an inconsistent cache of “page chunks” right now, how would you think about making a consistent cache?
This is fantastic question! We face something very similar (if not identical).
There are two parts to solve this problem
1. monitor and measure how consistent the cache is
2. figure out why they are inconsistent
I will focus on #1 in this comment. You can build something very similar to Polaris (mentioned in the blog) that
- tails your database's binlog so it knows when e.g. "friendship" data is mutated
- it can then perform the computation to figure out which cache entries "should have been" updated. E.g. if Alice just friended Bob, then Alice's friends-of-friends and Bob's friends-of-friends should reflect the change. And your monitoring service will "observe" that and alert on anomalies
So if you just implement the cache invalidation logic in your monitoring system, you can tell if you got the cache invalidation logic correct in your caching system. That sounds really helpful!
This continues to point out how you just completely don't understand the quote. You very clearly seem to think that the "hard" part of cache invalidation is how to implement invalidating it when you know exactly what needs invalidation. The "hard" part is actually in knowing what needs invalidation. Your grandiose claims make you, your team and org, and your company look bad.
You think figuring out when to invalidate is the harder part of cache invalidation; in that sense Polaris is not checking the “when” logic but rather repeating it. And you are right. This is not the problem we solved in the post. I should have applied a narrower definition to cache invalidation, and I apologize for the confusions.
There is a trade off between coordination (in some cases with unbounded write amplification https://twitter.com/uvdn7/status/1534979480363667457?s=21&t=...) and relaxed consistency. I assumed this is a trade off that people just have to make based on their workload (hence you saw me saying in other comments about TTL; it’s about tradeoffs, when the cost of coordination/invalidate surpasses the benefit — higher hit rate, etc.)
> You very clearly seem to think that the "hard" part of cache invalidation is how to implement invalidating it when you know exactly what needs invalidation. The "hard" part is actually in knowing what needs invalidation.
I acknowledge that both are hard. The former can't be avoided. And the latter is easy in certain cases (if you have a simpler data model, etc.). I am not saying the latter is easy. More details in https://news.ycombinator.com/item?id=31676102
And I made the mistake of assuming the definition of “cache invalidation” as it means different things to different people. Marc’s view is a very good one https://twitter.com/marcjbrooker/status/1534944340266864640?... but it’s actually different than some people’s definition on this thread.
I should have applied a narrower and more specific definition of cache invalidation in the blog post. The confusion it caused is not intended.
> how to implement invalidating it when you know exactly what needs invalidation
I stand by the claim that this is a hard problem, as explained in the blog post. When you deal with a cache you are inherently dealing with a distributed system (there's cache and the source of truth). And the contribution is making this aspect more manageable, which is common to all invalidation based caches.
I read the article. It seems to be suggesting that cache invalidation might no longer be a hard thing in computer science because of the insights uncovered by your tracing and observability solution. IOW, now that you can measure and audit the system, you're able to find and fix bugs in the system. Is that the correct take-away?
Yep. You're exactly right. And the approach is generic and I think it should work for everyone.
The idea is that with all these observability capabilities, debugging cache inconsistencies is getting very actionable and close to how we debug an error with message telling us exactly where an exception happened.
I didn't understand the tracing part. Is tracing 100% inside of polaris? If not does it start at the database? The first cache?
Does the invalidation need to be predictable before you can use tracing? What kind of parameters do you have so you don't have too much logging or too little?
Tracing is not in polaris. It's a separate library that runs in every cache host. Its main job is logging every cache state mutation for traced writes. So when polaris detects cache inconsistencies, we know what happened and why cache is inconsistent.
It starts all the way from client initiated write (where we mint a unique id, and plumb it all the way through).
> What kind of parameters do you have so you don't have too much logging or too little?
This is the key question! If you go to the second half of the post, it talks about an insight about how we managed to log only when and where cache inconsistencies _can_ be introduced. There's only a small window after mutation/write where cache can become inconsistent due to invalidation. So it's very cheap to trace and provide the information we need.
Is your argument here that cache invalidation may no longer be hard because you can implement systems like Polaris which continually test your cache implementation to try and catch invalidation problems so you can then go and fix them?
Polaris is actually fairly simple to build. The harder question to answer is "why cache is inconsistent" and how you debug. The second half of the post talks about consistency tracing, which tracks all cache data state mutations.
Distributed systems are state machines, with consistency tracing keeping track of all the state transitions, debugging cache inconsistencies in an invalidation-based cache is very actionable and managable based on our experience.
Correct me if I'm wrong, but don't we have general solutions that ensure correctness if you allow for enough time? Wouldn't performance be a critical aspect of the claim that this is "no longer a hard problem"? How about the CAP theorem?
The "hard problem" defined here is more about the engineering side. The analogy I have is Paxos the protocol and Google's Paxos Made Live paper. We do have many cache coherency protocols that are provably correct (using TLA+ if you will); but making them actually consistent in production is a completely different story; and a hard problem for an invalidation-based cache.
Can you speak more on why ‘making them actually consistent in production is a completely different story’?
Curious because I’ve been learning TLA+ recently and interested to more know about cases where an algorithm has been proven but actual an implementation of it fails.
Let try to put this as concisely as possible. When you put an algorithm in code, when rubber hits the road, you have to make certain assumptions of how things work, eg how events are ordered, how fsync works, what kind of failure scenarios you are expecting, etc. More likely than not, the reality will be a little different. No to measure just innocent bugs in the code.
My favorite example is Paxos. Its algorithm fits on a single slide. But it’s notoriously hard to make it actually work correctly in production.
My problem has never been invalidation as stated in the article, but how to manage interdependent cache entries, so that when one is invalidated, the dependent ones must also be invalidated. For example get_user(123) and get_users_for_org(456). Suppose user with id 123 is part of the org with id 456. When the user is deleted, you have to invalidate the get_users_for_org(456) entry. I haven’t seen any convincing “design pattern” for managing such dependencies.
I am sorry for the confusion. We have internal abstractions that deals with the problem you described (especially in simpler forms, the example you have is fairly managable actually).
EDIT: I can see why it gets complicated if your data model and query is very complicated.
In the past I've used a timestamp key for the org. Create a key like f"org_{org.pk}" that has a timestamp. Then append that timestamp to the `get_users_for_org`. Now all you have to do is invalidate the org timestamp to generate a rolling key (requires more memory).
I don’t get it. If you already have the version in the cache, then when a client does GET X, the cache can reply X@v4.
As long as the client submits transactions with “I calculated this based off of X@v4” then the database can reject if there is a newer version of X.
The client can then replay the transactions against the cache and by then there will be a X@v5.
With this scheme you can track the transaction replay rate instead of having to build a new system to read-back the cache for consistency.
To get the traceability on which cache node is stale, the client can forward a header from the cache node that identifies it. Then using the same transaction rejection rate ops has visibility on which cache node isn’t being updated.
No cache invalidation needed. Always just cache forever X@version, it’s just that the cache allows an unstable query for GET X.
EDIT: Alice, Bob, and Carol are three different servers. What has happened here is that Alice and Bob are closer together, so their updates are faster between each other (synchronizing the writes). Carol is slower for some reason (bad connection?), and is going to update off of old data.
In this case, the two valid caches are X@Alice-V2/Bob-V2, and X@Carol-V1. The other cached data is old and can be discarded.
Things get complicated when you have not only reads (such as in your simplified question), but also writes that are cached (such as what happens in practice).
Tracking all data dependencies used for all writes in a large system seems rather challenging.
Plus what about read workloads, like “pull message from queue and send it as a push notification to user Y”? I guess it’s fine if the push is re-delivered due to stale cache?
So basically, classic timestamp versioning with some consistency checking. Might work. Cache invalidation is hard. The only problem I ever faced while working in the biz that I could not solve even at theory level was cache related.
What I thought the article would tackle is the dependencies between stored objects. Some solve it with super complicated graphs others by just flushing the entire cache.
At Meta, why not just use flat files or a store that act as one, pubsub the changes, listen on the changes and update aggregated views as needed and store them. Then just short-term cache everything on the edge.
Yes. The Boss changes his name from Mark. It will directly hit a lot of caches. It will need an update of all bosses that materialized this. Then all the staff that materialized the boss of the boss and so on. This is simple example because it is a tree. This can be circular dependencies as well.
> In other words, 99.99999999 percent of cache writes are consistent within five minutes.
Those are the kind of rates that lead to very hard to find bugs. You won't properly write code in downstream systems to properly handle that failure case, and suddenly someone will get paged at 3am because it has just happened years after the code was deployed and nobody really understands how this corner case suddenly happened!
When I'm building systems, every happy codepath should either happen frequently or never.
Having a cache whose design allows it to be stale but only very rarely seems like a bad idea. If I was building something on top of this cache I would have a layer on top to force staleness for 1 in 1000 requests just to force the application built on top to be designed to handle that properly.
> Having a cache whose design allows it to be stale but only very rarely seems like a bad idea.
And that's exactly why we first brought this number from 6 9's to more than 10 9's; and we are not done yet. Also the cache is not "designed" to be inconsistent. It's like Paxos is designed to solve the consensus problem, but when implemented, most Paxos implementations do not work properly.
Marc put it extremely well. I agree with every single word of his thread. I should have applied a narrower and more specific definition of cache invalidation in the blog post. I apologize for any confusions it caused.
> Data in cache is not durable, which means that sometimes version information that is important for conflict resolution can get evicted.
> We also added a special flag to the query that Polaris sends to the cache server. So, in the reply, Polaris would know whether the target cache server has seen and processed the cache invalidation event.
To make special flag work, cache server need to track not only current version state, but also past versions. If it tracks past versions, then conflicts can be resolved at the cache server level, but whole premise of article is that cache servers can't resolve conflicts by themselves.
I will try to answer this one without too much implementation details.
First of all, cache items can be evicted and are not durable, which is just a fact. But that doesn't mean we can't track progress (in terms of "time" or "logical time"/"order" in distributed system terms). The social graph is sharded (not surprisingly). We can keep progress of each cache host per shard, which is just a map kept in memory, which doesn't get evicted.
As naive as it is, I always kinda liked MySQL's query cache invalidation strategy: blow away the whole cache anytime a mutating statement arrives at the server.
Simple. Effective. But it obviously doesn't work for everything.
> Based on consistency tracing, we know the following happened...
I'm a little confused:
- Is the database updated notification going out while the transaction is still in-progress? Doesn't it make more sense to delay the notification until after the transaction commits?
- If a cache tries to update itself while there's an open transaction against the data it's trying to read, shouldn't the database block the operation until the write is complete? And then shouldn't another notification after the transaction commits trigger the cache to update?
> Is the database updated notification going out while the transaction is still in-progress?
No, that's not what happened.
> If a cache tries to update itself while there's an open transaction against the data it's trying to read, shouldn't the database block the operation until the write is complete?
Yes. Generally speaking, having a read transaction here would be better. There are unrelated reasons why it's done this way. The point of the example is that it's really intricate and we can still identify the bug regardless.
Ahh, dirty reads when you need data integrity are dangerous. (They're fine for things like a UI, where the consumer of state is essentially transient.)
I'm not too familiar with your database, but I wonder if you could have a "read transaction" with such a short timeout that you could return a "try again soon" error to the cache? (The only time I dealt with dirty reads like this was Microsoft SQL, I suspect you're using something with very different transaction semantics or guarantees.)
De facto, it is a cache server, with the associated performance decrease. Isn't the main purpose of caching to improve performance?
>>We deploy Polaris as a separate service so that it will scale independently from the production service and its workload.
Assuming the scaling is horizontal, so then, to synchronize among the service instances, what do you do? Create another meta-Polaris service?
Not rhetoric or sarcasm - hoping for an open discussion.
> De facto, it is a cache server, with the associated performance decrease. Isn't the main purpose of caching to improve performance?
Polaris only receives cache invalidation event, and doesn't serve any client queries.
> so then, to synchronize among the service instances
It doesn't synchronize among the service instances. It's actually basically stateless. It pretends to be a cache server only for receiving invalidation events, and acts as a client, and assumes no knowledge of the cache internals.
Thanks for the clarification. However, in that case, won't the querying of all the cache copies/replicas turn out to be the bottleneck at high volumes? Because you are going to have to check all the existing copies/replicas somewhere, right?
>Distributed systems are essentially state machines. If every state transition is performed correctly, we will have a distributed system that works as expected.
I imagine huge distributed systems as Markov chains, where transitions are probabilistic rather than deterministic.
So you measure with Polaris and have found you have a data consistency of up to 10 nines in a period of 5 minutes. But aren't inconsistencies building over timer?
What if you measure for 1 hour instead of 5 minutes? Still 10 nines?
The number of nines actually go up with increasing timescale windows because Polaris checks anomalies at write-time/invalidation-time.
I guess the behavior of what you are describing is cache inconsistencies at read-time. E.g. if I introduced an inconsistent cache entry at time T, it can be exposed to many subsequent reads (hence the "building up over time" as you mentioned). This is an important metric as well – we actually measure it as well.
The key difference between read-time cache consistency measurement and write-time cache consistency measurement is about "purpose". Write-time cache consistency measurement is more actionable, as it captures the moment (or very close) of when cache becomes inconsistent. If one wants to debug something, you want to get close to when the anomaly happens. Read-time cache consistency measurement is more about measuring the negative impact of cache inconsistencies (which are client facing).
And assuming that certain problems are different from each other and should be counted separately.
(I'm starting to think that 'naming things' and 'cache invalidation' are the same thing, since you need to invalidate the cache anytime the description of the thing changes)
I understand some people have a different definition of cache invalidation. I am using the following definition from Wikipedia
> Cache invalidation is a process in a computer system whereby entries in a cache are replaced or removed.
It's the smallest unit of function that "cache invalidation" must perform. Some people define cache invalidation as a problem of figuring out "when/who" to invalidate. That's not the definition I am using (and my definition is narrower in that sense). The confusion is not intended, as I believe (as I explained in the post) that even with the narrower definition, cache invalidation is still insanely hard.
If it helps, let's essentially break cache invalidation into a few parts 1. knowing when/who to invalidate 2. actually processing the invalidate
My argument is that #1 can be very managable with simpler data models (as we did). #2 can't be avoided; and #2 is very hard. And the post is about how we believe we have a systemic approach for managing #2.
For #1, say, you are caching a result of joining two tables with two ids that you are filtering on. It's still very managable to track the dependency and know when to invalidate. It can easily grow out of hand (join 10 tables with 100 lines of SQL). Then solving the "when/who" to invalidate problem is essentially equivalent to doing "joins" on the write/invalidation path. First of all, it's unbounded. The number of cache entries you need to invalidate can be unbounded (not bounded by the number of indices, but a function of data in the database instead). My argument is that why do this to begin with? I acknowledge this is hard. But why do it? On the other hand, you can have simpler data models (e.g. TAO), fetching and stitching everything together on the read path scales fairly well. It's essentially doing "joins" on the read path. But it's all hitting caches, so it's fast still.
For some complicated queries, you can cache secondary indices (which is easier to figure out the "when/who" question, just as how DB figures out which index entry to update on transactions) to make your read-path join faster. The write amplification / cache invalidation fanout is bounded. You don't do "join" on writes/invalidations.
Let's discuss the "when/who" problem specifically, if say we just have to solve it.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
On the other hand, it's very easy to feel like #2 is an easy problem, which probably explains why people think #1 is what Phil Karlton was referring to. The analogy I like to use is Paxos. The protocol fits on a single slide. It's easier to feel like you have Paxos work; but it's very hard to have Paxos actually work.
We seem to have different concepts with regards to the cache invalidation problem.
The one I know has to do with "Is there a change in the source data? How should I check this? How often?" and such things.
Yours seems to be "In my distributed system, I cannot guarantee the order of the messages that get exchanged, and thus, different nodes may end up having different information at times".
IMO you have a consensus/distributed data problem, not a cache invalidation problem (the inconsistency between your cache nodes is a consequence of the underlying design of your system).
Yep, cache invalidation is hard because you can't just cache /users/1/networkstatus as a function of the request itself; its underlying value is a complex function of values fetched from possibly dozens of tables and services, any of which could change at any time without any immediately discernable connection to User 1.
OP's response to this, in another comment, is: "Now going back to your example with complicated dependencies. Maybe TTL is a better solution. With many dependencies, any changes from the dependency list might trigger invalidation. At some point, just doing TTL, would be simpler."
But what if you actually do want your dependency list to trigger invalidation?
https://materialize.com/ is the closest thing I know of to a "solution" for this - it lets you build nested, realtime materialized views on top of streaming and non-streaming data, all with SQL syntax and Postgres wire compatibility, that fully recalculate and recache their results whenever any input changes, no matter how deep (based on https://timelydataflow.github.io/timely-dataflow/ from Microsoft Research).
You could either use the outputs of such a system as a cache, or weave together a cache invalidation signaling system that sends a stream of keys that need to be evicted from cache at any time, directly into a Kafka topic. And then, of course, this could plug into Meta's cache invalidation consensus system. But it's remarkably disingenuous for Meta to pretend that the dependency side of this problem is trivial or can be solved by TTLs alone.
Nice comment, you opened a new perspective on this.
>But what if you actually do want your dependency list to trigger invalidation?
I (we, as it was a team) once had a scenario where there was a big dependency graph with a lot of replicas (not necessarily caches but a similar thing) and if we just used TTLs the effect on the whole thing was that there were some periods of time where we've just got a flood of nodes (children, grandchildren, and so on) all wanting to update at the same time, talking to each other and whatnot.
In the end the solution was quite complex as it was a combination of a short TTL, some random delays here and there to lower the probability of things happening at the exact same time, very low message overhead and a system akin to ETags in HTTP [1]. It was very hard due to the scale of the whole system and I'm sure Meta has this problem 100x so, disregarding the poor choice of title, it was an interesting read nonetheless.
> But what if you actually do want your dependency list to trigger invalidation?
Let's do it.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). So in theory this can be done. Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache); but rather invalidate caches asynchronously? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). It can't be done.
Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline (in the form of some control plane metadata). I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.
I am using the following definition from Wikipedia
> Cache invalidation is a process in a computer system whereby entries in a cache are replaced or removed.
It's the smallest unit of function that "cache invalidation" must perform. Some people define cache invalidation as a problem of figuring out "when/who" to invalidate. That's not the definition I am using (and my definition is narrower in that sense). The confusion is not intended, as I believe (as I explained in the post) that even with the narrower definition, cache invalidation is still insanely hard.
It's not disingenuous. You are welcome to speculate.
If it helps, let's essentially break cache invalidation into a few parts 1. knowing when/who to invalidate 2. actually processing the invalidate
My argument is that #1 can be very managable with simpler data models (as we did). #2 can't be avoided; and #2 is very hard. And the post is about how we believe we have a systemic approach for managing #2.
It's very easy to feel like #2 is an easy problem, which probably explains why people think #1 is what Phil Karlton was referring to. The analogy I like to use is Paxos. The protocol fits on a single slide. It's easier to feel like you have Paxos work; but it's very hard to have Paxos actually work.
> Now going back to your example with complicated dependencies. Maybe TTL is a better solution. With many dependencies, any changes from the dependency list might trigger invalidation. At some point, just doing TTL, would be simpler.
#1 is a hard problem, if the data model and data in cache is complex I agree. My comment here is referring to update-heavy workloads. If you have a large dependency, your invalidation workload will tend to be heavy. At some point, you will spend more power on processing invalidation than serving user queries. The post is talking about a narrower problem; and I am not trying to minimize anyone's struggle.
See also Noria, https://pdos.csail.mit.edu/papers/noria:osdi18.pdf - later research with similar thought process. Claims to scale better than differential-dataflow. Summary is, "let's make a timely distributed graph computation to keep SQL queries up-to-date, but this time we can forget some parts of the query to save on cache memory". Now productionized by https://readyset.io/
The nice thing about these systems (Materialize/Timely/Differential-Dataflow/Noria/ReadySet) is that the invalidation pipeline is automatically derived from the query structure. The user just needs to know how to write their query (regular SQL with Materialize/ReadySet), pipe in the source-of-truth changelog from all the tables in the database, and the system takes care of the rest. Your answer is somewhat vague about how to build the invalidation pipeline. I think that's what many people find challenging about cache design.
Yep. To generalize this even more, it’s not that cache invalidation is hard but “materialization/denormalization is hard” (even if it’s a table on the same database). Hence why I never thought this is what made cache invalidations hard (because it’s not even unique to cache).
I can imagine an automated solution to achieve what you described eg by keeping track of the table:columns touched by the query. Maybe that’s what these system did.
It has its downside though, as the invalidation cost in this case can be unbounded (read as huge write amplification in some cases). I imagine it would be hard to provision the invalidation pipeline. This is where it gets into trade offs. Eg by just caching the primary index and secondary indices (not materializations) you can very far as well (do joins on read). But it depends on the workload.
Exactly. The race condition problem from the example below still exists even without any caching, and perhaps still exists even without any replica.
> Imagine that after shuffling Alice’s primary message store from region 2 to region 1, two people, Bob and Mary, both sent messages to Alice. When Bob sent a message to Alice, the system queried the TAO replica in a region close to where Bob lives and sent the message to region 1. When Mary sent a message to Alice, it queried the TAO replica in a region close to where Mary lives, hit the inconsistent TAO replica, and sent the message to region 2. Mary and Bob sent their messages to different regions, and neither region/store had a complete copy of Alice’s messages.
A solution to this problem is to update Alice's primary message store to be "region 2;region 1" for a period of time, and get Alice to retrieve messages from both message stores during that period.
However, since you have a caching layer that doesn't have an explicit TTL, such that any cache inconsistency can last indefinitely, you now have a much bigger distributed data problem.
> 1. The cache tried to fill the metadata with version.
> 2. In the first round, the cache first filled the old metadata.
> 3. Next, a write transaction updated both the metadata table and the version table atomically.
> 4. In the second round, the cache filled the new version data.
Another way to look at this is that while the database can handle writes to the 2 tables atomically, the caching layer can't do a consistent read from the 2 tables, because it is either too expensive to do the 2 reads from within a database transaction, or too expensive to do a single read with join?
Our current system also suffers heavily from distributed inconsistency, mostly due to these "too expensive" constraints imposed by an ex-FB guy, even though our scale is nowhere near FB level. Yes I am bitter.
> Our current system also suffers heavily from distributed inconsistency
Despite of people having different definitions of cache invalidation (mine is narrower, and a subset), I hope the techniques covered here can be helpful (even a tiny bit).
FAFAIK the original cache invalidation problem is always seen from the point of view of a writer in a multiple-writer distributed system. It's not about determining the freshness of data as seen from a reader/consumer, but about atomically (!) updating all data copies by a write action in a distributed system.
If you assume loosely-coupled, eventually-consistent data, you haven't solved the problem at all -- you've just side-stepped it. If you assume a single-writer, multiple-reader architecture, you don't have the original problem at all.
Do you have any reference or links about it? Thanks.
> point of view of a writer in a multiple-writer distributed system
> If you assume a single-writer, multiple-reader architecture, you don't have the original problem at all.
It starts to sounds like you are describing Paxos?
Yeah I assumed it would originate cache coherency protocol as well. In that case, it's not hard to figure out "when" to invalidate, as you just invalidate the cacheline on mutates based on addresses. If cache invalidation originates from cache coherency protocols, I would assume Phil Karlton's quote was not referring to the "when" to invalidate as being the hard problem. But I will not die on this hill. I am open to rephrase/redefine/reword in anyway that is helpful to people.
OK. I am here to learn. Say we go with your definition of cache invalidation problem. Do you think that's THE hard part of cache invalidation?
Say, you are just caching simple k/v data (no joins, nothing). Are you claiming cache invalidation is simple in that case?
Also the reason why I didn't mention the cache invalidation dependencies is that _I believe_ it's a solved problem (see the link above, we do operate memcache at scale). I am happy to discuss if and why it wouldn't work for your case.
>Are you claiming cache invalidation is simple in that case?
No.
>I believe it's a solved problem
It's not.
Suppose you have source-of-truth A (doesn't really matter if it's a key-value store or whatever, it could even be a function for all intents) and a few clients B1, B2, B3, ... that rely on the data from A. You have to keep them in sync. When should B* check if A has changed? Every time they need it? Every minute? Every hour? Every day? This is the cache invalidation problem; which IMO is not even a problem but a tradeoff, but whatever.
Epilogue: With all due respect, you have an unbelievable career, IBM, Autodesk, Google, Box and now Meta. None of those companies would give me five minutes of their time because I am self-taught, yet here we are :)
Please don't cross into personal attack. It's not obvious what you were intending by that last paragraph, but it doesn't sound nice, and bringing someone's personal details into an argument is already not ok.
Epilogue: With all due respect, you have an unbelievable career, IBM, Autodesk, Google, Box and now Meta. None of those companies would give me five minutes of their time because I am self-taught, yet here we are :)
Correlation does not imply causation. I’m also self taught and some of those companies have given me a lot more than 5 minutes of their time.
Here’s some unsolicited feedback - I interpreted your comment above as implying you know more than $peer, and it comes off a little snarky with the emoji face. That approach doesn’t demonstrate professionalism to me. On the other end, $peer focuses on the technical challenge. Most of the gig boils down to communicating ideas.
> Suppose you have source-of-truth A (doesn't really matter if it's a key value store or whatever, it could be a function for all intents) and a few clients B1, B2, B3, ... that rely on the data from A. You have to keep them in sync. When should B* check if A has changed? Every time they need it? Every minute? Every hour? Every day? That is the cache invalidation problem.
A few things to clarify here what you are referring to as clients are cache hosts (as they keep data). You seem to imply that the cache is running on the client? I was referring to cache servers (think memcache, Redis, etc.), for which the membership can be determined. So on update (e.g. when you mutate A, you know all the Bs to invalidate).
Now continuing with your example, with cache running on the client. Assuming we are talking about same concept when we say "client", the membership is non deterministic. Clients can come and go (connect and disconnect as they wish). There are some attempts to do invalidation-based cache on clients, but they are hard because of the reason I just mentioned. So usually client cache is TTL'ed. E.g. the very browser you are using to see this comment has a lot of things cached. DNS is not going to send an invalidate event to your browser. It't TTL based.
I guess what I am saying is that cache invalidation rarely applies to cache side cache as far as I know. Maybe you have a different example, which we can discuss.
It doesn't matter whether the cache is co-located with the "client" that ultimately uses the data. Say A is a database, and B1, B2, B3... are memcached servers. The exact same situation applies.
> So on update (e.g. when you mutate A, you know all the Bs to invalidate).
But "knowing" this is a big part of what people mean when they say cache invalidation is hard! If the value in B is dependent on a complicated function of A's state, then it may be difficult to automatically determine, for any given mutation to A, which parts of B's cache need to be invalidated.
> There are some attempts to do invalidation-based cache on clients, but they are hard because of the reason I just mentioned. So usually client cache is TTL'ed.
Given this statement, the original title of this submission is even more baffling. If you recognize that data can be cached in clients, and that invalidating those caches is so hard that most systems -- including yours -- just completely abandon the goal of being able to do it correctly/reliably, then how can you claim your system makes it no longer a hard problem?
> But "knowing" this is a big part of what people mean when they say cache invalidation is hard!
I can see that. Memcache is a look-aside cache we have at scale. There are abstractions on top to manage this complexity; and it has been fairly managable. I am sure you can come up with a complicated dependency tree that things are not obvious at all. But when you do have a very large dependency tree, any change in them can trigger cache invalidation, at which point, caching with TTL will be a better option IMO. I can see where you are coming from.
> If you recognize that data can be cached in clients, and that invalidating those caches is so hard that most systems
My reasoning for this being hard is different than yours I think. In my comment, it's due to indeterminism of the cluster membership. I think in that case, we are talking about different problems.
> Given this statement, the original title of this submission is even more baffling. If you recognize that data can be cached in clients, and that invalidating those caches is so hard that most systems -- including yours -- just completely abandon the goal of being able to do it correctly/reliably, then how can you claim your system makes it no longer a hard problem?
I see where you are coming from; and I don't disagree. I just want to clarify a few things I talked about.
I know the tradeoff and I didn't consider the tradeoff specifically is what made cache invalidation hard (it's a distributed system challenge in general I thought). In in that context, I brought up TTL, as I think it makes a better tradeoff in the scenario. Again, cache invalidation can be done (in some cases with unbounded write/invalidation amplifications). It's just that IMO it's not worth it in that case.
It's a good observation that out-of-orderness and asynchrony contribute a lot to the challenge.
> the inconsistency between your cache nodes is a consequence of the underlying design of your system
I disagree with this. First of all, imposing order and synchrony is expensive, slow and unreliable. Even if we do that, cache fill and invalidation will never be externally coordinated; and will collide on the cache host.
> > the inconsistency between your cache nodes is a consequence of the underlying design of your system
> I disagree with this. First of all, imposing order and synchrony is expensive, slow and unreliable. Even if we do that, cache fill and invalidation will never be externally coordinated; and will collide on the cache host.
This is a weird reply. GP writes "A implied B" and you disagree by arguing that you had to do A. Sure. But it's still what implied B. Quite a fundamental logic flaw, no?
I don't see that. Yes I did say I had to do A. Then I said that it doesn't imply B. Because even if the underlying system is strongly ordered, we can't order external events. Did I miss something?
There are multiple models for cache consistency. What I'm saying is that it seems this tracing library is built on a particular model of cache consistency, and that if it doesn't find any issues, then it regards the cache as consistent. The question is, how much does that model accurately reflect actual cache behavior in the live system, and how much does it simply test that the cache behaves according to its internal model?
Correct me if I'm wrong, but the title seems a little clickbaity. "Cache invalidation might no longer be hard" --> "We built a tracing library and manually fixed the bugs it uncovered"
"A little clickbaity" if you are generous. Short-sighted and grandiloquent if you are not.
Short-sighted because it narrows the original problem (cache invalidation) to a specific version (detecting cache invalidation errors past an arbitrary time-frame).
Grandiloquent because it then claims to have "solved" the general problem, whereas they haven't even solved the narrowed version. Notice their own premise:
> For any anomaly in a stateful service, it is an anomaly only if clients can observe it one way or the other. Otherwise, we argue that it doesn’t matter at all.
This is a practical attitude that can yield great results, and probably even good engineering practice. However, it will never lead to solving a problem, because a problem isn't solved until you prove that it is.
> For any anomaly in a stateful service, it is an anomaly only if clients can observe it one way or the other. Otherwise, we argue that it doesn’t matter at all.
Thanks for the comment! I stand by this claim; I would love to hear more about why an anomaly (any anomaly) matters if it can't possibly be observed by any client for stateful service.
The claim itself is fine because it is kind of tautological. An error that cannot be observed in any way is probably not an error. However, you have to prove that it can't be observed.
This is quite different than what the described system does though. The system flags errors that have been observed. The correct claim to go with what the article does would thus be:
> For any anomaly in a stateful service, it is an anomaly only if the current clients running their current operations happen to observe it. Otherwise, we argue that it doesn't matter at all.
The exact same argument applies to normal exceptions and errors. It might be the case that out of quadrillions of queries a day, just none of them hit the error path, and hence we won't observe any issues.
I can see that in this case, if it matters or not becomes somewhat subjective. It's a good point though!
I am serious when I say cache invalidation might no longer be a hard thing in computer science. In the post, I explained why cache invalidation is hard; and how we solve/manage its unique challenge and complexity. By my definition, we are solving the cache invalidation problem.
The analogy I have is Paxos. It's notoriously hard to implement Paxos correctly. Google published a paper on Paxos Made Live just on how they managed the implementation and productionization complexity.
Sure - you typically cache the result of some expensive query.
The hard part of cache invalidation is to detect when an update somewhere in your system is going to affect the result of that query, such that you need to trigger an invalidation.
We have memcache which is a look-aside cache that serves this type of workload. What you described can be solved by adding one level of abstraction and letting reads and writes all go through it.
Now on your read path, you can construct arbitrary complex sql queries or whatnot, but it must take some kind of input to filter on. Those become part of the "keys".
The invariant is that as long as the "context/filter" you encode covers all the mutations which would impact your cache data, you should be good. Based on our experience, it has been fairly managable.
Let's actually also try to solve the "cache invalidation" by your definition.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
Because the materialization in cache don't take writes themselves, so updates to them are essentially performed blindly and can be ordered by the commit time (TrueTime).
Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline (in the form of some control plane metadata). I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.
Throw a few bits of conditional logic onto that fire while you're at it.
Some cache entries depend on the state of rows in Table C, but for some combinations of Table A and Table B, no data from Table C ends up in the cache entry.
All of this is business logic, and now it's touching one of our supposedly low-level libraries, which is separated by at least one level of indirection from the rest of the business logic - if your architecture is good. But caching tends to rot architecture.
> Some cache entries depend on the state of rows in Table C, but for some combinations of Table A and Table B, no data from Table C ends up in the cache entry.
This is a good example. Let's talk about it. Say we have a table for "friends", and a separate materialization, in cache!, for "friends-of-friends-who-lives-in-us". First of all, the query for the cache data needs to be schematized and made known to the data source. Otherwise, a client can cache arbitrary materialization of anything, it would be obvious that, in its most generic form, the problem can't be solved.
Now assume the data in cache is schematized. There's a transaction that changes "friends" table. Now we have two options, one is that within the same transaction (x-system 2phase commit for example), updates cache (the one that stores "friends-of-friends-who-live-in-us"). This is the synchronous flavor of it, which has obvious scaling challenges.
Spanner, etc. are about handling the async flavor of the same logic.
> if your architecture is good. But caching tends to rot architecture.
My speculation is that sometimes people are using cache without knowing they are dealing with a distributed system. A cache in its nature is a distributed system (because there's cache and the source of truth).
The linked blog post is targeted towards cache service owners, not cache users (who puts materializations in cache). If I learned anything from this public civil discourse on Hacker News is that
we should avoid putting the power/responsibility of cache invalidation in the hands of cache users
we should provide guard rails via better abstractions, simpler data models (e.g. a graph data model)
we should avoid caching relations (separate materializations) but prefer caching indices instead, so the write/invalidation amplification is bounded.
I also think it can be solved by changing the data model.
Say, you are caching a result of joining two tables with two ids that you are filtering on. It's still very managable to track the dependency and know when to invalidate. It can easily grow out of hand (talking about 10 table joins and 100 lines of SQL). Then solving the "when/who" to invalidate problem is essentially equivalent to doing "joins" on the write/invalidation path. First of all, it's unbounded. The number of cache entries you need to invalidate can be unbounded (not bounded by the number of indices, but a function of data in the database instead). My argument is that why do this to begin with? I acknowledge this is hard.
But why do it? On the other hand, you can have simpler data models (e.g. TAO), fetching and stitching everything together on the read path scales fairly well. It's essentially doing "joins" on the read path. But it's all hitting caches, so it's fast still.
For some complicated queries, you can cache secondary indices (which is easier to figure out the "when/who" question, just as how DB figures out which index entry to update on transactions) to make your read-path join faster.
I can see that. One can argue that the “difficulty” is front loaded. If or not you can identify the list of “context” is local. I guess you can probably come up with complicated dependencies and argue it’s hard to capture the dependencies and I would agree with you.
Now getting back to the “what’s really hard about cache invalidation” part. Even with a much simpler model. Say you just have a k/v store, no joins, nothing. Is cache invalidation simple in that case?
I went into details about why it’s still insanely hard. And the big example at the end might help make that point.
And this is one level beneath challenges from tracking dependencies, and I argue that’s what makes cache invalidation hard.
Now going back to your example with complicated dependencies. Maybe TTL is a better solution. With many dependencies, any changes from the dependency list might trigger invalidation. At some point, just doing TTL, would be simpler.
Accepted. Just to be clear, it’s not a link bait as far as I understand it for two reasons
1. I am dead serious about making cache invalidation a simpler problem
2. “Cache invalidation might no longer be a hard problem in Computer Science” is the subtitle of the corresponding systems @scale talk I just gave.
I respect the guidelines and I am OK with the rename. I just want to clarify that I don’t think the old title is misleading.
It's still massively click-bait until such time as the claim isn't pure speculation. Also, casually optimistic claims about complex and hard problems are the essence of click-bait. There isn't much unique about your instance of it
> until such time as the claim isn't pure speculation
Which I don't think it is.
> casually optimistic claims about complex and hard problems are the essence of click-bait
A group of people at Facebook worked on this for many years. TAO and Memcache serves quadrillions of queries a day. I wanted to make this claim – cache invalidation might no longer be a hard problem in Computer Science – years ago; but I didn't because I don't want to jump the gun. We baked the solution for years. There's nothing casual about this.
Just about all papers and write-ups that describe big advances on important problems aren't titled 'Big advance in Problem X' or 'X now an ex-problem'.
I am glad you caught that! I am personally very very excited about this. Because if you think about it, distributed systems and cache coherence problems on a multi-core system are essentially the same thing.
If C++'s memory model can abstract away multiple cachelines, why can't we steal the idea and abstract away distributed system details? The single keyword of distributed systems is "order". And that's exactly what `std::memory_order` captures, which is not a coincidence.
You can DM me and I would love to chat. I am @uvdn7 on twitter.
This seems to be a pure click bait title, cache invalidation will always be hard since it's basically a balanced give and take issue which simple logic dictates can't avoid.
I went into details in the post about what I believe is the root cause of why cache invalidation is hard, which seems different than what you are saying.
Based on the wide disagreement over whether or not this solves cache invalidation, this post provides strong evidence that the hardest bit is in fact naming things.
This would have been a really good article in itself, without the frankly unbelievable hubris of the author.
TL;DR this is a blogpost about some very interesting problems with distributed cache consistency and observability. The article doesn't address cache invalidation and - based on their replies here - the author doesn't seem to understand the cache invalidation problem at all.
On the definition of cache invalidation, and specifically why it's hard. Can you send me a link to any definition of it? This is what's in wikipedia and I think it's reasonable.
> Cache invalidation is a process in a computer system whereby entries in a cache are replaced or removed.
And I think I am describing that process, and what's hard about it. Some comments here explicitly talk about dependencies, which I can see why it's hard. My point is that even without dependencies, cache invalidation remains a hard problem. Now about dependency tracking, some of my thoughts are captured here https://news.ycombinator.com/item?id=31674933.
Numerous replies to your comments here have pointed out why cache invalidation is hard. You've responded to them by saying "Good point!" (always with an exclamation point), and then proceeded to demonstrate in your response that you didn't understand their point.
First, you acknowledge that they're correct, though you use obtuse language to say so: you seem to like using the very abstract term "dependency" to represent the very simple concept of detecting updates.
In your second paragraph you then go off on an unrelated tangent by saying:
> Now getting back to the “what’s really hard about cache invalidation” part.
No. You're not "getting back" to that - you're changing the subject back to the topic of your article, which is unrelated to cache invalidation.
> Say you just have a k/v store, no joins, nothing.
Cache invalidation is about invalidation - it's not about your store architecture. It's not about the implementation of logic that processes the clearing/overwrite of values, it's about the when and nothing else.
> just doing TTL, would be simpler.
Yes. It is simpler. If you want to avoid solving the hard problem, you can use a TTL. Now... how long should it be?
Let's solve the "cache invalidation problem" by your definition.
E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.
What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.
Because the materialization in cache don't take writes themselves, so updates to them are essentially performed blindly and can be ordered by the commit time (TrueTime).
Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline (in the form of some control plane metadata). I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.
First of all, I do think you folks make good points. Let me try again.
> The invariant is that as long as the "context/filter" you encode covers all the mutations which would impact your cache data, you should be good.
I wrote this line, which is referred to as that being exactly why cache invalidation is hard, in the commented you linked.
> it's about the when and nothing else.
Let's talk about that. Not to over generalize this, with a simpler cache model (say you just cache a single item), do you agree that solving the "when" problem is very managable? If not, I would like to be enlightened.
Now with this very simple cache model, where we have "magically" solved the "when" problem. Do you think cache invalidation is solved? Or it's simple? After knowing when, you still need to actually update cache right, and not to leave it in an inconsistent state (against the source of truth). Is that simple?
Let's essentially break cache invalidation into a few parts
1. knowing when/who to invalidate
2. actually processing the invalidate
My argument is that #1 can be very managable with simpler data models. #2 can't be avoided; and #2 is very hard.
> with a simpler cache model (say you just cache a single item), do you agree that solving the "when" problem is very managable?
The "when" problem is dependent on your application architecture, not on your cache backend nor the number of keys in it.
If, overall, you have an extremely simplistic application architecture, then cache invalidation may be quite easy, but you'll either:
1. forgo advanced user interaction or dynamic updates, in which case cache invalidation may not even be required at all (excepting publishing)
2. have scalability problems, and need to increase the complexity of your application to meet those challenges
The difficulty of cache invalidation scales with the complexity of your application (and not necessarily linear scaling)
> My argument is that #1 can be very managable with simpler data models. #2 can't be avoided; and #2 is very hard.
Yes, #1 can be manageable for low-traffic simple static applications. It is a general problem who's difficulty relates to the complexity of the application.
Yes, #2, can't be avoided. But, while it is interesting, and - in some cases, given a specific caching stack - it may be relatively hard, it's not a general problem. Difficulties with it are implementation-specific, not broadly applicable. Significantly, it's not the general and fundamental hard problem being referred when people talk about the universal difficulty of "cache invalidation".
> The "when" problem is dependent on your application architecture, not on your cache backend nor the number of keys in it.
I would argue the "when" problem is dependent on data models. And it's possible that we are referring to the same thing with different names.
> If, overall, you have an extremely simplistic application architecture
I mean FB is not a simple app. A complicated app can be built on a relatively simple data model as well (that's essentially how TAO works). But we do have memcache as well; and in some cases it can be complicated/hard. I can see that, and I won't argue against it.
> Yes, #2, can't be avoided. But, while it is interesting, and - in some cases, given a specific caching stack - it may be relatively hard, it's not a general problem.
I respectfully disagree. I explained in the blog post about why #2 is a generally hard problem. The analogy I like to use is Paxos. The protocol fits on a single slide. It's easier to feel like you have Paxos work; but it's very hard to have Paxos actually work.
> No, we don't sell your information. Instead, based on the information we have, advertisers and other partners pay us to show you personalized ads on the Facebook family of apps and technologies.
A lot can be said about Meta but I find it counterproductive to make up facts. It's not correct that they "sell personal details".
And regarding those tools:
it doesnt sound like Polaris would handle a network partition well. If a cache invalidation triggers it to check the other caches for consistency... that assumes Polaris will receive that invalidation message. Imagine a scenario of 5 cache servers, and a Polaris server. On one side of the split is Polaris and 2 servers, and the other has 3 cache servers... It's possible for the 3 cache servers to receive an update that is not received by the polaris+2 network... And polaris would not just be unaware of the inconsistency, but it also wouldn't know to check later for the inconsistency when the network partition is resolved.
I also feel like the consistency tracing is assuming that only one fill and invalidate is occurring at a time (when in practice, there may be multiple fills and invalidates occurring in parallel on the same data)... and that those calls will arrive in order. If they arrive out of order, it doesnt sound like it would catch that.. and I think you're relying on Polaris to catch this case, but high latency cannot be differentiated from a network partition except the length of the delay... so these two types of errors can be seen together.. in which case, you'd have a cache error that neither tool would detect.
I would like to hear why Im wrong.
I understand this is being used in prod, but network partitions don't occur everyday... and Im not convinced this has seen enough flaky networks to work out the bugs.