See also Noria, https://pdos.csail.mit.edu/papers/noria:osdi18.pdf - later resea...

uvdn7 · on June 9, 2022

Looks like the separate materialization (what's stored in cache) is schematized?

I will take a closer read. It seems similar to what I described here https://news.ycombinator.com/item?id=31676849.

jitl · on June 9, 2022

The nice thing about these systems (Materialize/Timely/Differential-Dataflow/Noria/ReadySet) is that the invalidation pipeline is automatically derived from the query structure. The user just needs to know how to write their query (regular SQL with Materialize/ReadySet), pipe in the source-of-truth changelog from all the tables in the database, and the system takes care of the rest. Your answer is somewhat vague about how to build the invalidation pipeline. I think that's what many people find challenging about cache design.

uvdn7 · on June 9, 2022

Yep. To generalize this even more, it’s not that cache invalidation is hard but “materialization/denormalization is hard” (even if it’s a table on the same database). Hence why I never thought this is what made cache invalidations hard (because it’s not even unique to cache).

I can imagine an automated solution to achieve what you described eg by keeping track of the table:columns touched by the query. Maybe that’s what these system did.

It has its downside though, as the invalidation cost in this case can be unbounded (read as huge write amplification in some cases). I imagine it would be hard to provision the invalidation pipeline. This is where it gets into trade offs. Eg by just caching the primary index and secondary indices (not materializations) you can very far as well (do joins on read). But it depends on the workload.