See also Noria, https://pdos.csail.mit.edu/papers/noria:osdi18.pdf - later research with similar thought process. Claims to scale better than differential-dataflow. Summary is, "let's make a timely distributed graph computation to keep SQL queries up-to-date, but this time we can forget some parts of the query to save on cache memory". Now productionized by https://readyset.io/
The nice thing about these systems (Materialize/Timely/Differential-Dataflow/Noria/ReadySet) is that the invalidation pipeline is automatically derived from the query structure. The user just needs to know how to write their query (regular SQL with Materialize/ReadySet), pipe in the source-of-truth changelog from all the tables in the database, and the system takes care of the rest. Your answer is somewhat vague about how to build the invalidation pipeline. I think that's what many people find challenging about cache design.
Yep. To generalize this even more, it’s not that cache invalidation is hard but “materialization/denormalization is hard” (even if it’s a table on the same database). Hence why I never thought this is what made cache invalidations hard (because it’s not even unique to cache).
I can imagine an automated solution to achieve what you described eg by keeping track of the table:columns touched by the query. Maybe that’s what these system did.
It has its downside though, as the invalidation cost in this case can be unbounded (read as huge write amplification in some cases). I imagine it would be hard to provision the invalidation pipeline. This is where it gets into trade offs. Eg by just caching the primary index and secondary indices (not materializations) you can very far as well (do joins on read). But it depends on the workload.