Caches, Modes, and Unstable Systems

jacquesm · on Aug 29, 2021

It starts off well and then just as it gets going it ends with 'more on this later'. This reads like the first part of a very long sequel.

It would probably be a good idea to write the article with fewer assumptions of who the audience is and with an explanation of why caches are considered a good idea and not to mix CPU caches and web intermediary result caches as though they are one-and-the-same and to explain the concepts of invalidation and affinity before moving on to things like feedback loops to measure cache efficiency/effectiveness.

MatteoFrigo · on Aug 29, 2021

The way I look at it, the article has two main messages.

The first message is that real systems comprise "core" components that are expected to work in all cases (e.g., databases, or CPU arithmetic units and DRAM memory) and "uncore" components that are just performance optimizations (caches, prefetchers, load estimators, whatever) and are expected to be not effective in certain cases. The point is that, while "uncore" components may well provide 100x performance improvement, the designer must make sure that the dynamics of the system are stable even if these components fail completely. A related point is that you cannot infer the dynamics of the core by looking at production metrics alone: you must either have a deep analysis of the system, or a provably robust algorithm that works in all cases, or test cases that explicitly disable caches or whatever other uncore is relevant.

The second message is that real systems have metastable states, which surprises people. "Metastable" means that the system runs fine for a while, then something bad happens, the system goes into a bad state, and stays in the bad state after the bad stimulus is removed. Historically one of the first examples is the infamous Internet congestion collapse of 1986, which prompted the implementation of TCP congestion control. The feedback loop in that case had nothing to do with caches. What happened is that routers drop packets when the rate is too high. Packet drop caused TCP to retransmit, causing more packet drop. This feedback loop is self-sustaining and persists even if you remove the initial overload that caused the initial packet drop. See https://ee.lbl.gov/papers/congavoid.pdf for ways to avoid this problem, which are more broadly applicable than just TCP.

Specifically w.r.t. congestion collapse, my experience is that many smart engineers have never heard of it, and I stopped counting how many outages have been caused by congestion-related problems. So there is something to be learned here.

jacquesm · on Aug 29, 2021

The first is absolutely impossible: the fact that performance changes is in many situations already a problem, even if the change is an improvement because it can lead to security issues. Caching is extremely subtle and due to non-local effect has all kinds of implications that are often poorly understood or only come out after a long time (sometimes decades, for instance in the case of the Spectre type bugs which affected CPUs of many generations).

Caching is hard. The benefits are too large to ignore it but the downsides are extensive and can bite you when you least expect it.

As for that particular interesting tidbit, see also: thundering herd and exponential back-off.

yarapavan · on Aug 29, 2021

Related read is Reliability, constant work, and a good cup of coffee - https://aws.amazon.com/builders-library/reliability-and-cons...

ignoramous · on Aug 29, 2021

See also (from 2014):

Running Multiple HTTP Endpoints as a Highly Available Health Proxy, https://aws.amazon.com/blogs/architecture/running-multiple-h...

Doing Constant Work to Avoid Failures, https://aws.amazon.com/blogs/architecture/doing-constant-wor...

A Case Study in Global Fault Isolation, https://aws.amazon.com/blogs/architecture/a-case-study-in-gl...

Organizing Software Deployments to Match Failure Conditions, https://aws.amazon.com/blogs/architecture/organizing-softwar...

Shuffle Sharding: Massive and Magical Fault Isolation, https://aws.amazon.com/blogs/architecture/shuffle-sharding-m...

AWS and Compartmentalization, https://aws.amazon.com/blogs/architecture/aws-and-compartmen...

jorangreef · on Aug 29, 2021

As an example of this in distributed systems:

There's a classic metastability issue in probably most implementations of state machine replication protocols such as Raft, where a lagging follower, if it sees an op that's newer than what it's expecting, must first repair and catch up its state (FIFO repair) before it can ACK back to the leader.

Apart from introducing latency in the critical path, this can lead to really bad queueing behavior where the lagging follower queues the latest request from the leader while it first catches up, but because this catch up can take seconds or minutes, in that time the pending request queue has also overflowed, and now we're back to state transfer catch up all over again, a vicious cycle.

However, there is a new approach (LIFO repair) that we developed for TigerBeetle [1] to eliminate this bimodality, maintain (and even reinforce) safety and achieve constant ACK latencies from all followers, no matter their state. If you're curious as to how this works, we'll be talking about this LIFO repair as part of Viewstamped Replication Made Famous [2], a $20,000 consensus challenge launching in September.

[1] https://www.tigerbeetle.com

[2] https://github.com/coilhq/viewstamped-replication-made-famou...

MatteoFrigo · on Aug 29, 2021

I am looking forward to learning about the LIFO repair when you guys decide to talk about it.

However, the fact that Raft is goofy and conflates safety, liveness, and dynamics, doesn't mean that the metastability problem has not been solved in a more traditional FIFO setup.

The basic issue is that things like Raft (but most other descriptions in the literature as well) describe replication in terms of RPC: request arrives at the leader, the leader sends three RPCs to followers, waits for two acks, etc. If you actually implement things this way (and many people do) then you end up with the problem that you describe.

However, you can organize the system in a different way. A request arrives to the leader and is appended to the leader's proposer log. A separate asynchronous protocol reconciles the proposer log with the follower acceptor logs. This protocol looks like TCP, with some caveats to account for log divergence, so it is pretty well-understood technology. By dimensioning log sizes and timeouts correctly, you can deamortize the whole catchup procedure and bound the latency of every operation as seen from the client (assuming that a quorum of acceptors is available).

jorangreef · on Aug 29, 2021

Yes, and in fact one of the things I like about Viewstamped Replication is that it does not describe replication in terms of RPC, but rather in terms of message passing, which I believe maps better to the domain of consensus, and which nudges people in the right direction when they start to think about their messaging protocol.

Thanks, hope to see you at the live launch event in September! All the details are in the repo linked above.

Beyond some of these new techniques, what I'm really most excited about for the live launch event in September is the back-to-back interviews with Brian Oki (Viewstamped Replication 1988) and James Cowling (Viewstamped Replication Revisited 2012). It's going to be a special moment, celebrating the pioneering consensus and replication protocol and the people behind it.

sigy · on Aug 29, 2021

I like the intro style of the article, but also found some that it glossed over things a bit and added some leading assumptions as a given. "goodput" is not helpfully distinguished here, and also explained in a way that is contrary to convention.

Throughput is throughput. Goodput is different.

alexchamberlain · on Aug 29, 2021

It's worth exploring alternatives to read-through caches for this reason; for example, we populate caches when the underlying data changes. It introduces alternative challenges, of course, such as deciding whether or not to cache a particular key.

_hhrq · on Aug 29, 2021

Nice read. Just curious that isn't this handled via rate throttling at load-balancer/gateway level? Like only forward requests to database as much as it can handle?

ignoramous · on Aug 29, 2021

That introduces a "mode"? In fact load-balancers / traffic-shaping themselves can misoperate often resulting in catastrophic consequences.

See chapters 19 to 22: https://sre.google/sre-book/load-balancing-frontend/

tyingq · on Aug 29, 2021

I like the idea of priority queues driven by number of prior successful requests + wait time. Such that once you're in, you get reasonable performance, otherwise you get 503 "too busy" until you've waited.

willvarfar · on Aug 29, 2021

A really great approachable article, putting into words things experienced engineers know to be true. Will bookmark :)