
Health Checks and Graceful Degradation in Distributed Systems - ingve
https://medium.com/@copyconstruct/health-checks-in-distributed-systems-aa8a0e8c1672
======
matt_oriordan
Fascinating and brilliant article!

We have a similar approach to health in our distributed messaging platform in
that during situations where there are capacity problems, we ensure existing
connections and requests continue to get serviced, yet new requests are
rejected triggering a fail over to other regions. We effectively put the
affected nodes into an “under siege mode” until the load subsides.

Interestingly, in our recent distributed deep dive interview
([https://blog.ably.io/hidden-scaling-issues-of-distributed-
sy...](https://blog.ably.io/hidden-scaling-issues-of-distributed-systems-
system-design-in-the-real-world-9a9f0d309e8e)) with Paul Nordstrom, a former
systems architect at AWS & Google, he also talked about how they designed
systems around how they would recover when there were failures. For example,
if a huge backlog of work built up, trying to catch up is probably not going
to work, so you need to think about that too in the design.

On a separate note, one of our engineers Simon wrote an article recently on
distributed system rate limiting ([https://blog.ably.io/how-adopting-a-
distributed-rate-limitin...](https://blog.ably.io/how-adopting-a-distributed-
rate-limiting-helps-scale-your-platform-1afdf3944b5a)), and the challenges he
faced. He also found that hard static limits would often cause significant yo-
yo’ing of load which did not help solve the problem.

Matt, technical co-founder, Ably Realtime
([https://www.ably.io](https://www.ably.io))

