
Operating a large distributed system in a reliable way: practices I learned - gregdoesit
https://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/
======
LaserToy
I would add - you should start your monitoring with business metrics.
Monitoring low level things is good to have but putting whole emphasis on it
is missing the whole point. You should be able to answer at any point of time
whether users are having a problem, what problem, how many users, what are
they doing to workaround?

In other words, when person is in ER, doctors are looking for heartbeat,
temperature, ... , not for some low level metric, like how many grams of
oxygen is consumed by some specific cell.

~~~
cperciva
Yes, in the ER, doctors look at things like heart rate, respiration rate, and
temperature. But they also draw blood for electrolytes, glucose, creatine
kinase, etc. since those can detect underlying problems which the body is
compensating for.

A well designed distributed system is going to be able handle a certain
failure rate in its components because requests will be retried automatically.
If a component's failure rate increases from 0.01% to 0.1%, there will
probably not be any user-visible impact... but if you can detect that
increase, you might be able to correct the underlying issue before that
component's failure rate increases to 1% or 10% or 100% -- at which point no
amount of retrying will avoid problems.

~~~
GauntletWizard
You're missing the forest for the trees. The deeper metrics are important for
diagnosis. The business metrics are what you care about.

I'd phrase your advice differently as well: monitor on contact points. Monitor
where rubber meets road. Monitor where two components interesect, be it
between teams working on frontend and backend components, your application and
it's database, or low level metrics - the system API. I've found many
interesting problems and averted several potential crises because I saw that
the metrics things like requests looked different between the client and
server, the system metrics and my applications, etc. Contact points are what's
important, and it just happens that every (Figuratively) application has a
contact point with the OS.

~~~
cperciva
But that's my point: The deeper metrics aren't merely useful for diagnosis
_after_ the business metrics go sideways; they can be useful as leading
indicators, to warn you _before_ problems reach the point of affecting the
business metrics which you care about.

~~~
linza
The deep metrics, as you call them, get vanishingly irrelevant the larger the
system gets, at least for the use case you mention.

To put it unscientifically, something always goes wrong, but that does not
mean SLA is violated. As a corrollary, you will be alerted constantly about
something that might go wrong because of some signal that someone thought
might lead to some outage.

At face value it seems proactive, but it's really not. Better spend time
actively making the system more reliable (e.g. look at your or someone else's
postmortems, or do premortem exercises).

~~~
jorangreef
On the contrary, low level metrics become more important the larger a system
gets, since the quantity of low probability events (e.g. cosmic ray bit flips)
increases.

Without those "irrelevant" low level metrics for a large growing distributed
system, you're not only flying blind but crashing increasingly more and more
often.

~~~
linza
You are just chasing your own tail by looking at those metrics. Your system
needs to be resilient to these random events and you must account for those,
that's for sure. But you would not put your focus on those, it's still the
higher level alerts that are more relevant to the business.

Crashing more often is caught by an alert that looks at total capacity is s
service. Doesn't matter if it's random bit flips or OOMing nodes at first.
Longer term these metrics can be useful to increase efficiency again (should i
first fix random bit flips or OOMing tasks). I would not base SLA relevant
alerts on too low level alerts.

~~~
jacquesm
This is simply wrong. Quality starts in the basement, and works its way up. So
if you monitor a system at the lowest level you will be able to build
confidence about those layers and you will spot trends _long_ before they will
become apparent at higher levels _because_ those higher levels will erase the
fact that at lower levels things are already going wrong.

Systems that only monitor the highest levels appear to function fantastically
well right up to the moment they crash spectacularly. Then the forensics will
show you that at lower levels there were plenty of warning signs telling you
that the system was headed for the cliffs and a large number of those signs
will be apparent while there was still time to do something about them.
Reliable systems engineering is not something you can do just at the highest
levels because of the build in resilience in intermediary levels. This is
counter-intuitive but born out in countless examples of systems that look
robust but aren't versus those systems that really are robust.

~~~
antpls
After reading this thread, everyone convinced me there are good arguments for
having different levels of metrics. I think we can better understand the
problem if we introduce the word "priority".

Priority is always the customer. The focus can change given the circumstances
and the context of each issue. Sometimes the focus must be high-level,
sometime it requires a low-level focus, but the priority is always the
customer.

------
bcoates
This is a good guide. One thing I'd add:

While you're monitoring for traffic/errors/latency throw in minimum success
rate. Make a good estimate of how many successful operations monitored systems
will do per minute on the slowest hour of the year and put in an alert if the
throughput drops _below_ that. You'd be surprised how many 0 errors/0
successes faults happen in a complex system, and a minimum throughput alarm
will catch them.

~~~
mkeedlinger
In some systems it's also nice to have a "canary" acting as the user to call
your API every minute or so. Then even during low / no traffic hours you can
still catch errors/outages/etc.

~~~
lrem
That's a prober, not a canary.

~~~
nitrogen
Sometimes things can be described in more than one way. Canary, readiness
probe, health check, heartbeat, etc.

It makes communication more efficient and enjoyable when all parties are aware
of this.

~~~
lrem
Yes, there are many words in the space. But there are also many classes of
systems. It makes communication more efficient and enjoyable if all parties
can think of the same thing when hearing a word.

------
cpursley
Can anyone recommend MOOCs and/or university courses (open syllabus) covering
Distributed Systems?

~~~
titanomachy
Just go work as an SRE at Google for a couple years.

~~~
rurban
Or better work at AVL for a few years. Their large distributed system is real-
time, and on failures people might die. Esp. in the Formula 1 dptmt, where
everything is 10x faster and larger with much more sensors.

On normal complicated control systems, such as in an airplane, you do have
plenty of time to complete your task in the given timeframe, but in F1 there's
not much time left, and you have to optimize everything, HW, protocols and SW.
Much harder than in gaming or at scale at Google/Facebook/Amazon.

Eg you have to convert your database rows from write-optimized to read-
optimized on demand. A simple fast database cannot do both fast enough. A
normal firewire protocol is not fast enough, you have split it up into trees.
A normal CAN protocol ditto, you have to multiplex it. You have to load logic
into the sensors to compress the data and filter it out, to help transmitting
the huge amount of data.

------
NewsAware
Nice article. The amount of in-house procedures/tooling developed by the
backend seems impressive (maybe some not invented here syndrome but can't
judge really). What I am astonished at though, is that the backend part of
Über seems so professional while the Uber Android app feels like it's build by
2 junior outsourced devs. Have rarely used an app which felt so buggy and
awkward. (aside from regular crashes, e.g. when registering for Jump-bikes in
Germany inside the app, I had to restart the app to have the corresponding
menue item appear).

~~~
RhodesianHunter
Really? That's a bit surprising to me as I've always thought their app was
slick, easy to use, and responsive. Out of curiosity what device are you on?

~~~
NewsAware
Huawei P10 lite. If that should be the explanation I would actually be
relieved - as said, it didn't fit my mental model they wouldn't have a super
slick app with so many Dev resources.

------
toolslive
You need to have a strategy for backward (and forward) compatibility for your
components. If the environment is large enough, you don't exactly know which
component is running what version of the code as they are constantly (holding
back on) upgrading some part of your system. This includes extra (paramaters
to) RPC calls, data type evolution, schema evolution. Without a decent
strategy you'll be in over your head quickly. (Tip : a version number for your
API as part of the API v0.0.1, ain't gonna be enough)

------
techie128
Interesting. Although it is on the lite side. For example, it doesn't talk
about chaos testing, defining effective and comprehensive metrics (KPIs),
alert noise or running services like databases in an active-active (hot-hot)
mode.

------
jwilliams
Good read. There are a few things that I'd throw on top as important;

\- Active monitoring

\- Chaos testing

\- Cold start testing

------
joshgel
> I like to think of the effort to operate a distributed system being similar
> to operating a large organization, like a hospital.

Clearly never worked for a hospital. Hospitals need good engineers (and often
don’t have them). Our ‘nines’ are embarrassing...

~~~
jpitz
Are you referring to medical operations, or IT operations, in hospitals? I
think he is referring to medical operations, where I would expect the relevant
professions to be doctors and nurses, not engineers.

~~~
arkades
Actually large hospital systems tend to hire one or two systems engineers to
be part of the QI department. But yeah, most QI is front line staff.

~~~
ambicapter
QI department?

~~~
arkades
Quality improvement.

------
ggregoire
Most of those advices apply to small non-distributed systems too.

------
drdrey
I find it problematic that this recommends the Five Whys to get to "the root
cause". Haven't we collectively moved past that?

~~~
teraflop
Would you care to explain why you find that problematic?

~~~
drdrey
See this post by John Allspaw: [https://www.oreilly.com/ideas/the-infinite-
hows](https://www.oreilly.com/ideas/the-infinite-hows)

> The Five Whys, as it’s commonly presented, will lead us to believe that not
> only is just one condition sufficient, but that condition is a canonical
> one, to the exclusion of all others.

Five whys presents itself as a way to dig deep but promotes doing so linearly
and getting to a singular thing you can fix, hiding a lot of potential
learnings along the way. Thinking about contributing factors is a much more
powerful framework.

~~~
lrem
I find it useful: you'll find at least one problem deep down. The more surface
problems will get worked on anyways. In any case, the point is to make your
system more robust over time.

------
VincentEvans
Did you do all these things by yourself?

Really great content, but was really taken back by “I” used everywhere. Maybe
it’s a new thing that I am not hip on that I ought to try - “I built and ran
transaction processing software for Bloomberg! This is what I learned!”

But perhaps you really did all that by yourself, in that case sorry that i
doubted you, looks like it’s a lot.

------
learnfromstory
Don't really agree that this list could have come about through discussions
with engineers at Google, Facebook, etc. The more computers you have the less
important it becomes to monitor junk like CPU and memory utilization of
individual machines. Host-level CPU usage alerting can't possibly be a "must-
have" if there are extremely large distributed systems operating without it.

If you've designed software where the whole service can degrade based on the
CPU consumption of a single machine, that right there is your problem and no
amount of alerting can help you.

~~~
kevinsundar
I work at a FAANG and host level cpu is most definitely an alert we page on.
Though a single host hitting 100% CPU isn't really a problem in and of itself
(our SOP is just to replace the host), its an important sign to watch for
other hosts becoming unhealthy. It might be overkill but hey theres mission
critical stuff at hand.

For example: if you have a fleet of hosts handling jobs with retries, a bad
job could end up being passed host to host killing each host / locking up each
one as it gets passed along. And that could happen in minutes while replacing
and deploying and bootstrapping a new host takes longer. So by the time your
automated system detects, removes, and spins up a new host everything is on
fire.

~~~
learnfromstory
Could you mention which FAANG so I can avoid applying or a job there? Large-
scale software systems _must_ be designed to serve through local resource
exhaustion. If you are paging on resource exhaustion of single host you are
just paying the interest on your technical debt by drawing down your SREs'
quality of life.

I stand by my beef with this article. The statement that "I've talked with
engineers at Google [and concluded that a thing Google wouldn't tolerate is a
must-have]" doesn't make sense. What I get from this article is you can talk
with engineers at Google without learning anything.

~~~
packetslave
A single host stuck at 100% CPU also has a nasty effect on your tail latency,
in a system with wide fanout. If a request hits 100 backend systems, and 1 of
them is slow, your 99th percentile latency is going to go in the toilet.

~~~
learnfromstory
Which is a good reason to hedge and replicate but NOT a reason to alert on
high CPU usage of single computers.

~~~
packetslave
You definitely want to TRACK cpu usage on individual hosts, but, yeah, I would
alert on service latency instead. Symptom, not cause.

