
Getting Real About Distributed System Reliability (2012) - sacheendra
https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability
======
ChuckMcM
While it is a bit dated in terms of the 'types' of applications it is still
quite relevant in terms of understanding "web 2.0[1]" distributed system
reliability versus other systems.

One of the coolest things Greg Lindhal built at Blekko was a system that
processed the error logs of every single system in the cluster all at the same
time. It then reduced by string matching error messages that varied only by
their variables into a single error (so 'error reading sector 10 on disk sda3'
and 'error reading sector 20 on disk sde2' would reduce to 'error reading
sector X on disk Y' x={...}, y={...}') really fun stuff. And you could easily
pull out of that errors in the software (happens on many nodes) errors in
hardware (happens on unrelated nodes) and errors in infrastructure (happens on
nodes correlated by a switch or PDU or Rack)

[1] Distributed systems made of many many identical nodes both from a software
and hardware perspective, managed by an orchestration service with software
that assumes unreliable platforms.

------
gregdoesit
While this post is from 2012, much of it applies very much, especially when
building your own systems/(micro)services:

>The actual reliability of your system depends largely on how bug free it is,
how good you are at monitoring it, and how well you have protected against the
myriad issues and problems it has. This isn’t any different from traditional
systems, except that the new software is far less mature. I don’t mean this
disparagingly, I work in this area, it is just a fact. Maturity comes with
time and usage and effort.

Working at Uber on larger systems, it’s surprised me how much more effort
we’re putting in operating the system reliably, vs the upfront planning/design
(and we spent a lot of time on planning/design). I wrote in-depth about those
practices and here’s the relevant HN discussion:
[https://news.ycombinator.com/item?id=20462349](https://news.ycombinator.com/item?id=20462349)

------
dang
Small thread from back then:
[https://news.ycombinator.com/item?id=3724508](https://news.ycombinator.com/item?id=3724508)

------
qaq
I wonder if anyone ever measured failure rates for real world distributed
systems outside of FANG vs simple non-distributed alternatives.

------
pixelmonkey
Probably requires "(2012)" in title.

~~~
dang
Added. Thanks!

