
On Infrastructure at Scale: A Cascading Failure of Distributed Systems - aberoham
https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df
======
hardwaresofton
This was a fascinating read.

> I’ve had concerns about the sidecars in the past, however after this event I
> am convinced that having each workload ship with their own logging and
> metric sidecars is the right thing. If these had been a shared resource to
> the cluster, the problem almost certainly would have been exacerbated and
> much harder to fix.

While I'm all for the sidecars model, I'm not sure that I agree with this...

If logging/metrics ingestion was a shared resource that applications were
calling out to directly (basically a shim for kafka 99% of the time), it seems
they could have built in a layer to dump writes to disk in the case kafka was
unreachable. Even if they didn't and the shared service just started failing,
that's a single (likely highly scaled) service failing (or best case 503ing
and dropping logging+metrics where necessary) _not_ service downtime/the
cascading failures. Maybe I'm misunderstanding something about this quote.

Maybe it's just me, but workloads going down due to a sidecar that handles
logging/metrics doesn't sound like the right prioritization to me.

Also, like other people have said, resource limits & using the kubernetes
compute reservation systems[0] also seems like a key bit that was missing from
this.

[0]: [https://kubernetes.io/docs/tasks/administer-
cluster/reserve-...](https://kubernetes.io/docs/tasks/administer-
cluster/reserve-compute-resources/)

------
jacques_chester
Some questions I had after reading it:

* Would this have been less painful with Netflix-esque patterns? Circuit breakers, throttling and the like? Failing hard on purpose is often easier to diagnose than failing fuzzily without clear cause.

* Would it be acceptable to maintain spare capacity? For a prod cluster on the main money path, I think everyone thinks yes. Who holds the decision bit for dev, which is a second-order source of money?

* Can I avoid the second act by having a clean way to (1) reset the entire system and (2) bring it up in an orderly fashion, as quickly as possible? Not every app needs to be relaunched simultaneously, there's presumably _some_ order of preference. Once the cluster state began to oscillate it doesn't seem like there were any easy options other than resetting it.

~~~
kronin
The risk with having a reset mentality, "after all it's only dev" (edit: not
intending to imply you said this), is that the opportunity to learn is lost.
If they took the reset tack, then this occurred in production and they didn't
have that option available, choosing the reset path would suddenly be seen in
hindsight as a lost opportunity.

~~~
jacques_chester
I hadn't thought of that. I still think it would be helpful in prod, though.

------
rmb938
For those that are not too versed in distributed systems, kubernetes or docker
I highly suggest you read the comments on the post. I am going to promot
myself a bit here and link directly to my comment
[https://medium.com/@rmb938/thanks-dan-for-the-great-
post-1ef...](https://medium.com/@rmb938/thanks-dan-for-the-great-
post-1ef87cbd5bcc)

Sadly a large environment was brought down due to not following best
practices. This event probably could have been prevent or at least have very
little impact if best practices were followed.

------
sciurus
The authors conclusions seem questionable. Ryan Belgrave's comment on them is
worth reading.

[https://medium.com/@rmb938/thanks-dan-for-the-great-
post-1ef...](https://medium.com/@rmb938/thanks-dan-for-the-great-
post-1ef87cbd5bcc)

------
marknadal
The only real way to deal with this is to test distributed systems.

Doing so is hard, but only way to reliably know a system behaves given
unpredictable failures.

So learn up:

\- [http://jepsen.io](http://jepsen.io)

\-
[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

\- [https://github.com/gundb/panic-server](https://github.com/gundb/panic-
server)

------
tzhenghao
For some reason, this reminds me of Jay's piece on reliability of distributed
systems [1]

[1] - [http://blog.empathybox.com/post/19574936361/getting-real-
abo...](http://blog.empathybox.com/post/19574936361/getting-real-about-
distributed-system-reliability)

------
tomnipotent
> This problem did not affect the TAP production environment in the same way
> due to it simply being a much smaller environment

Is it common for a dev environment to be larger than production...? Maybe
paints a picture about Target's ratio of software developers to traffic/load?

~~~
throwawayjava
He's not saying that the total dev environment is bigger than the total prod
environment. He's saying that this one cluster of dev machines is much bigger
than other clusters of machines (dev or prod). that could be true while
simultaneously having many more prod machines than dev machines.

The author explains: _however there is one cluster that was built
significantly larger than the others and hosted around 2,000 development
environment workloads. (This was an artifact of the early days of running
“smaller clusters, more of them”, when we were trying to figure out the right
size of “smaller”.)_

~~~
tomnipotent
Thanks, somehow missed that.

------
tedk-42
key take away that isn't mentioned in the article:

\- Put resource limits on your pods.

~~~
bboreham
What’s your reasoning?

From my experience, a memory limit, if hit, will cause periodic OOM and
restart; and a CPU limit will cause slowing-down of the processes.

The tale already mentions restarting and slowing-down; it’s not clear to me
that adding more would have improved anything.

~~~
jpgvm
The issue was the Docker daemon was affected due to poor resource
configuration. It should run at a higher priority and with guaranteed
resources so it can't be impacted by user mode tasks.

k8s has been slow to catch up in this area but finally has priorities and
preemption. That said Docker normally doesn't run in a pod so it's also a
matter of setting the node total allocatable CPU to something reasonable and
ensuring all pods spawned by kubelet are nested under a namespace that has a
lower priority than the system tasks (kubelet itself, Docker if you use it
etc).

~~~
merb
yeah after reading that I came to the same conclusion and tought that he
mostly blames it on cluster size, which is just wrong.

------
coldcode
Complex systems, even when think you designed to avoid complexity, are
unpredictable when shit happens. Any system which is hard to completely
understand is likely to fail in unforeseen ways, and possibly be difficult to
recover from or to know how to change to keep it from happening again. Sadly I
see these types of failures all the time where I work, though not quite the
same scale. One time a simple DNS update brought down our operations
worldwide, even in places thought not to be connected at all.

------
nerdbaggy
Interesting. Is there a benefit to running Kubernetes in Openstack in their
case vs running it on bare metal?

~~~
erikb
Storage and network stability is much higher if you run it on top of
Openstack. k8s on baremetal is a pain in the butt. Disadvantage is that you
have even more layers of abstraction.

~~~
nalbyuites
I understand storage, but could you elaborate on why there would be higher
network stability with Openstack?

------
jijji
I wonder if Target stores their customers cardholder PIN numbers still [1].

[1] [https://www.theguardian.com/world/2013/dec/27/target-
hackers...](https://www.theguardian.com/world/2013/dec/27/target-hackers-
stole-pin-numbers)

~~~
kiallmacinnes
According to your own linked article, target does not store the PIN numbers.

What happens is, the PIN pad encrypts the PIN code and other payment info,
using a key known only to the card issuer (i.e. VISA). This encrypted data
then finds it way to the card issuer (e.g. Visa) for verification via one of a
few possible paths, either PSTN dialup, or more common these days, over the
internet. In the Target incident, hackers grabbed this data as it was being
transferred over the LAN.

(For completeness, there are many more organisations this encrypted data
passes through between the merchant and card issuer, but nobody but the card
issuer can decrypt it)

Also, it's incredibly off topic for the post. I'll bet that's why your getting
all the downvotes.

