
Ask HN: Why is there so much downtime recently? - lsnickolov
I have a general wondering. We&#x27;ve been discussing with our team that for the past few weeks many big companies had some downtime (Slack, Intercom, Google Cloud, Cloudflare, Pingdom, Pagerduty, Facebook, Braintree, and probably more). Our wild guess is that there is some kind of dependency between those services. For example, if Pingdom relies on Cloudflare, and Cloudflare relies on Google Cloud (this is just an example, don&#x27;t know if this is true). Then a downtime on the Google Cloud platform would result in downtime for both Cloudflare and Pingdom. I&#x27;d love to hear your opinions on the matter.
======
bwb
Here is some data from a site I run called downforeveryoneorjustme.com, I
pulled out 2016 to today and looked at just direct traffic to avoid SEO
changes: [https://ibb.co/RzVFrXN](https://ibb.co/RzVFrXN)

Not a ton of difference, so I am not sure data backs this up :). I will say
that from my view on the backend, we do see more micro outages where services
are more resilient now. And, outages only affect certain areas or nodes...

Thanks, Ben

~~~
loco5niner
Thanks for running the site! I use it on a semi-non-regular basis and it's
super helpful when I need it.

------
seanwilson
[https://en.wikipedia.org/wiki/Baader%E2%80%93Meinhof_effect](https://en.wikipedia.org/wiki/Baader%E2%80%93Meinhof_effect)?

> The Baader–Meinhof effect, also known as frequency illusion, is the illusion
> in which a word, a name, or other thing that has recently come to one's
> attention suddenly seems to appear with improbable frequency shortly
> afterwards (not to be confused with the recency illusion or selection
> bias[1]). It was named in 1994 after the German Baader–Meinhof Group, when a
> contributor to the Bulletin Board column in the St. Paul Pioneer Press
> reported starting to hear the group's name repeatedly after learning about
> them for the first time.

~~~
lsnickolov
Cool. This explains some other questions (e.g. when you buy a new car and
suddenly everyone else starts driving the same model :D). However, I don't
think this is the case here. We're using all of the services I've given as an
example. We have experienced issues with them before but not in such short
period of time.

------
weeruz
[https://www.stilldrinking.org/programming-
sucks](https://www.stilldrinking.org/programming-sucks)

Read the whole thing it's funny as hell and frightening at the same time. But
as to internet focus on the header "A lot of work is done on the internet and
the internet is its own special hellscape"

------
vikramkr
It could also just be pure statistical chance. Depending on how often
opportunities for failure occur it's not infeasible that just by chance a
bunch go down at once, especially if they don't all have to go at the same
time and just going out within a few days of each other is enough to make
people see a connection there. And if they happen to correlate with each other
(slack going down means a remote employee cant work for the say which means
some bug isn't caught) even slightly then maybe this sort of situation goes
from kot infeasible to expected to occur once every few years.

~~~
davismwfl
Initially reading your response I was going to argue no way in hell there are
that many coincidences in a short period of time. But the argument isn't about
coincidences, it is a cause and effect argument, which does make sense to me.
Whether it is slack or just internet accessibility for someone and they decide
to take off that day so they miss a log entry that normally would've been
caught and fixed immediately. Or pagerduty failed and didn't notify a few
teams about things that normally wouldn't cause downtime because people would
be responding quickly to them. Totally make sense to me thinking of it that
way.

I am super curious to read any after action reports that get published on
these failures though. I always love learning even if I don't have a relevant
situation.

~~~
vikramkr
It's also about coincidence, because the question isn't necessarily if it's
likely to happen this week, but if its likely to happen in general, and I
think it is, and if there is a single causal link anywhere that could make the
odds really high.

~~~
davismwfl
Isn't what you just said exactly the opposite of coincidence? A coincidence is
two or more separate events that happen without an apparent or predictable
causal link. So the fact the odds are high that this type of situation could
occur (which I agree with) means it is predictable and has an apparent causal
link with an associated probability.

e.g. team A doesn't get notifications so they can't fix a server issue is a
pretty high prediction of a future failure for that team. So it is cause and
effect, with the associated probability.

A coincidence would be you and I have the same birthday, we can associate a
probability to that but there is no apparent causal link for why we would have
the same birthday.

------
Spooky23
I don’t buy the psychological illusion argument.

A bunch of companies with good track records of availability are having issues
with availability due to networking. This is happening during a time of high
drama geopolitical/diplomatic conflict.

There’s other unusual events that fit in the geopolitical situation. US
municipal governments are suddenly getting targeted with ransom ware. A
Russian submarine cable “survey” nuclear submarine has a mysterious fire.
There’s an uprising in Hong Kong.

------
djohnston
end of half. everyone trying to shit out code for reviews. that's my guess

------
tedmiston
Not sure about recent incidents but cascading downtime is definitely a problem
e.g., when S3 has an outage.

I hear the multi-cloud redundancy idea talked about a lot but I'd be curious
to see data around how many people do this in practice.

~~~
digitalsanctum
I think the number is much lower than people realize. There are a lot of
multi-billion dollar companies that still operate in one region and don't have
a solid disaster recover plan.

~~~
was_boring
There's a lot of multi billion dollar companies still coming to terms with one
provider, let alone redundancy.

------
tomxor
A lot of stuff relies on google cloud which explains some of it... but then
I've also noticed other seemingly unrelated incidents on Linode recently, main
one was a network outage in the Dallas data centre.

I haven't heard what all the underlying causes are but I wonder if this is to
do with the so called "768K Day" issue.

------
thebiglebrewski
I noticed this too. My two-pronged theory is: \- Many of these companies have
IPO-ed recently, maybe employees with enough vested stock are just quitting \-
Summer vacation - it's July, a lot of people are out of the office. In the US
it's the July 4th Holiday in addition.

------
karmakaze
Summer vacations. The guardians of qualify are enjoying time with their
families and the crew in charge isn't as experienced or supporting more than
their capacity.

------
Bucephalus355
I would like to know the answer to this question too. In theory we should know
everything about the internet and its underlying dependencies, it’s not like
biological systems for example which are still so poorly understood. Why can
these not be traced better?

------
machiaweliczny
vacations? Human config error is probably most common cause of services
failure, so if there are some more senior people missing it might lead to some
fails.

------
vectorEQ
agile :D

------
33a
Seems to match up with 6-4 and the HK protests. In China were major VPN
crackdowns around this time, not sure if all that had anything to do with the
other failures.

