
Cloudflare Dashboard and API Outage on April 15, 2020 - jplevine
https://blog.cloudflare.com/cloudflare-dashboard-and-api-outage-on-april-15-2020/
======
atonse
This is not a critique, CloudFlare is clearly a solid, well engineered system
given its scale, just look at their other post-mortems.

But it's just kind of interesting, you can have all the redundant systems and
smart software and some dude could accidentally pull cables – oh humans!

Would love to see what other mitigations they came up with than the ones
listed (apart from probably putting 20 BRIGHT RED labels next to the patch
panels saying DO NOT DISCONNECT, EVER EVER EVER!).

Perhaps one mitigation could be a better way to literally identify who's there
and call them up within seconds and ask what they just did?

------
bogomipz
>"Documentation: After the cables were removed from the patch panel, we lost
valuable time identifying for data center technicians the critical cables
providing external connectivity to be restored. We should take steps to ensure
the various cables and panels are labeled for quick identification by anyone
working to remediate the problem. This should expedite our ability to access
the needed documentation."

So they failed to label their cables? I'm sorry but this is "datacenter 101"
stuff. How are none of the cables plugged into your patch panels labeled?
Every colo has a label gun you can borrow! Also remote hands will gladly send
you a pic of a rack or cabinet to verify what they're looking at.

~~~
ahofmann
Why is this post being voted down? It is extremely impressive what Cloudflare
has done since its foundation. My company has been a customer since 2011
because of me, and yet Cloudflare looks like a nice shiny shell with the same
total chaos underneath as in any small IT company I've ever had the chance to
look into. Unfortunately this doesn't let me sleep well when my company is
dependent on Cloudflare. Therefore we hardly use any features of Cloudflare to
be able to switch to our own infrastructure at any time. As annoying as Google
and Microsoft are, I can sleep better because they have their processes better
under control (I know that these companies offer different products, the
question of dependency remains the same).

~~~
bogomipz
There seems to be an almost unspoken rule on HN that you don't say anything
critical about Cloudflare. It will almost universally be downvoted even if
it's done respectfully. You will even see people fawn over their post-mortems
yet still not be critical or express disappointment at being affected by the
actual outage. I am not sure why this is. It is very peculiar though. It's
stranger given that Cloudflare seemingly uses HN as their primary marketing
channel to generate discussion about the company but somehow that discussion
should only ever be positive commentary.

------
idrism
It’s strange to me that their remediation did not include distributing these
systems to be redundant across multiple datacenters, maybe with a globally
distributed database.

> we knew that the failback from disaster recovery would be very complex

The disaster recovery failover to a second data center (and failback) should
not force a choice to failover or not. They should be able to immediately
failover and the system should self-heal once the original data center was
back online.

------
cookiecaper
I'll just leave this here ... [https://github.com/netbox-
community/netbox](https://github.com/netbox-community/netbox)

------
rkwasny
In summary, 10% of internet traffic relies on one patch panel somewhere :)

~~~
atonse
The cloud is just someone else's computer right? :-)

Submarine Cables are like this too. It all comes down to a quarter inch thick
bunch of fibers (each being thinner than a human hair)

------
majjaa
Is it me or does it feels like these post mortem blog post are becoming
extremely common with Cloudflare.

