
Facebook blames a server configuration change for yesterday’s outage - Errorcod3
https://techcrunch.com/2019/03/14/facebook-blames-a-misconfigured-server-for-yesterdays-outage/
======
fenwick67
To everyone jumping to conclusions, remember that the words "server" and
"configuration" can mean a whole host of things. It doesn't necessarily mean
they mistyped their nginx config.

~~~
stingraycharles
Exactly. Doing an upgrade to an internal email service is a configuration
change. Scaling down a cluster is a configuration change. Mitigating a DDoS
attack by implementing a firewall rule is a configuration change.

“Configuration” in this context is the high level system configuration, and
can mean pretty much anything that falls under that.

------
mattbeckman
Largely impacted by this outage was how it affected those who use Facebook
Login as a convenient OAuth option. Good thing for developers to remember if
someone asks them to avoid a native login option.

------
aboutruby
Not a big surprise as it's one of the harder things to test.

------
cannedslime
Keep calm and blame dev ops!

------
mancerayder
.. but, but, didn't they wring the DevOps folks through coding challenges,
sorting algos and whiteboard coding before hiring them? I heard that's the
number 1 way to ensure uptime at FAANG.

(Configuration changes, that's the source of my sarcasm)

------
lousken
It took very long to fix so I think this was related to their databases, maybe
some data corruption.

------
CodeSheikh
Seems like a lot of people were forced to be productive yesterday (:

~~~
canada_dry
> Seems like a lot of people were forced to be productive yesterday

Well... except for those companies dumb enough to farm out their intra-company
communications to facebook.

A friend of mine's law firm had to resort to - OMG - the phone - yesterday.

~~~
thisacctforreal
Wait, the law firm talks about their dealings over Facebook?

~~~
evv
Facebook Workplace, presumably (the slack competitor)

------
rachelbythebay
Ah, the tao of reliability.

Async too.

------
chowes
Must be related to them merging the chat backends for WhatsApp, FB, and
Instagram

~~~
segmondy
I suspect this too, fastly integrating the systems they can't be broken up.

------
ajsharp
Also, 'many people had trouble accessing our apps and services' is some ninja-
level gaslighting:
[https://twitter.com/facebook/status/1106229690069442560](https://twitter.com/facebook/status/1106229690069442560)

~~~
traek
In what way is that gaslighting?

~~~
ajsharp
'many people': every product was down, completely, for everyone, afaik.

~~~
snazz
No. The services were down intermittently, with only certain parts down
completely (auth, etc).

------
halfnibble
WAT. A server configuration change? What kind of server configuration can
affect presumably thousands of machines replicated across the globe? I'm
trying to understand this.

~~~
ceejayoz
It's hardly unheard of.

[https://en.wikipedia.org/wiki/Cascading_failure](https://en.wikipedia.org/wiki/Cascading_failure)

An organization Facebook's size isn't gonna be applying configuration changes
to one server at a time over SSH, either. A server configuration can easily
affect thousands of machines across the globe if it's deployed to them all.

~~~
wyre
Shy did I take so long for Facebook to release the cause of the outage? If
they are applying configuration changes at a large level shouldnt it be fairly
easy for them to figure out what was the cause?

~~~
ceejayoz
That's silly. Error rates show as elevated on
[https://developers.facebook.com/status/dashboard/](https://developers.facebook.com/status/dashboard/)
until 11pm Pacific yesterday. The @facebook Twitter account sent out a
statement basically within an hour of the start of the next business day.

------
ajsharp
This is fucking bananas. For nearly a decade, Facebook has been at the
forefront of innovating how code is deployed at global scale. They presumably
have gradual rollouts, automated rollbacks, anomaly detection, not to mention
(I assume) loads of organizational safeguards in place to ensure this sort of
thing never happens.

Something else happened. This was not a configuration issue. Edit: If it was,
I'd expect a post-mortem post-haste.

~~~
shereadsthenews
How did you determine that Facebook leads this space? I recently read an
article about how Facebook distributes RPMs internally and it struck me as the
kind of thing an insane person might have invented fifteen years ago. I mean,
NFS in front of glusterfs? Also, RPMs???? Talk about bananas.

~~~
gmmeyer
What's wrong with RPMs?

~~~
lykr0n
It has a bunch of well built and supported tooling, including dependency
management, dependencies, and versioning /s

Nothing. I quite like using them to deploy applications. If you package them
right and build your deployment system correctly, they're not the worst way to
do things.

