

Today’s outage for several Google services  - panarky
http://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html

======
mhowland
Cliff notes: google's configuration service broke itself then fixed itself
today. Engineers were alerted. Skynet is self-aware.

~~~
aroman
yeah I'm very curious as to what sort of bug deploys a bad configuration and
then magically deploys the fix 30 minutes later...

~~~
reidrac
I don't know the specifics of what happened here, but in my experience with
automatic configuration generation one must have a way to validate the config,
but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the
monitoring system detects something's not right, and automatically _the last
working configuration is applied_ and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the
reliability team needed some minutes to realise it was a configuration
problem. A classic problem is a network appliance that is misbehaving (eg.
firewall, switch, etc), but nobody knows it is because of the configuration
and it is replaced by a fallback appliance that... oh, has the same problem
(configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you
know an important part of your infrastructure is down, time fly!

------
jjoe
Reads familiar? The outage pattern is identical to the one impacting GitHub
January 8 [1] with my comment in the thread discussing the root cause [2].
Systems generating config files and then pushing them out to services within
the infrastructure without proper checking and linting. In google's case, I
just can't believe their systems are so delicately integrated and such
critical component so botched up.

[1] [https://github.com/blog/1759-dns-outage-post-
mortem](https://github.com/blog/1759-dns-outage-post-mortem)

[2]
[https://news.ycombinator.com/item?id=7081913](https://news.ycombinator.com/item?id=7081913)

~~~
magicalist
Can you be more specific? They don't sound very similar at all, though to be
fair, google didn't provide that many details here. They both involve
generated configuration files?

> _Systems generating config files and then pushing them out to services
> within the infrastructure without proper checking and linting._

It's not really that easy, as "proper checking and linting" might as well be
phrased as "sufficiently smart checking and linting". You can have _amazing_
checking and linting and still let a bug pass through.

------
roskilli
The main thing that comes to mind is why they do not deploy these kind of
changes to a small slice and smoke test the slice before deploying to all
users? This seems to be a pretty common routine for services at scale
nowdays...?

~~~
pfg
I'm sure they use Canary Deployments, Gradual Rollouts and what have you to
update their services.

I suppose this is a hard problem to solve on a configuration change level
though. Imagine the configuration change that triggered the bug was something
like "hey load balancers, stop sending traffic to the cluster with that new
version of service X which seems to cause elevated error rates." You don't
really want that kind of change to take too long to propagate.

------
cwyers
I gotta admit, even though I already knew this in the back of my head, the
most surprising thing about this is that Google still uses Blogspot for stuff.

~~~
yeukhon
How is that surprising? Tumblr uses tumblr to post announcement and Google
owns Blogspot of course would use blogspot to make announcement.

~~~
cwyers
Honestly, I keep forgetting that Google owns Blogspot, mostly because I keep
forgetting that Blogspot still exists.

~~~
rahimnathwani
You may visit a site hosted on blogspot more often than you think. Blogspot is
blocked where I live (China) so I notice immediately when I click on a
blogspot link on HN. If I didn't have to deliberately turn on VPN, I might
just read the post without bothering to notice where it's hosted.

~~~
yeukhon
True. Matt Green for example uses blogspot too.
[http://blog.cryptographyengineering.com/](http://blog.cryptographyengineering.com/)

pycon blog is also on blogspot.
[http://pycon.blogspot.com/](http://pycon.blogspot.com/)

------
jlgaddis
Back in the day, we used to _TEST_ configurations on a few machines before
blowing them out to, you know, _EVERYTHING_!

------
ape4
Having a system then sends out config files sounds like part of the body that
send out hormones,etc. Both can go wrong.

