

DNS Outage Post Mortem - streeter
https://github.com/blog/1759-dns-outage-post-mortem

======
jjoe
These are some of the corner cases that are put on the back burner en route to
delivering an MVP. Just like you don't do early optimization of an
infrastructure, you almost never enumerate all possible issues that can crop
up under a less than ideal situation.

This isn't a GitHub only issue but rather one that would affect all quick-to-
launch startups (most). What I'm learning from this is that one needs to
regularly revisit the infrastructure and how it's glued together with the
provisioning system.

If it's not broken, break it.

~~~
mahmoudimus
+1. This is invaluable. This falls in line within a larger frame of thinking
-- "immutable infrastructure." Schedule time to regularly provision your
entire stack from ground up without any of the caching optimizations and have
it run in production.

With tools like Chef & AWS CloudFormation, there shouldn't be an excuse.

------
bscanlan
"an initial verification led us to believe the changes had been rolled out
successfully"

I would love more detail in the type II error in this validation step, and is
worth exploring deeper. What was the verification step? Why did it not detect
the issue? What review process was used for the verification step?

While the failed verification step is not the root cause, having good safety
checks are the most important part of planning good changes, whether they're
DNS reconfigurations, network changes or software deployments.

~~~
jlgaddis
I was surprised that the total amount of time between rolling out to the first
servers, waiting, verifying, and then rolling out to the second set of servers
was a whopping _nine minutes_.

Maybe I'm just too careful (perhaps because I've seen it happen before) but I
prefer to wait a helluva lot longer than that for verification.

And perhaps it's because I dealt with Microsoft Active Directory so much in
the past but I am extremely careful when it comes to DNS. If there's one thing
that'll screw up your entire environment ( _especially_ in an AD-based
network), it's broken DNS.

------
badmadrad
This is why i like Chef...i feel there are tools out there to test your code
better....FoodCritic...ChefSpec...Test Kitchen...before rolling to production
and having to validate machines in production....ouch

------
overworkedasian
Who in the right mind would schedule a critical infrastructure upgrade during
the day?

~~~
kawsper
What is the difference between day and night when your users are worldwide?

~~~
colmmacc
There is significant variance in population per timezone, and even more
significant variance in internet-usage per time zone. Much of this variance is
just demographic, but most of it is actually geographic. An interesting and
convenient thing about the present layout of the world is that the Pacific
Ocean takes up almost half of it, and almost half of the world's land masses
are uninhabitable tundra and desert (that's not so relevant to time peaks
though).

This has the great effect of lowering the median travel times and information
transmission latencies between the world's population centers, and it means
that for at least this geological epoch; we're always going to have daily
global peak and off-peak times for human-driven activity.

------
iwasphone
13:20 PST = 16:20 EST

------
nullrouted
What a seriously dumb outage to have. I'm still confused about it after
reading the the RFO.

