
How Stack Overflow plans to survive the next DNS attack - samhamilton
http://blog.serverfault.com/2017/01/09/surviving-the-next-dns-attack/
======
matt4077
I think I measured Cloudflare's performance and chose it over Google because
it was consistently faster. If the stack-stackers are reading, I'd love to
hear why they didn't make the list.

Also, it'd be a great public service to publish the results. Even if it's just
enabled for a day per year or so the results would probably appreciated by
many. And you could always sell your altruism as the need to continually
monitor the situation :)

~~~
samhamilton
+1 - I am also very interested why they chose to switch out from CloudFlare
for both their DNS and their CDN and over to Fastly. Nick Craver did a write
up where they specifically mentioned [1] Cloudflare for both their DNS and
CDN.

Do you think after the Dyn outage everyones sysadmins are running round adding
redundancy, too worried to trust the uptime of their site in the hands of just
CloudFlare?

[1] [http://nickcraver.com/blog/2016/02/17/stack-overflow-the-
arc...](http://nickcraver.com/blog/2016/02/17/stack-overflow-the-
architecture-2016-edition/)

~~~
dx034
The fact that they abandoned Cloudflare only 6 months after this post means
that they must've been pretty disappointed. I wonder if this is the same
reason as for some of the other pages that went to Fastly from Cloudflare
(e.g. Imgur).

~~~
hehheh
It could also be due to some mundane detail like the service's cost.

~~~
dx034
He mentioned 503 errors and "missed deadlines". I don't think it's costs, they
were known beforehand, you wouldn't switch after 6 months because of costs.
And I think if you're at that scale, you'd get a counteroffer from Cloudflare
if you threaten to leave.

As you can whitelist Tor traffic in Cloudflare, it seems to be down to these
503 errors (edge to origin). But haven't heard that before, so not sure if
it's a problem that occurs more often.

For Imgur, I could imagine that purging by cache tag is just too restrictive
at Cloudflare (the limit is very low, even for Enterprise clients). Fastly
doesn't have a limit there, they encourage you to cache everything and purge
where needed. Makes it much easier to cache APIs and HTML pages.

------
bks
Umm, brilliant thank you for this.

I ended up with a Dyn / Route53 configuration. We used libcloud to sync
everything together. We also added the exported zone to Cloudflare but did not
enable it.

We had actually planned for this, but in no way did we ever come close to your
in depth testing. The @ Azure issue - thank you for uncovering this for the
rest of us.

~~~
tibu
Can you maybe share how you did the sync between them? Also there are already
some tools pulling zone data from Dyn and adding it to Route53? Can you share
why did you choose an own sync. (I'm planning to do the same and I'm
interested in other's opinions)

~~~
captncraig
We actually wrote a tool to manage this. We define our desired records in a
common dsl format, and the tool can interact with various providers to ensure
things match the expected state.

We should be open sourcing this rather shortly, so stay tuned.

Sorry, I'm not who you asked, but that is how we are doing it at stack
overflow now.

------
matt4077
The calculation regarding the ideal number of name servers to list needs some
empirical data regarding the likelihood of provider and server outages and the
client reactions to it, right? Because otherwise 2 would must be the best
number, if I'm not mistaken (Chance of hitting the provider that's offline is
always 0.5 on the first try, but the second try would be guaranteed to hit the
other).

Here's the math for expected number of tries if half of the servers are
offline. (It's a hypergeometric distribution but I couldn't find a closed
formula)

E(2 server) = 1 * 1/2 + 2 * 1/2 = 1.5

E(4 server) = 1 * 2/4 + 2 * 2/4 * 2/3 + 3 * 2/4 * 1/3 = 1.67

E(8 server) = 1 * 4/8 + 2 * 4/8 * 4/7 + 3 * 4/8 * 3/7 * 4/6 + 4 * 4/8 * 3/7 *
2/6 *4/5 = 1.73

~~~
thefarseeker
Author here. One thing I didn't cover in the post was about how to have Google
and AWS honour their SLA's is that you have to use _all 4_ of the nameservers
they provide. Because we're not doing that (only using half), we have to
balance the chance of an outage versus the impact of an outage.

You are correct in saying that more empirical data could be used here. We
might even end up changing our minds. I'm not much of a numbers person but I
might pass this onto some of the people in our company who love solving
problems like this.

------
jlgaddis
It'd be great if more DNS providers supported "slaving" a zone from an
existing server. It would make it much easier to keep DNS synchronized across
multiple providers.

Hurricane Electric supports this but most of the providers mentioned in this
article do not.

~~~
thefarseeker
Author here. I agree. There are built-in mechanisms for doing this - AXFR and
IXFR. However, these mechanisms were not really designed with this sort of
scale in mind. You have to keep an up to date whitelist of all the servers
that can talk to each other, and they would need to talk to each other on a
non-anycasted address (otherwise the notify packet would go to just a single
anycasted node).

Managing whitelists between multiple 3rd party DNS providers is likely to
break frequently as servers move around, are added, removed, etc.

Interestingly, Hurricane Electric would have been one of our top choices if
they had a first class API and a commercial SLA. Their ability to support zone
transfers is admirable and did not go un-noticed. DNS Made Easy also supports
zone transfers.

~~~
jlgaddis
Just as an additional data point for anyone else reading this...

Hurricane Electric supports zone transfers and requires you to only allow
AXFR's from a single host -- slave.dns.he.net (IPv4: 216.218.133.2, IPv6:
[2001:470:600::2]). NOTIFYs should not be sent to slave.dns.he.net but instead
to ns1.he.net.

n.b.: ns1.he.net is not anycasted, but ns[2-5] are. In addition, ns1 does not
have an AAAA RR.

We (ISP) currently run our own authoritative name servers in our own
facilities but I've been seriously debating adding another provider into the
mix so "secondary" service is an important feature to me.

------
ksec
From my experience EdgeCast and DNSMadeEasy were consistently the fastest DNS.
I guess both were dropped because of price when Google DNS and Route53 did the
a similar job.

And as other have said, while Cloudflare may not be for everyone, their DNS is
possibly the fastest. Not sure why SO decide to drop them.

*Some old Data [http://www.dnsperf.com/](http://www.dnsperf.com/)

I also wonder on the performance of DNSimple. But they dont see to emphasis
much on performance.

~~~
thefarseeker
So DNSMadeEasy made it all the way through the barrage of tests. I even wrote
a library for their API so that we could integrate it into our DNS software
([https://github.com/mhenderson-
so/godnsmadeeasy](https://github.com/mhenderson-so/godnsmadeeasy)), but at the
end of the day their performance in certain regions was not good enough. In
some countries they were measurably faster than R53, but in others they were
measurably much slower.

EdgeCast were dropped due to pricing, and that there's talk of Verizon selling
the EdgeCast services again.

DNSimple didn't make it to performance testing because they only had 5 POPs,
as opposed to 20+ of other providers.

CloudFlare's DNS was consistently one of the fastest, you are correct about
that. If you read my responses to other comments here, you'll find that we
decided not use their DNS service because of some fairly pervasive API issues
we had with it.

------
elktea
Netflix have a tool for this as well
[https://github.com/Netflix/denominator](https://github.com/Netflix/denominator)

~~~
skuhn
Denominator is not actively developed:
[https://github.com/Netflix/denominator/issues/374](https://github.com/Netflix/denominator/issues/374)

Last commit of substance was in Sept 2015.

~~~
majewsky
I have not looked at this particular commit history, but I want to argue
against the notion that no fresh commits must mean that a project is
abandoned. Some stuff is just mostly _finished_ after some point.

~~~
skuhn
In the GitHub issue I linked to, the project maintainer indicates both that he
will no longer work on the project and that Netflix has retired the software.

So... I wouldn't invest time in it unless you want to take over stewardship
(no one else has offered in the last 7 months).

------
Mojah
I'm currently working on a tool [1] that can help with checking if all your
different providers are 'in sync' and responding with the same answers. Setups
like these are only to grow more common as people realise a single DNS
provider is a SPOF of its own.

Very good analysis of SO and a smart move to roll this out _before_ a new DNS
outage!

[https://dnsspy.io](https://dnsspy.io)

~~~
SteveNuts
I've thought about doing something like this, the biggest issue I found is
feature parity between DNS providers.

If you could have a unified API that would create the records on multiple
providers that would be money, it's just that you'd lose out on some things
like Route 53 health checking, etc.

------
cuu508
Is there a good writeup somewhere about setting up redundant NS records at the
zone apex? Or, more generally, "DNS primer for busy developer" article?

~~~
skuhn
Once you have configured your zone with multiple providers, it's simply a
matter of adding NS entries for each provider's authoritative servers to your
registrar. The harder part is ensuring that the zones are kept in sync and
that you don't rely on features (such as GSLB stuff or ALIAS records) that
aren't available with all providers.

It's up to the client resolver to handle failover, so it's not perfect in
terms of availability, but better than nothing.

For example:

    
    
      $ dig ns amazon.com
      amazon.com.		3599	IN	NS	ns4.p31.dynect.net.
      amazon.com.		3599	IN	NS	ns1.p31.dynect.net.
      amazon.com.		3599	IN	NS	ns3.p31.dynect.net.
      amazon.com.		3599	IN	NS	ns2.p31.dynect.net.
      amazon.com.		3599	IN	NS	pdns1.ultradns.net.
      amazon.com.		3599	IN	NS	pdns6.ultradns.co.uk.
    

(note that this is also TLD redundant, since there's a .co.uk included)

~~~
captncraig
The tricky part is making sure the apex NS records are consistent across all
authoritative nameservers. A surprising number of dns providers do not let you
edit those.

~~~
skuhn
Yeah, it does need to be the same in the zone as well as with your registrar.
As mentioned in StackOverflow's blog post, Azure doesn't support changing NS
records.

Similarly, there are a fair number of DNS providers that don't allow you to
use all DNS record types. For something so simple, providers can really go out
of their way to screw it up.

------
vaara
I wonder why there's no consideration of anycast servers.

~~~
thefarseeker
Author here. Google and AWS are anycasted services. We were not interested in
running our own anycasted DNS due to the management overhead of doing so, and
because the cost of anycasting our own services would be orders of magnitude
more expensive than outsourcing that to an established DNS provider.

