In addition you can reach out to our customer support team at email@example.com or +1 (844) 700-3889.
Tim Armandpour, SVP of Product Development, PagerDuty
Specifically, they shouldn't have all of their DNS hosted with one company. That is a major design flaw for a disaster-handling tool.
I ask because my secondary question, as a network noob, is was anybody prepared / preparing for a DDOS on a DNS like this? Were people talking about this before? I live in Mountain View so I've been thinking today about the steps I and my company could take in case something horrifying happens - I remember reading on reddit years ago about local internets, wifi nets, etc, and would love to start building some fail safes with this in mind.
Two pronged comment, sorry.
Re question #1, check out PagerDuty's reliability page here: https://www.pagerduty.com/features/always-on-reliability/
Namely "Uninterrupted Service at Scale -
Our service is distributed across multiple data centers and hosting providers, so that if one goes down, we stay available."
It seems fair to expect them to have a backup dns too, but I am not an expert.
I have, personally, been under attack with as-large or larger than todays attacks at my DNS infrastructure and survived.
Knocking half of the web off the grid because their DNS provider is under attack? It happened recently to DNSimple.
The irony is that I noticed it when dotnetrocks.com went offline, at that time dotnetrocks was sponsored by dnsimple...
While 'expensive' is a relative term, I disagree that it's cost-prohibitive for most firms, as I looked into this specifically (ironically considered using Dyn as our secondary). The challenge isn't coming up with the funds, it's if you happen to use 'intelligent DNS' features; these are proprietary (by nature) and thus they don't translate 1:1 between providers.
In addition to having to bridge the divide yourself, by analyzing the intelligent DNS features and using the API from each provider to simultaneously push changes to both providers, you have to write and maintain automation/tooling that ensures your records are the same (or as close as possible) between the providers. If you don't do this right, you'll get different / less predictable results between the providers, making troubleshooting something of a headache.
Thus in that case the 'cost' in man effort (and risk, given that APIs change and tooling can go wrong) in addition to the monthly fee.
If all you're doing is simple, standard DNS (no intelligent DNS features), it's not as hard, and it's just another monthly cost. Since you typically get charged by queries/month, if you run a popular service you're probably well able to afford the redundancy of a second provider.
The sarcasm is curious. It's a business decision. Either your revenue is high enough that the monetary loss from a several-hour intra-day outage is potentially worse than the cost of said redundancy, or you don't care enough to invest in that direction (it's expensive).
Making things redundant is exactly a core piece of what infrastructure engineering is. I guess with the world of VPSes and cloud services, that aspect is being forgotten? And yes, redundancy / uptime costs money!
It may, at scale, be like 100% (one provider), 55%+55% (two) and 40%+40%+40% (three). Still eminently affordable.
Route53 on AWS is $0.50/zone and $0.40/million queries. API integration is also very easy.
Using something like Route53 as a backup is significantly cheaper than suffering from the current Dyn outage.
$ dig -tNS github.com @184.108.40.206
; <<>> DiG 9.8.3-P1 <<>> -tNS github.com @220.127.116.11
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15616
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;github.com. IN NS
;; ANSWER SECTION:
github.com. 899 IN NS ns-1283.awsdns-32.org.
github.com. 899 IN NS ns-1707.awsdns-21.co.uk.
github.com. 899 IN NS ns-421.awsdns-52.com.
github.com. 899 IN NS ns-520.awsdns-01.net.
github.com. 899 IN NS ns1.p16.dynect.net.
github.com. 899 IN NS ns2.p16.dynect.net.
github.com. 899 IN NS ns3.p16.dynect.net.
github.com. 899 IN NS ns4.p16.dynect.net.
;; Query time: 32 msec
;; SERVER: 18.104.22.168#53(22.214.171.124)
;; WHEN: Fri Oct 21 13:01:48 2016
;; MSG SIZE rcvd: 248
$ dig -tNS twitter.com
; <<>> DiG 9.8.3-P1 <<>> -tNS twitter.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62729
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 4
;; QUESTION SECTION:
;twitter.com. IN NS
;; ANSWER SECTION:
twitter.com. 75575 IN NS ns3.p34.dynect.net.
twitter.com. 75575 IN NS ns4.p34.dynect.net.
twitter.com. 75575 IN NS ns1.p34.dynect.net.
twitter.com. 75575 IN NS ns2.p34.dynect.net.
;; ADDITIONAL SECTION:
ns3.p34.dynect.net. 54698 IN A 126.96.36.199
ns4.p34.dynect.net. 81779 IN A 188.8.131.52
ns1.p34.dynect.net. 8544 IN A 184.108.40.206
ns2.p34.dynect.net. 54775 IN A 220.127.116.11
;; Query time: 0 msec
;; SERVER: <local>
;; WHEN: Fri Oct 21 13:02:14 2016
;; MSG SIZE rcvd: 179
> Running a redundant DNS provider is expensive as all hell.
What makes you think that?
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.
See MaxCDN for example who uses a mix of dns providers (AWS Route53 and NS1):
ns-5.awsdns-00.com. ['18.104.22.168'] [TTL=172800]
ns-926.awsdns-51.net. ['22.214.171.124'] [TTL=172800]
ns-1762.awsdns-28.co.uk. ['126.96.36.199'] (NO GLUE) [TTL=172800]
ns-1295.awsdns-33.org. ['188.8.131.52'] (NO GLUE) [TTL=172800]
dns1.p03.nsone.net. ['184.108.40.206'] [TTL=172800]
dns2.p03.nsone.net. ['220.127.116.11'] [TTL=172800]
dns3.p03.nsone.net. ['18.104.22.168'] [TTL=172800]
dns4.p03.nsone.net. ['22.214.171.124'] [TTL=172800]
If both Dyn and R53 go down, it's exactly when you want a service like PagerDuty work without a hitch.
Spoiler: I'd bet my complete net worth against your assertion and give you incredible odds.
Golden rule: Fixing a DNS outage with actions that require DNS propagation = game over. You'd might as well hop in the car and start driving your content to people's homes.
I was giving a bare-minimum example of how this or (some other backup solution) should have already been setup and ready to be switched over.
DNS is bog-simple to serve and secure (provided you don't try to do the fancier stuff and just serve DNS records): it is basically like serving static HTML in terms of difficulty.
That a company would have a backup of all important sites/IP addresses locally available and ready to deploy on some other service, or even be built by hand via some quickly rented servers, is I think quite a reasonable thing to have. I guess it would also be simple to run on GCE and Azure as well, if you don't like the idea of dedicated servers.
I just wish it scaled to multiple cores :(
Calling it a "challenge" implies that there is some difficult, but possible, action that the customer could take to resolve the issue. Since that is not the case, this means either you don't understand what's going on, or you're subtly mocking your customers inadvertently.
Try less to make things sound nice and MBAish, and try more to just communicate honestly and directly using simple language.