Hacker News new | comments | show | ask | jobs | submit login

I wanted to provide an update on the PagerDuty service. At this time we have been able to restore the service by migrating to our secondary DNS provider. If you are still experiencing issues reaching any pagerduty.com addresses, please flush your DNS cache. This should restore your access to the service. We are actively monitoring our service and are working to resolve any outstanding issues. We sincerely apologize for the inconvenience and thank our customers for their support and patience. Real-time updates on all incidents can be found on our status page and on Twitter at @pagerdutyops and @pagerduty. In case of outages with our regular communications channels, we will update you via email directly.

In addition you can reach out to our customer support team at support@pagerduty.com or +1 (844) 700-3889.

Tim Armandpour, SVP of Product Development, PagerDuty




I had the privilege of being on-call during this entire fiasco today and I have to say I was really really disappointed. It's surprising how broken your entire service was when DNS went down. I couldn't acknowledge anything, and my secondary on-call was getting paged because it looked like I wasn't trying to respond. I was getting phone calls for alerts that wasn't even showing up on the web client, etc. Overall, it caused chaos and I was really disappointed.


"It's surprising how broken your entire service was when DNS went down." lol


How does the service you're responsible for work when DNS stops functioning?


Hopefully you have a nice SLA with them.


I appreciate the update, but your service has been unavailable for hours already. This is unacceptable for a service whose core value is to ensure that we know about any incidents.


Given that a large swath of SaaS services, infrastructure providers, and major sites across the internet are impacted, this seems harsh. Are you unhappy with PagerDuty's choice of DNS provider, or something else they have control over? I don't think anyone saw this particular problem coming.


A company that bills themselves as a reliable, highly available disaster handling tool ought to know better than to have a single point of failure anywhere in its infrastructure.

Specifically, they shouldn't have all of their DNS hosted with one company. That is a major design flaw for a disaster-handling tool.


I'm not using the service, but I'm curious what an acceptable threshold for this company is. Like, if half the DNS servers are attacked? If hostile actors sever fiber optic lines in the Pacific?

I ask because my secondary question, as a network noob, is was anybody prepared / preparing for a DDOS on a DNS like this? Were people talking about this before? I live in Mountain View so I've been thinking today about the steps I and my company could take in case something horrifying happens - I remember reading on reddit years ago about local internets, wifi nets, etc, and would love to start building some fail safes with this in mind.

Two pronged comment, sorry.


I'm not using the service either, but I noticed this comment [1]. It's not the first time that a DNS server has been DDoS-ed, so it has been discussed before (e.g. [2]). At minimum, I would expect a company that exists for scenarios like this to have more than one DNS server. Staying up when half of existing DNS servers are down is a new problem that no one has faced yet, but this is an old, solved one.

[1] https://news.ycombinator.com/item?id=12759653

[2] https://www.tune.com/blog/importance-dns-redundancy/


Re question #2, Amazon uses UltraDNS as a backup and seemed to be relatively unaffected by today's attack.

Re question #1, check out PagerDuty's reliability page here: https://www.pagerduty.com/features/always-on-reliability/

Namely "Uninterrupted Service at Scale - Our service is distributed across multiple data centers and hosting providers, so that if one goes down, we stay available."

It seems fair to expect them to have a backup dns too, but I am not an expert.


> is was anybody prepared / preparing for a DDOS on a DNS like this

Yes.

I have, personally, been under attack with as-large or larger than todays attacks at my DNS infrastructure and survived.


This is exactly my point.


From the perspective of my service being down, my customers being pissed, and me not being notified.. yes, maybe PD should be held to a higher standard of uptime. Seems core to their value prop.


> I don't think anyone saw this particular problem coming.

Knocking half of the web off the grid because their DNS provider is under attack? It happened recently to DNSimple.

https://blog.dnsimple.com/2014/12/incident-report-ddos/

The irony is that I noticed it when dotnetrocks.com went offline, at that time dotnetrocks was sponsored by dnsimple...


Flush your DNS like the parent said.


Flushing DNS wont do shit


pagerduty.com moved to Route53, but the TTL on NS records can be very long. Flushing (restarting, ...) whatever can cache DNS records in your infra will help to quickly pick up the new nameservers.


Not on your laptop. On your local DNS resolver.


[flagged]


Running a redundant DNS provider is expensive as all hell.

While 'expensive' is a relative term, I disagree that it's cost-prohibitive for most firms, as I looked into this specifically (ironically considered using Dyn as our secondary). The challenge isn't coming up with the funds, it's if you happen to use 'intelligent DNS' features; these are proprietary (by nature) and thus they don't translate 1:1 between providers.

In addition to having to bridge the divide yourself, by analyzing the intelligent DNS features and using the API from each provider to simultaneously push changes to both providers, you have to write and maintain automation/tooling that ensures your records are the same (or as close as possible) between the providers. If you don't do this right, you'll get different / less predictable results between the providers, making troubleshooting something of a headache.

Thus in that case the 'cost' in man effort (and risk, given that APIs change and tooling can go wrong) in addition to the monthly fee.

If all you're doing is simple, standard DNS (no intelligent DNS features), it's not as hard, and it's just another monthly cost. Since you typically get charged by queries/month, if you run a popular service you're probably well able to afford the redundancy of a second provider.


Ah so make everything redundant. Double my costs in man hours and in monetary cost. Brilliant!


Ah so make everything redundant. Double my costs in man hours and in monetary cost. Brilliant!

The sarcasm is curious. It's a business decision. Either your revenue is high enough that the monetary loss from a several-hour intra-day outage is potentially worse than the cost of said redundancy, or you don't care enough to invest in that direction (it's expensive).

Making things redundant is exactly a core piece of what infrastructure engineering is. I guess with the world of VPSes and cloud services, that aspect is being forgotten? And yes, redundancy / uptime costs money!


When your service literally says it exists to help provide uptime, redundancy makes sense.


Your automation should be handling creating/modifying records in both providers. Also, if you're utilizing multiple providers you don't need to pay for 100% of your QPS (or whatever metric is used for billing) on every provider, only 50% for two or 33% for three. You can just pay for overages when you need to send a higher percentage of your traffic to a single provider.


A lot of providers do have 'fixed' portions of costs, so, it won't be quite 1/2 or 1/3rd.

It may, at scale, be like 100% (one provider), 55%+55% (two) and 40%+40%+40% (three). Still eminently affordable.


Really?

Route53 on AWS is $0.50/zone and $0.40/million queries. API integration is also very easy.

Using something like Route53 as a backup is significantly cheaper than suffering from the current Dyn outage.


That is not helpful if you want vanity name servers



I assume your clients would prefer working nameservers over vanity ones. Especially if you are in a critical business like PagerDuty.


Latest github NS moved to awsdns

        $ dig -tNS github.com @8.8.8.8

        ; <<>> DiG 9.8.3-P1 <<>> -tNS github.com @8.8.8.8
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15616
        ;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0

        ;; QUESTION SECTION:
        ;github.com.                    IN      NS

        ;; ANSWER SECTION:
        github.com.             899     IN      NS      ns-1283.awsdns-32.org.
        github.com.             899     IN      NS      ns-1707.awsdns-21.co.uk.
        github.com.             899     IN      NS      ns-421.awsdns-52.com.
        github.com.             899     IN      NS      ns-520.awsdns-01.net.
        github.com.             899     IN      NS      ns1.p16.dynect.net.
        github.com.             899     IN      NS      ns2.p16.dynect.net.
        github.com.             899     IN      NS      ns3.p16.dynect.net.
        github.com.             899     IN      NS      ns4.p16.dynect.net.

        ;; Query time: 32 msec
        ;; SERVER: 8.8.8.8#53(8.8.8.8)
        ;; WHEN: Fri Oct 21 13:01:48 2016
        ;; MSG SIZE  rcvd: 248
But my local copy is still on dynect

        $ dig -tNS twitter.com

        ; <<>> DiG 9.8.3-P1 <<>> -tNS twitter.com
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62729
        ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 4

        ;; QUESTION SECTION:
        ;twitter.com.                   IN      NS

        ;; ANSWER SECTION:
        twitter.com.            75575   IN      NS      ns3.p34.dynect.net.
        twitter.com.            75575   IN      NS      ns4.p34.dynect.net.
        twitter.com.            75575   IN      NS      ns1.p34.dynect.net.
        twitter.com.            75575   IN      NS      ns2.p34.dynect.net.

        ;; ADDITIONAL SECTION:
        ns3.p34.dynect.net.     54698   IN      A       208.78.71.34
        ns4.p34.dynect.net.     81779   IN      A       204.13.251.34
        ns1.p34.dynect.net.     8544    IN      A       208.78.70.34
        ns2.p34.dynect.net.     54775   IN      A       204.13.250.34

        ;; Query time: 0 msec
        ;; SERVER: <local>
        ;; WHEN: Fri Oct 21 13:02:14 2016
        ;; MSG SIZE  rcvd: 179


Your local copy is also twitter, instead of github :)


I believe you don't understand DNS. It's probably the most resilient service (granted it's used correctly). There's nothing inherent in the protocol that would prevent them to use multiple DNS providers.

> Running a redundant DNS provider is expensive as all hell.

What makes you think that?


Sorry if this sounds dickish, but renting 3 servers @ $75 apiece from 3 different dedicated server companies in the USA, putting TinyDNS on them, and using them as backup servers, would have solved your problems hours ago.

Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....

EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.


That's not how that sound be done. Just use a mix of two providers. Using your own servers and TinyDNS is silly for million/billion dollar companies.

See MaxCDN for example who uses a mix of dns providers (AWS Route53 and NS1):

    ns-5.awsdns-00.com.   ['205.251.192.5']   [TTL=172800] 
    ns-926.awsdns-51.net.   ['205.251.195.158']   [TTL=172800] 
    ns-1762.awsdns-28.co.uk.   ['205.251.198.226'] (NO GLUE)   [TTL=172800] 
    ns-1295.awsdns-33.org.   ['205.251.197.15'] (NO GLUE)   [TTL=172800] 
    dns1.p03.nsone.net.   ['198.51.44.3']   [TTL=172800] 
    dns2.p03.nsone.net.   ['198.51.45.3']   [TTL=172800] 
    dns3.p03.nsone.net.   ['198.51.44.67']   [TTL=172800] 
    dns4.p03.nsone.net.   ['198.51.45.67']   [TTL=172800]
Curious, are you the kind of person that runs their own smtp email server and complains about GitHub pricing being too expensive?


No tool is silly as long as it does the job adequately. Are paperclips silly for a billion-dollar company?

If both Dyn and R53 go down, it's exactly when you want a service like PagerDuty work without a hitch.


You're asserting that your (or their) homegrown DNS service will have better reliability than Dyn and Route53 combined. That assertion gets even worse when it's a backup because people never, ever test backups. And "ready to go" means an extremely low TTL on NS records if you need to change them (which, for a hidden backup, you will), and many resolvers ignore that when it suits them, so have fun getting back to 100% of traffic.

Spoiler: I'd bet my complete net worth against your assertion and give you incredible odds.

Golden rule: Fixing a DNS outage with actions that require DNS propagation = game over. You'd might as well hop in the car and start driving your content to people's homes.


Idea: Chaos Monkey for DNS outages


I don't know how big PagerDuty is; IIRC over 200 employees, so, a decent size.

I was giving a bare-minimum example of how this or (some other backup solution) should have already been setup and ready to be switched over.

DNS is bog-simple to serve and secure (provided you don't try to do the fancier stuff and just serve DNS records): it is basically like serving static HTML in terms of difficulty.

That a company would have a backup of all important sites/IP addresses locally available and ready to deploy on some other service, or even be built by hand via some quickly rented servers, is I think quite a reasonable thing to have. I guess it would also be simple to run on GCE and Azure as well, if you don't like the idea of dedicated servers.


Not neccesarily. Granted this is how I would configure a system (two providers), but it is just as sensical to use one major provider which falls back to company servers in the event of an attack like this. It is all in sysadmin preference, while it is smart to relegate low-level tasks to managed providers it is also smart to have a backup solution that is under full control just in case that control needs to be taken at some point in time.


That would be a quick fix similar to adding another NS provider. Of course if dyn is out completely they might not have their master zone anywhere else. Then it's similar to any service rebuilding without a backup.


+1 for using a mix of two providers. That's what we do at my startup. Never had a problem since (knock on wood).


+1 for TinyDNS.

I just wish it scaled to multiple cores :(


[flagged]


[flagged]


[flagged]


You can't comment like this on Hacker News. Please read the guidelines:

https://news.ycombinator.com/newsguidelines.html


"Challenges" is exactly the sort of Dilbertesque euphemism that you should never say in a situation like this.

Calling it a "challenge" implies that there is some difficult, but possible, action that the customer could take to resolve the issue. Since that is not the case, this means either you don't understand what's going on, or you're subtly mocking your customers inadvertently.

Try less to make things sound nice and MBAish, and try more to just communicate honestly and directly using simple language.


Running multiple DNS providers is not actually that difficult and certainly not cost prohibitive. I am sure after this, we will see lots of companies adding multiple DNS providers and switching to AWS Route53 (which has always been solid for me).


How am i meant to see twitter status updates when twitter is down?


Please check our status page as an alternative method for updates. Unfortunately, it's also been encountering the same issue so we're sending out an email with the latest updates.


I didnae get an email


It's still a work in progress. If you have any immediate issues please contact us at support@pagerduty.com or (844) 700-3889.


PagerDuty outage is the real low point of this whole situation. Email alerts from PagerDuty that should have alerted of the outage in the first place, only got delivered hours later after the whole mess cleared out.


The outage started more than eight hours before you posted this message..


To be fair, Hacker News isn't exactly the first place where companies post status messages. I actually applaud him for posting his message here.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: