Out of curiosity, why do caching DNS resolvers, such as the DNS resolver I run on my home network, not provide an option to retain last-known-good resolutions beyond the authority-provided time to live? In such a configuration, after the TTL expiration, the resolver would attempt to refresh from the authority/upstream provider, but if that attempt fails, the response would be a more graceful failure of returning a last-known-good resolution (perhaps with a flag). This behavior would continue until an administrator-specified and potentially quite generous maximum TTL expires, after which nodes would finally see resolution failing outright.
Ideally, then, the local resolvers of the nodes and/or the UIs of applications could detect the last-known-good flag on resolution and present a UI to users ("DNS authority for this domain is unresponsive; you are visiting a last-known-good IP provided by a resolution from 8 hours ago."). But that would be a nicety, and not strictly necessary.
Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.
It's called HOSTS and djb's cdb constant database.
And one does not need to use a recursive cache to get the IP addresses. Fetching them non-recursively and dumping them to a HOSTS and a cdb file can sometimes be faster; I have a script that does that. Fetching them from scans.io can be even faster.
cd||exit
[ -c null ]||mknod null c 2 2
case $# in
0)
{
sed '
/#/d;
/^[0-9]/!d;
' /etc/hosts \
|{
while read a b c d;
do
echo +${#b},${#a}:$b-\>$a;
done;
}
echo;
} \
|exec awk '!($0 in a){a[$0];print}' \
|exec cdbmake $0.cdb $0.t||exit
exec cdbdump < $0.cdb
;;1)
test ${#0} = 2 ||
exec cdbget $1 < $0.cdb >null;
exec cdbget $1 < $0.cdb;
esac
usage: $0
usage: $0 domainname
First usage compiles and dumps database to screen.
Second usage checks for presence of domainname and exits 0 if present otherwise exits 100.
Third usage is if $0 is only two characters it will check for presence of domainname and if present print the IP and domainname in HOSTS format.
With all due respect to the enormous reliance on it that has built up over the past decades, DNS is not the internet. It is just a service heavily used for things like email and web. This does not mean, in an emergency, email and web cannot work without DNS. They once did and they still can.
The internet runs just fine without DNS. Some software may refuse to honour HOSTS and rely on solely on DNS. But that is a vulnerability of the software, not the internet. (And in such cases, e.g., qmail, I just serve my own zone via tinydns, which again is just a mirror of HOSTS.)
Be aware that some things (Netflix, Comcast, Youtube) expect you to use your local DNS server so that they can route you to the nearest media server. Using a central IP Address like what is mentioned here can result in unsatisfactory video streaming....at least that's what I found with our Apple TV.
That's good to know - the ads are the reason I reluctantly switched from OpenDNS to google.
(Reluctantly in that Google already has enough of my data, thanks, through gmail, search, maps, docs and other services, not because it doesn't work well.)
A shame OpenDNS used to redirect me to some spam webpage every time I tried to resolve a domain that didn't exist--they earned a spot on my black list forever. :(
Well, the fact that people still remember goes to show what a truly terrible idea it really was and that it probably did permanent damage to your brand.
I'm not sure what metric you use to judge it as terrible.
I thought it was great. 10,000 companies pay for my service today. 65 million people use my infrastructure today. Cisco bought the company for more than $650m. It continues to innovate on the decades old DNS in secure and useful ways.
I used OpenDNS for a long time. I eventually switched to Google DNS mostly because its IPs are shorter and easier to remember, and I didn't use any of the power user features for OpenDNS. I remember the page full of ads and to be honest I don't begrudge it. We all expect everything given to us for free these days, and then we don't even want the company to make money showing us an ad on the rare occasion that we mistype a URL. It's hard to get paid these days.
Ironically, those unrealistic expectations are probably a significant factor in the growth of data mining and resell; how else is a free-to-use website that doesn't have any ads (or whose users mostly block ads) going to get paid? You may say "not my problem", but it affects you when you leave the company no option but to resell data on the behaviors they observe from you.
Everyone has preferences, I guess. I far prefer honest and curt to the kind of anodyne, contentless word-payloads pumped out by so many corporate communications departments.
Say, generating corporate communications seems like a promising direction for neural networks. A Markov chain comes close...
They don't do that any more, for what it's worth. I think for a while that was the only revenue stream for what was otherwise a free service. https://www.opendns.com/no-more-ads/
This attitude only promotes the idea that "well we might as well just continue like this then". If you can never forgive a company for doing wrong when they've corrected themselves years ago and now have a track record of doing nothing else that's irked you then what's the point in them ever bothering to make the change?
If what they do is useful to you but have a feature or bug or something else you don't like then you absolutely should forgive them if they then fix that feature or bug to work in a way you like. They may as well never bother fixing things if they can never be forgiven after repenting their internet sins.
If you've since found something that does do what you want then fair play, fill your boots. Otherwise you're being petty for the sake of being petty.
Imagine migrating your website to a new host. A month later, you learn that a major ISP has decided that its customers don't need to know about the move, because they hold on to last-known-good records as they like. So half your traffic and business is gone. Or maybe you can't use anything run on Heroku, because the dynamicism there doesn't play nice with your resolver's policies.
That's the kind of world we used to live in when TTLs were often treated as vague suggestions.
The scenario I was describing was one where a last-known-good resolution would be used if and only if a refresh attempt fails after the authority-provided TTL expires.
I believe the scenario you are describing is a rogue ISP ignoring that authoritative TTL wholesale, caching resolutions according to its own preferences regardless of whether the authority is able to provide a response after the authoritative TTL expires.
The rogue ISPs thought they were helping people by serving stale data. After all, better something past its use-by date than failing, right? A low tolerance for DNS response times, and suddenly large chunks of the internet are failing a lot...
Among other problems, this enables attacks. Leak a route, DDoS a DNS provider, and watch as traffic everywhere goes to an attack server because servers everywhere "protect" people by serving known-stale data rather than failing safe.
Be very, very careful when trying to be "safer". It can unintentionally lead somewhere very different.
> A low tolerance for DNS response times, and suddenly large chunks of the internet are failing a lot...
Hang on a second. I feel that you're piling on other resolver changes in order to make a point. I'm not suggesting that the tolerance for DNS response times be reduced. Nor am I suggesting a scenario where the authority gets one shot after their TTL, after which they're considered dead forever. I would expect my caching DNS resolver to periodically re-attempt to resolve with the authority once we've entered the period after the authority's TTL.
> Leak a route, DDoS a DNS provider, and watch as traffic everywhere goes to an attack server because servers everywhere "protect" people by serving known-stale data rather than failing safe.
I think you're suggesting that someone could commandeer an IP and then prevent the rightful owner to correct their DNS to point to a temporary new IP.
Isn't the real problem in this scenario the ability to commandeer an IP? The malicious actor would also need to be able to provide a valid certificate at the commandeered IP. And at that point, I feel we've got a problem way beyond DNS resolution caching. Besides, if what you have proposed is possible, isn't it also possible against any current domain for the duration of their authoritative TTL? That is, a domain that specifies an 8-hour TTL is vulnerable to exactly this kind of scenario for up to an 8-hour window. Has this IP commandeering and certificate counterfeiting happened before?
> Hang on a second. I feel that you're piling on other resolver changes in order to make a point.
Yes. The point I am making is the additional failure modes that need to be considered and the pain they can cause. Historically have caused.
At no point did I ever think you were suggesting that one failure to respond renders a server dead to your resolver forever. Instead, I expect that your resolver will see a failure to respond from a resolver a high percentage of the time, leading to frequent serving of stale data.
> Isn't the real problem in this scenario the ability to commandeer an IP?
You're absolutely right! The real problem here is the ability to commandeer an IP.
However, that the real problem is in another castle does not excuse technical design decisions that compound the real problem and increase the damage potential.
> Instead, I expect that your resolver will see a failure to respond from a resolver a high percentage of the time, leading to frequent serving of stale data.
If this were true, the current failure mode would have end users receiving NX DOMAIN a "high percentage of the time," which obviously is not happening.
{edit: To be clear, I'm reading the quote as you stating that "failure to resolve" currently happens a high percentage of the time, and therefore this new logic would result in extended TTLs more often than the original post would assume they would happen}
> However, that the real problem is in another castle does not excuse technical design decisions that compound the real problem and increase the damage potential.
It's fair to point out that this change, combined with other known issues could create a "perfect storm," but as was pointed out this exploit is already possible within the current authoritative TTL window. Exploiting the additional caching rules would just be a method of extending that TTL window.
On the other hand, where do you draw the line here? If you had to make sure that no exploits were possible most of the systems that exist today would never have gotten off the ground. It seems a bit like complaining that the locks to the White House can be exploited (picked), while missing the fact that they are only supposed to slow someone down before the "men with guns" can react.
Based on the highly unscientific sample of the set of questions asked by my coworkers in my office today, the failure mode of end users receiving NX DOMAIN has happened much more than on most days.
I don't need to make sure no exploits are possible. However, it at all possible, I'd like to help ensure that things aren't accidentally made more dangerous. It's one thing to consider and make a tradeoff. It's quite another to be ignorant of what the price is.
Well, it obviously happens when the resolver is down, but that's the situation that this logic is being proposed to smooth over. The normal day-to-day does not see a high percentage of resolvers failing to respond, or else people would be getting NX DOMAIN for high profile domains much more often.
All the attacks mentioned here seem to be of the following shape:
1. Let's somehow get a record that points at a host controlled by us into many resolvers (by compromising a host or by actually inserting a record).
2. Let's prolong the time this record is visible to many people by denying access to authoritative name servers of a domain.
(1) is unrelated to caching-past-end-of-ttl, so you need to be able to do (1) already. (2) just prolongs the time (1) is effective and required you to be able to deny access to the correct DNS server. Is it really that much easier to deny access to a DNS server than it is to redirect traffic to that DNS server and supply bogus reponses?
DNS cache poisoning is currently a very common sort of attack. The UDP-y nature of DNS makes it very easy. There are typically some severe limitations placed on the effectiveness of this attack by low TTLs. It does not require you to deny access to the authoritative server. This attack is also known as DNS spoofing: https://en.wikipedia.org/wiki/DNS_spoofing
Ignoring TTLs in favor of your own policy means poisoned DNS caches can persist much longer and be much more dangerous.
Right now, to keep a poisoned entry one must keep poisoning the cache.
In that world, one can still do that. One can also poison the entry once and then deny access to the real server. You seem to be arguing that this is easier than continuous poisoning. Do I understand you correctly?
You are correct in your assessment of the current dangers of DNS poisoning.
I am in no way arguing about ease of any given attack over any other. I am arguing that a proposed change results in an increased level of danger from known attacks.
I'm arguing that the proposed change at hand, keeping DNS records past their TTLs, makes DNS poisoning attacks more dangerous because access to origin servers can be denied. Right now TTLs are a real defense against DNS cache poisoning, and the idea at hand removes that in the name of user-friendliness.
The way I read your argument, it relies on denying access to be cheaper or simpler than spoofing (X == spoofing, Y == denying access to authoritative NS):
You are arguing that a kind of attacks is made more dangerous, because in the world with that change an attacker can not only (a) keep performing attack X, but can also (b) perform attack X and then keep performing Y. If Y is in no way simpler for the attacker why would an attacker choose (b)? S/he can get the same result using (a) in that world or in our world.
Am I misreading you or missing some other important property of these two attack variants?
I believe you may have failed to consider the important role played by reliability.
X cannot always be done reliably - it usually relies on timing. Y, as we've seen, can be done with some degree of reliability. Combining them, in the wished-for world, creates a more reliable exploit environment because the spoofed records will not expire. The result is more attacks that persist longer and are more likely to reach their targets.
Such a world is certain to not be better than this one and likely to be worse.
I appreciate the support. But FWIW, I don't think Kalium was trolling. Although he (I assume, but correct me if I am wrong) and I disagree on the risk versus reward of extending the time-to-live of cached resolutions beyond the authoritative TTL, I nevertheless appreciated and enjoyed his feedback.
I'm afraid we're simply going to have to agree to disagree on this point. I do not share the opinion that this is a good idea with significant upside and virtually no downside. I also do not agree that none of the issues I have raised apply to the original suggestion - I believe they do apply, which is why I raised them.
WRT the second attack, what they're referring to is actually DNS cache poisoning - inserting a false record into the DNS pointing your name at an attacker-controlled IP address. This is a fairly common attack, but usually has an upper time limit - the TTL (which is often limited by DNS servers).
This proposal would allow an attacker to prolong the effects of cache poisoning by running a simultaneous DDoS against un-poisoned upstream DNS servers.
Not sure whether it could be used in a legitimate attack (probably), but it can definitely lead to confusing behavior in some scenarios. You switch servers, your old IP is handed to some random person, your website temporarily goes down - and now your visitors end up at some random website. Would you want that? Especially if you're a business?
Also, "commandeering" an IP of a small hosting might be easier than you think. It depends entirely on how they recycle addresses.
You seem to be continuing to warn against a proposal that isn't the one that was made. What specifically is dangerous about using cached records only in the case of the upstream servers failing to reply?
It doesn't take much of an imagination to attack this.
The older I get in tech the more I realize we just go in circles re-implementing every bad idea over again for the same exact reasons each "generation". Ah well.
TTL is TTL for a reason. It's simple. The publisher is in control, they set their TTL for 60 seconds so obviously they have robust DNS infrastructure they are confident in. They are also signaling with such low TTLs that they require them technically in order to do things like load balance or HA or need them for a DR plan.
Now I get a timeout. Or a negative response. What is the appropriate thing to do? Serve the last record I had? Are you sure? Maybe by doing so I'm actually redirecting traffic they are trying to drain and have now increased traffic at a specific point that is actually contributing to the problem vs. helping. How many queries do I get to serve out of my "best guess" cache before I ask again? How many minutes? Obviously a busy resolver (millions of qps at many ISPs) can't be checking every request so where do you draw the line?
It's just arrogant I suppose. The publisher of that DNS record could set a 30 day TTL if they wanted to, and completely avoid this. But they didn't, and they usually have a reason for that which should be respected. We have standards for a reason.
- Users still connect to bad actor even though TTL expired
"We have standards for a reason" is absolutely correct, and we can't start ignoring the standards because someone can't imagine why we need them _at this moment_
- Attacker generates or acquires counterfeit facebook.com certificate.
> - DDoS nameservers
> - facebook removes IP from rotation
> - Users still connect to bad actor even though TTL expired
I understand what you are saying, but this attack scenario is extraordinarily difficult as a means to attack users who have opted to configure their local DNS resolver to retain a last-known-good IP resolution. It involves commandeering an IP and counterfeiting Facebook's SSL/TLS certificate. As I have said elsewhere in this thread, all sites are currently vulnerable to such an attack today for the duration of their TTL window. So if this is a plausible attack vector, we could plausibly see it used now.
You're right! Completely, absolutely, 100% right. If this was a plausible attack vector, we could see it used now. And you know what? We do!
This is why some people are concerned about technical decisions that make this vector more dangerous. Systems that attack by, say, injecting DNS responses already exist and are deployed in real life. The NSA has one - Quantum. Why make the cache poisoning worse?
If my adversary can steal an IP from Facebook, create a valid certificate for facebook.com, and provide bogus DNS resolution for facebook.com, I feel it's game over for me. My home network is forfeit to such an adversary.
But I get your point. It's about layering on mitigating factors. The lower the TTL, the lower the exposure. Still, my current calculus is that the risk of being attacked by such an adversary is fairly low (well, I sure hope so), and I would personally like to configure my local caching resolver to hold onto last-known-good resolutions for a while.
All that said, I have to hand it to you and others like you, those whom keep the needle balanced between security and convenience.
Now that I think about it more, it's even worse than that. A bogus non-DNSSEC resolution and a forged cert, both of which are real-life attacks that have actually happened, and you're done for. Compromising an IP isn't really necessary if you're going to hang on to a bad one forever, but it's a nice add-on. It removes the need to take out the DNS provider, but we can clearly see that that is possible.
Keeping the balance between security and convenience is difficult on the best of days. Today is not one of them. :/
It's part of what I would use them for. A big, splashy attack distracts a bunch of people while you MITM something important with a forged cert? Great way to steal a bunch of credentials with something that leaves relatively few traces while the security people are distracted.
> - Attacker generates or acquires counterfeit facebook.com certificate.
So you enabled an attack vector that has to be nullified by a deeper layer of defense? And in some cases possibly impacted by a user having to do the right then when presented with a security warning.
Why would you willingly do that?
Also I do find your assumption of ubiquitous TLS rather alarming - facebook is a poor example here, there are far softer and more valuable targets for such an attack vector to succeed.
Edit: Also to keep my replies down...
> I would personally like to configure my local caching resolver to hold onto last-known-good resolutions for a while.
You can! All these tools are open source, and there are a number of simple stub resolvers that run on linux (I'd imagine OSX as well) which you can configure to ignore TTL. They may not be as configurable as you like, but again they are open source and I'm sure would welcome a pull request :)
The policy that caused so much pain before is to take DNS records, ignore their TTLs, and apply some other arbitrarily selected policy instead. I confess, I don't understand how the proposal at hand is different in ways that prevent the previous pains from recurring.
Maybe you can enlighten me on key differences I've overlooked? How do you define "failing to reply"? Do you ever stop serving records for being stale, or do you store them indefinitely?
So if our resolver was on our resolver was on our laptop and had a nice UI that would work great. Now the question is : why is the resolver not in my laptop?
It can be if you want it to be, but it's probably much less interesting than you think.
You likely underestimate the sheer number of DNS records you look up just by surfing the web, and how useful that information would be to 99.99% of users.
Basically the tools exist for you to do this yourself if you are so inclined, but they may not be that user friendly since they aren't generally useful to most.
Seconded. This is a common idea which occurs to people who haven't dealt with DNS before, and it ends with a much better understanding of how many things https is used for and going back to using openDNS as your resolver.
It'd be nice to have a "backup TTL" included, to allow sites to specify whether and how long they wanted such caching behavior.
Also, that cache would need to only kick in when the server was unreachable or produced SERVFAIL, not when it returned a negative result. Negative results returned by the authoritative server are correct, and should not result in the recursive resolver returning anything other than a negative result.
> Also, that cache would need to only kick in when the server was unreachable or produced SERVFAIL, not when it returned a negative result. Negative results returned by the authoritative server are correct, and should not result in the recursive resolver returning anything other than a negative result.
Precisely. I am not suggesting any change to how a caching resolver comprehends valid responses from the authoritative server for a domain. For example, if the authoritative server says, "No such domain," then the domain is understood to be gone. At that point, the domain being gone is in fact the last-known-good resolution.
This is why TTL is a variable. If you want to have your records last longer, you can. If you want to later shorten them, you can. People screw DNS up enough already, let's not make it worse by adding layers of TTL.
It might be a stretch to use the information, but the SOA RR does contain an EXPIRE field, defined as "A 32 bit time value that specifies the upper limit on the time interval that can elapse before the zone is no longer authoritative." It's an additional request, but the SOA RR does contain the type of information you are asking for.
I've been thinking of adding this exact feature to my DNS framework that I've been working on (if github was resolving): https://github.com/bluejekyll/trust-dns
To be perfectly honest, a "feature" like this has no business being in a safe and secure DNS server. You should fail-safe, rather than serving stale data of unknown safety.
Serving data you cannot verify is a dangerous failure state.
Perhaps. In this case the web was down. I definitely understand the point, stale data with TTLs which have expired, especially on RRSIG records is dangerous.
But I have to wonder about situations like this where DNS has been taken down, what the better good is. If the records can be proven to have been cached as authentic data at some point within some period of time. In this case hours, is it for the better good that stale authentic records are acceptable to serve back? In this case a stale period of some number of hours would have been good.
Would you believe me if I said that the DNS protocol itself has an answer for this? The answer is in the basic design of what a TTL is. It's preferable to serve nothing than to serve known bad data. Stale data is a form of bad data.
As another user put it, we have these standards for good reason.
> Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.
Because you would keep old DNS records around forever if a server goes away for good. So you need to have a timeout for that anyways.
1) Memory and disk are cheap. My caching DNS resolver can handle some stale records.
2) I suggested above that this behavior would continue until an administrator-specified and potentially quite generous maximum TTL expires. That is, I could configure my caching DNS resolver to fully purge expired records after, for example, 2 weeks.
> 1) Memory and disk are cheap. My caching DNS resolver can handle some stale records.
The problem is not that it would require storage but that stale records can be outright wrong. That timeout would require configuration and DNS does not provide that.
So sure, a new timeout could be introduced but that currently does not exist in DNS.
> The problem is not that it would require storage but that stale records can be outright wrong.
Again, the scenario is that the authoritative/upstream resolver cannot be reached in order to refresh after the authority-provided TTL expires. Are you saying that in the case of a service having been intentionally removed from the Internet (the domain is deactivated; the service is simply no more), my caching resolver will continue to resolve the domain for a time? Yes, it would. What's the downside though?
> That timeout would require configuration and DNS does not provide that.
Yes. This would be a configurable option in my caching DNS resolver, in the same vein as specifying the forwarders, roots, and so on. But to be clear, this would not be a change to the DNS protocol, merely a configuration change to control the cache expiration behavior of my resolver. I'm not wanting to sound flippant, I'm not sure I understand the point you're trying to make here.
If a server goes away for good, at some point NS records will stop pointing to it. We could serve stale records as long as all of the stale record's authority chain is either still there or unreachable.
I've had an IP address from a certain cloud provider for a month. Some abandoned domain still has its nameserver and glue records pointing to the IP, and i get DNS queries all the time.
The domain expires in January. I hope it's not set to auto-renew. :-)
Note that this is already happening. The only thing my proposal would change is that it would also affect servers that used to be authoritative for subdomains of such abandoned domains. I would expect there to be very few of them: very few domains have delegations of subdomains to a different DNS server and they are larger and thus less likely to be abandoned.
HTTP has a good solution/proposal for this: the server can include a stale-on-error=someTimeInSeconds header in addition to the TTL and then every cache is allowed to continue serving stale data for the specified time while the origin is unreachable. Probably a good idea to include such a mechanism in DNS, too.
FWIW, since the comment was a reply to my message above:
It provided value by answering my question concerning serious downsides to providing optional post-TTL last-known-good caching within a DNS resolver. The answer is implicit in that a major DNS resolver provides exactly this functionality.
i seem to remember that dns has generally been reliable (until recently, i guess), probably nobody has ever thought that to be necessary.
you could write a cron script that generates a date-stamped hosts file based on a list of your top-used domain names, and simply use that on your machine(s) if your dns ever goes down. that's basically a very simple local dns cache.
if you feel like living dangerously, have it update /etc/hosts directly.
> i seem to remember that dns has generally been reliable (until recently, i guess)
Probably because people used to use long TTLs (1 hour, 4 hours, whatever) and now the default behavior in services like Amazon Route 53 is to use 5 minutes.
The 20 seconds with Akamai is because of their dynamic end user IP mapping technology, Basically they need to map in near real-time based on characteristics of the end user IP, they can't afford a long TTL
It's not illegal to have TTL that short but it certainly feels like violation of some implicit contract between users and provides. Of course the root cause of this is the horrendous hack of using DNS for CDN routing. It doesn't have to be that way... I wrote a recent article about this very issue here
I haven't heard of packetzoom before, I'll definitely take time over the weekend or next week to dig into your approach.
I wouldn't call DNS based IP mapping "horrendous" simply because it doesn't work as well for mobile,I understand you have your own pitch but lets go easy on the hyperbole :)
The fact is that it is still very effective. The major CDNs are quite aware of the mobile shortcoming of DNS based mapping and I am pretty sure it is something they are working to address.
At the end of the day location is just one component involved in accelerating content, there are plenty of other features that various CDNs use to deliver optimal performance.
Regarding the short TTLs, I get your argument, it is indeed like a user's browser is constantly chasing a moving origin. The alternatively however is a non-optimized web, which would be orders of magnitude worse. Remember the benefits of CDNs doesn't just accrue to end users but also to content providers, most origin servers can't handle even the slightest up tick in traffic.
"I wouldn't call DNS based IP mapping "horrendous" simply because it doesn't work as well for mobile"
OK I'll take back the word "horrendous" but it's a hack alright.
> The fact is that it is still very effective. The major CDNs are quite aware of the mobile shortcoming of DNS based mapping and I am pretty sure it is something they are working to address.
No not really. They're certainly trying to patch DNS to pass through enriched information in DNS requests through recursive calls... but it's such a long shot to work consistently across tens of thousands of networks around the world, and requires coordination from so many different entities, that it's clearly a desperation move more than than a serious effort. Regardless, there's no real solution in sight for the web platform.
For mobile (native) apps though, the right way to discover nearby servers is to directly build in that functionality using mobile specific techniques. There's no reason to keep limiting mobile apps to old, restrictive web technologies considering that apps have taken over as majority of traffic around the world. That's the root idea behind a lot of what we're doing at PacketZoom. Not just in service discovery, but also in more intelligent transport for mobile with built-in knowledge of carriers and network technologies etc, automatic load-balancing/failover of servers and many other things. Here's my older article on the topic
Say I want to implement my own dynamic DNS solution on a VPS somewhere - if I set short TTLs am I causing problems for someone? How short is too short?
I think a problem that you might be overlooking is that DNS lookups aren't just failing, they are also very slow when a DDOS attack is underway on the authority servers. This introduces a latency shock to the system which causes cascading failures.
All will break the moment that one of the websites that you access makes a server-side request to another website ( think about logging-services, server-clusters, database servers, etc - they all either have IPs or most-likely some domains. )
The scenario is that my local network's caching DNS resolver retains resolutions beyond the authority-provided TTL in the event that a TTL-specified refresh at expiration fails. Therefore, my web browser may—in the very rare situation where this arises—make an HTTP request to an IP address of a server that has been intentionally moved by a service provider (let's assume they did so expecting their authoritative TTL to have expired). Since this scenario only arises because my caching resolver wasn't able to reach the authority, I'm not seeing a downside.
But if I understand your reply correctly, you are saying that the web server I've contacted may, in turn, be using a DNS resolver that is similarly configured to provide last-known-good resolution when its upstream provider/authority cannot provide resolution. This would potentially result in that web server making an HTTPS API request to a wrong IP, again only in the rare case where we have defaulted to a last-known-good resolution. I'm not really seeing the problem here except that the HTTPS request might fail (if the expected service was moved and no longer exists at the last-known-good IP), but how is that worse than the DNS resolution having failed? In both cases, the back-end service request fails.
One possible issue is that IPs are re-used in cloud environments. Potentially, your browser could POST sensitive data to an IP address that now belongs to a totally different company.
I mean, hopefully it is over HTTPS so they can't do anything with it... but if it isn't then it can definitely happen. Our servers get random web traffic all of the time.
True. I am not aware of any web services POSTing sensitive data over the public Internet that don't use HTTPS. If your service is sending sensitive data over HTTP without TLS, I feel the problem is bigger than a potential long-lived DNS resolution.
I mean, hopefully it is over HTTPS so they can't do anything with it...
DV certs only rely on you being able to reply to an HTTP request, so if any CA was using such a caching DNS server, you could probably get a valid cert from them.
I think you understood me. Maybe I can explain more.
If you have a service `log.io` with it's own DNS servers ( running named or djbdns ). And one day you decide to shut them down and rename the service to `loggy.io`.
What will happen is that any DNS trying to query the `log.io` DNS will reach unreachable server, which will lead to serving the last-known IP from the proposed DNS Cache on your machine.
If you don't use forever-failback-cache after the TTL expired you will just reach unreachable server and return back no IP address.
This is the equivalent of retiring the domain name itself. If you stop renewing it anyone can hijack it and serve whatever they like. Not to forget, they will also get email intended for that domain.
Anyone sane will keep the domain name and ns infrastructure and serve a 301 HTTP redirect.
All anyone is proposing here is to override the TTL to something longer (like 48h) if the nameserver is unreachable.
Of course the perfect solution would be to have the recursive nameserver fetch the correct record from a blockchain.
> If you have a service `log.io` with it's own DNS servers ( running named or djbdns ). And one day you decide to shut them down and rename the service to `loggy.io`.
> What will happen is that any DNS trying to query the `log.io` DNS will reach unreachable server, which will lead to serving the last-known IP from the proposed DNS Cache on your machine.
To reiterate the scenario you've put forth as I understand it: I'm a service operator and I've just renamed my company and procured a new domain. I've retired the old domain and expect to fulfill no more traffic sent to that domain. When a customer of my service attempts to resolve my old domain, their caching DNS resolver may return an IP address even though I have since shut down the authoritative DNS servers for the retired domain. They will make an HTTPS request to my servers (or potentially someone else's, if I also gave up my IP addresses), and fail the request because the certificate will be a mismatch.
The customer's application will see a failed request either by (a) DNS failing to resolve or (b) HTTPS failing to negotiate. Either way, my customer needs to fix their integration to point to my new domain.
To be clear, it is up to my customer to decide whether they want their caching DNS resolver to provide a last-known-good resolution in the event that authoritative servers are unreachable. If they prefer failure type (a) over (b), they would configure their DNS resolver to not provide last-known-good resolutions.
Exactly. You can't know if it is still valid, so you might send clients to an IP that's now controlled by somebody else. Worst case, they know and set up a phishing site. DNS generally has been reliable enough that the trade-off is not worth it.
> Worst case, they know and set up a phishing site.
They'd need to specifically gain access to the last known good IP address, which might be different depending on which DNS resolver you talk to (geodistribution, when the record was last updated, etc). I wouldn't really consider that a realistic attack vector.
Withing a small hosting provider this might be pretty simple. Attacker might lease a bunch of new servers and get the IP that was recently released. Then they could launch a DDoS to force address resolution in their favor. It's a bit far-fetched, but a lot of very successful attacks seem that way until someone figures out a way to pull them off.
Sure, within a small host or ISP, that may be doable; even then, I'd consider it a stretch. But if we were to limit it to those constraints, when will this ever be exploited except as a PoC? No entity large enough to actually want to run this exploit on has an architecture where this would be feasible (due to things like geodistribution and load balancing), nor would be hosted on a provider so small that their IP pool can be exploited in the manner described. If I were an attacker, I'd focus on something with much bigger RoI.
I like this idea. Grab a new elastic IP on AWS. Set up an EC2 listening on 80, maybe some other interesting ports, and see if anything juicy comes in. Or just respond with some canned SPAM or phishing attempt. Repeat.
Yes. You either positively know the right answer, or you return the fact that you don't know the right (currently valid per the spec) answer. The right answer in the situation you posed is "I don't know".
That was exactly my thought. This may be unrelated, or it may be a test run. But a large scale attack on Election Day that crippled communications would stir up unrest for a variety of reasons. Although I think that's highly unlikely to change the outcome, unrest after such a contentious election is not good.
What can they do? It's not Twitter themselves being DDOS'd, it's a DNS provider. This propagates up the chain to impact both a Tier 1 network and cloud providers, which hits tons of stuff on top of that.
if you utilize geoip routing features of one provider, it can be difficult to impossible to then ensure repeatable/deterministic behavior on a second provider.
Assuming you're referring specifically to targeting media companies reporting on the results and not the electric grid like someone else mentioned, wouldn't they have to DDoS Google itself for that to work? I don't really see a DDoS of Google being effective.
Probably not a great idea. If the internet went down at my work, none of us would be able to do anything, so we'd probably all head out to the polls just because we have nothing better to do. Unintentionally increased turnout.
This is terrifying. Thankfully I don't think much actual voting infra is network reliant. But it could probably delay the results from being finalized for days, and allow Trump to spew further allegations of rigging.
Though if they targeted electric grid, water, and public transport, starting early in the day and choosing the regions by their populations political leaning, it could easily have an effect on the result itself.
You don't need to target voting infrastructure. You target media infrastructure (DNS, streaming, web media) in order to either reduce or shift voter turnout. A candidate ahead in a battleground state? You stomp on media reporting to ensure their opponent's voters aren't dissuaded from heading to the polls.
Control the message, and through that the actual votes cast.
Yeah it just needs to be "The internet was broken so your votes were lost" and then some made up post-hoc explanations that 90% of people don't understand so they can't dispute
I'm working at the polls in CA, and can verify this; all critical information is moved by sneakernet with a two-person rule on its handling.
Of course, I have no information on the security model of the pre-election preparations and post-election tabulation, but luckily results for each polling place are also posted for the public to inspect - media outlets and campaigns can verify the tabulation themselves with a slight delay.
Hahahaha. For sure it's not supposed to be network reliant. But from my experience working on critical infra, even things like power grids and rail systems, this is almost never the case.
I think you're right, not much of the voting infra is network reliant, but the more I think about it the more it seems that the "fear" factor of such outages could influence the election. Or, perhaps a curated working set of information sources, thanks to selective DDoS. Regardless, terrifying to be sure.
Is it really important who wins it there are only two candidates that share common view on many problems? And you don't need Internet to count votes anyway.
It's okay. James Comey, the FBI chief, said the US electoral system is such a mess, it would be too hard for an attacker to hack it or damage its integrity in any way. It's all good.
From what I know of the situation (don't trust me, I'm not going to offer citations or sources), this attack wasn't particularly large in terms of gigabits/second. It was, however, very large in terms of economic impact.
I would assume that when a large number of big enterprise-y things go down, HSI takes notice. When other providers get attacks that are 20x larger (gbit/sec), but have much less widespread impact and impact on less enterprise-y things, they don't care so much.
It doesn't make a whole lot of sense for the USA to take down the internet, as they benefit the most from it. A significant fraction of that economy is based on it, much larger than in the cases of China and Russia. It would be like the owner of a coal mine campaigning for a carbon emissions tax: maybe there's something we don't know, but from the information we have it seems unlikely.
Note that this wouldn't rule out the USA as such. First, it could be a longshot preparedness thing, with no expectation that it would ever be used. Second, they could be red-teaming the thing (looking for weaknesses so that they can arrange for them to be shored up).
In either of these scenarios, it's no less likely that the USA would be doing it than anyone else. If you assume that whoever is doing this is planning to use their knowledge, however, the economic argument makes the USA less likely to be involved.
There are many types of actors even within the USA nation-state / government.
For example, if a particular part of the government got wind of a data dump about to be released by another nation-state or independent actor (for example, a leak of some kind) - I think some parts of the USA government that possesses the ability to do so wouldn't hesitate to take down dns to the entire internet to avoid another similar data leak to the Snowden dump.
Be really wary of attributing intent: you do not know who will benefit the most from taking down certain services. To claim that the US benefits from the internet so much that it wouldn't do certain actions to protect itself from certain types of harm is shortsighted.
Even my example could be really wrong, but the idea is that nobody really can say - "oh the internet is too important to xyz, they'll never do anything!"
I'm probably wrong, but this is how I see it (not sure about the OP).
News cycles happen fairly rapidly, so if you could take down a number of sites that might be friendly to the dissemination of potentially damaging information just long enough such that it's forgotten about, or the attack is so large the media talks about the attack instead, then you might be able to successfully avoid widespread public knowledge of such information. Though, this would be best aided with collusion or cooperation (intentional or otherwise) from the media. Toss in a few unrelated services as a bonus for collateral damage, and you might be able to avoid scrutiny or, at the very least, shift the blame to an unrelated state actor. It won't prevent the release of information, but that's not the point--you want to prevent the dissemination and analysis of that information by the public at large.
This is all hypothetical, of course, and not likely to work. It also comes with the associated risk that if you were discovered or implicated, public outrage might be even worse than if you allowed the release of the information you hoped to distract from in the first place! As such, I can't imagine anyone would be stupid enough to try.
If you take them offline before they've managed to disseminate the info, then it can't be forgotten because nobody knew about it in the first place. Which means when the sites come back on, the info is still newsworthy.
What kind of information would be so sensitive as to risk crashing the economy over, yet so trivial that people would forget about it because they couldn't access Twitter for a day? I get that it's more nuanced than that, but I'm really struggling with this scenario; sensitive information tends to get out if it's important enough, even if you're willing to kill a bunch of people.
It renders the server that's hosting a leak unable to broadcast the leak temporarily, while they arrange more conventional measures to seize it. It's a more rapid response than getting a warrant and a police team on location. The broad nature of the attack also avoids tipping off the server owners.
I still give it less than a 5% probability, though.
Honestly, that's fairly thin. WL uses torrents and other means of disseminating data that don't rely on central control structures. Plus, presumably, WL has the ability to quickly shift data into secure hands who are willing to release it when things quiet down.
So, sure, the USA could go send someone to sieze the hard drives of someone who has confidential information. But, I have to imagine one of the first steps when getting that kind of information is to disseminate it to others (at least some of whom are unknown to the states). If they were hit, these people would very quickly take that as a signal to indiscriminately release all the information.
Or they could take it down for enough time for other parts of the government and/or international diplomatic system to do their work.
Remember, a data leak is not just a technical issue. They can resolve it in any number of ways - get a small team incursion into another state's territory for extraction, etc. All the outage needs to do is to hold open that window for enough time for all the different parts of the entire threat response chain to do each part's job.
A lot of technical people think tech is the end, but no - if you get a small team to go knock on the person's door, and get your internet response team to shut down dns, or to get someone on site at the telco to perform certain actions at the router/switch level, etc all portions working together is a powerful way to resolve or to accomplish certain goals.
Think bigger, especially with state actors - the resources are there, and this line of thought is probably really basic stuff that people came up with in the 1960's or 70's (even when the arpanet was being created, there was probably already a team tasked with taking such actions - it only make sense to have 2 teams working on such goals in tandem - one to create the network, the other to take it down)
For either of your cases, I can't imagine the value of having it last this long, or even be this severe. The impact on the economy and trust in a infrastructure company is too high.
I'd be more willing to put my money on someone attacking an entity downstream who is normally immune to DDOS attacks of this size.
Total and absolute speculation follows: If the US wanted reliable take-down capability, they might want to test it first, and it would be least provocative if they tested here in the US.
As for the length of the "test," they might want to see how the US would react to such attacks in the future, and shake out anything critical. "Oh, these two agencies can't talk to each other. Good to know."
The usual thinking goes something like; well, the US created the internet so why would they want to take it down? Yes, NSA spies and all that, but they need the internet up to do that and also as bad as NSA is, it's nowhere near as bad as China or Russia where they ... (ranges from censorship to eating babies alive)
As someone else pointed out this is a classic false flag operation. Look up Gleiwitz incident and Operation Northwood. It can be very effective. With good opsec and anonymity online it can be even easier.
> kind of conspiratorial thinking?
You mean like lizard aliens infiltrating our planet? -No. But in the realm of "shooting down of passenger and military planes, sinking a U.S. ship in the vicinity of Cuba, burning crops, sinking a boat filled with Cuban refugees, attacks by alleged Cuban infiltrators inside the United States, and harassment of U.S. aircraft and shipping and the destruction of aerial drones by aircraft disguised as Cuban MiGs", yes.
It was mostly a reply to "US would have absolutely no reason for doing this" and the reply is there cold be a plausible reason.
And the ratio of false flag operations to false accusations of false flag operations is about 1:99. Lots of unlikely things "do happen," but if that's the immediate explanation you reach for you're going to be wrong most of the time.
You might want to take this up with that "rdtsc" guy who wrote "As someone else pointed out this is a classic false flag operation." He seems to disagree with you on those points.
Oh right, I know him! He is a decent fellar. I think he was saying if US is attacking its own infrastructure, then a false flag operation is a plausible explanation. Talking to him a bit revealed he didn't say this US attacking itself.
From the context of the paper, because the USA could just send a three-letter-agency agent of some sort to Dyn, a US-based company, and ask what their infrastructure looks like? (Presuming of course some weird scenario where they weren't already tracking it, which seems unlikely.)
Who is "we"? The u.s. government is a conglomerate of interests, organizations, individuals... Many of whom are quite indiscriminate. I'm not at all proposing that the U.S. was involved here. I'm questioning the simple identification of the U.S. government with the word "we", and the corresponding assumption that this institution is integrated in a carefully discriminating way...
Because it's being down voted? I think any down votes are more likely because the comment doesn't add substantively to the conversation and is distracting given other recent threads.
um. if you are basing your logic on orwelianness then the England should be your your top candidate. massive monitoring of the populace, severe restrictions on the ability to carry anything that is vaguely pointy or goes bang, poor freedom of speech rules (relative to the US).... I love England but as far as orwellian societies go, you can do notably worse than the US.
As someone living in England, I can confirm that a) we are an extreme surveillance society, which the general population neither really understands, nor cares about b) the vast majority of us are very grateful we don't have (legalised) guns on the streets, and we suffer from a much lower homicide rate as a result
Schneier:
"I have received a gazillion press requests, but I am traveling in Australia and Asia and have had to decline most of them. That's okay, really, because we don't know anything much of anything about the attacks.
If I had to guess, though, I don't think it's China. I think it's more likely related to the DDoS attacks against Brian Krebs than the probing attacks against the Internet infrastructure, despite how prescient that essay seems right now. And, no, I don't think China is going to launch a preemptive attack on the Internet."
If this is supposed to be "taking down the internet", then I'm not impressed. Using cached DNS still gives access to any service. I'm even typing here on HackerNews.
If this is another practice run, then I'm still not impressed. Taking down one provider is not that hard. Good luck finding the resources to do this DDoS to ALL large DNS providers out there.
Maybe it's not really fair to link to that post every time a DDoS with more than average payload happens. Especially since the post doesn't mention any specifics, because well, "protect my sources". It's like the "buy gold now" guy starring in the 2 AM infomercial predicting an economic recession within the next 5 years, without adding what the exact cause is going to be. He is probably going to be right, but that doesn't make him a visionary.
I work as a freelancer and today I didn't get paid and that's just me. Companies probably lost millions today. By just one DDoS to one DNS provider.
Yeah it's not the whole Internet, but how do you define "taking down the Internet" anyway. Is it every connected computer or just a huge amount of interconnected big websites? Because the latter is happening right now.
- Hillary Clinton's personal email server was hacked a while ago.
- A lone hacker published a document obtained by hacking the DNC servers. The document includes opposition research on Donald Trump and how Hillary can attack him in the election.
- Wikileaks published emails obtained by hacking the DNC
- US intelligence agencies confirmed that Russia was behind the DNC hack
- This is happening while the war in Syria and Iraq is growing. The Russians are there to "fight ISIS" but they have deployed an air defense system even though ISIS doesn't have any air force.
- Russia's only air craft carrier is trespassing through UK waters to get to Syria in a show of force that doesn't really add anything to their military capabilities there.
- Finland (yes, Finland) is increasingly worried about Russia. They violated their air space, and they're questioning Finland's independence. Finland shares a long boarder with Russia.
- US election is in 3 weeks and Donald Trump is openly in love with Putin. Trump questioned the benefit of NATO which is the basis for Europe stability after the 2nd world war.
> US election is in 3 weeks and Donald Trump is openly in love with Putin.
He states that he's never met Putin nor has any holdings in Russia. He has stated that he is open to positive relationships with the Russian government.
> Trump questioned the benefit of NATO which is the basis for Europe stability after the 2nd world war.
I believe he stated that he wants NATO to "pay their fare share" in the costs of maintaining the organization.
I'm not a Trump supporter but we shouldn't believe everything we read.
Trump says he never met Putin, now. In the past, he said he did. I just did a search for "trump met putin" and found a bunch of news sites reporting that in a GOP debate a while ago Trump said
“I got to know him very well because we were both on ‘60 Minutes,’ we were stablemates, and we did very well that night.”
Trump was boasting in that debate about nothing, they were on the same episode of 60 minutes but they were not even on the same continent for that episode.
That's not really the point. The point is Trump is now saying he hasn't, but in the past he said he has. Not only is he contradicting his earlier statement, but it also makes him not trustworthy. And of course, if he was boasting about having supposedly met Putin in the past, that means he thought it was a good thing to boast about, which suggests that he is sympathetic to Putin and to Russian interests.
> - Finland (yes, Finland) is increasingly worried about Russia. They violated their air space, and they're questioning Finland's independence. Finland shares a long boarder with Russia.
The Finns actually have still quite good relationship with Russia (better than other neighbors) and nobody's actually questioning Finland's independence. Baltic countries is a different story.
It's good to know things are clam, I was just quoting the article. Not sure where they got that from
> Finland is becoming increasingly worried about what it sees as Russian propaganda against it, including Russian questioning about the legality of its 1917 independence
> - Russia's only air craft carrier is trespassing through UK waters to get to Syria in a show of force that doesn't really add anything to their military capabilities there.
If they were really gearing up for war why would they move their only carrier away from the mother land. Your article even says it is more of a "show of force" than start of war.
I'm sure that carrier is being followed by multiple NATO submarines as well.
It seems incredibly unlikely that a global war would start over Syria when we've had 60 years of proxy conflict instead. Russia or NATO have absolutely nothing to gain from an open military conflict.
> - Finland (yes, Finland) is increasingly worried about Russia. They violated their air space, and they're questioning Finland's independence. Finland shares a long boarder with Russia.
Finland isn't worried, they have had stable relations for half a century as both sides agreed to not mess with each other. They have even refused to join NATO because it is actually safer for Finland and vice versa.
Cowboys with missiles stationed on Russia's border making hyperbole statements (like you do) - now that would be a real threat. (the same was also true the other way around with the Soviets stationing missiles in Cuba)
Is this sabre rattling or the prelude to a global conflict? Surely at worst it will (continue to) be a proxy war between NATO and Russia in Syria and nothing more? What motive is there for Russia or NATO to engage in open warfare? I'm not sure that a slow and prolonged lead up to an open war would even be effective in this situation.
Perhaps it should be "Say hello to Cold War v2.2017"?
I just want to sell my software, why does everyone have to fight?!
Thank you for these links. I'm trying not to get wrapped up in conspiracies but am increasingly worried by the mounting conflict. I'd love to hear a calm, reasoned response from someone more knowledgable than me on these topics.
The global powers are negotiating with themselves, and consolidating state power at home. Proxy wars for the former, and fear mongering at home for the latter.
The players are the 0.001% who control these states and the rest (we) are the captive (and propagandized) audience. They are being super kind as to at least make it entertaining for us.
Yeah I've gotten a little wrapped up as well looking into all of this mounting tension between the US and Russia. Protip, stay away from /r/the_donald.
My personal opinion is that it is mostly political and I think (hope) that what is happening in Syria won't escalate to direct conflict between the US and Russia.
I stumbled across this little blog article the other day and it helped relieve some of my anxieties.
I don't actually think this will lead to open conflict. My comment at the end was just saying that this is what a world war would probably look like now, and that this back and forth might continue for a while.
I'd like to hear from an expert as well, rather than rely on piecing news items together.
> - Russia's only air craft carrier is trespassing through UK waters to get to Syria in a show of force that doesn't really add anything to their military capabilities there.
I read they were passing in international waters. Is that not the case? It's clearly a show of force, but no need for the hyperbole if it is not true.
I wanted to provide an update on the PagerDuty service. At this time we have been able to restore the service by migrating to our secondary DNS provider. If you are still experiencing issues reaching any pagerduty.com addresses, please flush your DNS cache. This should restore your access to the service. We are actively monitoring our service and are working to resolve any outstanding issues. We sincerely apologize for the inconvenience and thank our customers for their support and patience. Real-time updates on all incidents can be found on our status page and on Twitter at @pagerdutyops and @pagerduty. In case of outages with our regular communications channels, we will update you via email directly.
In addition you can reach out to our customer support team at support@pagerduty.com or +1 (844) 700-3889.
Tim Armandpour, SVP of Product Development, PagerDuty
I had the privilege of being on-call during this entire fiasco today and I have to say I was really really disappointed. It's surprising how broken your entire service was when DNS went down. I couldn't acknowledge anything, and my secondary on-call was getting paged because it looked like I wasn't trying to respond. I was getting phone calls for alerts that wasn't even showing up on the web client, etc. Overall, it caused chaos and I was really disappointed.
I appreciate the update, but your service has been unavailable for hours already. This is unacceptable for a service whose core value is to ensure that we know about any incidents.
Given that a large swath of SaaS services, infrastructure providers, and major sites across the internet are impacted, this seems harsh. Are you unhappy with PagerDuty's choice of DNS provider, or something else they have control over? I don't think anyone saw this particular problem coming.
A company that bills themselves as a reliable, highly available disaster handling tool ought to know better than to have a single point of failure anywhere in its infrastructure.
Specifically, they shouldn't have all of their DNS hosted with one company. That is a major design flaw for a disaster-handling tool.
I'm not using the service, but I'm curious what an acceptable threshold for this company is. Like, if half the DNS servers are attacked? If hostile actors sever fiber optic lines in the Pacific?
I ask because my secondary question, as a network noob, is was anybody prepared / preparing for a DDOS on a DNS like this? Were people talking about this before? I live in Mountain View so I've been thinking today about the steps I and my company could take in case something horrifying happens - I remember reading on reddit years ago about local internets, wifi nets, etc, and would love to start building some fail safes with this in mind.
I'm not using the service either, but I noticed this comment [1]. It's not the first time that a DNS server has been DDoS-ed, so it has been discussed before (e.g. [2]). At minimum, I would expect a company that exists for scenarios like this to have more than one DNS server. Staying up when half of existing DNS servers are down is a new problem that no one has faced yet, but this is an old, solved one.
Namely "Uninterrupted Service at Scale -
Our service is distributed across multiple data centers and hosting providers, so that if one goes down, we stay available."
It seems fair to expect them to have a backup dns too, but I am not an expert.
From the perspective of my service being down, my customers being pissed, and me not being notified.. yes, maybe PD should be held to a higher standard of uptime. Seems core to their value prop.
pagerduty.com moved to Route53, but the TTL on NS records can be very long. Flushing (restarting, ...) whatever can cache DNS records in your infra will help to quickly pick up the new nameservers.
Running a redundant DNS provider is expensive as all hell.
While 'expensive' is a relative term, I disagree that it's cost-prohibitive for most firms, as I looked into this specifically (ironically considered using Dyn as our secondary). The challenge isn't coming up with the funds, it's if you happen to use 'intelligent DNS' features; these are proprietary (by nature) and thus they don't translate 1:1 between providers.
In addition to having to bridge the divide yourself, by analyzing the intelligent DNS features and using the API from each provider to simultaneously push changes to both providers, you have to write and maintain automation/tooling that ensures your records are the same (or as close as possible) between the providers. If you don't do this right, you'll get different / less predictable results between the providers, making troubleshooting something of a headache.
Thus in that case the 'cost' in man effort (and risk, given that APIs change and tooling can go wrong) in addition to the monthly fee.
If all you're doing is simple, standard DNS (no intelligent DNS features), it's not as hard, and it's just another monthly cost. Since you typically get charged by queries/month, if you run a popular service you're probably well able to afford the redundancy of a second provider.
Ah so make everything redundant. Double my costs in man hours and in monetary cost. Brilliant!
The sarcasm is curious. It's a business decision. Either your revenue is high enough that the monetary loss from a several-hour intra-day outage is potentially worse than the cost of said redundancy, or you don't care enough to invest in that direction (it's expensive).
Making things redundant is exactly a core piece of what infrastructure engineering is. I guess with the world of VPSes and cloud services, that aspect is being forgotten? And yes, redundancy / uptime costs money!
Your automation should be handling creating/modifying records in both providers. Also, if you're utilizing multiple providers you don't need to pay for 100% of your QPS (or whatever metric is used for billing) on every provider, only 50% for two or 33% for three. You can just pay for overages when you need to send a higher percentage of your traffic to a single provider.
I believe you don't understand DNS. It's probably the most resilient service (granted it's used correctly). There's nothing inherent in the protocol that would prevent them to use multiple DNS providers.
> Running a redundant DNS provider is expensive as all hell.
Sorry if this sounds dickish, but renting 3 servers @ $75 apiece from 3 different dedicated server companies in the USA, putting TinyDNS on them, and using them as backup servers, would have solved your problems hours ago.
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.
You're asserting that your (or their) homegrown DNS service will have better reliability than Dyn and Route53 combined. That assertion gets even worse when it's a backup because people never, ever test backups. And "ready to go" means an extremely low TTL on NS records if you need to change them (which, for a hidden backup, you will), and many resolvers ignore that when it suits them, so have fun getting back to 100% of traffic.
Spoiler: I'd bet my complete net worth against your assertion and give you incredible odds.
Golden rule: Fixing a DNS outage with actions that require DNS propagation = game over. You'd might as well hop in the car and start driving your content to people's homes.
I don't know how big PagerDuty is; IIRC over 200 employees, so, a decent size.
I was giving a bare-minimum example of how this or (some other backup solution) should have already been setup and ready to be switched over.
DNS is bog-simple to serve and secure (provided you don't try to do the fancier stuff and just serve DNS records): it is basically like serving static HTML in terms of difficulty.
That a company would have a backup of all important sites/IP addresses locally available and ready to deploy on some other service, or even be built by hand via some quickly rented servers, is I think quite a reasonable thing to have. I guess it would also be simple to run on GCE and Azure as well, if you don't like the idea of dedicated servers.
Not neccesarily. Granted this is how I would configure a system (two providers), but it is just as sensical to use one major provider which falls back to company servers in the event of an attack like this. It is all in sysadmin preference, while it is smart to relegate low-level tasks to managed providers it is also smart to have a backup solution that is under full control just in case that control needs to be taken at some point in time.
That would be a quick fix similar to adding another NS provider. Of course if dyn is out completely they might not have their master zone anywhere else. Then it's similar to any service rebuilding without a backup.
"Challenges" is exactly the sort of Dilbertesque euphemism that you should never say in a situation like this.
Calling it a "challenge" implies that there is some difficult, but possible, action that the customer could take to resolve the issue. Since that is not the case, this means either you don't understand what's going on, or you're subtly mocking your customers inadvertently.
Try less to make things sound nice and MBAish, and try more to just communicate honestly and directly using simple language.
Running multiple DNS providers is not actually that difficult and certainly not cost prohibitive. I am sure after this, we will see lots of companies adding multiple DNS providers and switching to AWS Route53 (which has always been solid for me).
Please check our status page as an alternative method for updates. Unfortunately, it's also been encountering the same issue so we're sending out an email with the latest updates.
PagerDuty outage is the real low point of this whole situation. Email alerts from PagerDuty that should have alerted of the outage in the first place, only got delivered hours later after the whole mess cleared out.
I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.
"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.
I'm curious why you don't host your status page on a different domain/provider? When checking this AM why GitHub was down, I also couldn't reach the status page.
The only way that I could check to see if Github knew they were having problems was by searching Google for "github status", and then seeing from the embedded Twitter section in the results page that there was a tweet about having problems. Twitter also being down for me didn't help the situation either.
The attack is on the DNS servers, which take names like www.github.com and resolve them to ip addresses (i.e. 192.30.253.112 for me). Their status page is status.github.com - it is on the same domain name (github.com) as the rest of the site. Normally this isn't a problem because availability is usually something going on with a server, not DNS.
In this case, the servers (DNS server under attack at Dyn) that knows how to turn both www.github.com and status.github.com into an IP address were under attack and couldn't respond to a query. The only way to mitigate this would be to have a completely different domain (i.e. githubstatus.com) and host the DNS with a different company (i.e. not Dyn).
Right, this was my point. Hosting "status.domain.com" doesn't help much when it's "domain.com" that's having the problem. I think today's event will make a lot of companies consider this a bit more.
Anyway, for them to take the github.com nameservers out of the mix they would need a completely separate domain name; would you know to look there?
You can delegate subdomains to other providers, but the NS records are still present in the servers listed in the registrar. So, you'd already need multiple DNS providers.. And you wouldn't have been down. Just sayin. I'm not sure anyone rated a DNS provider of this status getting hit this hard or completely as high enough risk to go through the trouble.
It's easy enough to look at a system and point out all the things you depend on as being a risk. The harder part is deciding which risks are high enough priority to address instead of all the other work to be done.
Lots of companies use Twitter for that sort of real-time status reporting, whose own up/down status one would think is sufficiently uncorrelated... unfortunately the internet is complicated.
If you're attempting to understand the behavior of individual users of HN as a collective, I can assure you that your initial principles are hampering you greatly.
(I'm not Github, but I work for a Dyn customer) Using multiple DNS providers has technical and organizational issues.
From a technical perspective, if you're doing fancy DNS things like geo targetting, round robin though more A records than you'll return to a query, or healtchecks to fail out ips from your rotations, using multiple providers means they're likely to be out of sync, especially if the provider capabilities don't match. That may not be terrible, because some resolvers are going to cache DNS answers for way longer than the TTL and you have to deal with that anyway. You'll also have to think about what to do when an update applied successfully to one provider, but the second provider failed to apply the update.
From an organizational perspective, most enterprise DNS costs a bunch of money, with volume discounts, so paying for two services, each at half the volume, is going to be significantly more expensive than just one. And you have to deal with two enterprise sales teams bugging you to try their other products, asking for testimonials, etc, bleh.
Also, the enterprise DNS I shopped with all claimed they ran multiple distinct clusters, so they should be covered for software risks that come from shipping the same broken software to all servers and having them all fall over at the same time.
more accurately, they don't support the common standard methodologies for transferring zone data between primary and secondary name servers (like NOTIFY, AXFR, etc).
there is nothing stopping you from having Route53 and $others as NS records for your domains. You just have to make sure they stay consistent. Apparently from the linked discussion, there are people offering scripts and services to do just that.
If this is consistently a problem why doesn't Github have fallback TLDs that use different DNS providers? Or even just code the site to work with static IPs. I tried the Github IP and it didn't load, but that could be for an unrelated issue.
> If this is consistently a problem why doesn't Github have fallback TLDs
I don't believe this has been consistently a problem in the past. But after today big services probably will have fallback TLDs.
Another status update from GitHub: "We have migrated to an unaffected DNS provider. Some users may experience problems with cached results as the change propagates."
We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.
Twitter's working fine for me. This attack is affecting different people differently; as a DDOS, attacking a distributed system (DNS) with a lot of redundancy, it's possible for some people to be affected badly while others not affected at all.
I briefly lost access to GitHub, but Twitter has been working fine every time I've checked. Posting status messages in multiple venues helps to ensure that even if one channel is down, people might be able to get status from another channel.
Name Server: ns1.p44.dynect.net
Name Server: ns2.p44.dynect.net
Name Server: ns3.p44.dynect.net
Name Server: ns4.p44.dynect.net
Name Server: sdns3.ultradns.biz
Name Server: sdns3.ultradns.com
Name Server: sdns3.ultradns.net
Name Server: sdns3.ultradns.org
ultradns.biz:
Name Server: PDNS196.ULTRADNS.ORG
Name Server: ARI.ALPHA.ARIDNS.NET.AU
Name Server: ARI.BETA.ARIDNS.NET.AU
Name Server: ARI.GAMMA.ARIDNS.NET.AU
Name Server: ARI.DELTA.ARIDNS.NET.AU
Name Server: PDNS196.ULTRADNS.NET
Name Server: PDNS196.ULTRADNS.COM
Name Server: PDNS196.ULTRADNS.BIZ
Name Server: PDNS196.ULTRADNS.INFO
Name Server: PDNS196.ULTRADNS.CO.UK
Name Server: ns2.p16.dynect.net
Name Server: ns-1283.awsdns-32.org.
Name Server: ns-1707.awsdns-21.co.uk.
Name Server: ns-421.awsdns-52.com.
Name Server: ns1.p16.dynect.net
Name Server: ns4.p16.dynect.net
Name Server: ns3.p16.dynect.net
Name Server: ns-520.awsdns-01.net.
Twilio just dumped Dyn, and is now available again.
twilio.com:
Name Server: ns3.dnsmadeeasy.com
Name Server: ns2.dnsmadeeasy.com
Name Server: ns4.dnsmadeeasy.com
Name Server: ns1.dnsmadeeasy.com
Name Server: ns0.dnsmadeeasy.com
Amazon was Using Dyn, now they added also UltraDNS too:
Name Server: pdns1.ultradns.net
Name Server: pdns6.ultradns.co.uk
Name Server: ns3.p31.dynect.net
Name Server: ns1.p31.dynect.net
Name Server: ns4.p31.dynect.net
Name Server: ns2.p31.dynect.net
Right now, if your site is in trouble, I'd suggest getting UltraDNS
service and AWS DNS service, and some obscure service as well, and put them them all in your domain registration. DNS service is cheap. Get some redundancy going. We have no idea how long this DDOS attack will last. It's not costing the attackers anything. They might leave it running for days.
Had a similar experience. When I went to confirm on twitter, that was down too. I was able to acces Twitter from my phone though, where I found a ton of tweets saying "Twitter down!". Strange.
Funny how we're able to tell that twitter's down on twitter, but here in Brazil, when whatsapp was down, nobody could use their own whatsapp to ask or tell if whatsapp was down.
No, not necessarily journalists; rahter, an information source...Fortune - a site/company known for journalism/reporting - now just gave HackerNews more legitimacy as an official information source...Now with this power, please use it responsibly. ;-)
No but we are a hivemind of smart individuals that correctly upvote important information and downvote irrelevant information. Most of the time, you can be sure that the top HN listings are going to be relevant.
I wonder if they used "Hacker News" as the source on that because it contains the word "hacker" and they wanted to say "oh, hackers read this site"... as in "the blackhat, break into stuff, steal your money, deface your website" folks.
Well, the parts that relied out outside services hooked up via SSO were not demoed, but majority of it worked fine because demo server was misconfigured to not actually rely on the external services. It is pretty funny.
I am a bit paranoid about disclosing details, but basically our SAML IDP was down, so the sales person couldn't log in at all. I was messing with the demo server to convince myself that it is 100% IDPs fault and we can't do anything about it, and discovered to my surprise that the form-based authentication was not disabled on it (normally our servers are in one mode or the other, but not both, even though this is an artificial separation). So I gave them the direct link to the form based entry point and most of the demo could be done.
We had a demo at the exact same time. (Internal weekly product demo, not that critical).
We did it on local host, the only one reliable 100% of the time.
Github out, Etsy out, Paypal out, Twitter out, Soundcloud out, Crunchbase out, Heroku out, Spotify intermittent, Netflix only loads a white page with plaintext "who's watching" list and no functionality.
I'm in NYC too. Github.com is resolving/working fine. Netflix.com is resolved but all assets (probably) weren't loading. Additionally Zendesk is also affected.
NYC, fios: github, twitter, soundcloud, heroku back up for me. Tunneling through an ec2 instance on us-east-1d gives the same results - can't find anything that is unreachable now.
Interesting. 8.8.8.8 is not able to provide me with a record. However ns1.p57.dynect.net and ns3.p57.dynect.net give an answer where as ns2.p57.dynect.net and ns4.p57.dynect.net hang.
Shouldn't 8.8.8.8 query another name server if one fails to respond or takes too long?
EDIT:
Is there an inherent flaw in how secondary records are queried? And as bhauer mentioned, is there a possibility to fall back to last known record?
They sell premium services, have a large sales team, and are very aggressive. I get emails from them weekly discussing millisecond savings of their DNS solutions and the value increase in customers and sales.
Squeaky wheels get grease and their sales team squeaks a lot.
Realistically they compete with Neustar which is shockingly expensive and has less features and is harder to use.
I chose Dyn over Neustar (UltraDNS) when it was time to renew contracts because it was 60% cheaper, had a better latency, their support was great and the interfaces were clear.
Not a fanboy or anything, I really don't like how aggressively they hound me now (even though I have nothing to do with DNS for my current employer), but it's cheap and effective so it's not surprising people use them.
Way, way back in time, they offered lifetime DNS hosting for a relatively low price.
I bought that, and they've honored the deal. Admittedly it comes with limits that would make it useless for any large site, but it's just great for individuals.
I had a NS1 demo account. And then they stopped doing that, but it still worked. And then I lost the credentials, and now my account is invalid for a password reset :(
They're widely used because they were one of the few providers of geo-aware DNS service for a long time. (These days there are other, cheaper options, including Amazon Route53.)
They offer Anycast and have POPs around the globe. They also have some other nice features such as intelligent failover and extra GEO IP features. Things that you would otherwise have to build yourself.
They have been around a long time. For years they had a free product called DynDNS that would allow you get an A record for your dynamic IP at home.
I'm surprised; I would have thought such large sites would use more than one DNS provider? I mean:
$ host -t NS twitter.com
twitter.com name server ns4.p34.dynect.net.
twitter.com name server ns3.p34.dynect.net.
twitter.com name server ns2.p34.dynect.net.
twitter.com name server ns1.p34.dynect.net.
I would have expected at least one of those to be somewhere else. What is the reason they would not have a backup provider?
I know a lot about some things, but almost nothing about networking, so excuse me if this is a really dumb question but - would your physical location determine what hosts you returned from that query? Like if you were in Asia would you get different ones back?
I'd guess the reasoning is that DNS providers these days are all anycast-style DNS. A DDoS would usually just be a blip on a few servers around the world depending on where the attacks originate.
I'm not saying it's a good reason but it's a reason.
All of these work for me from Germany, and querying their authorative nameservers works just fine (so definitely no caching effect). Anycast for the win!
If anycast routing is in play, which is not unlikely with a DNS service like that, then it may also be that specific servers are being attacked so the outages don't affect users in all locations as some will be routed to infrastructure that is not affected.
Journalist and security researcher Brian Krebs believes this is someone doing a DDoS as payback for research into questionable "DDoS mitigation services" that he and Dyn's Doug Madory did. Doug just presented his results yesterday at NANOG and Krebs believes this is payback. Read more: https://krebsonsecurity.com/2016/10/ddos-on-dyn-impacts-twit...
Well, Krebs sees this as an extension of the attacks that took down his site a few weeks ago after he wrote about this research. So he wrote about it, attackers take down his site. His co-author Doug Madory speaks about it, attackers take down Madory's employer's site.
Krebs indicates in an update at the end that a source had heard rumors in criminal channels that an attack against Dyn was being planned.
Doug Madory's presentation was on the agenda for NANOG and so attackers would have had plenty of time to know about it.
I'm wondering, from a regulatory perspective, what might be done to mitigate DDoS attacks in the future?
From comments made on this and other similar posts in the past, I've gathered the following:
1) Malicious traffic often uses a spoofed IP address, which is detectable by ISPs. What if ISPs were not allowed to forward such traffic?
2) There is no way for a service to exert back pressure. What if there was? e.g. send a response indicating the request was malicious (or simply unwanted due to current traffic levels), and a router along the way would refuse to send follow up requests for some time. There is HTTP status code 429, but that is entirely dependent on a well-behaved client. I'm talking about something at the packet level, enforced by every hop along the way.
3) I believe it is suspected that a substantial portion of the traffic is from compromised IoT devices. What if IoT devices were required to continually pass some sort of a health check to make other HTTP requests? This could be enforced at the hardware/firmware level (much harder to change with malware), and, say, send a signature of the currently running binary (or binaries) to a remote server which gave the thumbs up/down.
One thing that occured to me regulations wise is to require IoT devices to have some minimum level of security such as a unique hard password rather than it just being "admin" or some such. You could enforce it for items sold in the US or EU and the Chinese manufacturers would probably follow so their goods could be sold easily.
> What if IoT devices were required to continually pass some sort of a health check to make other HTTP requests?
Even better, what if IoT devices were required to pass some health check to operate at all. This could be as simple as a verified boot plus a forcible reboot every now and then.
Today the peering agreements are made so that ISPs get paid for whatever traffic they pass through. They have no financial motivation to change anything. And as the Internet is decentralized you cannot order them to do anything. So everyone has to protect from DDOS by themselves.
> Today the peering agreements are made so that ISP's get paid for whatever traffic they pass through. They have no financial motivation to change anything.
That seems believable.
> And as the Internet is decentralized you cannot order them to do anything.
...that doesn't. Being decentralized doesn't render them immune to regulation. If all major networks responsible for large scale peering were required not to pass on a certain type of traffic, it would be quite difficult to route around that. Yes, if only some did, this would be routed around.
Lets avoid giving ISP's more power than they have. Next thing we will see is "oh we thought that person was using a compromised device" for any disagreement.
Regarding point 2, I can think of a few ways to utilize that mechanism itself as a way to DDoS something. Sometimes the security mechanisms themselves are the attack vectors.
Well, it's similar to when a company tries to stop brute force by blindly blocking people who try 10 invalid passwords, but they have a CSRF on the login page (cross site request forgery). The problem is that I can craft a page to make a POST to their login page with invalid passwords repeatedly via ajax, and lock out legitimate users with a spam campaign pointing to my page on their user base. It seems far fetched until you consider something global like the internet. There are two ways I could see this failing on a global scale:
- Attackers figure out something similar to the attack described above and can entice large amounts of users to visit a page that repeatedly fires at something like s3.aws.com or w/e, the user is unaware but they're essentially DDoSing s3.aws.com via attacker.com's webpage, and in point 2 they would be banned.
- DRDoS is similar to whats described above, but in point 2 it kind of stands alone as the biggest issue. it can be mitigated to a certain level by ISP's, but not entirely. (https://en.wikipedia.org/wiki/Denial-of-service_attack#Refle...) . Point 2 would actually help attackers poison DNS.
Although I don't like to to recommend Google products, they provide a provide a public DNS-over-HTTPS interface that should be useful for people who want to add specific entries into their /etc/hosts files: https://dns.google.com/query?name=github.com&type=A&dnssec=t...
"digikey.com", the big electronic part distributor, is currently inaccessible. DNS lookups are failing with SERVFAIL. Even the Google DNS server (8.8.8.8) can't resolve that domain. Their DNS servers are "ns1.p10.dynect.net" through "ns4.p10.dynect.net", so it's a Dyn problem.
This will cause supply-chain disruption for manufacturers using DigiKey for just-in-time supply.
(justdownforme.com says the site is down, but
downforeveryoneorjustme.com says it's up. They're probably caching DNS locally.)
Google's DNS has been working all day here. The problem is that Dyn's DNS server is being DDoS'd; if you request a record that the authoritative DNS server for is hosted by Dyn, then when you query Google's DNS for that record, then Google's server needs to make a query to Dyn, which is down, and thus, your query fails. But queries to Google for non-Dyn domains will continue to work just fine.
Sorry for the confusion about saying Google NS is down :-) I meant dig heroku.com @8.8.8.8 does not work (as pointed by some other poster it is because google NS honors TTL but opendns does not)
Google's OpenDNS servers are not down, they are recursive resolvers, and are not getting data from Dyn's authoritative resolvers. OpenDNS caches DNS data for longer than other recursive DNS providers (they call it smartcache, IIRC), which is why they are still working.
If you're having issues with people accessing your running Heroku apps, it's likely because you're running your DNS through herokussl.com (with their SSL endpoint product) which is hosted on Dyn.
If you can update your DNS to CNAME directly to the ELB behind it, it should at least make your site accessible.
Nice, this is working well for us too. We were able to get the CNAME of the ELB by doing a `dig whatever.ourdomain.com` in an EC2 instance we launched in São Paulo (which presumably worked since Dyn's outage is primarily affecting their east coast PoPs.)
Another option may be to switch from the SSL endpoint add-on to the new, free SNI-based SSL termination feature, which will mean CNAMEing to your-domain.herokudns.com. , which seems not to be affected by today's issues.
Presumably with something like `dig @208.67.220.220 -t CNAME <your site>.herokussl.com`. This uses the OpenDNS nameservers, that people have been reporting as working. Haven't tested it as I am on the go.
Just to be clear, this is a DDoS against Dynect's NS hosts, right?
I'm confused because of the use of "dyn dns", which to me means dns for hosts that don't have static ip addresses.
I'm actually surprised so many big-name sites rely on Dynect, which I hadn't heard of, but more importantly don't seem to use someone else's NS hosts as 2nd or 4th entries.
They were up most of the day in here (Prague, Czech Republic, Europe), but it's down now (started about 20 minutes ago). It seems to be another wave of the attack.
OpenDNS servers seem the only ones that still work. Kudos.
It may not be the proper action but this kind of soft-fail scenario (use the old DNS until you can contact the DNS servers and get new ones) is much better.
echo "nameserver 208.67.222.222" | sudo tee -a /etc/resolv.conf
AWS says "We are investigating elevated errors resolving the DNS hostnames used to access some AWS services in the US-EAST-1 Region." Is that coincidental, or are they being DDoSed also?
Apparently us-east-1 is backed by Dyn (and only Dyn) as well?
$ host -t NS us-east-1.amazonaws.com
us-east-1.amazonaws.com name server ns3.p31.dynect.net.
us-east-1.amazonaws.com name server ns1.p31.dynect.net.
us-east-1.amazonaws.com name server ns2.p31.dynect.net.
us-east-1.amazonaws.com name server ns4.p31.dynect.net.
That's… utterly bizarre to me. us-east-2 has a more diverse selection:
$ host -t NS us-east-2.amazonaws.com
us-east-2.amazonaws.com name server u4.amazonaws.com.
us-east-2.amazonaws.com name server u6.amazonaws.com.
us-east-2.amazonaws.com name server u3.amazonaws.com.
us-east-2.amazonaws.com name server u2.amazonaws.com.
us-east-2.amazonaws.com name server u1.amazonaws.com.
us-east-2.amazonaws.com name server u5.amazonaws.com.
us-east-2.amazonaws.com name server ns2.p31.dynect.net.
us-east-2.amazonaws.com name server ns1.p31.dynect.net.
us-east-2.amazonaws.com name server pdns1.ultradns.net.
us-east-2.amazonaws.com name server pdns5.ultradns.info.
us-east-2.amazonaws.com name server ns3.p31.dynect.net.
us-east-2.amazonaws.com name server ns4.p31.dynect.net.
us-east-2.amazonaws.com name server pdns3.ultradns.org.
Not that anyone should be running a service whose availability they care about solely in us-east-1 anyway…
$ host -t NS us-east-1.amazonaws.com
us-east-1.amazonaws.com name server pdns5.ultradns.info.
us-east-1.amazonaws.com name server ns3.p31.dynect.net.
us-east-1.amazonaws.com name server pdns1.ultradns.net.
us-east-1.amazonaws.com name server pdns3.ultradns.org.
us-east-1.amazonaws.com name server ns4.p31.dynect.net.
us-east-1.amazonaws.com name server ns1.p31.dynect.net.
us-east-1.amazonaws.com name server ns2.p31.dynect.net.
us-east-1.amazonaws.com name server u1.amazonaws.com.
us-east-1.amazonaws.com name server u2.amazonaws.com.
us-east-1.amazonaws.com name server u3.amazonaws.com.
us-east-1.amazonaws.com name server u4.amazonaws.com.
us-east-1.amazonaws.com name server u5.amazonaws.com.
us-east-1.amazonaws.com name server u6.amazonaws.com.
us-east-1 is the oldest region and predates Route 53. Not adding extra DNS providers to the older regions is probably an oversight.
(The EC2 API team requests load balancers from a separate load balancer team. The load balancer team probably didn't exist as a separate team when some of these regions were created.)
If that were the reason I wouldn't expect this update:
6:36 AM PDT [RESOLVED] Between 4:31 AM and 6:10 AM PDT, we experienced errors resolving the DNS hostnames used to access some AWS services in the US-EAST-1 Region. During the issue, customers may have experienced failures indicating "hostname unknown" or "unknown host exception" when attempting to resolve the hostnames for AWS services and EC2 instances. This issue has been resolved and the service is operating normally.
That might explain why we are down - most of our EC2 instances are in us-east-1. Looks like Amazon SQS is impacted too. We are getting a stream of undeliverable messages, and our 'dead letter' queue is filling up!
Anyone else spend the morning thinking the problem was their setup? I've been flushing my system DNS cache, Chrome's DNS cache, changing DNS servers, rebooting my router, turning VPN on/off, etc.
It happened to be at the same time I was getting things configured to connect to a new VPN that I hadn't used before for the first time. Until about 7am today my home network was a 10.0.0.0/8 network. VPN kept bombing in the last phase of connecting and I couldn't figure out why, so I thought it was an IP conflict with my internal network range.
So naturally, I then went into my router and changed my subnet for my entire home network to the more common 192.168.1.0/24 range to see if it'd help. It didn't. Until suddenly VPN "just worked" -- which makes me wonder if I needed to change my network at all to begin with.
Then I started experiencing all sorts of weird issues where the Internet seemed to disappear from one minute until the next.
Then I hit IRC when things finally stabilized and see "Did you hear about Dyn?".
My reaction: wut.
TL;DR: I rearchitected my home network at 7am for no reason.
I've been singing the praise of AWS Route53 for a long time, they up and running. I can't believe major multi-million dollar companies (Twitter, GitHub, Soundcloud, Pagerduty) would not run a mix of multiple DNS providers.
Also what is happening is a cascade effect, where a 3rd party being down effects others.
One of the reasons why Route53 is good is because they give different nameservers to each hosted zone - unless you choose to use a branded record-set.
I've seen them be hit by dDos attacks in the past, but never had any significant impact.
(I wrap Route53 and handle storing DNS records in a git repository over at https://dns-api.com/ Adding support for other backends is my current priority to allow more redundancy.)
I had several, sporadic 'secure connection could not be established' yesterday while trying to open HN, amongst others. Painfully slow page load times across the board, too(Craigslist, Monoprice,weather.gov, etc) Still may be my buggy phone SIM...
Sorta. When I changed phones I cut my micro SIM down to nano size. Cut a wee bit too much off and it now can slide off contacts if jarred... gotta get a new SIM.
Seems to be impacting POPs in US East most severly. We use Ripe Atlas to assess the impact of DNS outages, and in the past hour have measured about 50-60% recursive query failure from a few hundred probes in that region: https://cloudharmony.com/status-for-dyn
Is it time for everyone to actually start using secondary name servers/DNS resolvers too from a different provider from primary? DNS _is_ built for this, for the very purpose of handling failure of the primary resolver, isn't it? Just most people don't seem to do it -- including major players?
Or would that not actually solve this particular scenario?
Yes, I think this attack has brought to everyone's attention that many companies have gone away from what used to be the extremely common practice of having your authoritative DNS serving shared across multiple DNS hosting providers. This would have addressed the issue... and we're seeing that by the end of the day many of these sites have gone to having multiple providers.
The attack is on the authoritative name servers, not a DNS resolver. A public DNS resolver will query the authoritative name server for a record if it doesn't exist in it's cache.
Agreed, but there is nothing stopping you from having the authoritative name servers for a domain with different providers. As someone previously said, DNS was designed for this.
It's used to be common for universities to do this, mine still does:
ic.ac.uk. 45665 IN NS ns1.ic.ac.uk.
ic.ac.uk. 45665 IN NS ns2.ic.ac.uk.
ic.ac.uk. 45665 IN NS ns0.ic.ac.uk.
ic.ac.uk. 45665 IN NS authdns1.csx.cam.ac.uk.
(and Cambridge use Imperial College as a secondary) but the best-known American universities are on cloud providers now.
that's not going to help much if the authoritative name servers (which is what dyn is, btw) go down for more than a day.
Max record cache time is 86400s (24h), so if the attackers can keep it down for 24h then google will have to have custom instructions in place (or cache more aggressively than the RFC allows)
Is there any reason why Dyn has to be "down" from Google's perspective? Is it possible that the large DNS providers maintain private network between each other, such that DDoS attacks that are effective against the public are ineffective against the private network?
Since the attacked dyndns DNS servers are evidently anycast, the google server you are reaching might connect to a different dyndns server than you do. If google has luck to reach a less overloaded server, they might get an answer where you get none.
In addition, Google Public DNS engineers have proposed a technical solution called EDNS Client Subnet. This proposal allows resolvers to pass in part of the client's IP address (the first 24/64 bits or less for IPv4/IPv6 respectively) as the source IP in the DNS message, so that name servers can return optimized results based on the user's location rather than that of the resolver. To date, we have deployed an implementation of the proposal for many large CDNs (including Akamai) and Google properties. The majority of geo-sensitive domain names are already covered.
I'm a Verizon FIOS customer in NYC and was unable to reach nytimes.com and several other sites this morning. Switching my DNS to Google's (8.8.8.8 and 8.8.4.4) seemed to fix the problem, but I don't understand why yet.
Since many (all?) of Dyn's authoritative server IPs are anycast, attack traffic is probably not well distributed either. If you're routed to a server that's getting a lot of attack traffic, you're likely to have problems, but a server without much attack traffic will work fine.
Any quick script to see if a given domain ultimately resolves to them? My SaaS company has a lot of custom domains from whatever DNS servers pointed at us and I'd like to be able to tell people whether it's our fault or not.
Query for the root domain, without any subdomains like www. That is, you need to check the "zone apex," the shortest name purchased from a registrar and potentially delegated to Dyn. Look for dynect.net in the list of authoritative name servers.
Yeah, I tried that. I don't see dnyect in a lot of domains that are failing, and it's clearly related somehow, they didn't all break at the same time by coincidence.
Let's assume, that foreign countries such as Russia or China would be trying to sabotage our elections on Nov 8th night. What are the severe economic and political backlash that we can deal with if we cut off the traffic coming in from those region (not in a "we control the internet" kinda way)? I am sure they already have nodes operating within the USA. A lot of major tech companies use CDNs that can still serve traffic globally to the consumers of those countries. Even better, how about we regulate and slow down all of incoming traffic for say half day on election day? Is it even possible?
But then why would they be doing it right now? I'm sure they already know if they can do it or not. I don't think they need to do a large scale test run that would put people on high alert. They'd keep their head down until election day.
But then, what is China or Russia going to get out of doing something like this? It isn't going to change anything. Hillary is the next president regardless. Hell, even if no votes could be counted I am sure the Supreme Court wouldn't have a problem calling it for her.
So to me the idea that China and Russia is doing this for political reasons doesn't make any sense.
Is it just me or are these kind of attacks becoming way more frequent recently? This kind of widespread outage seems so new, but again, that might just be me.
Isn't it the whole idea of git to make user independent from place like github? After all if you have the repository on your local machine you can continue working as if nothing happened, right?
I'm so damn tired of the "host it locally" mentality. Not everyone has the resources to host all of that locally.
For example, most open source projects.
But even outside of that, we use github for issue tracking, the new project management kanban stuff, a CI server, reading documentation (which is offline, but the online versions are nicer on the eyes), and a ton more. Not to mention that StackOverflow and other discussion forums tend to be used by many.
Yes you can host all of that locally, but we don't have a few hundred thousand a year to spend on some sysadmins to maintain all of that, and we don't have the time or money to run the machines, vet the software, and keep it up more reliably than github does for next to nothing.
And I'm so damn tired of people complaining about cost to run stuff locally. The true cost of not having some basic stuff setup locally, even for backup purposes is when situation like this happens. It does not take long time or resources to download all of the libraries, with corresponding docs to a local server, or even your laptop. It is not complicated to have all of the new issues sent to an email to have a version of them available at all times. And you don't need a sysadmin to administer all of that.
>And you don't need a sysadmin to administer all of that.
I disagree with that. If you have a server, you need a sysadmin. End of story.
Who is going to secure the system and setup ssh keys? who is going to run updates? who is going to monitor for security issues? who is going to run backups? who is going to secure those backups? who is going to oversee the installation of the network, the battery backups, the racks, the server hardware, etc... Who will swap out bad disks? Who will recover the system when it goes down? Who is going to double the hardware and setup high availability (remember, you are competing with github for uptime here)? And god help you if you have one guy that does all of this. What happens if he gets hit by a bus?
An on-prem server isn't a "backup", it's a liability. And without the resources to maintain it, it's going to become a nightmare. I've been there, and I won't ever do it again.
I'm either going to pay to do it right, or give it to someone who will. And if that means a few hours of downtime every year or so, then that's a wonderful tradeoff for me.
> It does not take long time or resources to download all of the libraries, with corresponding docs to a local server, or even your laptop. It is not complicated to have all of the new issues sent to an email to have a version of them available at all times.
Luckily github (and alternatives) provide all of that. It sends us emails and slack messages on everything, so if it's down, we can still read, and we all have our local repos. But reading is different than working.
If you have any substantial business, you already have a sysadmin on your team. He's not doing his job if he has no local versions of almost everything that is online. He should be staging everything locally, before deploying to the cloud. The currently very popular way of deploying everything live, without any testing, or staging is one of the reasons behind current crappy state of the internet.
I disagree, with very large companies, you have no "local" sysadmin, and no local versions of anything. Especially if your IT department is actually its own company.
If self-hosted, somewhere, you could still be screwed by having Dyn as your DNS provider.
If dev-machine-hosted, then uh, your issue tracker is no longer an issue tracker. Your build server is not a build server. All the services besides Git are not meant to operate offline in a decentralized/distributed fashion.
Library documentation, sure, that could be local. Otherwise, your assertion that all of this tools infrastructure can somehow be replicated, easily, in a way that makes the difference between working online or offline effectively zero, is nonsense.
First of all I'm not saying that the whole infrastructure could be replicated, only critical parts, and parts that can be easily hosted locally.
Second, pointing to a new machine is as simple as updating IP in your hosts file, or dns server.
Third, you can use vmware or any other virtualization stack to replicate your infrastructure locally. In fact that's the best way to build things - create virtual network, use it for testing, troubleshooting and debugging, and deploy only when everything is working.
All I'm saying is that if you're company is making any kind of money, and your development environment depends 100% on online services, you're doing it wrong.
Not that I'm saying that you're completely wrong, but you're oversimplifying the problem and the solution to it. Your last statement is not necessarily true; you have to balance the cost and how much of a PITA it is to set up and maintain vs how much money would be lost by disgruntled customers for a single rare outage. Granted, it depends on the kind of business you are running, but not every business is so fragile. In fact, I'd say that most(as in >50%) are not that fragile. Even better when you can simply put the blame on someone else, which everyone can do in our case right now.
I'm under the impression that you say self-hosted ~= dev-machine-hosted ?
If that's the case, i think you're misguided : imho, the internet as it was designed was conceived so that everyone has its little self-hosted thing, with dev-machine just for the purpose of, well, test and dev, with the latter goal of it being self-hosted.
Just look how email is technically designed and how it was meant to work, and we use it now, relying mostly on Gmail or Outlook, or worse, using Facebook for emails : we put all our eggs in the same basket.
If I were running a business, here are the options as I'd see them:
Option A) Spend no money and experience an outage maybe once a year, if that. And the problem works itself out.
Option B) Spend money and gain technical debt to avoid a problem that happens maybe once a year, if that.
Which one would you pick? I mean, maybe if everything you have is closed-source or you are guaranteeing 99.9% uptime to your customers, perhaps option B makes sense. Otherwise, the choice seems fairly obvious to me.
What's your source that you "experience an outage maybe once a year" ??
I work in IT infrastructure and I see attacks literally every day, moreover, most people just setup a quick LAMP or MEAN stack to prove their concept, and then they leave it like that, so most of the time, no, the problem don't just "work itself out".
Better yet, have one day a year that is "Red Team Day" where people hunt for vulnerabilities so that assessments can be done, and companies can later fix any issues noted. Like how earlier this week there was a statewide earthquake drill in California, local emergency sirens were sounded, schoolkids practiced hiding under desks, etc. The Internet needs periodic tests like that too.
In (well, after) attacks like this, and really any other massive DDOS, shouldn't it be possible to identify potential botnets and try to take them out (notify their owners that they're being used, notify their hosting providers, etc) so that they can't be used again in the future?
Quick question for you all. Just two days ago I registered two domain names at dynu (not dyn). Early this morning I a cold call from a company in India who knew the domain names and my phone number and was calling to ask if I wanted them to help me manage my website cheaply. Also, this morning I got a spam text from someone who claimed to by godaddy offering the same thing. Now I protect my number really well so this is the first time in 5+ years that I ever got spam texts or calls to my number. Do you think Dynu was also hacked?! Or maybe Dynu sells client numbers (which is how the guy in India claimed to get my number) and it was just by random chance that this happened at the same time as the Dyn hack.
Agree with shortstuffsushi that this is just someone getting your domain name info and spamming you. It sadly happens all the time.
Go to http://whois.icann.org/en and enter your domain name and see what info is public about you. If all your info is public, you may want to see if your registrar offers "private" registration where your info does not appear in WHOIS.
Fwiw, this isn't a hack, this is a DDoS (denial of service). It seems almost certain that your information was either given out by dynu, or your WHOIS record isn't protected. Check your domains out with your favorite WHOIS tool first. Otherwise... time for an awkward conversation with dynu.
I've been having the same problem accessing github in particular. Just for fun, I opened the Opera browser and activated the built-in VPN. That got everything going again. At least for browsing, not so useful for my git pulls and pushes.
Can someone explain why this is so bad? I think the internet handled the downtime of Dyn pretty great, not reaching github wasn't exactly pleasing, but i added the ip temporary to /etc/hosts and the problem was solved. Isn't the best strategy to accept that attacks will continue and systems may go down and design for resilience? If so this attack can serve as a warning and as a check that we can handle these types of attacks. I am a bit exaggerating, but i would imagine that constant attacks keep the internet resilient and healthy. An unchallenged internet may be the greater risk.
but just pretending no one will try bad things won't help either. It may speed up the progress and will prevent bringing a significant part of the internet down in the future and also remind everybody that these things can (and probably will) happen.
The DDoS problems, at least those not related to spoofing IPs, could be curtailed if we provide a strong incentive to the ISPs to work on it.
Let's hold the ISPs financially liable for the harmful traffic that comes from their network. If a client reports a harmful IP to the ISP, every bit of subsequent traffic sent from that IP to this client carries a penalty.
Yeah, I know, routing tables are small, yada yada. If we put thumbscrews to the ISPs they will find a way to block a few thousands IPs of the typical botnet, even it requires buying new switches from Cisco & co.
Put the thumbscrews on the IoT manufacturers instead, so they don't release widgets with bad security, so the problem is eliminated at its root.
You wouldn't allow car manufacturers to sell cars with faulty airbags, why do we allow device manufacturers to provide plentiful firepower for bad actors?
Semi related: I noticed this incident right when it began, but not because I was trying to access a website. This started happening to me: http://imgur.com/PPlaY5o
Then when I went to push to github out of fear my computer was about to soil itself, that failed too, and I noticed the outage.
Does anyone know if the above errors could be related to the outage? I'm using vim inside tmux with zsh as my shell. Maybe zsh does some kind of communication with gh while running?
removed the github plugin, reloaded zsh and happened again 5 min later. I believe it has to do with slack, because the issue resolved itself after closing. maybe slack got pwned and all slack users are being used as part of the botnet lol
Anyone know any details of what the attack looks like ? I had a quick look in my (albeit small) network to look for odd flows going to their ASN33517, but didnt see much that looked odd on first glance...
Need to get in to dyn.com to download your zone files add this to your hosts file:
204.13.248.106 www.dyn.com
204.13.248.106 dyn.com
216.146.41.66 manage.dynect.net
151.101.33.7 static.dyn.com
While my app isn't resolved using DYN, we are relying on APIs on our EC2 backend that use their DNS. Is there a Linux DNS caching server that will serve from a local cache primarily, and do lookups in the background instead to update the local cache? During the period DYN was down, it would've continued severing from the local cache and retried the background lookups, keeping my app up. I can also see it improving performance as my servers currently do lookups to the EC2 DNS on each http request...
Fastly is simply putting up a status page so they aren't contacted about issues, and letting them know it's about DYN. And they are having internal issues with communications like zendesk.
No idea if this would work, but could people theoretically just ping flood the IOT devices involved to mitigate the attack?
They run some sort of web server since most devices provide some web interface, so clearly there's a port open which could be hit if the IP is know, and with the shoddy security in these devices I'd wonder if their local (likely low performance) hardware would be susceptible to something as simple as a ping flood attack.
Depends on how many PoPs they have. Looks like they have 4 easter US.[0] If they are seeing large attacks that Krebs saw a few weeks ago, that could certainly be enough to take down one or two, and then causing redirected traffic to take down the other two.
I used to work for a DNS/DDoS provider, and this was a very real problem. Leave the PoPs that are being affected out, or risk overloading the other PoPs by overloading real traffic.
Before moving the other traffic, you also have to worry about blocking the DDoS traffic otherwise you're just redirecting them to the other PoPs. Mitigating DDoS attacks are not fun, and hard to block.
Some sites intentionally disable that, however, by setting a short TTL on replies. The idea is usually that it allows them to very quickly adjust to hardware failures or load across datacenters but it has the consequence of making your infrastructure comparatively brittle.
Surprised to see so many big names relying on a single provider. DNS is designed to be distributed, it should be possible to avoid a single point of failure.
This question came to my mind when I saw this post. The possibility might be the management cost. Synchronize between different providers can go wrong and might hard to debug when end users get different replies.
How can I, a proficient web developer but one with little experience working directly with its underlying infrastructure, help in whatever effort is being down to thwart this and related attacks? I feel a moral obligation to help as these attacks seem a grave threat to our economy and could cause unrest given the current political climate. Thanks.
Read all the analysis you can to form a better understanding of how this all works. Use that information to design and run more resilient services in the future. Teach what you have learned to others.
https://cloudharmony.com/status-for-dyn is now (12:43pm EDT) showing Dyn's "US East" and "US West" centers as being down. Anyone know anything about this Cloudharmony service? How often does it update? and what is it monitoring?
Hmm... Seems to be quite widespread. Some of our Amazon AWS services (located in the US) that rely on SQS are reporting critical errors. Intercom.io is also down at present, which we use for support for our web apps. Not looking very good from here (in Australia).
If this kind of attacking does escalate, wouldn't it be possible to simply cut off requests from outside the United States at the points of entry? Basically, turning the US into an intranet?
We don't know yet, but the attack very easily could have been coming from a botnet of devices entirely inside the US. Geographic borders don't matter much at all for the Internet.
But even if it were, the creator of the botnet would first have to gain control and then issue a command, right? How would either of those things be possible from outside if there was no connection into the US?
Well, there's pretty much no way to impose the geographic borders of the US onto the Internet. Our networks here in the US are all global and integrated with other networks all around the world. The only places where this kind of geographic control is possible are countries like Iraq, Iran, China and others where the government controls all the ISPs. Countries with more freedom have a free flow of information and packets - and to me that is a very GOOD thing.
What this event shows is that using DNS as a load routing/balancing mechanism is a bad idea (because that's why folks have low TTL and an inability to specify truly redundant secondaries).
Interesting. Lots of sites have been down for me, here in Mexico City. Twitter. Github. Loads of other random sites. When I turned on my US based VPN. It all started working again.
Why is there even a concept of managed DNS ? Arnt we already paying >$1M/yr so that we can get 32 bit integer from a string ? This does not make sense.
How come you can access these sites from some countries? I imagine there are lots of name servers and that the attackers are specifically targeting servers for US?
Weird, works for me - from Italy (not sure if there isn't just some caching going somewhere down the line and I can see it because of that)
edit: nevermind, it's almost certain i've got it cached
I can query the authoritative ns*.p16.dynect.com DNS servers from Europe (Germany in my case), and the traceroute looks like it's near Frankfurt. So the anycasted copies here seem fine.
First of all, the only thing that'd matter is for modifying your dependencies at the moment. If you've previously built the project, and don't touch your deps, GitHub won't be hit.
Reposting imglorp's comment on the root of the comment tree, as it's buried currently. This should restore service for those desperately needing to access Github etc ;)
> ....point your machine or router's DNS to use opendns resolvers instead of your regular ones: 208.67.222.222 and 208.67.220.220
Anyone having any issues with WhatsApp? Mobile text seems to work fine but all images fail, Desktop & web browser aren't connecting at the moment (west coast)
I'm using Google Public DNS too. I don't really know if there is a relation with DynDNS but I'm still experiencing issues in GitHub and Twitter, like partial loading of images.
Perhaps Google had old (but valid) records still in their cache for a while. Google DNS was working for me for a while, and then stopped. Apparently Dyn has the problem fixed, but maybe there is some TTL based propagation delays still. I updated my internal network to use Dyn's internet guide/public DNS and the problem is fixed.
Maybe this is their strategy: we break it, you buy it ;)
The Wikileaks twitter feed is putting out some really weird stuff lately. They are claiming their "supporters" are behind the attacks, which makes little sense.
Ideally, then, the local resolvers of the nodes and/or the UIs of applications could detect the last-known-good flag on resolution and present a UI to users ("DNS authority for this domain is unresponsive; you are visiting a last-known-good IP provided by a resolution from 8 hours ago."). But that would be a nicety, and not strictly necessary.
Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.