Hacker News new | past | comments | ask | show | jobs | submit login outage explanation (cloudflare.com)
459 points by adamch on June 1, 2018 | hide | past | web | favorite | 84 comments

This is a great write up. It's also why the DNS root servers have a policy of surviving DDoS through massively over-provisioned, multi-org, anycasted redundancy rather this sort of smart DDoS mitigation that drops traffic: DNS is so critical that any risk of dropping real traffic is unacceptable. (obviously, such a scale is impractical for 99% of services)

A good takeaway from this outage for the average user would be to make sure that your fallback DNS resolvers are operated by totally separate providers. (eg, configure with as a fallback, rather than and (Edit: fixed cloudflare's secondary address)

I read on the Pi-Hole forums that 'fallback' is a misleading term because clients don't work that way - they will happily spread requests between two functioning DNS servers. Can anyone confirm this or provide further insight?

Different OSes handle it differently. Windows tries the primary, waits 1 second and then starts trying secondaries.

glibc tries in order. musl (Alpine Linux) tries all in parallel and returns thebl first resolution (If any).

Most systems do not round-robin by default afaik. They just try the listed DNS servers in order (and resolv.conf will only respect the first 3 IIRC).

DNS fallback is often also misunderstood to mean "fall back if the domain is not found", but it really means "fall back if the name server fails to respond". If the domain is not found (i.e. servers returns a valid NXDOMAIN response), most resolvers do not consult any other name servers.

Round-robin, cascading name lookups, and other frequently desired functionality can be obtained through dnsmasq or similar caching/forwarding name servers.

There are also options for libc that affect this behavior that can be set in /etc/resolv.conf like:

    options timeout:2 retries:2 max-inflight:768 max-timeouts:100 single-request-reopen rotate
There are man pages that describe these options.

Can confirm. If I have a secondary DNS specified in my router, other than my pi-hole, it becomes useless. My guess is the router is either measuring response time, and goes with the most efficient, or otherwise round-robining the requests. Either way, requests bypass the pi-hole in such quantities that it became useless.

PS: someone here mentioned that this behavior is OS-dependent - nope, this happens on the router level, and all devices in my apartment suffer.

It depends on how your router's DHCP server is configured. If you configure your router to pass its own IP address out as the DNS server for the local subnet then the router's behavior dictates how DNS works. If your router is passing out an external DNS in the DHCP configuration, then you'll get OS-dependent behavior.

My router uses a DNS resolver internally, and it will spread-cast to multiple DNS servers and use the quickest response it can get. It also caches using the TTL in the DNS response, and so it will serve up cached records transparently.

Then your devices are likely using your router as a DNS resolver, which in turn talks to your pihole and the external one. And thus it depends on your router's OS what it does.

Dnsmasq has different modes if you give it multiple upstream resolvers. By default it only queries the fastest one, which it determines by every now and then sending a query to all the servers and picking the one that replies first. You can tell it to always query all servers, or always query them in the order specified.

If your recursive DNS servers are Unbound DNS, they will keep note of which infra (upstream forwarder) nodes are responding and will blacklist the ones that do not. You can see this with

    unbound-control dump_infra
After the infra ttl expires, they will probe the down nodes. If they respond, then traffic will be distributed among them again.

Or if you're not as comfortable with Google services.

For anyone wondering, is https://www.quad9.net/ who claim to not only resolve requests but also check them against IBM X-Force's threat intelligence database.

Also from https://www.quad9.net/faq/

Secure IP: Provides: Security blocklist, DNSSEC, No EDNS Client-Subnet sent. If your DNS software requires a Secondary IP address, please use the secure secondary address of

Unsecured IP: Provides: No security blocklist, DNSSEC, sends EDNS Client-Subnet. If your DNS software requires a Secondary IP address, please use the unsecured secondary address of

Client-Subnet lets providers with widely-distributed servers pick one that's near you to serve your content.

For the curious, they also have an extremely detailed privacy policy: https://www.quad9.net/policy/

Seems like a fair deal to me. I get free DNS service, and companies sponsoring this program get metrics on threats and general Internet usage. I'm a little skeptical of their claims that individuals can't be identified from their anonymized data. E.g. I probably only get one or two hits on my personal website every week, so it might not be hard for a malicious employee to deanonymize visitors to my site.

Some highlights from the policy:

Many nations classify IP addresses as Personally-Identifiable Information (PII), and we take a conservative approach in treating IP addresses as PII in all jurisdictions in which our systems reside. Our normal course of data management does not have any IP address information or other PII logged to disk or transmitted out of the location in which the query was received. We may aggregate certain counters to larger network block levels for statistical collection purposes, but those counters do not maintain specific IP address data nor is the format or model of data stored capable of being reverse-engineered to ascertain what specific IP addresses made what queries.

There are exceptions to this storage model: In the event of events or observed behaviors which we deem malicious or anomalous, we may utilize more detailed logging to collect more specific IP address data in the process of normal network defense and mitigation. This collection and transmission off-site will be limited to IP addresses that we determine are involved in the event.


We do not correlate or combine information from our logs with any personal information that you have provided Quad9 for other services, or with your specific IP address.


Quad9 DNS Services generate and share high level anonymized aggregate statistics including threat metrics on threat type, geolocation, and if available, sector, as well as other vertical metrics including performance metrics on the Quad9 DNS Services (i.e. number of threats blocked, infrastructure uptime) when available with the Quad9 threat intelligence (TI) partners, academic researchers, or the public.

Quad9 DNS Services share anonymized data on specific domains queried (records such as domain, timestamp, geolocation, number of hits, first seen, last seen) with its threat intelligence partners. Quad9 DNS Services also builds, stores, and may share certain DNS data streams which store high level information about domain resolved, query types, result codes, and timestamp. These streams do not contain IP address information of requestor and cannot be correlated to IP address or other PII.


Quad9 does not track visitors over time and across third-party websites, and therefore does not respond to Do Not Track signaling.

At least for personal use, OpenNIC is nice and many of the servers say they do not keep logs. I use the (2a05:dfc7:5::53) anycast server and it works well. They are more likely to disappear randomly than the ones run by large companies.


Am I the only one not comfortable using DNS servers running by random volunteers? Is there any "vouching" of the operators or regular checks on common domain on those OpenNIC servers?

Well, my computer runs a bunch of software written by random volunteers so personally I'm not that worried about it. I personally prefer that to the available alternatives. Yes it would be great to have monitoring (of all the dns services) and I'm not sure if anyone does that, but considering the perpetual tran wreck that is DNSSEC it doesn't really alter anything as all dns is vulnerable.

With https, whoever you end up contacting needs to cough up a valid certificate for the domain in the url. I run https everywhere to try to get that protection as often as possible. In practice there are still ways that dns tricks can cause trouble but they are not as bad as you might think and browsers are slowly pushing an https only web (I hear Chrome will soon start marking all http sites as "insecure" rather than https sites as secure). ssh has its own authentication method and I do try to verify new hosts via another secure chanel.

Speaking of not trusting companies, I am reminded that at one point I noticed that CentryLink seems to be intercepting all dns traffic no matter the intended destination, so without either a secure connection past the ISP or maybe a nonstandard port it may not matter what dns server you try to use. Hopefully all ISPs that do this do the horrible redirect of invalid domains thing so attempting an http connection to an invalid domain might show if this is the case (I found it trying some of the nonstandard domains that OpenNIC resolves).

You're definitely not alone. This sounds a little too Tor-ish for my taste.

I’m amazed at how many people abandoned other providers and blindly switched to

I can’t even use that address with the ISP Alestra in Mexico

The lack of operational knowledge is real, and there is no easy fix other than resilient defaults by a benevolent dictator (in this case, resolvers across netblock and provider demarcations) or spending the time to educate yourself first on the “why” then the “how”.

(the cloudflare secondary is

Cloudfare makes it weirdly difficult to find this. is plastered over many pages but not concomitant with the secondary.

That is not my experience at all. By following the `install`-instructions on, all available DNS addresses are listed for both ipv4 and ipv6.

FYI, works over SSL, so there's no need to format it

That's not necessarily a bad thing, since it's probably better to use another source for a secondary if that's your primary.

I'm using cloudflare first, and google as secondary in my router. For a while I was running my own, but it became less necessary when I stopped doing as much dev at home.

Not if you're not comfortable giving up so many of your internet activity to Google but as others have pointed out there are alternatives to spread.

> Our FRP framework allows us to express this in clear and readable code. For example, this is part of the code responsible for performing DNS attack mitigation: > > def action_gk_dns(...): > [...] > > if port != 53: > return None > > if whitelisted_ip.get(ip): > return None > > if ip not in ANYCAST_IPS: > return None > > [...]

What does this code sample have to do with FRP? This code seems extremely trivial and doesn't give any real indication to me why you'd need a framework of any sort. It seems like they really want to emphasize that they use FRP, but this code just seems completely unrelated.

From one of the linked presentations: https://github.com/cloudflare/gatelogic

Major credit to Cloudflare for publishing a clear, honest, and detailed description of what happened. I wish more companies would do this.

One thing I’d be interested to know more about is why it took 17 minutes to fix. While you can and always should strive to make them less likely, outages are inevitable, so how you respond is crucial. Here the outage was very obviously caused by a deployment that I’d assume was supervised by humans – why did it take 17 minutes to roll back?

The problem is so clear in their write up that I can understand your thinking. However, in reality as this was going down it was probably not that clear cut.

Especially when you consider that they are getting DoS attacks every 2-3 minutes - so all deploys are going out into a hectic world and the dots maybe aren't that easy to connect under those circumstances.

I'm not an expert, but is 17 minutes for:

- shit is not working

- is this an attack?

- no it's us

- how?

- that's how

- let's go back

- have to get supervisor

- roll back huge thing

really that long?

With ~150 data centres, roll back alone probably took 5-10 minutes. Don't think 17 minutes is that long.

For simple PagerDuty alerts I already need 15 minutes to open the app / logs and figure out what's going on.

Good point, although the way it's described it sounds like the problem cropped up right after deploy. So they would have been watching it actively. But as said above, 17 minutes to notice, figure out what's going on, decide what to do, and propagate the resolution seems reasonable.

Not to mention selectively purging all of the rules created (I assume) at every edge server in their network that were blackholing all of the traffic to the resolver. There's probably a command for it, but all in all 17 minutes seems quite a respectable turnaround time.

Did they really deploy to all 150 DCs at once? Why was this release not done in phases? Not even a canary?

Maybe they did, the strategy is probably to deploy to small DCs first. But that also means the DDOS detection wouldn't trigger. So probably something that didn't get caught during such tests.

Yes, this exactly. And to expand on what happens around that:

- monitoring system picks up irregularity (smoothed over some window of time, which delays alerting)

- alert propagates to humans

- humans may take time to notice alert (even a page takes a few seconds to read)

- humans make decisions, may need to talk to other humans (all of what you said above)

- humans evaluate correct procedure, double-check it (you don't want them making the wrong "fix" and making something else worse, do you?)

- humans execute commands

- commands take time to run on large collections of computers (running them completely in parallel can cause thundering herd issues, in some cases)

In 2000 the answer would be no. In 2018 I think it is.

Things change in a time when you would freak getting up in the morning and say google.com did not work.

Humans react, analyze, and communicate at the same speed we did in 2000. Our tools may have gotten better, but that only cuts down on part of the process.

Automated processes can only mitigate so many edge cases. Even then, humans need to be involved, and that slows things down.

Disagree. Have you ever seen a toddler use an iPad? More input/stimuli in 2018 = higher capacity to analyze said inputs. Or an overload of capacity which results in the epidemic of mental illnesses and psychotic breakdowns we witness in this post-social media society.

Another example - I can record a video and broadcast internationally, translated on the fly into dozens of languages, effectively communicating with significantly more people than if I could not harness that technical capability. In 2000 that communication process would have been orders of magnitude longer. (Did they have on-the-wire translation then? Idk, just making an assumption to illustrate my point.)

Great information on the outage. Looks like another version 2 syndrome side effect [1].

> Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.

> What we did not account for, and what Provision API didn’t know about, was that and are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.

> As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!

When porting legacy code it is not only important to understand the edge cases and technical debt built up over time, but to test more heavily in production because you never know if you got them all because some smart guy built them long ago and/or there are unknown hacks that were cornerstones of the system for better or worse.

Phased and alpha/beta rollouts in an almost A/B testing way is good for replacement systems. Version 2 systems can also add new attack vectors or other single points of failure that aren't as know from legacy problems, the Provision API seems like it is a candidate for that.

Over time the Version 2 system will be hardened just before it is EOL and replaced again to fix all the new problems that arise over time. Version 2's do innovate but they also shroud fixing old issues and pain points for new unknown problems.

[1] https://en.wikipedia.org/wiki/Second-system_effect

Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn’t have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

Each of these bugs took weeks of real-world usage before they were found.

— Things You Should Never Do, Part I, (https://www.joelonsoftware.com/2000/04/06/things-you-should-...)

Engineers realized that maintaining all those bug fixes was too expensive. So as an industry we collectively agreed everyone will do constant rewrites.

That win 95 fix? Doesn't matter because win 95 was auto upgraded to be unrecognizable (SaaS is a wonderful thing).

Low memory? Not our problem - go buy another computer. Remember programmer time is more valuable.

You can try and buck the trend but your dependencies wont and the customer doesn't care whose code the bug is in. You won't get any browny points so you might as well just save the effort.

It's a brave new world.

What's interesting here is that the automatic cure (DDoS protection) was worse than the disease (even if there was an attack, blocking all access to the DNS servers is potentially worse than letting them get overloaded).

I wonder if it would be possible to express the idea that if a block being applied drops traffic well below expected levels, it must be a mistake?

Is it the block causing the traffic drop, or the DDOS though?

Commendable honesty and level of detail in the public RFO.

Understandable outage. I switched back to cloudflare when it came back up, but this did prompt me to drop in as a 3rd fallback.

Doesn't seem fixed, still can't resolve archive.is cloudflare is giving off cloudflare DNS web errors.


That's apparently due to the archive.is servers returning different results based on the requesting IP:


I’ve never been able to resolve some of those domains, specifically archive.il. Not a fan, why I switched away.

Yup now I remember why I took it off.

Do they just use python as pseudo code or do they actually run their attack detection in python?

It is python. They linked a presentation[1] and a talk about the bot.

[1] https://speakerdeck.com/majek04/gatelogic-somewhat-functiona...

It's interesting. I would have expected them to use rather something low latency/high performance like c++ or erlang given their scale and performance criticality.

This is something that only keeps track of user settings data and consequently configures the endpoints given that data.

It has got nothing to do with packet filtering per se.

Any supplementary thoughts on the fact they are running their attack detection in python? What's the point?

There is a reply in a parallel chain (timestamped after your comment) which expands on the line of thought: https://news.ycombinator.com/item?id=17204069

>The next time we mitigate traffic, we will make sure there is a legitimate attack hitting us.

I fucking love these guys

This downtime annoyed me at a critical time. Even though I think it's great they are transparent about it, I don't see myself using their DNS again.

I wonder if the change was manually reverted after 17 minutes, or if Cloudflare has a system that watches for a spike in failures and automatically reverts the most recent change.

Error 1001 DNS resolution error when trying to access archive.li

But works with

another decent alternative fallback with the same featureset is quad9


The core feature of Cloudflare's DNS is getting you closer and faster resolves to their CDN (probably one of the largest at this point, barring the established enterprise encumbants) in a privacy-first way.

Quad9's goal seems to be about threat-detection and prevention.

It would be nice if you could have both simultaneously (and maybe you can), but at the moment both services are actually quite different. is "quad nine without the threat detection" from memory.

Yeah, but the reason why is so fast for sites that use Cloudflare as DNS is because Cloudflare is the authoritative DNS for them.

The only way you get that in a more generic sense is if a specialist DNS CDN provider started up that provided DNS services for all the existing CDNs (or they all agreed to some type of federated standard that let them share the same recursive multicast IP addresses for DNS resolution).

Also, Cloudflare has a huge amount of data centres by now, probably more than any other service. Even Google often underperforms them. Debatable if a few ms make a difference but it can for people living in remote areas where CF has a centre and the next is 100ms away.

It's important to note that the cloudflare dns does not send the EDNS Client Subnet header which can have a negative impact depending on where you live.

or a positive privacy impact depending on where you live :)

I'm on Google Fiber and is often 1ms for me and googles DNS is ~8-12ms.

It's amazing how fast it is.

I guess Google doesn't deploy DNS at each edge. Otherwise it's hard to explain that you often get >5ms in major cities.

Maybe check out what NS1 is doing with DNS


TL;DR: we should have used an IP that is not traditionally used for testing and internal stuff by everybody including Cisco.

They were given the IP block by ARPA specifically with an agreement that they would analyze the junk traffic and report back on it. No one attempting this before has been able to keep up with the amount of noise pointed at that IP. Everyone involved knew exactly what they were getting into.

Sounds like it was truly a "DR" in your case.

Not even close. The system had a glitch because they were doing a major DNS resolver at all. It had nothing to do with the baggage that comes with specifically.

However a few days ago there was a bgp based outage on cloudfare DNs want there?

I switched to when it was released and since I’ve had multiple issues with free wifis where they would fail to hijack my dns requests to allow me to login to their portal. I Assume this is a good thing but can someone explain to me why this is happening and what’s the state on improving these wifi portals?

PS: sorry for hijacking the thread.

Might be unrelated. Always try http://neverssl.com to trigger the captive login on public wifi.

Cisco shipped some hardware which incorrectly used as an internal interface. The long-term fix will be those networks being correctly configured but that may not happen until that hardware is EOLed.

In the meantime, have you tried adding other addresses such as,, or to see whether the fallback works?

Set your DNS to and it might be a bit better. Still cloudflare's DNS resolver but used by less stupid Wi-Fi portals.

I wonder if they are going to factor the availability into all the blog posts about how fast they are.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact