
1.1.1.1 outage explanation - adamch
https://blog.cloudflare.com/today-we-mitigated-1-1-1-1/
======
gcommer
This is a great write up. It's also why the DNS root servers have a policy of
surviving DDoS through massively over-provisioned, multi-org, anycasted
redundancy rather this sort of smart DDoS mitigation that drops traffic: DNS
is so critical that any risk of dropping real traffic is unacceptable.
(obviously, such a scale is impractical for 99% of services)

A good takeaway from this outage for the average user would be to make sure
that your fallback DNS resolvers are operated by totally separate providers.
(eg, configure 1.1.1.1 with 8.8.8.8 as a fallback, rather than 1.1.1.1 and
1.0.0.1) (Edit: fixed cloudflare's secondary address)

~~~
crtasm
I read on the Pi-Hole forums that 'fallback' is a misleading term because
clients don't work that way - they will happily spread requests between two
functioning DNS servers. Can anyone confirm this or provide further insight?

~~~
TravelTechGuy
Can confirm. If I have a secondary DNS specified in my router, other than my
pi-hole, it becomes useless. My guess is the router is either measuring
response time, and goes with the most efficient, or otherwise round-robining
the requests. Either way, requests bypass the pi-hole in such quantities that
it became useless.

PS: someone here mentioned that this behavior is OS-dependent - nope, this
happens on the router level, and all devices in my apartment suffer.

~~~
jschwartzi
It depends on how your router's DHCP server is configured. If you configure
your router to pass its own IP address out as the DNS server for the local
subnet then the router's behavior dictates how DNS works. If your router is
passing out an external DNS in the DHCP configuration, then you'll get OS-
dependent behavior.

My router uses a DNS resolver internally, and it will spread-cast to multiple
DNS servers and use the quickest response it can get. It also caches using the
TTL in the DNS response, and so it will serve up cached records transparently.

------
saghm
> Our FRP framework allows us to express this in clear and readable code. For
> example, this is part of the code responsible for performing DNS attack
> mitigation: > > def action_gk_dns(...): > [...] > > if port != 53: > return
> None > > if whitelisted_ip.get(ip): > return None > > if ip not in
> ANYCAST_IPS: > return None > > [...]

What does this code sample have to do with FRP? This code seems extremely
trivial and doesn't give any real indication to me why you'd need a framework
of any sort. It seems like they really want to emphasize that they use FRP,
but this code just seems completely unrelated.

~~~
Chris911
From one of the linked presentations:
[https://github.com/cloudflare/gatelogic](https://github.com/cloudflare/gatelogic)

------
obeattie
Major credit to Cloudflare for publishing a clear, honest, and detailed
description of what happened. I wish more companies would do this.

One thing I’d be interested to know more about is why it took 17 minutes to
fix. While you can and always should strive to make them less likely, outages
are inevitable, so how you respond is crucial. Here the outage was very
obviously caused by a deployment that I’d assume was supervised by humans –
why did it take 17 minutes to roll back?

~~~
blackbrokkoli
I'm not an expert, but is 17 minutes for:

\- shit is not working

\- is this an attack?

\- no it's us

\- how?

\- that's how

\- let's go back

\- have to get supervisor

\- roll back huge thing

really that long?

~~~
dx034
With ~150 data centres, roll back alone probably took 5-10 minutes. Don't
think 17 minutes is that long.

~~~
Cthulhu_
For simple PagerDuty alerts I already need 15 minutes to open the app / logs
and figure out what's going on.

~~~
wool_gather
Good point, although the way it's described it sounds like the problem cropped
up right after deploy. So they would have been watching it actively. But as
said above, 17 minutes to notice, figure out what's going on, decide what to
do, and propagate the resolution seems reasonable.

------
drawkbox
Great information on the outage. Looks like another version 2 syndrome side
effect [1].

> _Today, in an effort to reclaim some technical debt, we deployed new code
> that introduced Gatebot to Provision API._

> _What we did not account for, and what Provision API didn’t know about, was
> that 1.1.1.0 /24 and 1.0.0.0/24 are special IP ranges. Frankly speaking,
> almost every IP range is "special" for one reason or another, since our IP
> configuration is rather complex. But our recursive DNS resolver ranges are
> even more special: they are relatively new, and we're using them in a very
> unique way. Our hardcoded list of Cloudflare addresses contained a manual
> exception specifically for these ranges._

> _As you might be able to guess by now, we didn 't implement this manual
> exception while we were doing the integration work. Remember, the whole idea
> of the fix was to remove the hardcoded gotchas!_

When porting legacy code it is not only important to understand the edge cases
and technical debt built up over time, but to test more heavily in production
because you never know if you got them all because some smart guy built them
long ago and/or there are unknown hacks that were cornerstones of the system
for better or worse.

Phased and alpha/beta rollouts in an almost A/B testing way is good for
replacement systems. Version 2 systems can also add new attack vectors or
other single points of failure that aren't as know from legacy problems, the
Provision API seems like it is a candidate for that.

Over time the Version 2 system will be hardened just before it is EOL and
replaced again to fix all the new problems that arise over time. Version 2's
do innovate but they also shroud fixing old issues and pain points for new
unknown problems.

[1] [https://en.wikipedia.org/wiki/Second-
system_effect](https://en.wikipedia.org/wiki/Second-system_effect)

~~~
teddyh
“ _Yes, I know, it’s just a simple function to display a window, but it has
grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you
why: those are bug fixes. One of them fixes that bug that Nancy had when she
tried to install the thing on a computer that didn’t have Internet Explorer.
Another one fixes that bug that occurs in low memory conditions. Another one
fixes that bug that occurred when the file is on a floppy disk and the user
yanks out the disk in the middle. That LoadLibrary call is ugly but it makes
the code work on old versions of Windows 95._

 _Each of these bugs took weeks of real-world usage before they were found._ ”

— Things You Should Never Do, Part I,
([https://www.joelonsoftware.com/2000/04/06/things-you-
should-...](https://www.joelonsoftware.com/2000/04/06/things-you-should-never-
do-part-i/))

~~~
slededit
Engineers realized that maintaining all those bug fixes was too expensive. So
as an industry we collectively agreed everyone will do constant rewrites.

That win 95 fix? Doesn't matter because win 95 was auto upgraded to be
unrecognizable (SaaS is a wonderful thing).

Low memory? Not our problem - go buy another computer. Remember programmer
time is more valuable.

You can try and buck the trend but your dependencies wont and the customer
doesn't care whose code the bug is in. You won't get any browny points so you
might as well just save the effort.

It's a brave new world.

------
Robin_Message
What's interesting here is that the automatic cure (DDoS protection) was worse
than the disease (even if there was an attack, blocking all access to the DNS
servers is potentially worse than letting them get overloaded).

I wonder if it would be possible to express the idea that if a block being
applied drops traffic well below expected levels, it must be a mistake?

~~~
ben509
Is it the block causing the traffic drop, or the DDOS though?

------
walrus01
Commendable honesty and level of detail in the public RFO.

------
frenchie4111
Understandable outage. I switched back to cloudflare when it came back up, but
this did prompt me to drop 8.8.8.8 in as a 3rd fallback.

------
IronWolve
Doesn't seem fixed, still can't resolve archive.is cloudflare is giving off
cloudflare DNS web errors.

[https://i.imgur.com/APpQPTJ.png](https://i.imgur.com/APpQPTJ.png)

~~~
acdha
That's apparently due to the archive.is servers returning different results
based on the requesting IP:

[https://community.cloudflare.com/t/archive-is-
error-1001/182...](https://community.cloudflare.com/t/archive-is-
error-1001/18227/3)

------
cm2187
Do they just use python as pseudo code or do they actually run their attack
detection in python?

~~~
usmannk
It is python. They linked a presentation[1] and a talk about the bot.

[1] [https://speakerdeck.com/majek04/gatelogic-somewhat-
functiona...](https://speakerdeck.com/majek04/gatelogic-somewhat-functional-
reactive-framework-in-python)

~~~
cm2187
It's interesting. I would have expected them to use rather something low
latency/high performance like c++ or erlang given their scale and performance
criticality.

~~~
heavenlyblue
This is something that only keeps track of user settings data and consequently
configures the endpoints given that data.

It has got nothing to do with packet filtering per se.

------
jonnismash
>The next time we mitigate 1.1.1.1 traffic, we will make sure there is a
legitimate attack hitting us.

I fucking love these guys

------
xtrimsky1234
This downtime annoyed me at a critical time. Even though I think it's great
they are transparent about it, I don't see myself using their DNS again.

------
8_hours_ago
I wonder if the change was manually reverted after 17 minutes, or if
Cloudflare has a system that watches for a spike in failures and automatically
reverts the most recent change.

------
known
Error 1001 DNS resolution error when trying to access archive.li

But works with 9.9.9.9

------
throwaway2048
another decent alternative fallback with the same featureset is quad9

[https://www.quad9.net/](https://www.quad9.net/)

~~~
kondro
The core feature of Cloudflare's DNS is getting you closer and faster resolves
to their CDN (probably one of the largest at this point, barring the
established enterprise encumbants) in a privacy-first way.

Quad9's goal seems to be about threat-detection and prevention.

It would be nice if you could have both simultaneously (and maybe you can),
but at the moment both services are actually quite different.

~~~
bigiain
9.9.9.10 is "quad nine without the threat detection" from memory.

~~~
kondro
Yeah, but the reason why 1.1.1.1 is so fast for sites that use Cloudflare as
DNS is because Cloudflare is the authoritative DNS for them.

The only way you get that in a more generic sense is if a specialist DNS CDN
provider started up that provided DNS services for all the existing CDNs (or
they all agreed to some type of federated standard that let them share the
same recursive multicast IP addresses for DNS resolution).

~~~
dx034
Also, Cloudflare has a huge amount of data centres by now, probably more than
any other service. Even Google often underperforms them. Debatable if a few ms
make a difference but it can for people living in remote areas where CF has a
centre and the next 9.9.9.9/8.8.8.8 is 100ms away.

~~~
scandinavian
It's important to note that the cloudflare dns does not send the EDNS Client
Subnet header which can have a negative impact depending on where you live.

~~~
patrickmcmanus
or a positive privacy impact depending on where you live :)

------
dingo_bat
TL;DR: we should have used an IP that is not traditionally used for testing
and internal stuff by everybody including Cisco.

~~~
Dylan16807
Not even close. The system had a glitch because they were doing a major DNS
resolver at all. It had nothing to do with the baggage that comes with 1.1.1.1
specifically.

~~~
isostatic
However a few days ago there was a bgp based outage on cloudfare DNs want
there?

------
baby
I switched to 1.1.1.1 when it was released and since I’ve had multiple issues
with free wifis where they would fail to hijack my dns requests to allow me to
login to their portal. I Assume this is a good thing but can someone explain
to me why this is happening and what’s the state on improving these wifi
portals?

PS: sorry for hijacking the thread.

~~~
syntheticcdo
Might be unrelated. Always try [http://neverssl.com](http://neverssl.com) to
trigger the captive login on public wifi.

------
decker
I wonder if they are going to factor the availability into all the blog posts
about how fast they are.

