
Cloudflare outage caused by bad software deploy - TomAnthony
https://blog.cloudflare.com/cloudflare-outage/
======
aleem
If a single regex can take down the Internet for a half hour, that's
definitely not good -- for a class of errors that can be easily prevented,
tested, etc.

The timing is unfortunate too, after calling out Verizon for lack of due
process and negligence.

I'm sure they have an undo or rollback for deployments but probably worth
investing into further.

They also need to resolve the catch-22 where people could not login and
disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was
down.

~~~
nbm
Since the work involved in doing a regular expression match can depend largely
on the input for non-trivial expressions, one fun case (probably not the one
here, though) is that a user of your system could start using a pathological
case input that no amount of standard testing (synthetic or replayed traffic,
staging environments, production canaries) would have caught.

Didn't take anything down, but did cause an inordinate amount of effort
tracking down what was suddenly blocking the event loop without any
operational changes to the system...

~~~
menzoic
Fuzz testing could help

~~~
nbm
Yep, it could help in some cases.

It's nowhere near as standardly applied as the other approaches to release
verification, though.

And in complex cases (say, a large multi-tenant service with complex
configuration), it can be very hard to find the combination of inputs
necessary to catch this issue. If you have hundreds of customer
configurations, and only one of them has this particular feature enabled (or
uses this sort of expression), fuzzing is less likely to be effective.

------
neom
I'm always beyond impressed with how responsive and transparent CF is with
incidence and post mortem communication. Given who the CEO and COO are, I
suppose this shouldn't be surprising, never the less as a customer it builds a
great deal of trust. Kudos.

~~~
grey-area
Yes, they do really well on this - open, transparent, posting information
quickly as soon as they were fairly sure what the problem was. I always really
enjoy their writing, both incident reports and writeups of new features. The
only thing I think they could have managed better was their status page, which
claimed they were up (every service was green) when they were not.

~~~
gridspy
I think they were blindsided that this was even possible. So they hadn't
thought to add a panel to the status page for this one.

I bet they will now.

~~~
grey-area
I think their status page doesn't update service status automatically in
response to downtime, which it really should.

------
eganist
Kinda wonder at this point what findings exist on their Availability SOC 2,
assuming they've gotten one.

The repeated outages plus the constant malicious advertising by scammy ad
providers through cloudflare are slowly turning me off to the service as a
potential enterprise customer. Unfortunate too since plenty of superlatively
qualified people build great things there (hat tip to Nick Sullivan), but it
seems like the build-fast culture may now be impeding the availability
requirements of their clients.

This is also a great example of a case where SLAs are meaningless without
rigorous enforcement provisions negotiated in by enterprise clients.
Cloudflare advertises 100% uptime ([https://www.cloudflare.com/business-
sla/](https://www.cloudflare.com/business-sla/)) but every time they fall
over, they're down for what, an hour at a time? Just this one issue would've
blown anyone else's 99.99% SLA out of the water --
[https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr](https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr)

I love the service, but if I'm to consider consuming the service, they'd do
well to have the equivalent of a long term servicing branch as its own
isolated environment, one where changes are only merged in once they've proven
to be hyper-stable.

~~~
klodolph
As an engineer, I get pissed whenever I see 100% uptime, or eleven-nines,
nine-nines, or other impossible targets. Like, how am I supposed to design a
system with numbers like that?

~~~
Lorin
Deploy once, never update, and deploy a missile defense to prevent backhoes
from digging up fiber?

~~~
JaimeThompson
You honestly think a missile defense system will work. Backhoes are much more
creative than that. You will need defense in depth, roaming patrols, as well
as air and satellite based monitoring assets.

~~~
jacques_chester
And then the fibre will get cut by a building crew working on a guard tower.

------
ti_ranger
> We make software deployments constantly across the network and have
> automated systems to run test suites and a procedure for deploying
> progressively to prevent incidents.

Good.

> Unfortunately, these WAF rules were deployed globally in one go and caused
> today’s outage.

Wow. This seems like a very immature operational stance.

Any deployment of any kind should be subject to minimum deployment safety,
that they claim they have.

> At 1402 UTC we understood what was happening and decided to issue a ‘global
> kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to
> normal and restored traffic. That occurred at 1409 UTC.

Many large companies would have had automatic roll-back of this kind of change
in less time than it took CloudFlare to (apparently) have humans decide to
roll-back, and possibly before a single (usually not global) deployment had
actually completed on all hosts/instances.

However, what is more concerning is that it seems you shouldn't _rely_ on
CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to
turn it off instead of correctly rolling back a bad deployment, which they
only did > 43 minutes later:

> We then went on to review the offending pull request, roll back the specific
> rules, test the change to ensure that we were 100% certain that we had the
> correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.

How were they not able to trivially roll back to the previous deployment?

~~~
londons_explore
So many employees deploying so many changes at a time it wasn't clear which
one was the cause...?

~~~
mschuster91
Which is why the entire (mostly in Agile environments) model of "deploy to
prod as soon as you can" is absolute nuts.

If you're dev at a hipster app maybe a dozen people use to holler "yo" at each
other, by all means go for it. If you're operating one of the biggest and most
important chonks of Internet infra... maaaaaybe stick to established practices
such as stage testing, release schedules and incremental rollouts?

~~~
flurdy
Why not both.

I don't want to return to the old slothful release schedules of the 2000s,
where features and bug fixes was mostly stagnant.

You can have staging, scheduled QA signed off releases, that happen every day.
I have worked on some fairly large significant services and we still released
several times a day, just that you did not trigger the final prod release
yourself but the QAs pressed the button instead. Though usually just once a
day per microservice.

I have also worked with several clients lately without QA where devs could
themselves push to prod many times a day. I am not sure these systems were
that much less stable, though they were all mostly greenfield and not critical
public government systems. They were off course a lot smaller changes, and
quick to undo. Which is the core element of "release straight to prod" ethos.

I am sure Cloudflare have a significant QA process whilst using todays fast
moving release schedules.

What is always a grey-zone is configuration changes. Even if properly
versioned and on a release schedule train with several staging environments,
configuration is often very environment sensitive. So maybe they could not
test it properly in any staging environments but had to hope prod worked...

However Cloudflare will hopefully implement some way to make sure this
particular configuration and subsequent future changes are not as bottle-
necked that instead can be be gradually released to a subset and region-by-
region instead of a big bang to all. Though canary/blue-green/etc releases of
core routing configurations is hard.

------
grey-area
I really want to know the regexp and corresponding input(s) which killed the
internet now :) Was it just aaaaaaaaaaaah?

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
almost_usual
I'm assuming it's something pretty embarrassing if it's not in the post
mortem.

~~~
hunter2_
The first sentence here is "This is a short placeholder blog and will be
replaced with a full post-mortem...".

I'd bet big money that they do include it.

~~~
almost_usual
Look forward to the full post-mortem

------
rob-olmos
For the size and importance of Cloudflare some insights to a couple questions
would be nice:

1\. Why are WAF rules not progressively deployed since there's already a
system to do so?

2\. Maybe there should also be a testing environment that receives a mirror of
production traffic before deployments reach real users?

(I understand the WAF change was not set to take action, but a separate
environment would be less likely to affect production)

~~~
JakeTheAndroid
Cloudflare does have test colos that use a subset of real network traffic for
testing. It's actually the primary testing methodology, and the employees are
usually some of the first people forced through test updates.

This release wasn't meant to go out, and the fact it did means it would have
bypassed the test environments either way.

------
souterrain
Cloudflare should write a guide to doing post-event communication. Or perhaps
they shouldn’t, as this seems to be a potential differentiator.

This is direct and doesn’t attempt to avoid blame. Well done.

~~~
Pfhreak
Avoiding blame is different than acknowledging responsibility. A post mortem
should be very conscious about blame - never target the engineer who deployed
the change, for example. Take responsibility for the machine that allowed the
unsafe change to be deployed. (Where machine could be tooling or process, as
appropriate.)

------
lgats
At 1402 UTC we understood what was happening and decided to issue a ‘global
kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal
and restored traffic. That occurred at 1409 UTC.

So for about 50 minutes, those who relied on the WAF were open to attack?

~~~
foota
Isn't open to DDoS better than can't be reached?

~~~
javagram
WAF wouldn’t just prevent DDoS I assume. I’m pretty sure there are WAF
rulesets that attempt to block attacks such as XSS or even remote code
execution vulnerabilities.

~~~
dx87
I can confirm that there are WAF rules that block things like basic SQL
injection. A client uses Akamai and if it detects certain strings in a
request, like "<script>", it'll block the request before it ever gets to the
application. The bad part is that some developers get complacent in their
development and rely on the WAF to do their security for them.

------
txcwpalpha
Peculiar that eastdakota (Cloudflare's CEO) doesn't seem to be tweeting at the
Cloudflare team responsible for this, telling them they should be ashamed and
are guilty of malpractice.

When it was Verizon that took down the internet he felt it was appropriate to
do that to the Verizon teams, after all.

edit: right after posting this comment, he did tweet the following:
[https://twitter.com/eastdakota/status/1146196836035620864](https://twitter.com/eastdakota/status/1146196836035620864)

> I'd say both we and Verizon deserve to be ashamed.

As well as this:
[https://twitter.com/eastdakota/status/1146170209780113408](https://twitter.com/eastdakota/status/1146170209780113408)

> Our team should be and is ashamed. And we deserve criticism. ...

I still don't think that publicly shaming _anyone_ is a good leadership style
nor is it a good way to motivate people to perform better in the future, but
kudos for the self-awareness, at least.

~~~
threezero
Cloudflare was responsive and reasonable. Verizon was unreachable and
deflected responsibility when they finally made a statement. And public
shaming does often motivate companies to be more responsive to their
customers.

~~~
txcwpalpha
AFAIK Cloudflare isn't in any way a "customer" of Verizon. Verizon doesn't owe
Cloudflare any kind of response or devotion of resources. Verizon owes it's
actual customers a resolution to their problem, _which they gave_.

I'm not saying Verizon is perfect nor absolved of fault, but Cloudflare was/is
not owed any kind of explanation or assistance by VZ, and it's absurd of CF to
_still_ be whining about that fact (as they are doing in some other tweets
today). If CF wants some kind of SLA with VZ, they should engage them in a
business relationship, not try to publicly shame them.

~~~
thegagne
I’d say what they really need is a representative governing body over major
network carriers to establish proper standards and levy fines for those that
do not comply.

Kind of similar to a homes association saying “hey that trash on your lawn
affects your neighbor, clean it up!”

It’s true that they are not a customer but at that level what they do affects
each other, and it’s better to resolve things civilly and privately instead of
publicly on twitter.

~~~
mschuster91
In theory this should be the job of the FCC, and in Europe the local
regulatory agencies (BNetzA in Germany for example). But properly funding them
to do their jobs doesn't seem to be very high om the political agendas these
days.

------
peterwwillis
How to implement a multi-CDN strategy (streamroot.io):
[https://news.ycombinator.com/item?id=18399523](https://news.ycombinator.com/item?id=18399523)

Etsy implementing multiple CDN (7 years ago, the CDNcontrol project looks
abandoned): [https://speakerdeck.com/ickymettle/integrating-multiple-
cdn-...](https://speakerdeck.com/ickymettle/integrating-multiple-cdn-
providers-our-experience-at-etsy) [https://dyn.com/blog/speaking-with-etsy-
about-multi-cdns-and...](https://dyn.com/blog/speaking-with-etsy-about-multi-
cdns-and-dns/)

Basically: you can try to keep a low TTL DNS, but it'll be more DNS traffic,
and 5-10% of traffic takes forever to cut over because nobody respects TTL.
Worst case you have just as much down time as before, best case most of your
traffic is recovered in a few minutes.

~~~
outworlder
It may be useful to note, for whoever is reading this, that low DNS TTL only
ever makes sense for anything that you can do a cutover either automatically
or on short notice, not for all records. Otherwise, you are now at mercy of
outages on your DNS providers.

Just leaving it out there so one doesn't get the idea that "low TTL == Always
Good"

------
cfors
I wonder what happened with that poor regex expression.

My thoughts are immediately shifting to one of my favorite articles of all
time "Regular Expression Matching can be Simple and Fast..." [0]

[0]
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

------
UI_at_80x24
Can anybody suggest a Systems Engineer-centric forum/site? (Not Windows 'help
I can't print' level, more DataCenter grade.)

HN does have some great content/replies that touch on these topics, but I'd
like something more.

~~~
gridspy
Perhaps these QA sites are interesting?

[https://superuser.com/](https://superuser.com/)
[https://serverfault.com/](https://serverfault.com/)

But yes, the content like this on HN is fascinating and I would also like
more.

------
sequoia
What sort of regular expression pitfalls can cause this sort of CPU
utilization? I know they're possible but I am curious about specific examples
of something similar to what caused Cloudflare's issue here.

~~~
roro159
DoS with regex is a thing:
[https://www.owasp.org/index.php/Regular_expression_Denial_of...](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-
_ReDoS)

StackOverflow had a similar case a while back:
[https://stackstatus.net/post/147710624694/outage-
postmortem-...](https://stackstatus.net/post/147710624694/outage-postmortem-
july-20-2016)

~~~
andreareina
HN discussion of the StackOverflow incident:
[https://news.ycombinator.com/item?id=12131909](https://news.ycombinator.com/item?id=12131909)

------
djhworld
A good war story there, at least the problem was relatively simple and quick
to identify as the root cause, rather than something deeper.

Would be interested to see what the gnarly regex was that was bombing their
CPUs so hard!

------
mrzasa
Shameless plug: understanding regex engine implementation can help with
avoiding performance pitfalls: [https://medium.com/textmaster-
engineering/performance-of-reg...](https://medium.com/textmaster-
engineering/performance-of-regular-expressions-81371f569698)

------
ksara
>"It doesn't cost a provider like Verizon anything to have such limits in
place. And there's no good reason, other than sloppiness or laziness, that
they wouldn't have such limits in place."[1]

[1] [https://blog.cloudflare.com/how-verizon-and-a-bgp-
optimizer-...](https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-
knocked-large-parts-of-the-internet-offline-today/)

~~~
Operyl
The difference between Verizon and Cloudflare in this case is that Cloudflare
generally _does_ fix their mistakes when they screw up (and generally don't
make the same type again).. whereas Verizon has screwed up internet routing
more times than I'd like to think about. No company is perfect, but I'd say
this comparison is pretty apples to oranges.

------
nodesocket
It is interesting NGINX returned 502 nearly instantly under very heavy CPU
load. I would have expected requests to just hang or timeout.

~~~
docapotamus
I would imagine it's tiered. The Nginx servers at the front returning the 502
probably aren't the boxes running the code

~~~
jgrahamc
yes

------
BentFranklin
Kind of funny that it was a regexp.

~~~
davidw
I'm reminded of:

"You have a problem, and you decide to use a regexp to solve it. Now you have
two problems"

Although of course I'm just kidding and I'm sure that a good regexp probably
is the right solution for what they're doing in that instance: they have a lot
of bright people.

------
gist
Nothing like having what should be a world class company falling prey to the
same type of screw-ups that plaque 'the local guy maintaining some wordpress
site on a shared server'.

Separately there is nothing that says that a company like Cloudflare has to
air their dirty laundry (as the saying goes). The vast majority of 'customers'
really don't care why something happened at all or the reason. All they know
is that they don't have service.

Pretend that the local electric company had a power outage (and it wasn't
caused by some obvious weather event). Does it really matter if they tell
people that 'some hardware we deployed failed and we are making sure it never
happens again'. I know tech thinks they are great for these types of post-
mortems but the truth is only tech people really care to hear them. (And guess
what all it probably means is that that issue won't happen again...)

~~~
GranPC
> I know tech thinks they are great for these types of post-mortems but the
> truth is only tech people really care to hear them.

Well, Cloudflare is in luck; most of their customers are "tech people"!

~~~
gist
100 not true. All you have to do is pull a list of the daily additions and
deletions and you will see that they have many customers that are not 'tech'
people. Further you are assuming all the customers of theirs that are tech
people even read and keep up with blog posts like this.

------
rubyn00bie
This line kills me:

> We were seeing an unprecedented CPU exhaustion event, which was novel for us
> as we had not experienced global CPU exhaustion before.

I'd imagine it was quite novel for most anyone affected /s

------
quickthrower2
Probably for the kind of work they are doing avoid regex? Or at least the very
complicated modern regex (simple autonoma that you can compile in advance
might be ok)

~~~
txcwpalpha
If you're trying to do pattern matching, is there actually a widely used
alternative to regex? The more I can avoid using regex for mission-critical
things, the happier I will be, but I'm really not aware of anything better for
this type of application.

~~~
quickthrower2
Ive tried parser combinators. They are nice but a bit more labour than writing
out a regex and I’m not sure how performance compares

------
bdibs
Seems like a terrible idea to deploy changes to such a vital piece of their
software GLOBALLY without some sort of rollout procedure.

------
tolgahanuzun
It is very difficult to explain this to customers who don't understand
technology. 30 minutes is a very big time. :/

------
suchow
Is there a usage error in the first sentence or has English lost the blog /
blog post distinction?

~~~
tom_
Some people distinguish between the two, some don't.

------
tomcam
I run a service placing bids the last few seconds on eBay. Every time this
happens I lose measurable business (we place thousands of bids per day). While
it doesn’t affect scheduled bids, they can’t place bids and are likely to move
to a competitor. These recent outages have been costly.

Does anyone know a more reliable provider?

~~~
PhasmaFelis
Haha, people _pay_ you to bid-snipe on eBay for them?

It just baffles me that manual/third-party bid-sniping is still a thing. eBay
has had automatic bidding for more than _twenty years._ You'll pay the same
whether you put in the winning bid a week in advance or 5 seconds. But people
see that "you lost this auction" notice and they're irrationally convinced
that it would have gone differently if they'd bid at the last minute, somehow.

~~~
beering
If other bidders are irrational, then bid-sniping can work. It doesn't give
others the opportunity to contemplate, "I've been out-bid, do I actually want
this item more than I originally thought?"

~~~
neilv
And it's well-known since early eBay days that many bidders are irrational,
including but not limited to competitive impulse to "win". Plus you sometimes
have shill bidders.

Sniping approximates sealed bids, with the highest-bidder the second-highest
sealed bid amount or a small increment above it. (Unfortunately for eBay, that
would tend to decrease their cuts, unless the appeal of the sealed bid format
brings in sufficiently more bidder activity.)

Another advantage of the software/service is that it automates. If you want to
buy a Foo, you can look at the search lists, find a few Foos (possibly of
varying value in the details), say how much you'll pay for each one, and let
the software attempt to buy each one by its auction end until it's bought one,
then it stops. If eBay implemented this itself, it might be too much headache
in customer support, but third-parties could provide it to power users.

(I don't buy enough on eBay anymore to bother with anything other than
conventional manual bids, but I see the appeal of automation.)

~~~
steve_adams_86
If everyone is using a bidder like this, isn't it essentially like a blind
auction?

------
BuddhaSource
Do builds go through stage role out? For service like Cloudflare.

------
pearapps
No way!?!?!??!?!?!?!??!?!??!?!?!

------
rodgerd
Karma for shitting on Verizon, maybe.

