
Details of the Cloudflare outage on July 2, 2019 - jgrahamc
https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
======
blr246
Appreciate the detail here. It's a great writeup. Wondering what folks think
about one of the changes:

    
    
      5. Changing the SOP to do staged rollouts of rules in
         the same manner used for other software at Cloudflare
         while retaining the ability to do emergency global
         deployment for active attacks.
    

One concern I'd have is whether or not I'm exercising the global rollout
procedure often enough to be confident it works when it's needed. Of the
hundreds of WAF rule changes rolled out every month, how many are global
emergencies?

It's a fact of managing process that branches are liability and the hot path
is the thing that will have the highest level of reliability. I wonder if
anyone there has concerns about diluting the rapid response path (the one
having the highest associated risk) by making this process change.

edit: fix verbatim formatting

~~~
jsnell
Yep, that's the exact bullet point I was writing a response on. Security and
abuse are of course special little snowflakes, with configs that need to be
pushed very fast, contrary to all best practices for safe deployments of
globally distributed systems. An anti-abuse rule that takes three days to roll
out might as well not exist.

The only way this makes sense is if they mean that there'll be a staged
rollout of some sort, but it won't be the same process as for the rest of
their software. I.e. for this purpose you need much faster staging just due to
the problem domain, but even a 10 minute canary should provide meaningful push
safety against this kind of catastrophic meltdown. And the emergency process
is something you'll use once every five years.

~~~
blr246
Your response highlights a good idea to mitigate the risk I was trying to
highlight in mine.

They want to have a rapid response path (little to no delay using staging
envs) to respond to emergencies. The old SOP allowed all releases to use the
emergency path. By not using it in the SOP anymore, I'd be concerned that it
would break silently from some other refactor or change.

Your notion is to maintain the emergency rollout as a relaxation of the new
SOP such that the time in staging is reduced to almost nothing. That sounds
like a good idea since it avoids maintaining two processes and having greater
risk of breakage. So, same logic but using different thresholds versus two
independent processes.

~~~
jsnell
Right. The emergency path is either something you end up using always, or
something you use so rarely that it gets eaten by bit-rot before it gets ever
used[0]. So I think we're in full agreement on your original point. This was
just an attempt to parse a working policy out of that bullet point.

[0] My favorite example of this had somebody accidentally trigger an ancient
emergency config push procedure. It worked, made a (pre-canned) global
configuration change that broke everything. Since the change was made via this
non-standard and obsolete method, rolling it back took ages. Now, in theory it
should have been trivial. But in practice, in the years since the
functionality had been written (and never used), somehow all humans had lost
the rights to override the emergency system.

~~~
jacques_chester
My personal rule is that any code which doesn't get exercised at least weekly
is untrustworthy. I once inherited a codebase with a heavy, custom blue-green
deploy system (it made sense for the original authors). While we deployed
about once a week, we set up CI to test the deployment every day.

Cold code is dead code.

------
hn_throwaway_99
One thing that was interesting to me:

The outage was caused by a regex that ended up doing a lot of backtracking,
which caused PCRE, the regex engine, to essentially handle a runaway
expression.

This reminded me of a HN post from a couple months back by the author of
Google Code Search, and how it worked:
[https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)
. Interestingly, he wrote his _own_ regex engine, RE2, specifically because
PCRE and others did not use real automata and he needed a way to do arbitrary
regex search safely.

~~~
bsder
The problem is that a deterministic regex engine (deterministic finite
automata or DFA) is strictly less powerful than a non-deterministic one (NFA).
DFA's can't backtrack, for example. In addition, DFA's can be quite a bit
slower for certain inputs and matches.

~~~
auscompgeek
Actually, it is proven that NFAs and DFAs are equally expressive. See
[https://en.wikipedia.org/wiki/Powerset_construction](https://en.wikipedia.org/wiki/Powerset_construction)

~~~
bsder
"You are technically correct. The best kind of correct."

In theory, your statement is perfectly correct. However, quoting that
reference:

"However, if the NFA has n states, the resulting DFA may have up to 2^n
states, an exponentially larger number, which sometimes makes the construction
impractical for large NFAs."

This means that _in practice_ , DFAs are larger, slower, and sometimes can't
be run at all if complex enough.

However, this was my mistake. I remembered (vaguely) the 2^n issue and didn't
follow up to make sure I was accurate.

And I completely spaced on the fact that neither NFA's nor DFA's handle
backreferences without extension.

------
dsr_
9\. We had difficulty accessing our own systems because of the outage and the
bypass procedure wasn’t well trained on.

Suggestion for future, learned from bitter experience: separate your control
plane from your data plane. In this case, make sure that the tools you use to
manage your infrastructure don't depend on that infrastructure being
functional.

That way you won't have to remember how to use a bypass procedure -- it will
just be your normal procedure.

~~~
vayne
Yeah, that is true in most cases. However, here is Cloudflare was using
Cloudflare on dash.cloudflare.com as well.

This calls for not using Cloudflare for their web dashboard.

~~~
jjeaff
Then comes the inevitable tweets every few weeks. "cloudflare doesn't event
trust cloudflare to run their own control panel"

~~~
vegiraghav
couldn't it be named in order to indicate cloudflare uses cloudflare just in a
different way.

------
minxomat
> The Lua WAF uses PCRE internally and it uses backtracking for matching and
> has no mechanism to protect against a runaway expression. More on that and
> what we're doing about it below.

We run a WAF based on LuaJIT in resty. Just to be clear, the resty interface
to PCRE does provide a DFA mode. Furthermore, Zhang actually ported RE2 (see
other comments here) to C as sregex, which is usable from Lua as a c module
regardless if it runs in resty or a custom Lua app.

> Switching to either the re2 or Rust regex engine which both have run-time
> guarantees. (ETA: July 31)

Not addressed at Cloudflare, since they had a defense in place. But just in
case anyone else is running a similar thing in Lua.

And:

> In the longer term we are moving away from the Lua WAF that I wrote years
> ago.

Then sregex might be the perfect fit here. Though Rust is technically safer.
Depends on what longer term means.

------
woliveirajr
> Unfortunately, last Tuesday’s update contained a regular expression that
> backtracked enormously and exhausted CPU used for HTTP/HTTPS serving.

One of those cases where they had 1 problem, used regular expression and ended
up with 2 problems ?

Edit: I really like how much information is given by CloudFlare. 11 points in
the "what went wrong analysis" is how every root-cause analysis should be
done.

~~~
toomuchtodo
Somewhat humorous, as someone [1] (congrats /u/fossuser!) mentioned this
failure scenario in the thread about Twitter being down yesterday.

"Pushing bad regex to production, chaos monkey code causing cascading network
failure, etc.", in response to a comment from someone who previously worked at
Cloudflare.

[1]
[https://news.ycombinator.com/item?id=20415608](https://news.ycombinator.com/item?id=20415608)

~~~
citruspi
They mentioned it was a regular expression in the original post[0] on the day
of the incident, that part isn't news (discussion here[1]).

[0]: [https://blog.cloudflare.com/cloudflare-
outage/](https://blog.cloudflare.com/cloudflare-outage/)

[1]:
[https://news.ycombinator.com/item?id=20336332](https://news.ycombinator.com/item?id=20336332)

------
buildzr
> Then we moved on to restoring the WAF functionality. Because of the
> sensitivity of the situation we performed both negative tests (asking
> ourselves “was it really that particular change that caused the problem?”)
> and positive tests (verifying the rollback worked) in a single city using a
> subset of traffic after removing our paying customers’ traffic from that
> location.

Haha, so the free customers are crash test dummies for providing test traffic.
Nice.

I actually don't mind that much, considering it's basically bulletproof DDoS
protection for free. I'd much rather "be the product" in this way than in the
way ad companies cause at least.

~~~
pvg
Or you can say all customers were affected but some localized free-tier
customers got the fix first.

~~~
regnerba
In this case yes, however they also indicate this is how they do their staged
rollouts in general. So if they are releasing any other software update that
goes through the staged rollout free customers are tested first. If that
change broke something, free customers get that first. Which seems fair to me.

~~~
charrondev
In my experience it’s generally best to roll out changes on testing, staging,
and then clients in order of how much they pay, especially if you have SLAs
with the highest paying customers.

Impact is generally lower, both to the client, and to your bank account.

~~~
Thorrez
That sounds strange to me. If you introduce a bug then roll back very quickly,
it will only affect high paying customers. If you introduce a bug then roll
back a while later, it will impact high paying and low paying customers
equally. Why would you want this scenario? If you flip it it seems strictly
better to me.

~~~
pvg
The idea is that the fix itself is being tested. If you knew your 'rollback'
will work for certain, then you'd just deploy it to everyone asap. But since
you don't, you test it and as a potential outcome of your test is no fix or
making things worse, you don't test it on your highest-value customers.
Imagine what your postmortem would read like if your fix made an even bigger
mess.

~~~
Thorrez
Oh, I misunderstood you originally. I thought you said rollout from highest to
lowest. You're actually saying lowest to highest.

~~~
pvg
I'm just describing what I think the sequence in the postmortem is. They were
already in the poop and wanted to test their fix in a real but low-impact way.

------
ssully
I love how the post basically concludes with: "This problem has been known
since 1968, which is detailed in this paper written by Ken Thompson".

Incredible write up. Really appreciate the detail, and am really impressed by
how mature their response coordination seems to be.

------
riobard
Note: Golang's stdlib regexp
([https://golang.org/pkg/regexp/](https://golang.org/pkg/regexp/)) is
guaranteed to run in time linear to the size of input. Russ Cox has a detailed
article
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
jefftk
Golang's regexp is derived from RE2, referenced in footnote four of the post.

~~~
f2f
Russ Cox also wrote RE2:
[https://swtch.com/~rsc/regexp/](https://swtch.com/~rsc/regexp/)

------
aalleavitch
I’d like to imagine that as long as we live, poorly written regular
expressions will continue to be the cause of breaking issues.

~~~
lmkg
StackOverflow also had an outage a few years ago that was caused by
exponential blow-up of a backtracking regular expression.

[https://stackstatus.net/post/147710624694/outage-
postmortem-...](https://stackstatus.net/post/147710624694/outage-postmortem-
july-20-2016)

~~~
isaacg
That blow up is quadratic, not exponential: "This is not classic catastrophic
backtracking (talk on backtracking) (performance is O(n²), not exponential, in
length), but it was enough."

~~~
perlgeek
Some years ago I tried to create an exponential time regex in Perl, and only
managed a quadratic time one after a bit of experimenting.

But, as you said, quadratic is often already fatal on realistic data.

------
jakejarvis
Always appreciate the transparency from you and Cloudflare. :)

My main fright during this outage wasn't really the outage itself, but the
fact that I couldn't log into the dashboard and simply click the orange cloud
to bypass Cloudflare in the meantime. I'm assuming that this is now covered by
this mitigation:

>> 6\. Putting in place an emergency ability to take the Cloudflare Dashboard
and API off Cloudflare's edge.

If so, and if this would have prevented the dashboard outage even during the
WAF fiasco, this is a huge comfort to me. Just curious, though: how far can
you really go in separating Cloudflare "the interface" from Cloudflare "the
network?"

And in general, what does everyone on HN think about mission-critical
companies using their own infrastructure and being their own customer?
Especially when the alternative is using a competitor?

~~~
groovybits
I've always been a proponent of separating the monitoring from the infra.
Otherwise your insight is binary: the service is either up or down. You don't
have an context as to why.

Edit: Additionally, from a competitive standpoint, I don't see a problem with
using a third-party platform for a monitoring service.

~~~
cortesoft
Yes, this is always one of our primary questions we ask when deciding when and
how we should dogfood our own services; will we create a circular dependency
where our ability to fix an issue on one service is hindered by any chain of
dependencies between the service with the issue and the service used to fix
it. We always avoid those, or at least have easy alternatives.

------
lkbm
> In the last few years we have seen a dramatic increase in vulnerabilities in
> common applications. This has happened due to the increased availability of
> software testing tools, like fuzzing for example (we just posted a new blog
> on fuzzing here).

So security/debugging tools _increased_ the number of [discovered/exploited]
vulnerabilities, because developers don't use them. Only malware developers
and third-party security researchers take the time to test security.

~~~
buildzr
Yup, unless you're a seriously security or stability focused company you don't
even use basic stuff like static analysis, let alone fuzzing. Tools for these
are often either expensive, hard to use or both.

------
neom
Awesome write up as usual John. I'm no expert so I was wondering: "In the
initial moments of the outage there was speculation it was an attack of some
type we’d never seen before." \- Is there a reason you would go to this first
vs checking the last lot of code deploys first/and or at the same time?

~~~
jgrahamc
It was speculation. We were looking at sudden massive CPU use all over the
world. Since we do staged deploys that shouldn't happen yet...

------
Havoc
There really should a prize or something for best postmortems to reward
companies for doing this & giving them some PR.

Most of us will (hopefully) never be in a situation like this so "book
knowledge" of extremis events is the next best thing available. And that
relies on good write-ups.

------
sgentle
More general questions I would consider asking:

1\. It appears there was a safe path with more safety and scrutiny, and a fast
path with less. In this case, over time, the fast path became routine. Are
there other places where this pattern could develop or has already developed?
Is this tradeoff between speed and scrutiny actually necessary? (ie could you
have urgent updates reach production faster but actually receive more
scrutiny/more testing, even if that happens after the fact?)

2\. In a similar vein, if the system has a failsafe configuration (eg only
changes that have passed the full barnyard, or configurations that have been
running safely for more than a certain amount of time), would it be plausible
to automatically roll servers back to that configuration if they remain
unresponsive for a certain amount of time?

3\. It seems as though there are multiple points (big WAF refactor, credential
expiry, internal services dependent on working prod) where a sufficiently
cynical engineer would say "I bet there's something here that could, if not
bring down the site, at least ruin someone's day". Is there a suitable voice
for this kind of cynicism? Eg, a red team or similar? If you were Murphy's Law
incarnate, messing with Cloudflare's systems to achieve maximum mischief,
where would you start?

4\. I get the sense that there are many reliable and well-tested layers of
safety, but is it common to test what happens if they fail anyway? Eg: let's
pretend Cloudflare just got knocked out globally by a wizard spell, what do we
do? Or let's say our staged rollout system gets completely bypassed because of
solar flares, how bad is it? Beyond developing a procedure or training for
these kinds of situations, are they actively simulated or practiced?

If anything, I'd guess the root root cause here is a success failure, where
the system has been so reliable for so long that the main reactions to it
failing are disbelief and unpreparedness. I'm sure it wasn't funny at the
time, but it gives me a chuckle to imagine the SREs speculating about Mossad
quantum-tunnelling 0days or something because the idea of everything falling
over on its own is so unthinkable. Meanwhile, those of us without so many 9s
would jump straight to "I probably broke it again."

~~~
laughinghan
Overall I think these are very thoughtful, and I upvoted your comment.

However, I don't think this question is very fruitful:

> let's pretend Cloudflare just got knocked out globally by a wizard spell,
> what do we do?

The way you solve a production issue is you identify its cause and then
contain, mitigate, or fix it. I don't think you'd learn anything useful from a
drill where there's no specific cause.

Perhaps along similar lines to what you're thinking of, something I could see
being useful is to look at components that you've already thought to implement
a 'global kill' for, like WAF, for instance. Maybe you could run drills where
every machine running WAF starts blackholing packets, or maxing out RAM, or
(as happened here) maxing out CPU, the kind of thing where you'd want to
execute the 'global kill' in the first place. That way, you can ensure that
the 'global kill' switches are actually useful in practice. Something like
that seems more grounded to me, making the assumption that something specific
is going wrong and not just "magic", while still avoiding too-specific
assumptions about what can and can't go wrong.

------
rcfox
Maybe I'm not the intended audience, but I had to look up the acronym WAF
since it came up a lot in this article. I'm assuming it's "web application
firewall"?

~~~
mmwelt
Yup, I had to look it up too. Best reference I found:

[https://www.cloudflare.com/learning/ddos/glossary/web-
applic...](https://www.cloudflare.com/learning/ddos/glossary/web-application-
firewall-waf/)

------
helper
This is a fantastic postmortem. Thanks jgc!

Can you share any more details about the protection to prevent excessive cpu
usage by a regular expression that was accidentally removed?

~~~
xakahnx
Also interested in this protection. How does it detect this situation and what
does it do when something is detected? Did you notice when it was accidentally
removed that your monitoring of this condition went to zeros (or was it never
happening during normal operations)?

------
kevinherron
Sometimes I wish Cloudflare failed more often so I could read more of these
postmortems...

~~~
bostik
Please no.

The last time they had a global problem, everyone scrambled for more than a
week. (Cloudbleed)

This 30-minute global outage was pretty nasty, but not anywhere near as awful.
Timing helped, as nothing truly critical was affected. (There are some
extremely high-volume sporting events which, if affected even just for few
minutes, can have a direct impact on the bottom line.)

I do not wish to see more of these. Cloudbleed gave me two weeks of headache
and an indigestion problem. This one did basically nothing. If there is a
happy middle ground between the two, I am not exactly thrilled at finding out
what it is.

~~~
eastdakota
Agree. Cloudbleed was really, really awful. We should write up all the things
we learned from that and all the changes we've made since. Just looking at the
number of engineers who are Rust experts since then, for instance.

------
lkoolma
Might be late, but has anyone in CloudFlare tried to switch away from regex to
something more efficient and powerful? Tools like re2c can convert 100s of
regexs and CFG into a single optimized state machine (which includes no back
tracking, as far as I remember). It should easily handle 10s of millions
transactions per second per core if the complete state machine fits into the
CPU level 3 cache (or lower), with a bit of optimization.

~~~
edwintorok
There is also Ragel [0], but I think that in this context deploying regexes as
strings is safer than generating code and deploying that code (unless Ragel
could generate webassembly).

[0]: [http://www.colm.net/open-source/ragel/](http://www.colm.net/open-
source/ragel/)

~~~
thurston
Ragel has the advantage that CPU blowups happen at compile time, rather than
run-time. Other risks aside, they would have avoided this problem had they
been using ragel or something similar to pre-compile their patterns into
deterministic machines.

------
stevens32
For the regex novices here, would anyone mind explaining what that pattern is
meant to match? More specifically, what `. _(?:._ =.*)` is meant to do?

~~~
Buge
It's meant to match any number of any characters, then match an equal sign,
then match any number of any characters. But it's very badly written. It
should instead simply be written

    
    
        .*=.*
    

BTW, your comment got mangled by HN's markdown formatting.

~~~
stevens32
Gotcha - I thought I was just missing the point as to why it wasn't simpler
since it looked to have been structured that way intentionally.

------
w0m
>A protection that would have helped prevent excessive CPU use by a regular
expression was removed by mistake during a refactoring of the WAF weeks
prior—a refactoring that was part of making the WAF use less CPU.

Faster karma than normal i think.

~~~
hinkley
Taking the safeties off to go faster... yes, you will go faster, but it might
be right off a cliff.

This is a good lesson on Chesterton’s Fence. I’ve been thinking for a while
that we really need the (default behavior) ability to annotate commits after
the fact, so that we have a durable commentary that can evolve over time. We
should be able to go back and add strongly worded things like “yes this looks
broken but it exists due to this bug fix” or “please don’t write new code that
looks like this. See xyz for a better alternative.”

Hell I think I’d be perfectly ok if the code review lived with the code
permanently. Regression in the code? Josh warned you it was a bad idea. Maybe
we should listen to Josh more?

~~~
Thorrez
Can't you just add comments to the actual code saying those things? I've seen
code comments saying those exact things.

~~~
hinkley
There’s an art to that, I’ve seen new code get between the comment and the
code, and since the comment is in a separate commit, it’s difficult to go back
ten refactorings later to answer why. The most interesting bug fixes I do end
up exploiting the commit history. Yes it’s hidden in plain sight, but it’s
also more reliable.

These days we treat code as a living breathing thing. No reason we can’t do
the same to commits.

------
joncrane
I used to be really into regex and I'm now rusty, but wouldn't the desired
representation of .* .* =.* be something closer to [^=]\\* [^=]\\* =[^=]\\* ?

I feel like it could be optimized further but this would be the first step,
and wouldn't most experienced regex authors use that from the beginning,
nipping the whole backtracking problem in the bud and making the regex _much_
more performant?

~~~
Buge
The original one would match "==", your suggestion would not. To get a clean
regex they should switch to

    
    
        .*=.*
    

I don't think they should spend any time contemplating whether a regex will
backtrack, because it's hard. Instead they should (and are planning to) simply
switch to a better regex library that never backtracks.

------
glangdale
Well, if I still worked on Hyperscan, this would be my "what am I, a potted
plant?" moment. I think Cloudflare is pretty determined to avoid x86-only
implementations of anything, though.

It's entertaining to see people making the same mistakes that have been widely
known about in network security well before there was Hyperscan, RE2, etc.

~~~
xigency
There are always those potted plant moments. For some reason regular
expressions / regex implementations seem to always be in that hole.

------
_pmf_
New idea: scan github for regexes that exhibit unbounded backtracking and
generate worst case examples.

~~~
minxomat
Do it. I'd be interested.

------
cordite
That was a fantastic demonstration of what backtracking meant. Thank you John
for your in depth description of what went wrong.

As a follow up, would something like `[^=] _=._ ` be a better capture group
regex?

~~~
jalada
Yes. I think HN stole your asterisks. You meant:

    
    
      /.*=.*/ becomes /[^=]*=.*/
    

That is, zero or more 'not-equals-sign-characters', followed by an equals
sign.

Where the first regex is 57 steps for x=xxxxxxxxxxxxxxxxxxxxxxxx, the second
is just 7.

Avoid using greedy .* for backtracking regex engines! Give your greedy regex
engine the hints it needs to do what it does best.

~~~
wbl
Don't use a backtracking regex engine is probably the better lesson. I want a
tool that won't unexpectedly poke me with the sharp bit.

------
Twirrim
One thing set of alarm bells in my head from an operational perspective:

> Switching to either the re2 or Rust regex engine which both have run-time
> guarantees. (ETA: July 31)

That's short timescales for quite a significant change. I know it's just
replacing a piece of automation with one that does the same task, but the guts
are all changing and all automation introduces some level of instability, and
a bunch of unknowns. Changing the regex engine is just as significant as
introducing new automation from an operations perspective, even if it seems
like it should be a no-brainer. I'd encourage taking time there (unless this
is something they've been working on a lot and are already doing canary
testing).

The other steps look excellent, and they should all collectively give ample
breathing room to make sure that switching to re2 or Rust's regex engine won't
introduce further issues. There's no need to be doing it on a scale of weeks.

Some quick thoughts about Quicksilver: Deploying everywhere super fast is
inherently dangerous (for some reason, old school rocketjumping springs to
mind. Fine until you get it wrong).

I definitely see the value for customer actions, but for WAF rule rollouts,
some kind of (automated) increasing speed rollouts might be good, and might
help catch issues even as the deployment steps beyond the bounds of PIG etc.
canary fleets. Of course, that's also useless in and of itself unless there is
some kind of automated feedback mechanism to retard, stop, or undo changes.

If I can make a reading suggestion:
[https://smile.amazon.com/gp/product/0804759464/ref=ppx_yo_dt...](https://smile.amazon.com/gp/product/0804759464/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1)
The book is "High Reliability Management: Operating on the Edge (High
Reliability and Crisis Management)" (unfortunately not available in electronic
form). It's focussed on the energy grid in California, the authors were
university researchers specialising in high reliability operations, and they
had the good fortune to be present doing a research job at the operations
centre right when the California brownouts were occurring in the early 2000s.
There's a lot to be gleaned from that book, particularly when it comes to
automation, and especially changes to automation.

~~~
f2f
RE2 is bullet proof. It is the defacto re engine used by anyone looking to
deploy regexes to the wild (used in the now defunct google code search, for
example). It has a track record. Russ Cox, its author, was affiliated with Ken
Thompson from very early in his career.

Rust also has a good pedigree for not being faulty. BurntSushi, the author of
rust's regex crate also has a good pedigree...

We switched to RE2 for a massive project 2 years ago and haven't looked back.
It is a massive improvement in peace of mind.

If anything, I'm surprised that JGC has allowed the use of PCRE in production
and on live inputs...

~~~
Twirrim
I'm absolutely not denying that RE2 is great. Not in the slightest. I even
agree with their idea to switch towards it or the Rust one.

Changing anything brings an element of risk, and changing quickly to it, even
more so, which is essentially what they're proposing doing. That's where my
concern lies.

Their current approach clearly has issues, but it has been running in
production for several years now and those issues are fully understood,
engineers know how to debug them, and there's a lot of institutional knowledge
around covering them. They've put a series of protective measures in place
following the incident that takes out one of the more significant risks. That
gives them breathing space to evaluate and verify their options, carry out
smaller scale experiments, train up engineers across the company around any
relevant changes etc. There is no reason to go _fast_

~~~
f2f
i concur with your assessment. i'm sure cloudflare will be cautious and not
rush with deployment after the switch. lesson learned?

------
simonw
This is such a great retrospective - thanks very much for this, jgrahamc. I
really appreciated learning so much about how Cloudflare works internally, and
the appendix (with animations!) was delightful.

------
Thorrez
> It might also be obvious that once state 4 was reached (after x= was
> matched) the regular expression had matched and the algorithm could
> terminate without considering the final x at all.

That's true if you just want a boolean result. But if you want to get the
matched string (which it appears the actual code does), then you need to
continue, because it's using greedy matching.

------
erowtom
Amazing postmortem! I have a question: I see a lot of software / process
solutions to avoid this to repeat in the future. What about the Human factor?

\- what happened to the engineer(s) responsible for that event? They must feel
really bad RN, how do you handle this situation?

\- on a more general point, how do you train individuals to ensure this
particular event does not reproduce?

Edit: formatting.

------
eeeeeeeeeeeee
Great post! Can’t think of many companies that would spend this much time
explaining regex.

I was affected by this outage, but I really appreciate Cloudflare taking the
time to explain the problem in this much detail. Given their own systems were
affected, I’m surprised they mitigated as fast as they did.

------
electricviolet
I'm a relatively novice regex user. Could anyone explain to me why someone
might use an expression like `. _(?:._ =.*)` ? What is the meaning of the
group if it's boundary could be placed in any number of places? Hope that
makes sense.

~~~
jetbooster
It can be useful depending on how the engine handles the match. The non-
capturing group is the important part, the .* is just there so that the 'full
match' isn't just an empty string, but contains the whole line the rule is
being run against

------
markbnj
Really appreciate the detail and operational insights in this post-mortem.
Great work. Also appreciate the heads up and detailed explanation of the
issues with backtracking in regexes. Turns out we use those expensive patterns
in a few places.

------
foota
What about: WAF cpu usage wasn't isolated from the ability to serve requests?
This would allow requests that don't go throwugh WAF to be able to proceed as
usual.

~~~
clinta
A firewall that fails open sounds like a terrible plan.

As far as problems go, an outage is preferable to a breach.

~~~
Thorrez
I believe WAF is a feature customers enable, not all customers have it
enabled. So some customers are already open, and in theory wouldn't need to be
affected by a WAF outage.

------
h1d
It was too bad WAF took down their entire network. I don't use it but things
went down too.

Maybe it's better to separate damage zones for different features.

------
magoon
This is such a fantastic blogpost. I love the part about regex backtracking.
Great job CF.

------
bartimus
> At 13:37 TeamCity built the rules

Hmmm...

------
Thorrez
>x=x still takes 555 steps

That's a typo. It should say

>x=xxxxxxxxxxxxxxxxxxxx still takes 555 steps

------
01CGAT
I love how Cloudflare turns an outage into a selling opportunity by explaining
exactly why you need their product to a hungry audience waiting to read about
it.

------
hellogoodbye
Incredibly well written, thank you.

------
h1d
Regexp::Debugger was a cute one.

------
Exuma
Talk about a great writeup

------
pstuart
Obligatory JWZ quote:

Some people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems.

------
karmakaze
TL;DR

Root cause was a bad regex generating excessive backtracking using all CPU on
nodes.

The meta-cause is the process workflow:

> But, by design, the WAF doesn’t use this process because of the need to
> respond rapidly to threats.

The above is in reference to how WAF deployment doesn't use the graduated
DOG(fooding)/(guinea)PIG/canary flow.

> We responded quickly to correct the situation and are correcting the process
> deficiencies that allowed the outage to occur [...]

Live and learn. Not all WAF deployments are emergency rollouts.

------
londons_explore
Cloudflare lets their customers write their own WAF regex rules right?

And those rules still get run on every box on cloudflares edge network with
HTTP requests from strangers on the internet right?

So how come this didn't get triggered by a customer first?

Perhaps it _did_ get triggered by a customer first, but that customer didn't
get too much traffic of the URL which triggers the issue, and that box got one
thread stuck executing that regex for a few minutes till a health check killed
it...? Does this imply that cloudflare runs with random failing health checks
across the fleet and there isn't someone looking at core dumps of such
failures?

That would align with my experience with seeing occasional "502 bad gateway"
errors from cloudflare over the past few years. It also seems likely
considering the incident where cloudflare servers leaked sensitive memory
contents into HTTP responses which happened so frequently they got cached by
google search. Hard to leak arbitrary memory contents without occasional
SIGSEGV's...

If the above conjecture is true, it reflects very badly on engineering culture
at Cloudflare. The core issue had been seen across the fleet sporadically for
a long time, but was ignored, and even during the postmortem process, which
should be a very thorough investigation, the telltale pre-warning signs of the
issue were _still_ missed.

~~~
JakeTheAndroid
They allow a limited subset of rules, with strict parameters of what logic is
allowed. Unless you do something fancy with workers.

Also, the protection for this was removed in a recent update before the
incident, so it wouldn't have had an impact if a customer did this until that
protect was removed. So maybe a few weeks earlier they might have started
seeing some problems. But again, I am pretty sure the logic in the rule that
caused the issue isn't available to customers.

------
peterwwillis
Here's their _What Went Wrong_ :

    
    
      1. An engineer wrote a regular expression that could easily backtrack enormously.
      2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
      3. The regular expression engine being used didn’t have complexity guarantees.
      4. The test suite didn’t have a way of identifying excessive CPU consumption.
      5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
      6. The rollback plan required running the complete WAF build twice taking too long.
      7. The first alert for the global traffic drop took too long to fire.
      8. We didn’t update our status page quickly enough.
      9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
      10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
      11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
    

Here's my version of what went wrong:

    
    
      1. The process for composing complex regular expressions is "engineer tries to shove a lot of symbols into a line" rather than "compile/compose regex programmatically from individual matches"
      2. Production services had no service health watchdog (the kind of thing that makes systemd stop re-running services that repeatedly hang/die)
      3. Performance testing/quality assurance not done before releasing changes (this is not CI/CD)
      4. No gradual rollout
      5. No testing of rollbacks
      6. Lack of emergency response plans / training
    

All of these things are completely common, by the way, so they're in no way
surprising. Budget has to actually be set aside to continuously improve the
reliability of a service, or it doesn't get done. These incidents are a good
way to get that budget.

(Wrt the regex's, I know they're implementing a new system that avoids a lot
of it, but in the new system they can still write regex's which (I think)
should be constructed programmatically)

~~~
hyperpape
I don't see the relevance of how regexes are written to the problem they had.
The engineer didn't typo the regex, or have a hard time understanding what it
would match.

Instead, they didn't understand the runtime performance of the regex, as it
was implemented in their particular system. No amount of syntax can change
that.

~~~
rkagerer
_No amount of syntax can change that_

A framework that allows well-written, "normal" code to parse out what you
want, can produce something easier to understand and maintain, surfacing this
type of bug in a more obvious way.

Cryptic syntax is the main reason I avoid regexes (particularly complex ones).

Too much obfuscation between the code you write and the steps your program
will take. Granted, my concern doesn't apply to master craftsmen who truly
understand the nuances of the tool, but in the real world those are few and
far between.

ps. I get there was a lot more going on in this postmortem than just one rogue
regex.

------
_wmd
So in response to a catastrophic failure due to testing in prod, they're going
to push out a brand new regex engine with an ETA of 2 weeks. Can anyone say
testing in prod?

The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes
this report, and repeatedly singling out a responsible engineer, nameless or
not, is a failure in its own right. This was a collective failure, any
individual identity is totally irrelevant. We're not looking for an account of
your superman-like heroism, sprinting from meeting rooms or otherwise, we want
to know whether anything has been learned in the 2 years since Cloudflare
leaked heap all across the Internet without noticing, and the answer to that
seems fantastically clear.

~~~
jgrahamc
This report is written by me, the CTO of Cloudflare. I say "I" throughout
because organizational failings are my responsibilty. If I'd said "we" I
imagine you'd be criticizing me for NOT taking responsibility.

If you read the report you'd see I do not blame the engineer responsible at
all. Not once. I made that perfectly clear.

~~~
pvg
I wonder if you are able to talk a bit about the development of the Lua-based
WAF. I imagine the possible unbounded performance of feeding requests into
PCRE must have occurred to you or others at the time - or at least, long
before this outage.

I don't mean this as some sort of lame 'lol shoulda known better' dunk -
stories about technical organizations' decision-making and tradeoff-handling
are just more interesting than the details of how regexes typed in a control
panel grow up to become Jira tickets.

~~~
jgrahamc
I did a talk about this years ago:
[https://www.youtube.com/watch?v=nlt4XKhucS4](https://www.youtube.com/watch?v=nlt4XKhucS4)

~~~
pvg
It sounds like one of the primary factors was compatibility with existing (or
customer-provided) mod_security rules, if I've understood 1.75x speed hyper-
you right.

