5. Changing the SOP to do staged rollouts of rules in
the same manner used for other software at Cloudflare
while retaining the ability to do emergency global
deployment for active attacks.
It's a fact of managing process that branches are liability and the hot path is the thing that will have the highest level of reliability. I wonder if anyone there has concerns about diluting the rapid response path (the one having the highest associated risk) by making this process change.
edit: fix verbatim formatting
The only way this makes sense is if they mean that there'll be a staged rollout of some sort, but it won't be the same process as for the rest of their software. I.e. for this purpose you need much faster staging just due to the problem domain, but even a 10 minute canary should provide meaningful push safety against this kind of catastrophic meltdown. And the emergency process is something you'll use once every five years.
They want to have a rapid response path (little to no delay using staging envs) to respond to emergencies. The old SOP allowed all releases to use the emergency path. By not using it in the SOP anymore, I'd be concerned that it would break silently from some other refactor or change.
Your notion is to maintain the emergency rollout as a relaxation of the new SOP such that the time in staging is reduced to almost nothing. That sounds like a good idea since it avoids maintaining two processes and having greater risk of breakage. So, same logic but using different thresholds versus two independent processes.
 My favorite example of this had somebody accidentally trigger an ancient emergency config push procedure. It worked, made a (pre-canned) global configuration change that broke everything. Since the change was made via this non-standard and obsolete method, rolling it back took ages. Now, in theory it should have been trivial. But in practice, in the years since the functionality had been written (and never used), somehow all humans had lost the rights to override the emergency system.
Cold code is dead code.
Once upon a time, I worked on a system where many values which would otherwise be statically defined in similar systems where instead put into a database table. This particular system didn't have a proper testing and deployment pipeline set up, so whereas a normal system would just change the static value at some hard-coded point in the code and quickly roll it out, this system needed to keep it in the database so that it would be changeable in between manual deployments (months or even years apart). The ability to change the value facing the user by changing the value in the database inflated the time it took to test a release, thus exacerbating the amount of time it took to release a new version, but well, it worked.
My point is that if security and abuse rules need to be rolled out quickly, then the system needs security and abuse systems where the entire range of security and abuse configurations (i.e. their types) are a testable part of the original pipeline. Then the configurations can safely be changed on the fly, so long as the changes type-check.
It's easy to understand why it's never been built though - you'd need both a security background and a Haskell-ish/type-theory kind of background. Best of luck finding people like that.
My takeaway is that it's time to move to a custom solution using a more flexible language. A simple async watchdog on total rule execution time would have prevented this. When running tons of Regex rules I'm amazed they didn't have this
For example my company (nowhere near the scale of Cloudflare) does progressive deployments. New code is deployed only to a handful machines first, and then as the hours pass and checks remain green it propagates to the rest of the server fleet. Full deployment takes 24 hours. We never had code breaking changes in production in the past 3 years. And before that, us breaking things was the most common occurence for production issues. Of course that's not the only thing we do, good test practices, code reviews etc.
The second thing, is separation of monitoring and production. If production going down takes down the monitoring systems too, you will have a very hard time figuring out what's wrong. Cloudflare says "We had difficulty accessing our own systems because of the outage". That sounds very bad.
I 'd wager there are many wrong things at play here other than "regex is hard". But I guess HN loves cloudflare way too much to ask the hard questions.
Recursion attacks against Regex are extremely well known. The only reason I can fathom for not having an execution time watchdog is that Nginx Lua runtime doesn't allow it. I assume the scripts run during a single request cycle on one thread due to Nginx async IO (one thread per core only).
That's still no excuse. They admit to running THOUSANDS of Regex rules in custom Lua scripts embedded in Nginx. This sounds like a bad idea to anyone that knows anything about software because it is.
My previous employer embedded way too much Lua script inside Nginx plugins for the same reasons (it's easy). Even at our "scale" (50 requests/second) we had constant issues. To think they run ~10% of internet traffic on such a rube Goldberg machine is proof you can use just about anything in prod (until it inevitably explodes at least)
The outage was caused by a regex that ended up doing a lot of backtracking, which caused PCRE, the regex engine, to essentially handle a runaway expression.
This reminded me of a HN post from a couple months back by the author of Google Code Search, and how it worked: https://swtch.com/~rsc/regexp/regexp4.html . Interestingly, he wrote his own regex engine, RE2, specifically because PCRE and others did not use real automata and he needed a way to do arbitrary regex search safely.
1. A test job in CI/CD pipeline suddenly taking a very long time and lots of cpu
2. A data cleansing / checking job in a Java webapp occasionally turning the machine to molasses.
In both occurrences the regex had been around for a while; what happened is that the data was different. e.g. Lots of trailing whitespace.
In theory, your statement is perfectly correct. However, quoting that reference:
"However, if the NFA has n states, the resulting DFA may have up to 2^n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs."
This means that in practice, DFAs are larger, slower, and sometimes can't be run at all if complex enough.
However, this was my mistake. I remembered (vaguely) the 2^n issue and didn't follow up to make sure I was accurate.
And I completely spaced on the fact that neither NFA's nor DFA's handle backreferences without extension.
I believe if r is the size of the regex and d is the size of the data, an NFA is O(r) to compile and O(rd) to execute, while a DFA is O(2^r) to compile and O(d) to execute. So DFAs are slower to compile, but faster to execute.
“regular expression” has different meaning in programming context and formal language context. Regular expressions in regex libraries do more than match regular languages.
PCRE can recognize also all context free languages and some subset of context-sensitive languages. Just having backreferences makes the problem NP-hard.
Suggestion for future, learned from bitter experience: separate your control plane from your data plane. In this case, make sure that the tools you use to manage your infrastructure don't depend on that infrastructure being functional.
That way you won't have to remember how to use a bypass procedure -- it will just be your normal procedure.
Imagine an IaaS cloud. It starts will Compute, Networking, Storage (block) and maybe Object Storage/S3. Next comes a fully-managed database product. The Database team may want to leverage the Object Storage data plane in the Database control plane. A year or two down the road, a team building a SaaS application will probably look to use the fully-managed database as it’s one less piece of infrastructure to manage.
To avoid or eliminate these types of delays in resolution, it’s imperative that the product team have a strong understanding of failure modes and dependencies. There’s a lot to be said for building completely isolated foundational services — it’s also a very expensive undertaking. Lastly, it’s possible to build out-of-band/break glass access without compromising security.
(I work at a global cloud but have no familiarity with CloudFlare’s internals.)
This calls for not using Cloudflare for their web dashboard.
We run a WAF based on LuaJIT in resty. Just to be clear, the resty interface to PCRE does provide a DFA mode. Furthermore, Zhang actually ported RE2 (see other comments here) to C as sregex, which is usable from Lua as a c module regardless if it runs in resty or a custom Lua app.
> Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
Not addressed at Cloudflare, since they had a defense in place. But just in case anyone else is running a similar thing in Lua.
> In the longer term we are moving away from the Lua WAF that I wrote years ago.
Then sregex might be the perfect fit here. Though Rust is technically safer. Depends on what longer term means.
One of those cases where they had 1 problem, used regular expression and ended up with 2 problems ?
Edit: I really like how much information is given by CloudFlare. 11 points in the "what went wrong analysis" is how every root-cause analysis should be done.
"Pushing bad regex to production, chaos monkey code causing cascading network failure, etc.", in response to a comment from someone who previously worked at Cloudflare.
Incredible write up. Really appreciate the detail, and am really impressed by how mature their response coordination seems to be.
Haha, so the free customers are crash test dummies for providing test traffic. Nice.
I actually don't mind that much, considering it's basically bulletproof DDoS protection for free. I'd much rather "be the product" in this way than in the way ad companies cause at least.
I honestly just assumed that when customer's chose where they would try things outside their lab, it was lower level customers, less busy part of the network, anywhere the impact isn't as serious. That's where the lowest risk is.
Some customers would discuss their own customer's by name as far as "Should we try this change on Customer Y?" And the discussion would work along those lines.
When I started deploying my own software, I just assumed anything that I was deploying to for free was a sort of "lab light" for them. I also don't mind, it seems fair.
ANY change outside a lab... is its own experiment.
Smaller customers don't have the same web traffic, which may not be enough to trip any given failure scenario. One could imagine that the backtracking in an onerous regexep is only triggered with a sufficiently large customer that has a path that is especially difficult to match.
With staged rollout and without a "fast" deploy procedure, by the time it hits the larger customers, it's already been deployed to some percentage of the fleet - and then you still have a problem, with a significant proportion of your fleet.
Staged rollouts are an entirely reasonable risk mitigation idea, mind you, and not one I'm even arguing against.
My point is that unfortunately it's no panacea, especially at scale. Which is what makes this all an experiment.
Impact is generally lower, both to the client, and to your bank account.
Overall I think it's a good deal for both users and Cloudflare. Users get a major CDN for free, and instead of paying for it with ads, surveillance or other shady thing, they pay by being beta testers.
Your customers are the product. Cloudflare sets a first party tracking cookie on every domain they serve. They unwrap TLS and can see every product your customers look at or buy.
Whether intentionally or not, they built the Ad Network 2.0. They found the solution to ISPs not being able to snoop, and browsers locking down third party tracking.
But, as you said, quadratic is often already fatal on realistic data.
My main fright during this outage wasn't really the outage itself, but the fact that I couldn't log into the dashboard and simply click the orange cloud to bypass Cloudflare in the meantime. I'm assuming that this is now covered by this mitigation:
>> 6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare's edge.
If so, and if this would have prevented the dashboard outage even during the WAF fiasco, this is a huge comfort to me. Just curious, though: how far can you really go in separating Cloudflare "the interface" from Cloudflare "the network?"
And in general, what does everyone on HN think about mission-critical companies using their own infrastructure and being their own customer? Especially when the alternative is using a competitor?
Edit: Additionally, from a competitive standpoint, I don't see a problem with using a third-party platform for a monitoring service.
> If so, and if this would have prevented the dashboard outage even during the WAF fiasco
It wouldn't prevent the initial dashboard outage. However, in a similar situation where the main issue can't be resolved quickly, it would allow them to restore dashboard access.
So security/debugging tools increased the number of [discovered/exploited] vulnerabilities, because developers don't use them. Only malware developers and third-party security researchers take the time to test security.
I assume this is why :D
Most of us will (hopefully) never be in a situation like this so "book knowledge" of extremis events is the next best thing available. And that relies on good write-ups.
1. It appears there was a safe path with more safety and scrutiny, and a fast path with less. In this case, over time, the fast path became routine. Are there other places where this pattern could develop or has already developed? Is this tradeoff between speed and scrutiny actually necessary? (ie could you have urgent updates reach production faster but actually receive more scrutiny/more testing, even if that happens after the fact?)
2. In a similar vein, if the system has a failsafe configuration (eg only changes that have passed the full barnyard, or configurations that have been running safely for more than a certain amount of time), would it be plausible to automatically roll servers back to that configuration if they remain unresponsive for a certain amount of time?
3. It seems as though there are multiple points (big WAF refactor, credential expiry, internal services dependent on working prod) where a sufficiently cynical engineer would say "I bet there's something here that could, if not bring down the site, at least ruin someone's day". Is there a suitable voice for this kind of cynicism? Eg, a red team or similar? If you were Murphy's Law incarnate, messing with Cloudflare's systems to achieve maximum mischief, where would you start?
4. I get the sense that there are many reliable and well-tested layers of safety, but is it common to test what happens if they fail anyway? Eg: let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do? Or let's say our staged rollout system gets completely bypassed because of solar flares, how bad is it? Beyond developing a procedure or training for these kinds of situations, are they actively simulated or practiced?
If anything, I'd guess the root root cause here is a success failure, where the system has been so reliable for so long that the main reactions to it failing are disbelief and unpreparedness. I'm sure it wasn't funny at the time, but it gives me a chuckle to imagine the SREs speculating about Mossad quantum-tunnelling 0days or something because the idea of everything falling over on its own is so unthinkable. Meanwhile, those of us without so many 9s would jump straight to "I probably broke it again."
However, I don't think this question is very fruitful:
> let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do?
The way you solve a production issue is you identify its cause and then contain, mitigate, or fix it. I don't think you'd learn anything useful from a drill where there's no specific cause.
Perhaps along similar lines to what you're thinking of, something I could see being useful is to look at components that you've already thought to implement a 'global kill' for, like WAF, for instance. Maybe you could run drills where every machine running WAF starts blackholing packets, or maxing out RAM, or (as happened here) maxing out CPU, the kind of thing where you'd want to execute the 'global kill' in the first place. That way, you can ensure that the 'global kill' switches are actually useful in practice. Something like that seems more grounded to me, making the assumption that something specific is going wrong and not just "magic", while still avoiding too-specific assumptions about what can and can't go wrong.
Can you share any more details about the protection to prevent excessive cpu usage by a regular expression that was accidentally removed?
The last time they had a global problem, everyone scrambled for more than a week. (Cloudbleed)
This 30-minute global outage was pretty nasty, but not anywhere near as awful. Timing helped, as nothing truly critical was affected. (There are some extremely high-volume sporting events which, if affected even just for few minutes, can have a direct impact on the bottom line.)
I do not wish to see more of these. Cloudbleed gave me two weeks of headache and an indigestion problem. This one did basically nothing. If there is a happy middle ground between the two, I am not exactly thrilled at finding out what it is.
But you do bring up a good point. RE2 and Rust both compile the regex in the same process that executes it. Compiling the regex as part of your build process then pushing the compiled form could have advantages.
Slightly more verbosely, it will match [0-or-more bytes of anything] followed by [0-or-more bytes of anything] followed by [an equal sign] followed by [0-or-more bytes of anything]. The expensive part is that it can't decide where the first grouping of [0-or-more bytes of anything] starts and the second grouping begins. It doesn't matter where the division is, of course, but many regex engines use an exponential-time algorithm for that, even though an obvious liner-time algorithm exists (and pre-dates the exponential-time algorithm!).
Faster karma than normal i think.
This is a good lesson on Chesterton’s Fence. I’ve been thinking for a while that we really need the (default behavior) ability to annotate commits after the fact, so that we have a durable commentary that can evolve over time. We should be able to go back and add strongly worded things like “yes this looks broken but it exists due to this bug fix” or “please don’t write new code that looks like this. See xyz for a better alternative.”
Hell I think I’d be perfectly ok if the code review lived with the code permanently. Regression in the code? Josh warned you it was a bad idea. Maybe we should listen to Josh more?
These days we treat code as a living breathing thing. No reason we can’t do the same to commits.
I feel like it could be optimized further but this would be the first step, and wouldn't most experienced regex authors use that from the beginning, nipping the whole backtracking problem in the bud and making the regex much more performant?
It's entertaining to see people making the same mistakes that have been widely known about in network security well before there was Hyperscan, RE2, etc.
As a follow up, would something like `[^=]=.` be a better capture group regex?
/.*=.*/ becomes /[^=]*=.*/
Where the first regex is 57 steps for x=xxxxxxxxxxxxxxxxxxxxxxxx, the second is just 7.
Avoid using greedy .* for backtracking regex engines! Give your greedy regex engine the hints it needs to do what it does best.
That's short timescales for quite a significant change. I know it's just replacing a piece of automation with one that does the same task, but the guts are all changing and all automation introduces some level of instability, and a bunch of unknowns. Changing the regex engine is just as significant as introducing new automation from an operations perspective, even if it seems like it should be a no-brainer. I'd encourage taking time there (unless this is something they've been working on a lot and are already doing canary testing).
The other steps look excellent, and they should all collectively give ample breathing room to make sure that switching to re2 or Rust's regex engine won't introduce further issues. There's no need to be doing it on a scale of weeks.
Some quick thoughts about Quicksilver: Deploying everywhere super fast is inherently dangerous (for some reason, old school rocketjumping springs to mind. Fine until you get it wrong).
I definitely see the value for customer actions, but for WAF rule rollouts, some kind of (automated) increasing speed rollouts might be good, and might help catch issues even as the deployment steps beyond the bounds of PIG etc. canary fleets. Of course, that's also useless in and of itself unless there is some kind of automated feedback mechanism to retard, stop, or undo changes.
If I can make a reading suggestion: https://smile.amazon.com/gp/product/0804759464/ref=ppx_yo_dt... The book is "High Reliability Management: Operating on the Edge (High Reliability and Crisis Management)" (unfortunately not available in electronic form). It's focussed on the energy grid in California, the authors were university researchers specialising in high reliability operations, and they had the good fortune to be present doing a research job at the operations centre right when the California brownouts were occurring in the early 2000s. There's a lot to be gleaned from that book, particularly when it comes to automation, and especially changes to automation.
Rust also has a good pedigree for not being faulty. BurntSushi, the author of rust's regex crate also has a good pedigree...
We switched to RE2 for a massive project 2 years ago and haven't looked back. It is a massive improvement in peace of mind.
If anything, I'm surprised that JGC has allowed the use of PCRE in production and on live inputs...
Changing anything brings an element of risk, and changing quickly to it, even more so, which is essentially what they're proposing doing. That's where my concern lies.
Their current approach clearly has issues, but it has been running in production for several years now and those issues are fully understood, engineers know how to debug them, and there's a lot of institutional knowledge around covering them. They've put a series of protective measures in place following the incident that takes out one of the more significant risks. That gives them breathing space to evaluate and verify their options, carry out smaller scale experiments, train up engineers across the company around any relevant changes etc. There is no reason to go _fast_
That's true if you just want a boolean result. But if you want to get the matched string (which it appears the actual code does), then you need to continue, because it's using greedy matching.
- what happened to the engineer(s) responsible for that event? They must feel really bad RN, how do you handle this situation?
- on a more general point, how do you train individuals to ensure this particular event does not reproduce?
I was affected by this outage, but I really appreciate Cloudflare taking the time to explain the problem in this much detail. Given their own systems were affected, I’m surprised they mitigated as fast as they did.
As far as problems go, an outage is preferable to a breach.
Maybe it's better to separate damage zones for different features.
That's a typo. It should say
>x=xxxxxxxxxxxxxxxxxxxx still takes 555 steps
Root cause was a bad regex generating excessive backtracking using all CPU on nodes.
The meta-cause is the process workflow:
> But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats.
The above is in reference to how WAF deployment doesn't use the graduated DOG(fooding)/(guinea)PIG/canary flow.
> We responded quickly to correct the situation and are correcting the process deficiencies that allowed the outage to occur [...]
Live and learn. Not all WAF deployments are emergency rollouts.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
And those rules still get run on every box on cloudflares edge network with HTTP requests from strangers on the internet right?
So how come this didn't get triggered by a customer first?
Perhaps it did get triggered by a customer first, but that customer didn't get too much traffic of the URL which triggers the issue, and that box got one thread stuck executing that regex for a few minutes till a health check killed it...? Does this imply that cloudflare runs with random failing health checks across the fleet and there isn't someone looking at core dumps of such failures?
That would align with my experience with seeing occasional "502 bad gateway" errors from cloudflare over the past few years. It also seems likely considering the incident where cloudflare servers leaked sensitive memory contents into HTTP responses which happened so frequently they got cached by google search. Hard to leak arbitrary memory contents without occasional SIGSEGV's...
If the above conjecture is true, it reflects very badly on engineering culture at Cloudflare. The core issue had been seen across the fleet sporadically for a long time, but was ignored, and even during the postmortem process, which should be a very thorough investigation, the telltale pre-warning signs of the issue were still missed.
Also, the protection for this was removed in a recent update before the incident, so it wouldn't have had an impact if a customer did this until that protect was removed. So maybe a few weeks earlier they might have started seeing some problems. But again, I am pretty sure the logic in the rule that caused the issue isn't available to customers.
No, but customers can request a custom WAF rule to be written by Cloudflare engineers specifically for their domain.
1. An engineer wrote a regular expression that could easily backtrack enormously.
2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
3. The regular expression engine being used didn’t have complexity guarantees.
4. The test suite didn’t have a way of identifying excessive CPU consumption.
5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
6. The rollback plan required running the complete WAF build twice taking too long.
7. The first alert for the global traffic drop took too long to fire.
8. We didn’t update our status page quickly enough.
9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
1. The process for composing complex regular expressions is "engineer tries to shove a lot of symbols into a line" rather than "compile/compose regex programmatically from individual matches"
2. Production services had no service health watchdog (the kind of thing that makes systemd stop re-running services that repeatedly hang/die)
3. Performance testing/quality assurance not done before releasing changes (this is not CI/CD)
4. No gradual rollout
5. No testing of rollbacks
6. Lack of emergency response plans / training
(Wrt the regex's, I know they're implementing a new system that avoids a lot of it, but in the new system they can still write regex's which (I think) should be constructed programmatically)
Instead, they didn't understand the runtime performance of the regex, as it was implemented in their particular system. No amount of syntax can change that.
A framework that allows well-written, "normal" code to parse out what you want, can produce something easier to understand and maintain, surfacing this type of bug in a more obvious way.
Cryptic syntax is the main reason I avoid regexes (particularly complex ones).
Too much obfuscation between the code you write and the steps your program will take. Granted, my concern doesn't apply to master craftsmen who truly understand the nuances of the tool, but in the real world those are few and far between.
ps. I get there was a lot more going on in this postmortem than just one rogue regex.
This isn't even why you should compose them programmatically, though. Perl allows you to compose a regex with in-line comments (https://perldoc.perl.org/perlfaq6.html#How-can-I-hope-to-use...), but it's still a hand-crafted regex, which is error-prone, much like composing code by hand. If you can get a machine to generate it for you, you avoid unintentional human-introduced bugs, as well as make it easier to read and reason about.
If you have a ton of regex's, or they are super important to your business, you should consider not editing them by hand. There's only so much test cases can do to prevent bugs.
The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes this report, and repeatedly singling out a responsible engineer, nameless or not, is a failure in its own right. This was a collective failure, any individual identity is totally irrelevant. We're not looking for an account of your superman-like heroism, sprinting from meeting rooms or otherwise, we want to know whether anything has been learned in the 2 years since Cloudflare leaked heap all across the Internet without noticing, and the answer to that seems fantastically clear.
If you read the report you'd see I do not blame the engineer responsible at all. Not once. I made that perfectly clear.
I don't mean this as some sort of lame 'lol shoulda known better' dunk - stories about technical organizations' decision-making and tradeoff-handling are just more interesting than the details of how regexes typed in a control panel grow up to become Jira tickets.
Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.
I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.
The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)
On top of that they're switching to more constrained regex engines. Rust's regex engine makes guarantees about its running time, something that would have directly mitigated a portion of the issue. And it isn't as if RE2/Rust regex aren't in use anywhere, rust's regex engine is integrated into vscode, for example.
If this is a personal attack, there are literally 10-50 of these per day in arbitrary threads.
> there are literally 10-50 of these per day in arbitrary threads
If you can find cases of this where moderators didn't respond, I'd like to see links. We don't come close to seeing everything that gets posted here, so we depend on users, via flagging (https://news.ycombinator.com/newsfaq.html) or by emailing firstname.lastname@example.org.
> What is HN running on again?
I suppose I have to answer this or someone will concoct a sinister reason why I didn't. HN doesn't run on Cloudflare.