The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.
I'm sure they have an undo or rollback for deployments but probably worth investing into further.
They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.
Nonetheless, Verizon could take a leaf out of their responsiveness and transparency book.
After their huge Cloudbleed issue with the addition of this one, they continue to call out everyone through their blog posts. And everyone seems fine with it because they are a hype company.
What they are criticising however are things like not adopting new protocols or not taking things that affects everyone seriously. This isn't something that would happen if people were trying. And the response from some of the industry is "we know what we are doing", and shortly after the same thing happens again and again and again.
So I don't really see CloudFlare being that arrogant, if anything it's the "you are not better than us" from some parts of the industry that is. The day I see CloudFlare not trying I would be happy calling them arrogant. But if anything I would caution that they are too successful by trying more than most.
Cloudflare improved a lot. You can see just from what they're open sourcing that the usage of go and rust increased significantly. And I'm sure we'll notice improvements in deployment practices.
When Cloudbleed happened I was very vocal and skeptical, but this is different. Everyone makes mistakes.
You say this like using trendy languages implicitly indicates improvement.
Monitoring, canaries, experimantations do need to be adopted at pretty much everywhere possible.
That depends on how good your tests are.
If your engineers are so solid, and them making a mistake on a given release is individually 0.5%, and you have 50 engineers, you will see the probability of nothing going wrong is about 77%(0.995^50), and something going wrong is 1-0.995^50. Pretty low, i might say.
Dont do this to your engineers. 80% test coverage is a sweet spot, the rest is caught better with other approaches. No reason to kill engineers productivity everytime something fails on production by blaming their tests arent good.
In this example, that’s 23%.
Didn't take anything down, but did cause an inordinate amount of effort tracking down what was suddenly blocking the event loop without any operational changes to the system...
It's nowhere near as standardly applied as the other approaches to release verification, though.
And in complex cases (say, a large multi-tenant service with complex configuration), it can be very hard to find the combination of inputs necessary to catch this issue. If you have hundreds of customer configurations, and only one of them has this particular feature enabled (or uses this sort of expression), fuzzing is less likely to be effective.
As I commented yesterday, this is due to the fact, that "the Internet" thinks it needs to use Cloudflare services, although there really is no need to do so.
Stupid people making stupid decisions and then wondering why their services are down.
I bet they will now.
The repeated outages plus the constant malicious advertising by scammy ad providers through cloudflare are slowly turning me off to the service as a potential enterprise customer. Unfortunate too since plenty of superlatively qualified people build great things there (hat tip to Nick Sullivan), but it seems like the build-fast culture may now be impeding the availability requirements of their clients.
This is also a great example of a case where SLAs are meaningless without rigorous enforcement provisions negotiated in by enterprise clients. Cloudflare advertises 100% uptime (https://www.cloudflare.com/business-sla/) but every time they fall over, they're down for what, an hour at a time? Just this one issue would've blown anyone else's 99.99% SLA out of the water -- https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr
I love the service, but if I'm to consider consuming the service, they'd do well to have the equivalent of a long term servicing branch as its own isolated environment, one where changes are only merged in once they've proven to be hyper-stable.
None of the major cloud vendors actually hit 99.99% uptime.
None of them even promise that -- last time I checked, it was 99.95% for most of them.
(To my knowledge, it's the only AWS service to promise 100%.)
The question is whether the guarantee is meaningful by way of whether the penalties will significantly dissuade failures to meet the guarantee, and I'd argue in the case of Cloudflare, this isn't the case.
[Edit: Cloudflare's standard] penalty is a service credit defined as follows:
> 6.1 For any and each Outage Period during a monthly billing period the Company will provide as a Service Credit an amount calculated as follows: Service Credit = (Outage Period minutes * Affected Customer Ratio) ÷ Scheduled Availability minutes
And that's woefully inadequate for any enterprise client with mission- or life-critical services.
TL;DR: A SLA is a guarantee, by the very definition of the word "guarantee," that a service will be delivered to a specific level and that certain agreed-upon penalties will be applied to the service provider if this guarantee is not met.
Edited for tone.
100% uptime doesn't necessarily mean nothing failed, it means the failure detection and mitigation worked within the allowed windows. In a typical internet environment, that means allowing connections to die when the server they're connected to dies. It's would be possible to handoff tcp connections, but nobody does it.
If you want to get close to those numbers, you need to have a real reason, and then you need to make sure you have a good plan for everything that can go wrong. Power, routers, fiber, load balancers, switches, hosts, etc. And then do your best not to push bad software / bad configuration.
Bare metal on quality hardware with redundant networking goes a long way towards reliability, once the kinks are worked out.
Nothing wrong with just using SLOs, but if you are a technical lead or senior engineer, you should have the big picture.
Or - if you prefer - what is the "reasonable" percentage of issues - timewise - for an internet service?
Agreements to uphold past performance are much better.
There's a lot of good info in there
There's a lot of good info here, but there are many more questions raised in my mind based on what I'm reading in the SOC3 than perhaps what you might've expected. I can ideally run through them if I catch you again at DEF CON this year. I'm also willing to sign your standard MNDA to review your SOC 2, but we can take that thread offline.
> Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.
Wow. This seems like a very immature operational stance.
Any deployment of any kind should be subject to minimum deployment safety, that they claim they have.
> At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.
Many large companies would have had automatic roll-back of this kind of change in less time than it took CloudFlare to (apparently) have humans decide to roll-back, and possibly before a single (usually not global) deployment had actually completed on all hosts/instances.
However, what is more concerning is that it seems you shouldn't rely on CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to turn it off instead of correctly rolling back a bad deployment, which they only did > 43 minutes later:
> We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.
How were they not able to trivially roll back to the previous deployment?
If you're dev at a hipster app maybe a dozen people use to holler "yo" at each other, by all means go for it. If you're operating one of the biggest and most important chonks of Internet infra... maaaaaybe stick to established practices such as stage testing, release schedules and incremental rollouts?
I don't want to return to the old slothful release schedules of the 2000s, where features and bug fixes was mostly stagnant.
You can have staging, scheduled QA signed off releases, that happen every day. I have worked on some fairly large significant services and we still released several times a day, just that you did not trigger the final prod release yourself but the QAs pressed the button instead. Though usually just once a day per microservice.
I have also worked with several clients lately without QA where devs could themselves push to prod many times a day. I am not sure these systems were that much less stable, though they were all mostly greenfield and not critical public government systems. They were off course a lot smaller changes, and quick to undo. Which is the core element of "release straight to prod" ethos.
I am sure Cloudflare have a significant QA process whilst using todays fast moving release schedules.
What is always a grey-zone is configuration changes. Even if properly versioned and on a release schedule train with several staging environments, configuration is often very environment sensitive. So maybe they could not test it properly in any staging environments but had to hope prod worked...
However Cloudflare will hopefully implement some way to make sure this particular configuration and subsequent future changes are not as bottle-necked that instead can be be gradually released to a subset and region-by-region instead of a big bang to all. Though canary/blue-green/etc releases of core routing configurations is hard.
That's the one that took down Stack Overflow a few years ago
I'd bet big money that they do include it.
1. Why are WAF rules not progressively deployed since there's already a system to do so?
2. Maybe there should also be a testing environment that receives a mirror of production traffic before deployments reach real users?
(I understand the WAF change was not set to take action, but a separate environment would be less likely to affect production)
This release wasn't meant to go out, and the fact it did means it would have bypassed the test environments either way.
This is direct and doesn’t attempt to avoid blame. Well done.
So for about 50 minutes, those who relied on the WAF were open to attack?
First Cloudflare literally denied service, then as a hotfix there was a higher-than-normal potential for denying service, and eventually the normal potential for denying service was restored. I'm trying to comprehend how the second phase could ever be worse than the first phase.
Now, if you're talking about elevating the potential for compromised confidentiality and/or integrity rather than merely availability, I'd agree, but generally [D]DoS refers to availability.
Leaning on a WAF to plug gaping vulnerabilities that can be discovered and exploited during the period of time before the WAF was restored means you have much bigger problems than uptime.
It's also, roughly speaking, the selling point of products called "WAF". (and yes, relying on them is not great)
When it was Verizon that took down the internet he felt it was appropriate to do that to the Verizon teams, after all.
edit: right after posting this comment, he did tweet the following: https://twitter.com/eastdakota/status/1146196836035620864
> I'd say both we and Verizon deserve to be ashamed.
As well as this: https://twitter.com/eastdakota/status/1146170209780113408
> Our team should be and is ashamed. And we deserve criticism. ...
I still don't think that publicly shaming anyone is a good leadership style nor is it a good way to motivate people to perform better in the future, but kudos for the self-awareness, at least.
I'm not saying Verizon is perfect nor absolved of fault, but Cloudflare was/is not owed any kind of explanation or assistance by VZ, and it's absurd of CF to still be whining about that fact (as they are doing in some other tweets today). If CF wants some kind of SLA with VZ, they should engage them in a business relationship, not try to publicly shame them.
Kind of similar to a homes association saying “hey that trash on your lawn affects your neighbor, clean it up!”
It’s true that they are not a customer but at that level what they do affects each other, and it’s better to resolve things civilly and privately instead of publicly on twitter.
Etsy implementing multiple CDN (7 years ago, the CDNcontrol project looks abandoned): https://speakerdeck.com/ickymettle/integrating-multiple-cdn-... https://dyn.com/blog/speaking-with-etsy-about-multi-cdns-and...
Basically: you can try to keep a low TTL DNS, but it'll be more DNS traffic, and 5-10% of traffic takes forever to cut over because nobody respects TTL. Worst case you have just as much down time as before, best case most of your traffic is recovered in a few minutes.
Just leaving it out there so one doesn't get the idea that "low TTL == Always Good"
My thoughts are immediately shifting to one of my favorite articles of all time "Regular Expression Matching can be Simple and Fast..." 
HN does have some great content/replies that touch on these topics, but I'd like something more.
But yes, the content like this on HN is fascinating and I would also like more.
StackOverflow had a similar case a while back:
* Google's RE2 https://github.com/google/re2/wiki/WhyRE2
There is a good series of articles about the problem: https://swtch.com/~rsc/regexp/regexp3.html
I would strongly recommend deploying such a regular expression matcher to avoid problems like this.
There are examples in the above article that you can use to test anything in your production deployment that accepts regular expressions to see how well it copes.
Regular languages have some very nice properties relating to how they can be evaluated. Some regular expression engines have features that pulls the expressions from being a regular language into more complexity.
Would be interested to see what the gnarly regex was that was bombing their CPUs so hard!
"You have a problem, and you decide to use a regexp to solve it. Now you have two problems"
Although of course I'm just kidding and I'm sure that a good regexp probably is the right solution for what they're doing in that instance: they have a lot of bright people.
Separately there is nothing that says that a company like Cloudflare has to air their dirty laundry (as the saying goes). The vast majority of 'customers' really don't care why something happened at all or the reason. All they know is that they don't have service.
Pretend that the local electric company had a power outage (and it wasn't caused by some obvious weather event). Does it really matter if they tell people that 'some hardware we deployed failed and we are making sure it never happens again'. I know tech thinks they are great for these types of post-mortems but the truth is only tech people really care to hear them. (And guess what all it probably means is that that issue won't happen again...)
Well, Cloudflare is in luck; most of their customers are "tech people"!
One again using an example of lost luggage I don't really care (other than an interesting story) why my luggage was lost I just don't want it to happen and an airline writing a detailed story doesn't give me any more confidence they won't have a different problem happen again. If anything it maybe even opens up the door if something I read seems that it could have been avoided (whereby if they say nothing I might not know that it could).
I doubt any of the people you are talking about "not caring" read this site to begin with.
> We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.
I'd imagine it was quite novel for most anyone affected /s
Does anyone know a more reliable provider?
It just baffles me that manual/third-party bid-sniping is still a thing. eBay has had automatic bidding for more than twenty years. You'll pay the same whether you put in the winning bid a week in advance or 5 seconds. But people see that "you lost this auction" notice and they're irrationally convinced that it would have gone differently if they'd bid at the last minute, somehow.
Also this helps when placing bids on auctions that end when you're asleep and you want to effect the above. I've tried to do it manually in the past but naturally life intervenes and you're somewhere with no signal or in the middle of something. Having tried it both ways sniping is definitely better than regular bidding.
Sniping approximates sealed bids, with the highest-bidder the second-highest sealed bid amount or a small increment above it. (Unfortunately for eBay, that would tend to decrease their cuts, unless the appeal of the sealed bid format brings in sufficiently more bidder activity.)
Another advantage of the software/service is that it automates. If you want to buy a Foo, you can look at the search lists, find a few Foos (possibly of varying value in the details), say how much you'll pay for each one, and let the software attempt to buy each one by its auction end until it's bought one, then it stops. If eBay implemented this itself, it might be too much headache in customer support, but third-parties could provide it to power users.
(I don't buy enough on eBay anymore to bother with anything other than conventional manual bids, but I see the appeal of automation.)