A trivial example would be a bug that replaces the configuration for all customers with the last uploaded. Then when the next customer uploads a new (valid!) config, you have a problem.
Obviously it wasn’t that trivial but the point is: it wasn’t the customer’s configuration change that was the problem but some code that managed the config change.
Can it be used as a CDN for a normal website? How well does it perform?
> Can it be used as a CDN for a normal website?
No. It's an independent overlay network designed to give anonymity.
> How well does it perform?
It's slow in comparison to the clearnet, but usable for basic things, including torrents (~100+ KB/s speeds).
A web filled with DDOS attacks and scraping is a web that needs cloudflare and fastly. I’m not sure how to avoid this sorry state of things.
I2P doesn’t seem like an immediate solution -- maybe it can resist DDOS, but at the cost of losing fast, easily-accessible, easily-searchable public websites, no? Could Starbucks host their website on I2P, to pick a random example? Seems like a bunch more infrastructure would be needed first.
> Could Starbucks host their website on I2P, to pick a random example?
Yes, they definitely can. Which additional infrastructure is needed? I tried to host websites there myself. Did not see any problems.
Web crawlers are a feature not a bug. If your site shouldn't be crawled, it doesn't belong on the Internet.
Your profitability is not the Internet's problem.
If you cannot generate revenue by your internet content, probably you can’t live from generating content for the internet.
The consequence, IMHO, is that the internet would have this amount of content and usefulness.
Newspapers? No. Can’t live from internet news if anyone can copy a reporter’s work and post it on his own site and dilute traffic.
Online selling? Don’t look like a viable business model, as anyone can copy the photos you paid a photographer for, the descriptions you paid someone to write and the reviews your customers wrote. True reviews are priceless, you now? Even more now that an AI can detect computer generated reviews.
Obviously an open and totally money-free internet is nice, but it wouldn’t be the internet people make a living from.
Fastly obviously didn’t test their code (with the bug) enough, but testing of course can never prove the absence of bugs. Testing for a global deployment like a massive CDN happens to a large extent in prod because you don’t have another globe. You can test on a smaller scale but eventually you run into a problem that only shows itself at full scale.
> We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.
in the first sentence
The customer change was a valid configuration. That was yesterday.
It looks like this guy did just that. And for Fastly. Wow.
"So, did I just hear three distinct light switch clicks?"
Breaking 85% of Fastly's servers is not the same as breaking the entire internet.
Also, what is up with their partitioning? Do they seriously have one customer that gets served from 85% of their servers? Is it a whale?
Good on them for getting a statement out right away (although they basically had to) but seems to be lacking any useful details. Wonder if they were scrubbed by PR/legal in hopes of reducing the number of customers coming to ask for gibs.
With respect to partitioning - we don't know how or why an invalid configuration could poison so many nodes; if the config was physically present on them or if there was a cascade of healing/balancing issues stemming from it.
I would leave speculation on many of your points at the doorstep until we see a full report.
A defense against this could be to ensure the system that applies the change validates some health-checks continue to work after the new file is made (or automatically rollback to previous configuration).
I can see how this would happen, assuming thats what happened.
When making a config change I'd assume they don't make it to all servers at once and instead roll it out gradually. If this caused the server to instantly start 503'ing all customers, presumably this would have been caught - perhaps it was more delayed though (resource leak, etc) and obviously that is somewhat more difficult to catch.
If they're properly partitioning customers, ideally they wouldn't even ship the configs to all servers (slightly less good, but still pretty good they could ship them there but not parse/load them). It sounds like at the least this customer's config change effected 85% of servers, which seems absurd to me.
So yes, I can see how it happened, but for Fastly, which runs one of the biggest CDNs, these don't seem like very reasonable mistakes.
There could be a feedback loop that is the opposite of a smoke test.
1. Validate customer configuration, if passes, assume it can roll out
2. Roll out customer configuration to node
3. Node goes down
4. Migrate all customers on node to new nodes
5. Node that problematic customer was migrated to goes down
6. Rinse and repeat as problematic customer migrates to every node and takes out every last one.
TikTok is my guess. ByteDance is valued at 250 billion. Plus, the change was pushed in the middle of the night, which would be daytime in Asia. Certainly there are other development teams in Asia, but considering the scale of the change it likely comes from HQ, and Fastly's whale in Asia would be them.
edit They may have lost TikTok at the end of last year, either partially or completely . Anyone know what they use now? Akamai, or maybe they stealthily switched back to Fastly?
Also, just because the rollout to fastly happened at a morning EU time, doesn't mean that the change was made. If there's a deployment pipeline, it could have been made 2-3 hours earlier, or even the day before
To me it sounds plausible that an SRE team in an alternate location made a change scoped to their permission level, following company-directed playbooks, which eventually triggered the faulty condition at Fastly.
I'm still a little annoyed at their status page . It says:
> We're currently investigating potential impact to performance with our CDN services.
yet in the blog post we're talking about here it says:
> Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.
85% of your network returning errors is _not_ a potential performance impact.
They say it was a performance issue, but we were getting 500 errors from the Fastly API.
* Bug was introduced on date X but only caused problems on date Y ("if the bug was introduced on date X then we would have seen it on date X, so you’re wrong")
* Doing X led to the outage but X wasn’t the fault, X was a valid thing to do, the code should have been able to handle X, the fact the code couldn't handle it was the actual problem which needs to be fixed ("look, you said X caused the problem, so the solution is just not to do X right?")
This article conveys both these points clearly and effortlessly. I might borrow some terminology from this in the future.
But I can’t help but be bothered that a single customer’s configuration change would have such a wide ranging impact across so many sites. I’m looking forward to finding out how that happens…
It's perhaps a bit premature to demand it at this point, but I'm hoping a full post-mortem will outline precisely how this change was not picked up in pre-prod. Surely all valid customer configurations must be tested prior to rollout.
If my data centre provider suffered a complete outage, then I demand to get a detailed post-mortem of what happened (in due time). If they just tell me bullshit PR speak about "We value our customers", I'll be looking at switching providers.
As a Fastly customer whose site went down, I'm entitled to know exactly what happened. If they don't tell me, I'm switching CDNs as a matter of priority.
Does your contract say you're entitled to an RCA?
As others have said, this is more of an update, not a complete RCA on the entire situation. They have short term tasks that they've described in this summary post and I would expect that they will give a more complete analysis later.
If you are a hardcore user of their vcl on the edge I'm very curious where you would go to. The last time I looked ( a year ago ) there was no one that came even close to giving customers that level of control in request processing. Most of them fail do complicated stuff with CORSs without doing arabesque while balancing on a medicine ball ( Looking at you Lambda@Edge ) not to mention ability to massage the response.
My guess: it is some sort of config-triggered recursion that caused the servers to stack overflow and crash/reboot in a cycle.
They could probably cut and paste that same page for 90% of future outages. Maybe they need to read this: https://artsy.github.io/blog/2014/11/19/how-to-write-great-o...
So I don't think they are claiming this is a post mortem.
How could you plan for an outage like this by fastly and how could you mitigate this?
I was thinking more about this though and it has its own problems. You want a short TTL so failover is fast, but this increases the number of DNS lookups people have to do (and DNS lookups can be very slow!).
Additionally, a short TTL means you're more vulnerable to problems like the dyndns attack  from 2016: names with longer TTLs were up for longer since they preserved the correct DNS records for longer.
But if you have a long TTL, even if you fail over, you'll still be down for at least as long as the DNS TTL pointing to the bad CDN.
Maybe, you could do DNS roundrobin against multiple CDN providers at once. Say you used 4, then if one went down, only 25% of requests would fail, and you could just remove the failing entry. This seems very expensive!
Honestly, the cost of these solutions is probably not worth it. The product I work on went partially down during the fastly outage. Then it came back up and everything is back to normal. It really won't impact us much at all. Shrug.
You can't just ensure a config change won't break things in large distributed systems, it's too complex with too many factors, there will always be risk. To mitigate your risk, youd want to design your system to do progressive, regional rollouts, with canaries to attempt to detect and isolate before a wide spread outage occurs. Even if you have all of this set up, there is still risk that your regions and systems are not fully isolated and outages could cascade anyway.
There will always be risk, there will always be errors. This is why SLAs and SLOs exist, they define and codify an agreement of what an outage is and what compensation is required if the agreement isn't met.
You can read Fastlys SLA here: https://docs.fastly.com/products/service-availability-sla
Canarying should detect this. Not clear if they do this or the canary failed to report this.
Sharding by customers could help reduce blast radius. But maybe not by much of this was a very big customer.
Why is this the case? I don't have too much knowledge of CDN architecture so I am curious
Fastly is not really a regular CDN. It is a fully programmable edge cache with cache control algorithms decided and controlled by the customer running at the edges. You can think of Fastly configuration as a part of your code base where it is for you to decide if you want to perform the action on the edge on a per-request basis rather than on the origin per cached request basis.
That in turn means that if you do deploy to your API/web 50 times a day, you would are likely to deploy your Fastly configurations about the same number of times
> This incident emphasizes the importance of the Zero Trust model that Cloudflare follows and provides to customers, which ensures that if any one system or vendor is compromised, it does not compromise the entire organization. 
Authentication is a part of a zero-trust model, not the whole thing.
> No single specific technology is associated with zero trust architecture; it is a holistic approach to network security that incorporates several different principles and technologies. 
For 99% of customers, one can argue that cloudflare is more than sufficient. For 1% of customers, fastly is arguably the correct choice just based on feature set alone.
So, in summary, you can certainly compare the two, but for certain customers cloudflare lacks the feature set they may choose to use on fastly.
But how does Fastly avoid this problem? It's really more a symptom of the "web pki" trainwreck than anything else.
I tried looking on Fastly's website for technical details, but like every other corporate website it was an impenetrable mass of marketing bling and partner logos.
I don't want any more of your PR speak or "we value our customers". That's crap and insults my intelligence. STOP getting PR to write your comms; just speak to engineers like engineers. I'd rather get no response than this post.
I hope there are actual details as they complete their investigation. If there isn't a public post-mortem, I am switching away from Fastly.
Obviously on this site we tend to be rather technical people, so we want to know as much detail as possible, but that's something we desire, not something we are entitled to.
> any actions I might need to take
An alternative CDN setup you can switch to when there are problems.