Hacker News new | past | comments | ask | show | jobs | submit login
One Fastly customer triggered internet meltdown (bbc.co.uk)
287 points by JulianMorrison 3 months ago | hide | past | favorite | 199 comments

See also: Summary of June 8 outage - https://news.ycombinator.com/item?id=27444005 - June 2021 (ongoing)

Throwing in my positive hot take among all the negative ones here: the immediate response and blog post from Fastly here is really good.

A quick fix, a clear apology, enough detail to give an idea of what happened, but not so much detail that there might be a mistake they’ll have to clarify or retract. What more are you looking for?

Apart from “not have the bug in the first place” -- and I hope and expect they’ll go into more detail later when they’ve had time for a proper post mortem -- I’d be interested to hear what anyone thinks they could have done better in terms of their immediate firefighting.

I don't blame them for having a bug. I do blame them for having a design that doesn't isolate incident like this (although, it is hard to know how much without more details. And I blame our industry for relying so much on a single company (and that isn't a problem unique to fastly, or even our industry).

> And I blame our industry for relying so much on a single company (and that isn't a problem unique to fastly, or even our industry).

The problem is that if fastly is the best choice for a company then there's zero incentive for the company to choose another vendor. Everyone acting in their own best interest results in a sub-optimal global outcome.

It's actually one of the major problems with the global, winner-takes-all marketplace that's evolving with the internet.

Do you roll your own power grid? Do you roll your own ISP + telecoms network?

As a software engineer I live by the ethos that coupling and dependency is bad, but if you unravel the layers you start to realise much of our life is centralised:

Roads, trains, water, electricity, internet

These are quite consolidated and any of these going down would be very disruptive to our lives. Connected software, ie the internet, is still quite new. Being charitable, are these just growing pains in the journey to building out foundational infrastructure?

> Do you roll your own power grid? Do you roll your own ISP + telecoms network?

> Roads, trains, water, electricity, internet

I guess the difference here is that you're (mostly) talking about physical infra, which by definition must be local to where it's being used. We allow (enforce?) a monopoly on power distribution (and separate distribution from generation) because it doesn't make sense to have every power company run their own lines. But with that monopoly comes regulation.

Digital services are different. The entire value prop is that you can have an infinite number and the marginal cost of "hooking up" a new customer is ~$0. This frequently leads to a natural winner-take-all market.

One way to address this is to add regulation to digital services, saying that they must be up x% of the time or respond to incidents in y minutes or whatever. But another way to address it is to ensure it's easy for new companies to disrupt the incumbents if they are acting poorly. The first still leads to entrenched incumbents who act exactly as poorly as they can get away with. The second actually has a chance of pushing incumbents out, assuming the rules are being enforced. And now you've basically re-discovered the current American antitrust laws.

As far as any individual company's best interests, like anything else in engineering, it's about risk vs. reward.

What's the cost of having a backup CDN (cost of service, cost of extra engineering effort, opportunity cost of building that instead of something else, etc.) vs. the cost of the occasional fastly downtime?

I have to imagine that for most companies the cost of being multi-CDN isn't worth what they lose with a little down time (or four hours of downtime every four years).

But a CDN _is_ physical infrastructure. Just like power, water, transit, etc. The same economic forces influence CDNs just as much as they do the others.

> One way to address this is to add regulation to digital services, saying that they must be up x% of the time or respond to incidents in y minutes or whatever.

This is good reasoning but I don't think it's possible to legislate service level objectives like that.

> But another way to address it is to ensure it's easy for new companies to disrupt the incumbents if they are acting poorly.

I agree but realistically there will be many cases when a company is far better at something than anyone else. I think the only way to avoid global infra single points of failure is competitive bidding and multi-source contracts, plus competitive pressure to force robustness (which already works quite well).

My datacentres have two sources of power (plus internal UPS), two main internet lines to two different exchange points (and half a dozen others), plenty of bottled water.

At home I have emergency power, water and internet. If the trains stop I drive, if the car breaks I take the train.

But having everything redundantly available costs money. While some redundancy is easy to justify... At some point it becomes hard when the MBA wants to cut costs so he gets a bigger bonus.

There is even a competitive advantage in living with the risk, as you have less costs and overhead... Sure, you might have an outage once every x years for a few minutes... But that's obviously the fault of the development team, duh

This is a classic example of an externality. You use regulations or lawsuits to force the costs back onto the decision makers. Make it so people can collect damages from outages, the company then needs insurance to cover the potential costs of an outage; if the savings from removing a redundancy exceed the increase in insurance premium then it is actually efficient, otherwise it is a net negative. While an actuarian may make a mistake and underestimate the likelihood of an outage, they are far less incentivized to do so than the MBA looking for a bigger bonus.

If you can win a lawsuit against a company for a failure it is probably because it wasn't an externality but a contract or warranty agreement they have breached. This is present even in niches with minimal regulations.

Data centers offer the highest uptime guarantees at the highest price tiers. People pay more for Toyotas, new or used, because of their reputation. Quality is a product feature. If MBA's want to decide if they can cut corners, there are already upsides and downsides, the calculation is something they need to make.

Yeah, I'm saying add regulations so it is no longer an externality.

Quality is a product feature when there is competition, monopolies don't suffer from cutting quality.

whereas I'm saying it's not an externality even in the absence of manipulation by specific regulations, which you apparently agree with given your second sentence.

I find that people have a tendency to be overly narrow in considering competition and declaring things monopolies. There are alternative ways to get tasks done that avoid relying on (and paying for) low-quality internet services if companies find it necessary.

An externality is a side effect or consequence of an industrial or commercial activity that affects other parties without this being reflected in the cost of the goods or services involved. Removing redundancy to increase profit margins affects other parties (ie users) without affecting the cost of the goods or services involved. If and only if the MBA's decision to increase risk to the user is reflected in the costs does it cease to be an externality.

And we are specifically talking about an industry over-relying on a single provider of a service. If there were a variety of competing services, the entire point would be moot.

Which is one reason why things like roads, trains, water, electricty, etc. are so heavily regulated. To prevent the companies that hold monopolies over the infrastructure from cutting corners like that.

Right, but that's why we (ideally) put infrastructure costs under the control of an entity (the government) which doesn't have to operate within a system of profit and market competition.

What's this "emergency ... internet"? A hot-spot on a cellular telephone?

but do all your emergency backups have emergency backups?

My emergency backups have emergency emergency backups (but those do not have emergency emergency emergency backups).

> Do you roll your own power grid?

I know plenty of people in Texas who will be buying solar panels and batteries after last winter. I will be doing the same.

> Do you roll your own ISP + telecoms network?

If I could magically get fiber directly to an IX I would gladly be my own ISP. I have confidence I would do as good a job or better than the ISPs I’ve had over the years (yes I realize having hundreds of thousands of customers to service is more difficult than a single home).

> I know plenty of people in Texas who will be buying solar panels and batteries after last winter. I will be doing the same.

I have actually been in the position of having to rely on non-mains power all my life.

It bloody sucks.

How long is "all my life?"

Because it seems like a relatively recent development that off-grid solar power solutions have become affordable and mature enough to not suck on average.

Around 40% of the population of Nigeria (a country of 200+ million) does not have access to electricity. The per capita electricity consumption of Nigeria is _two orders of magnitude lower_ than the US.

> Around 40% of the population of Nigeria (a country of 200+ million) does not have access to electricity. The per capita electricity consumption of Nigeria is _two orders of magnitude lower_ than the US.

... And how is this relevant?

Because that is where I live?

> Because that is where I live?

... And how is this relevant?

You asked how long "all my life" I've spent only partially reliant on the power grid is, and someone else has provided you some context (that I actually mean all my life, which you can probably infer to be longer than two decades at least).

As to the second half of your original question, solar power is not the only kind of backup power that exists.

What do you think the suboptimal outcome was in this case?

Is it better for websites to be unavailable at different times as opposed to all at the same time? This seems to be a really common assumption people make re these occasional cloud take-downs, but I don't really understand why people think it.

Seems to me that in cases like this, everyone operating in their own self interest, by all using the best value service, is actually the best outcome. Everyone suffered the same outage at the same time, which minimised the overall cost of the outage (one resolution, one communication line etc. as opposed to many).

With short term outages like this your probably right that a single point of failure doesn't matter.

It's the longer term outages that are the problem. That's because we start talking about knock on effects.

It's not really a problem if your supplier (and all others) have a short term issue (assuming you don't run super lean). It may be a headache if your supplier has a longer term issue while you set up another supplier (or use your less desired one) but it's not a disaster. It's a big problem if all suppliers are down for more than a short time.

I'd assume people seeing this as a market failure are talking about it in the "this highlights the problem" kind of way, not the "this event was a true disaster" way.

Sure. But again, I feel like using a "big, everyone uses it" kind of supplier is the best mitigation of the "what if I have to replace this" problem.

If a massive vendor shutters or has a long term failure, at least you're in the same boat as a bunch of experts, which is a much better place to be than "my obscure or self-rolled solution is now orphaned / hacked / broken".

The unspoken assumption always seems to be "my self-configured solution will have fewer and/or shorter issues than the massive publicly traded solution that everyone uses" but that seems ... very incorrect.

Also... a reverse proxy / CDN always is a single point of failure. The question is... is it a single point of failure that you personally own. In my opinion shared single points of failure are desirable. It's just obviously more efficient.

> Is it better for websites to be unavailable at different times as opposed to all at the same time?

Yes. If one site is down, it may hurt my productivity a little bit, and I may have to adjust what I work on. But if the entire internet is down that has drastic impact on my productivity, and depending on what I am working on at the time, may completely block me.

I would wager that they run a single configuration, as it grants a significant economy of scale, rather than vertical partitioning of their stack, which would require headroom per customer and/or slice. This way you just need global headroom.

Having done some similar stuff with varnish in the past (ecommerce platform), they’re likely taking changes in the control panel and deploying them to a global config - and someone put something lethal in that somehow passed validation and got published, and did not parse.

This looks a quite likely scenario.

But then we still don't know what they fixed, was is the incorrect configuration or the underlying bug? I would expect the former instead of the latter, because it is probably not very difficult or dangerous to change that specific configuration while fixing bugs in the code seems riskier and would probably take more time for testing.

We'll see if they will publish a post-mortem. It has become more or less a normal custom these days (and they are frequently quite interesting).

They were pretty clear about this in their response (linked in the article)

    Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25.
So they did both. First reverted the config then later fixed the bug.

Sometimes, it’s something as small as a missing parentheses that makes things go wrong.

(Slight joke here)

If they have a bug that can crash their servers, they likely won't want to publicize the details until it is fully patched. I wouldn't expect that detail for a while.

As an ex CTO having lots of fire fighting experience, I wanna give the honor to the team being able to identify such a user triggered bug in such a short period of time. Hardly anyone would anticipate a single user can trigger a meltdown to the internet!

It reminds me of that ancient joke: A QA engineer walks into a bar. He orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd.

First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.

You forgot the part where the QA engineer ticks off the checklist and gives the bar a pass.

The blog post actually says it’s fixed already (but I would definitely expect them to keep the details private until they’re 100% sure, yeah)

I'm sorry, but I disagree. They gave the BBC enough detail that a very misleading headline was produced as a result. True, the main blame lies with the BBC, but it also comes across — to me, anyway, maybe I'm being too cynical — as a bit of an excuse from Fastly.

"But a customer quite legitimately changing their settings had exposed a bug in a software update issued to customers in mid-May, causing "85% of our network to return errors"

They are careful to make clear that the customer did nothing wrong and that the problem was a bug in their software.

I know — as I've said, the main blame lies with the bbc. However, as it's reported, it comes across very much as Fastly trying to save face. Maybe the blame is entirely on the bbc, maybe Fastly were naive in thinking that giving them this information wouldn't result in irresponsible headlines.

What verbiage exactly are you looking for from Fastly here? I'm hearing "Nobody else did anything wrong, it was 100% a software bug on our end, and we're sorry about that." How much more responsibility are you asking them to take before you would no longer be considering them to "save face"? I'm trying to come up with an ironic exaggeration here, but I can't, because it kinda seems like Fastly has already taken 100% full responsibility and there's no room left for exaggeration.

How on earth do you figure that is anyone's fault but the BBC?

Read Fastly's statement. There is nothing about it blaming the customer(s) at all. There is nothing trying to save face.

What is your point here?

> Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

Is it necessary to refer to "a customer" at all in this statement? What would be problematic if the above were rewritten as something like:

> Early June 8, a configuration change triggered a bug in our software, which caused 85% of our network to return errors.

The advantage is that you wouldn't get ignorant reporting that "one customer took down the internet". I'm not sure there are disadvantages that net outweigh that.

Yes, because it is explaining that it was a valid *customer* configuration, which is a separate set of concerns from, say, infrastructure config.

The important adjective "valid" means it was completely normal/expected input and thus not the fault of the customer.

It's perfectly clear you've come at this with a pre-determined agenda of "I bet fastly, like most other public statements after corporate booboos I've seen, will try to shrug this one off as someone else's fault" after reading the BBCs title and haven't bothered to read it at all until now.

Not necessarily valid. Could have been a bad entry that passed validation when it shouldn’t have, which would still not be the customer’s fault.

> We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.

Verbatim from Fastly: https://www.fastly.com/blog/summary-of-june-8-outage

> Is it necessary to refer to "a customer" at all in this statement? What would be problematic if the above were rewritten as something like:

That's literally what happened. They even say it was a valid configuration change, it's very blameless.

Saying "a configuration change" loses critical context. I would have assumed that this was in some sort of deployment update, not something that a customer could trigger. Why would you want less information here?

OK, I'm replying to your comment since it's the least aggressive — thanks for that!

I'll fully retract my statement. This is 100% the BBC's fault, 0% Fastly's.

Can I make one small suggestion that might help to prevent this kind of misleading reporting in future, though? What if Fastly produce the detailed statement they have, with as much accurate technical detail as possible AND a more general public-facing statement that organisations such as the BBC can use for reporting, that doesn't include such detailed information that can easily be misconstrued?

Most of the replies to yours haven’t been aggressive. Ironically it’s your comments that have come across the worst by using terms like “aggressive”, “blame” and “fault” in the first place. Calling other people’s comments aggressive is pretty unfair. One might even say hypocritical.

I hate being part of a dogpile, so yeah sorry about that, I just open up things to reply to, and then come back later and write it up just to find that I'm one of 10 people saying the same shit.

edit: FWIW I had a very negative initial reaction to the headline as well.

Not at all — I understand.

My apologies for any hostility on my part.

No worries. I probably didn't take it very well because my intentions were genuine and I really wasn't trying to level anything beyond the very mildest criticism towards Fastly. I recognise, however, that even that was misplaced — I think the BBC headline just got me too worked up!

In what way is the BBC at fault for this? Their title is objectively true. A _valid_ configuration setting that was used by a customer _did_ cause fastly to have an outage.

It's not limited to one specific customer (i.e this customer isn't the only customer who could have caused the issue, presumably), but it _was_ something the customer (legitimately) did. It wasn't a server outage. It wasn't a fire. It wasn't a cut cable.

"a customer quite legitimately changing their settings (BBC: one fastly customer) had exposed a bug (BBC: triggered internet meltdown) in a software update issued to customers (fastly admitting, when combined with 'legitimately', that fastly are at fault) in mid-May".

People love to hate main stream media

Not me — I adore the BBC. I've always paid my licence fee gladly, and I've been waxing lyrical about the latest BBC drama on Twitter just this very hour. On this issue, I believe they've made a mistake.

Whatever happened to nuanced opinion, where you can see good and bad in the same entity? Why do some people insist so strongly on absolutes?

How on earth is the headline making a mistake?

Here's some excerpts:

Fastly, the cloud-computing company responsible for the issues, said the bug had been triggered when one of its customers had changed their settings.

Fastly senior engineering executive Nick Rockwell said: "This outage was broad and severe - and we're truly sorry for the impact to our customers and everyone who relies on them."

But a customer quite legitimately changing their settings had exposed a bug in a software update issued to customers in mid-May, causing "85% of our network to return errors", it said.

The headline accurately portrays the story given the limit on headlines.

The somewhat awkward "a customer pushed a _valid_ configuration" is Fastly making sure they aren't pushing any blame onto the customer.

There is no customer blaming here. None at all.

> Is it necessary to refer to "a customer" at all in this statement?

That’s how autopsies work. You describe the cause and resolution. The cause was a bug in the customers control panel.

They’re not trying to absolve themselves of responsibility.

The wording "a configuration change triggered a bug" in this context sounds (to me) like it was a configuration change made by Fastly to something on their backend.

The wording which was actually used makes it clear that that was not the case.

Don't forget that the BBC were also initially affected by this, and jumped on it a lot sooner than most outlets, so they have skin in the game.

Would love it if it was the BBC that triggered the problem :D

Not in the article: how one Fastly customer triggered the bug. Just a quote and a promise that an RCA will be posted.

Fastly doesn't appear to be sharing that detail. Their own blog post is similarly vague about the exact cause. https://www.fastly.com/blog/summary-of-june-8-outage

Edit: That blog post does say this: "On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances."

The scheduled maintenance on May 12 was this: https://status.fastly.com/incidents/dlsphjqst537

Based on that, it sounds like maybe a configuration change could deploy new cache nodes with ip addresses that a customer hasn't explicitly allowed to talk to their backend:

"When this change is applied, customers may observe additional origin traffic as new cache nodes retrieve content from origin. Please be sure to check that your origin access lists allow the full range of Fastly IP addresses"

Why is everyone banging on about this. It's a blog post from the same day, a decent post mortem takes a while to put together and assuming the bug isn't fully patched across their entire CDN, why would they post the information.

I wasn't "banging on", I was answering why the article didn't mention the cause...because the source didn't either.

Plus it would be weird to present just that specific information, outside of the context of a post mortem / failure chain analysis type discussion.

That's true, though they are also saying things like "We created a permanent fix for the bug and began deploying it at 17:25.". "Permanent fix" sort of implies they understood the issue really well.

That's my point though. Even though they may understand the immediate flaw in their code that caused the issue, there's not much use (for them or their customers) in just talking in detail about that specific flaw.

I'd go so far as to argue that the specifics of the flaw are immaterial right now. At this stage, the important thing is that they have identified a specific code change that was the proximate cause of the issue, and have a mitigation in place. This is contrasted with more mysterious and hard-to-track-down failures. ("We are working to understand why our systems are down and will post another update in 30 minutes")

What will take time, and the thing which will be interesting, is failure tree analysis. (You might hear the phrase "failure chain" or "root cause" but IMO it's quite rare for things to be so linear). That can help identify opportunities to improve processes at many different levels of the product lifecycle.

Humans are fallible, and there's no way we can write bug-free software, so the solution has to be more robust than "hope that every member of our organization never makes a mistake again"

Yes, I was saying I would have avoided words like "permanent fix", because it sets unrealistic expectations.

Everyone is "banging on" because there are important lessons to be learned from such incidents, and people want to learn. They hunger for more details about the generalizable aspects of the bug, even if a full post mortem that also covers internal processes etc. might take longer to do. Having participated in many post mortems, in many roles, for systems just as complex, I believe it's entirely possible to provide that information the next (not same) day. Is that still setting the bar too high? Perhaps. Fastly deserves kudos for providing even the level of information that they have, since that's already above the pathetic industry standard, but I don't think there's anything bad about wanting more. Defensiveness is the enemy of effective post mortems.

The paragraph you quoted is just describing a side effect of adding nodes. That excerpt appears in every one of their capacity expansion announcements, going back years.

Ah, interesting. Though the change itself reads like it was just adding capacity in one location (Newark). I don't see any other changes mentioned for that date.

My uneducated hypothesis is that Fastly runs varnish. Presumably they have some process that collects data from their config system generates the VCL (varnishes psuedo-C configuration language which compiles directly to C). Somehow a customer configured something in such a way that it generated a bad VCL file which then caused it to lose all configuration, or caused one domain to incorrectly garble up traffic for all domains.

I can poke plenty of holes in this hypothesis, like fastly likely not deploying configuration to all nodes but only subsets. Looking forward to the deeper post.

They say as much here [1].

> Fastly is a shared infrastructure. By allowing the use of inline C code, we could potentially give a single user the power to read, write to, or write from everything. As a result, our varnish process (i.e., files on disk, memory of the varnish user's processes) would become unprotected because inline C code opens the potential for users to do things like crash servers, steal data, or run a botnet.

Personally, my hypothesis is that somebody uploaded a configuration for their domain `https://IVCL_{raise(SIGSEGV)}.com` (edit: the preceding URL used to contain a heart emoji between I and VCL, apparently HN prefers ASCII, too) in a way that, rather than converting to Punycode, passed a few bytes that weren't in the 96 legal characters accepted by the VCC compiler and caused some kind of undefined behavior.

[1] https://docs.fastly.com/en/guides/guide-to-vcl#embedding-inl...

It's interesting to me that they still run Varnish, since it doesn't have https/tls built-in. I do get that VCL is more expressive than similar capabilities in Nginx, HAproxy, etc. But it would seem like less work to add expressiveness to one of those (via Lua maybe?) than to maintain both Varnish and the separate components needed for both tls ingress and egress.

Varnish and VCL were the hotness at the time Fastly was coming up so it sort of makes sense, but Varnish also doesn’t support WebSockets, can’t proxy GRPC, etc so they’re very limited in functionality vs CloudFlare.

I doubt they’d build it on Varnish today, but it’s a bit late now since they allow custom VCL (which has now proven to be a terrible idea) and will have to support that for eternity.

They can run two or more serving stacks side by side though if it comes to that.

And to add to your point, they also have a separate process that speaks QUIC. It’s an interesting tech stack with a lot of technical debt.

> Varnish also doesn’t support WebSockets

Yes it does. I recently added WebSockets support to a Varnish instance. See https://varnish-cache.org/docs/trunk/users-guide/vcl-example...

>it’s a bit late now since they allow custom VCL (which has now proven to be a terrible idea)

Ah, okay. I took a look, and it appears they at least didn't allow varnish modules or inline C. But, still, a fairly hefty anchor for the future.

You can write custom VCL snippets directly in the Fastly control panel. Migrating all existing customer VCL to another language would be an enormous task.

Fastly directly allows you to run custom VCL: https://developer.fastly.com/reference/vcl/

I assume that the bug isn't 100% fixed yet and the instant they publish how the bug took place 1000 yahoos will immediately try to re-create it.

If you find out you have a DOS that can take down the internet you might be wary about sharing details until you've hammered things out.

The BBC title is a terrible title. The customer didn’t cause the outage. The bug caused the outage.

That was my take as well and very much disenfranchises what happens with title's like that.

If you have a design flaw, you don't hint at onus upon the first person to fall foul of it - and we all know how quick many are reading news that they will run fully upon the title alone (something we can all do).

But we have seen a drive towards click-bait/search-bot friendly to garner hits - style headlines. Even the BBC over the years have IMHO learnt towards such tabloid style headlines more and that is just sad.

The bug is the root cause, the customer's action is the proximal cause (or "trigger")

Placing the emphasis on "one customer" still creates a false narrative though. Note the headline doesn't even say it was a bug. This headline provokes the question "Who was it and what did they do?" rather than the more insightful "What was the bug?".

If a single bug caused the "internet meltdown", it's fairly likely that the bug was triggered by one person so there's no need to emphasize that part.

Presumably this leads to narratives with higher engagement. The market for technical root cause analysis amongst BBC readership is likely fairly low, so getting into the technicals is not going to sell clicks.

Humanizing the problem OTOH is (presumably) a narrative that resonates with their audience. Who was this guy? What was he doing? How does he feel about bringing down the Internet? (Is he sorry?) Is he anything like me? Could I accidentally bring down the entire Internet? Should I be worried about that? etc etc.

Not saying modern mass media has no room for technical truth, but I would argue their business model demonstrates over and over again it's this other meta-narrative they're tilting at.

Maybe working at a service provider has warped my English, but usually for me reading "one customer took down X" is usually a pejorative about the service and not a customer witch hunt for me.

This is like saying "Fat man destroys bridge"

no, the engineers/builders who didn't implement proper safety tolerance broke the bridge.

The straw is not responsible for breaking the camels back.

I'm sorry, but that is BS. If a heavy haulage drives over a bridge and it collapses, the first headline will be "Heavy truck destroys bridge". And that is fine. (Subsequent investigations reveal the real reason and will have different headlines. How about the engineers did their part but maintenance didn't. Or routing selected a route that wasn't suitable to begin with. Doesn't change the fact that the bridge collapsed as a heavy truck drove over it.)

Why should the BBC talk about a bug in the headline if the source says "valid customer configuration"? They don't write for industries insiders. (Plus that industries is shit to begin with and tries to establish bugs as some kind of force of nature no one can do anything about.)

Who cares about headlines?

Headlines are written by editors, not reporters, to maximize CTR and minimize length.

You can't fit detail and nuance in a headline. The point is to get people to read the article, not inform.

If people are drawing conclusions from reading just the headline, not the article, then you can safely ignore them. There's no point in getting mad about it.

Well, it is BBC, what do you want.

Unless it’s been changed since your comment, I don’t think that’s what the title is claiming. A trigger is different from a cause.

Sadly, most news headlines are terrible. When I read the headline, I clicked through for the details. Fortunately the details made the whole thing clear.

I too wish for a world where headlines aren't terrible, but we currently live in a world of clickbait.

I love how it almost suggests it was a customer's fault. If only they hadn't changed their settings!

I know you’re mostly joking, but it does no such thing. The blog post explicitly states it was a “valid customer configuration change”.

The Fastly blog does not blame the user. The BBC article kind of suggests it ("One Fastly customer triggered internet meltdown").

The BBC article includes the line "a customer quite legitimately changing their settings had exposed a bug".

That's 'bottom of a locked filing cabinet in a disused lavatory with a sign on the door saying "Beware of the Leopard"'-level stuff compared to the clickbait headline, though.

What's your proposed headline?

Fastly bug triggered internet meltdown

That's a great headline for yesterday.

This article details a new piece of information, and the headline reflects that.

Yeah, you’re right, the headline is clickbaity.

No, it is designed to be suggestive -- to shift attention and perhaps even blame. And of course the BBC picked up on it for clickbait reasons. Brilliant (but perhaps evil) PR by Fastly.

Customer configuration is an irrelevant detail that should have been left out until a full RCA. What does it matter that it was valid? So an invalid configuration would have meant it was indeed the _customer's_ fault??

A more fair treatment would have been, "a customer pentested us and won".

An edge case triggered an edge condition at our edge location.

"Edge cases" and "conditions" are some of my trigger phrases. Given enough time or users, they are inevitable and passing them off as rare to avoid dealing with them, especially when you're aware of their existence, drives me up the wall.

Unless it's an edge use case you're not supporting, don't sell me your cost avoidance on any production systems.

And edge case is just that though? Something that hadn't been thought of, and makes stuff break, then you can figure out a fix.

I don't know any dev that would agree with "passing them off as rare to avoid dealing with them" - rather, "it's a low priority, but it needs fixing" or "ok, this is an edge case, but holy crap its a bad one"

A very edgy comment! Well done.

quite edge-ucational I must say

It suggests it only as long as someone doesn't read the article.

The way I read it, they were trying to communicate the fact that a customer fiddling with their own configuration brought down large swathes of the internet for everybody else. That absolutely deserves to be in the headline.

>It suggests it only as long as someone doesn't read the article.

Which is why it's clickbait. Sensational title, humdrum article.

That's not clickbait, it's editiorial discretion. A customer's actions did take down Fastly for other customers. What about that is misleading or untrue?

Should be "Untested code triggers problems"

It could just as easily have been tested extensively but no testing is 100% guaranteed, especially in a world wide service as complex as a CDN. People who think perfect code comes purely from testing are delusional.

hey hey, production is just big test.

fta: "The outage has raised questions about relying on a handful of companies to run the vast infrastructure that underpins the internet."

I read that phrase everytime something like this happens and yet we all still rely on the same handful of companies.

The answer is that before the existence of those companies to rely upon, people didn't rely upon them, they just accepted the lag, or they hand hacked the same globally distributed approach on their own, and it sucked for them, and it wasn't too great for users either. CDNs are big because that's what their function is, to reach the world and absorb traffic spikes, and take the complicated business of distribution of edge servers out of the hands of the people who just want to run a website.

The trade off here is intrinsic and accepting the risks of big CDNs is the right answer.

Exactly. The question isn’t “a few CDNs” vs “Many CDNs”. There are too many economies of scale. It’s really CDN vs Not? (Roll your own just isn’t feasible except for a half dozen of the largest tech firms)

Well, you could use two CDNs instead of just one. But that costs money.

I think it will continue to happen. People and orgs, when picking a service, don't have an incentive or a way to ask around to see what percentage of similar-service users are using that particular service. People and orgs will always flow towards cheap/popular/well-known services.

On the other hand, those handful of companies could be asked to structure their services so that an outage only affects a portion of customers and not all their customers. However! That would be more inefficient for them, and more expensive, and that cost would cause the people and orgs mentioned earlier to just flow towards the company that took those shortcuts.

There is an incentive to see what everyone else is using, so that you can use it too. Choose boring technologies applies to infra as well. Go with a big stable well known can or cloud provider instead of trusting your service to some fly by night startup etc.

That's because many such business have grown so large because they benefit from scale.

Some examples:

Social network: you only engage on one if your friends/family/coworkers are on the same network.

Search engine: needs to index "the whole Internet", which is less expensive per user if you have more users

CDN: works best if you have edge nodes everywhere, which is quite capital intensive, which is why you need many customers to distribute it over.

... and so on. We might not like it, but many of these quasi monopolies are based on fundamental economics, not (just) on the greed of the companies.

> Social network: you only engage on one if your friends/family/coworkers are on the same network.

This one isn't exactly based on fundamental economics given that federated social networks exist. Email has similar network effects and is not centralized.

None of the federated social networks seems to have reached the scale of the biggest centralized social networks.

Which leads me to believe that economics and incentives favor big, centralized social networks.

I think this argument could be made for many of these.

If you're making a purchasing recommendation for your company, do you want to tell your boss that you're recommending not going with your best option, or even 2nd, but that your company should use the 4th or 5th best CDN as a way to diversify the Internet? Seems pretty altruistic, but not a great way to keep your job.

It's the tech equivalent of "thoughts and prayers" isn't it?

There are at least 5 major CDNs out there. CloudFront, Fastly,Cloudflare, Akamai, Google CDN. You can use more than just 1. Shopify uses two. Akamai and Fastly.

If you use them as pure file-serving CDN, then sure. But once you start adding extra logic, headers, routing, etc. the features don't fully align. Or you need to keep to the minimum common featureset.

So, after multi-cloud now we need to go multi-CDN? Half joking here, it's actually a good idea although probably it's not worth the cost. I think GitHub (at least from my casual looking at that behavior during the outage) nailed it because they must have some kind of active/passive CDN config. They were affected by the outage but after a few minutes (less than whole Fastly outage duration) they were serving assets again.

It is getting increasing tricky to have enough redundancy at a basic level to avoid a major player outage from affecting you. For example, you would probably want at least one authoritative DNS server that isn't either of your CDN providers. And knowing some details about how these players sometimes use each other, like that Google's Firebase uses Fastly.

Are you bigger than a major player is the question I'd be asking. Maybe risking it is fine.

I don't know that size of your operation is the right metric to gauge whether to bother with this. If 100% of your revenue, for example, is from online sales, it might be worth it even if you're small. But yes, it's often not worth it.

I agree, it's just for some DR scenarios there's only so much you can do. And 'the internet is down' is hard to plan for. If CNN is offline due to some outage and you're a smaller enterprise then are people really still doing online eCommerce stuff, or are they waiting for their favorite sites to come back up as a signal that things are back to normal.

But we're talking about what was a 1 hour outage. Does it make sense to spend more than 1/8000 of your revenue to avoid an hour per year esp when you will never lose an hour of revenue from being down for one hour because many customers come back to buy the thing they were going to buy anyway.

Does anyone know if Azure/Google/Amazon can provide some 'multi-cdn' setup out of the box? The way to change these points of failure is for the big boys to change their defaults.

Do it at the DNS layer. Route53 has failover support out of the box that should work for it. You can setup a monitor and it will switch dns entries on a failure.

Would a smaller CDN provider have no outages at all?

No, but it would take down less sites if it does have one.

Is that a feature?

If we have 12 small cdns have 12 outages in a year (combined, 1/year each), each time bringing down 10,000 websites, is that better than 1 large cdn having one outage during the year bringing down 120,000 websites?

If I'm the website owner I think I prefer the latter, my customers blame the cdn instead of me. If I'm the cdn owner I definitely prefer the latter, more customers to amortize my costs over.

From a customer point of view diversity is far better it's great - if Sainsburys is closed for some reason I'll go to Tesco.

Certainly don't want a situation where all the shops are closed.

It may be more acceptable to have more frequent outages if the impact radius (number of websites or services impacted) is smaller.

Obviously that isn't an honest comparison. If you're asking whether or not 10 small CDN providers could provide a more robust, higher quality service with more uptime than one large CDN provider, then I think the answer is probably "yes."

There's kind of a race-to-the-bottom[0] wrt. decentralisation.

It'd be better for the internet as a whole if we don't always pick the most popular (so when your email's CDN goes down you can still communicate on chat, when CNN goes down you can still read BBC). But as an individual I have strong incentives to pick the one everyone else picks, because that's presumably the most stable/documented/lowest cost due to volume.

[0] https://slatestarcodex.com/2014/07/30/meditations-on-moloch/

One of the side-effects of the efficiency obsession that capitalism generates, is the constant strive for centralization.

We took a technology stack designed to survive nuclear attacks, and turned it into something where a single bug can take down half the services on it. Why? Because on the flip side, a single improvement in the centralized service can automatically cascade to all the businesses using it.

Efficiency is a double-edged sword.

Shame on the BBC for such a misleading headline. But Fastly probably shouldn't have even given them detail — it's irrelevant and bound to be misreported. Just own up to your bug and be done with it.

Why is Amazon dependent on someone else’s cloud infrastructure?

They can't afford Cloudfront bills either.

Someone answered this yesterday. CloudFront is good for video and large download assets (plus very low margins) but not for images and smaller stuff which Fastly is much faster at:


Also large compagnies uses multiple CDN.

I believe Google's Firebase also uses Fastly.

Hedging your bets

Still impressed at how quickly Fastly recovers. Kudos.

Even if they fixed it in 20 minutes the chain reaction caused by Fastly being down took much more than 20 minutes to resolve itself.

An example is imgix:


It took them 11 hours to recover from Fastly going down for their claimed 40 minutes.

> We are still seeing increased origin load as a result of the earlier outage from our service provider.

Does this mean that some companies using Fastly could have major costs because of the increased origin load?


Global shared control plane updates are frighteningly common. For example Cloudflare regularly brags about how your configuration updates are pushed around the world in single-digit seconds. Sure, it is an amazing feature, but it opens you up to this sort of issue.

All changes to critical infrastructure should be a gradual rollout (emergencies aside). Instant sounds nice, until it isn't. If this rolled out to one region for the first hour it likely would have been caught and Fastly could press the "stop all rollouts" button.

Did they just miss a where clause on the update statement??

> The outage has raised questions about relying on a handful of companies to run the vast infrastructure that underpins the internet.

Big surprise

the thing with CDN is that they may have many edge locations, but the cache does not sit there.

They frequently have a common caching servers located close by. so maybe every 10 or 100 edge locations have a single cache location.

Edge is a reverse proxy and probably handles ssl handshakes. So if your cache is down, all your edge locations in that area are down

This may be true with some, but it is not true of Fastly. Each of their edge nodes is a varnish cache. Because they are multi-tenant, when varnish crashes it crashes hard and takes everyone with it.

The question in various threads is why not have redundancy -- but the point of a CDN is to have extra servers and capacity and lots of locations to make individual crashes just flow elsewhere.

But if the single customer with a valid-yet-crashable config had lots of traffic all over the world... it'll take everything out at once.

Redundancy of CDN is more expensive, and still requires DNS failover. People do the calculation and usually decide that 30 min of downtime every couple years is worth the saving on vendors and code and hassle. They don't like it, but every site that was down made that decision.

it remembers ue about the inevitable tension between eficiency and resilience...

cat /users/*.conf >> global.conf && ./restart_everything

I remember there was a similar bug that hit Cloud Flare before??

So they have an infrastructure where customer data is not separated from other customers, that's pretty terrifying! This means another bug of the same kind can also cause a global disruption any time in the future (mistakes will happen).

Yes, that might happen.

But the only way to avoid that is to give each customer its own private hardware, which seems prohibitively expensive (and may not even prevent all failure sources).

Them pointing to customer as a source of issue is not OK. This reminds me of that Chase scandal, when they transferred huge amount of money to pay of full loan instead of installment, then they blamed on some guy in India for alleged mistake, where his 2 or 3 superiors that approved that and really horrific interface, but they decided to pin the blame on that guy.

So, I am glad things are OK, but it definitely is not that one customer who is to blame for this outage.

I don't think that Fastly is pinning blame on the customer. That seems to be the BBC trying to bait us into reading the article.

I am sad to report that their clickbait worked on me.

I think you are right, it is my bad.

Again, The Cloud is just someone else's server. Most things in The Cloud doesn't belong there in my opinion but it is yet another trend everyone must follow and check off on their list from management. Let's all chant in unison and play out the rituals sent down from above.


> Again, The Cloud is just someone else's server.

Sure, but in a lot of cases, that "someone else" is an entire team of experts in their particular niche that can do a better job of the specific task at hand than I ever can hope to.

Is this always the case? No. Is it sometimes the case? Yes.

But this is primarily a CDN, with some edge copmpute, isn't it? What's the functional alternative here, "I'll go ahead and lease 2U of rack space and fly my ops people to every major city in which I have users"?

The sad reality is that if your app serves any real business value and you can't afford to hire a team that can quickly handle scaling attacks, simply running your own server isn't really a viable option anymore.

The internet's original distributed nature just isn't compatible with the sheer scale of billions of active users.

THIS version of the internet isn't.

I hold out hope that the next version won't be controlled by corporate entities.

Unless you are a very large company a CDN is not something you can realistically build and run in house.

If you are not a very large company, you might not need the full functionality of a very large CDN either.

Now, building the functionality you require can still be unrealistic to build yourself. One thing that springs to mind is DDoS protection.

When your site can expect visitors from all around the world you need a CDN that also has a world wide presence. Using a CDN can have a pretty big impact on the performance of your site, and it is well known that users avoid sites that are slow.

Colocation is cheap. But your correct with DDoS; that is the only thing I am not prepared for. I will forever go with colocation.

1Gbit is nothing nowadays and can be saturated in seconds; my purposes do not justify the cost of 10g transit. Even owning 4U when I only need a VPS is overkill. But I like owning a small dusty cube of internet. So there's that.

Just an unsolicited plug of an interesting podcast I started listening to recently (co-hosted by HNer slackerIII)- https://downtimeproject.com/

Hope this one gets the treatment.

I'm not sure there is quite enough detail in this post to make a whole episode, but if any more details about the bug come out we will definitely cover it.

Awesome! I think my plug was a little forward for hump day on HN lol. Love the podcast.

"Fastly senior engineering executive Nick Rockwell said: "This outage was broad and severe - and we're truly sorry for the impact to our customers and everyone who relies on them.""

I wonder here, are they sorry enough to run the company on two different tech and software stacks and data centers? Like people don't buy disk drives (SSDs) not all from the same vendor? How much would they spend for the "sorry".

I’ve heard this before but I’m skeptical about this approach. It’s got to cost close to 2x to develop and maintain two stacks — if they’re sharing a lot, that would defeat the point. Then you’d have two stacks tested about as well as the one today. Instead, you could put that investment towards testing and improving quality on the one stack.

I know a second stack hopefully has different bugs. But is that likely (to a meaningful extent)? Reimplementations often reintroduce old issues (which suggests many people make similar mistakes), plus it’s hard to imagine the first stack not influencing the second in various ways.

Redundant implementations only help with uncorrelated problems. With complex software, problems are often quite correlated across multiple implementations. Add to that the additional complexity of managing those multiple implementations and the potential for problems may be net worse. The consensus among safety experts is that multiple implementations are a bad approach to safety or reliability.

I would assume different developers and product managers create different bugs.

You'd have to consider what actually happens when 50% of your infrastructure goes down. Can the remaining 50% cope with 100% of the load? If not, then you still get a complete failure. So then the question becomes can you reprovision between stack A and stack B very rapidly, both running from the same hardware pool, while 50% of your infrastructure is down. Now you've introduced the potential for correlated failures (single hardware pool and network), plus added complexity and load due to reprovisioning just when things are already overloaded. Not easy to get this right, so might not actually increase reliability, as now you've got two different stacks that can each independently fail and get you into this mess.

That only protects against one type of failure, eg one of your vendors going down. The article suggests this was a problem in their own software, which in all likelihood would not be protected against by using different stacks.

Two developers and two product managers would not create the same bugs but different ones I'd think. From writing thousands of correct lines and implementing features without bugs, it would be a coincidende (except perhaps not well understood complex parts) where two developers create the same bugs.

Reading studies about bugs in the past where developers where told to write some code resulted in different bugs.

Are there any organizations out there that actually make multiple versions of their product and deploy it together, in real time, all the while being completely agnostic to the end user?

It seems like it will be very hard to justify the immense costs for this.

Making your infra heterogeneous is very underrated. Or at least I don't hear of many companies doing that, even large ones.

How do you go about different software though, have one center running a version or two behind to fallover to?

In general it is expensive. In most cases you will need experts that need knowledge of both system A and system B. Those people are more difficult to find. Also you have a higher chance of errors due to confusion between the two systems.

Without more details on what/where the "software bug" was, it's hard to say if that would've helped at all.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact