Hacker News new | past | comments | ask | show | jobs | submit login

Google had a comparable outage several years ago.

https://status.cloud.google.com/incident/cloud-networking/19...

This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).

I'd be curious to see what systemic changes happen at FB as a result, if any.




To expand on why this made me think of the Google outage:

It was a global backbone isolation, caused by configuration changes (as they all are...). It was detected fairly early on, but recovery was difficult because internal tools / debugging workflows were also impacted, and even after the problem was identified, it still took time to back out the change.

"But wait, a global backbone isolation? Google wasn't totally down," you might say. That's because Google has two (primary) backbones (B2 and B4), and only B4 was isolated, so traffic spilled over onto B2 (which has much less capacity), causing heavy congestion.


Google also had a runaway automation outage where a process went around the world "selling" all the frontend machines back to the global resource pool. Nobody was alerted until something like 95% of global frontends had disappeared.

This was an important lesson for SREs inside and outside Google because it shows the dangers of the antipattern of command line flags that narrow the scope of an operation instead of expanding it. I.e. if your command was supposed to be `drain -cell xx` to locally turn-down a small resource pool but `drain` without any arguments drains the whole universe, you have developed a tool which is too dangerous to exist.


Agreed, but with an amendment:

If your tool is capable of draining the whole universe, period, it is too dangerous to exist.

That was one of the big takeaways: global config changes must happen slowly. (Whether we've fully internalized that lesson is a different matter.)


As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).

The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."

Because otherwise, you're asking for an impossible process (quick and protected).


SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)


The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.


If your tool is capable of draining the whole universe

Why did I think of humans, when I read this. :P


I feel like this explains so much about why the gcloud command works the way it does. Sometimes feels overly complicated for minor things, but given this logic, I get it.


But the FB outage was not a configuration change.

> a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network


From yesterday's post:

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.

...

Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end."

Ultimately, that faulty command changed router configuration globally.

The Google outage was triggered by a configuration change due to an automation system gone rogue. But hey, it too was triggered by a human issuing a command at some point.


I'm inclined to believe the later post as they've had more time to assess the details. I think the point of the earlier post is really to say "we weren't hacked!" but they didn't want to use exactly that language.


This is kind of like Chernobyl where they were testing to see how hot they could run the reactor to see how much power it could generate. Then things went sideways.


The Chernobyl test was not a test to drive the reactor to the limits, but actually a test to verify that the inertia of the main turbines is big enough to drive the coolant pumps for X amount of time in the case of grid failure.


Of possible interest:

https://www.youtube.com/watch?v=Ijst4g5KFN0

This is a presentation to students by an MIT professor that goes over exactly what happened, the sequence of events, mistakes made, and so on.


Warning for others: I watched the above video and then watched the entire course (>30 hours).


Now I know what I'm doing the rest of this week...


As already said the test was about something entirely different. And the dangerous part was not the test itself, but the way they delayed the test and then continued to perform it despite the reactor being in a problematic state and the night shift being on duty, who were not trained on this test. The main problem was that they ran the reactor at reduced power long enough to have significant xenon poisoning, and then put the reactor at the brink when they tried to actually run the test under these unsafe conditions.


I'd say the failure at Chernobyl was that anyone who asked questions got sent to a labor camp and the people making the decisions really had no clue about the work being done. Everything else just stems from that. The safest reactor in the world would blow up under the same leadership.


At first i thought it was inappropriate hyperbole to compare Facebook to Chernobyl, but then i realized that i think Facebook (along with twitter and other "web 2.0" graduates) has spread toxic waste across far larger of an area than Chernobyl. But I would still say that it's not the _outage_ which is comparable to Chernobyl, but the steady-state operations.


>internal tools / debugging workflows were also impacted

That's something that should never happen.


> a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust

Is that normal at Google? Making people feel bad for an outage doesn't seem consistent with the "blameless postmortem" culture promoted in the SRE book[1].

[1] https://sre.google/sre-book/postmortem-culture/


"Blameless Postmortem" does not mean "No Consequences", even if people often want to interpret it that way. If an organization determines that a disconnect between ground work and a customer's experience is a contributing factor to poor decision making then they might conclude that making engineers more emotionally invested in their customers could be a viable path forward.


Relentless customer service is never going to screw you over in my experience... It pains me that we have to constantly play these games of abstraction between engineer and customer. You are presumably working a job which involves some business and some customer. It is not a fucking daycare. If any of my customers are pissed about their experience, I want to be on the phone with them as soon as humanly possible and I want to hear it myself. Yes, it is a dreadful experience to get bitched at, but it also sharpens your focus like you wouldn't believe when you can't just throw a problem to the guy behind you.

By all means, put the support/enhancement requests through a separate channel+buffer so everyone can actually get work done during the day. But, at no point should an engineer ever be allowed to feel like they don't have to answer to some customer. If you are terrified a junior dev is going to say a naughty phrase to a VIP, then invent an internal customer for them to answer to, and diligently proxy the end customer's sentiment for the engineer's benefit.


I think of this is terms of empathy: every engineer should be able to provide a quick and accurate answer to "What do our customers want? And how do they use our product?"

I'm not talking esoterica, but at least a first approximation.


Why? Like we all are customers as well as an employee.


Because we as engineers create software for our customers, and if you don't understand who your customers are how can you create software that actually suits their needs?

Very rarely are we our own customers


I would argue that SREs are consistently our own customers in a way unique to SRE.


Ironic, as measurability came up in another comment thread I'm in.

I'd say from a technical perspective SREs are, but there's a potential (depends on product) gap between their technical goals and user goals.

e.g. What does "p95 latency is spiking" actually mean to the end user?


From the SRE book: "For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the 'wrong' thing prevails, people will not bring issues to light for fear of punishment."

If it's really the case that engineers are lacking information about the impact that outages have on customers (which seems rather unlikely), then leadership needs to find a way to provide them with that information without reading customer emails about how the engineers "let them down", which is blameful.

Furthermore, making engineers "emotionally invested" doesn't provide concrete guidance on how to make better decisions in the future. A blameless portmortem does, but you're less likely to get good postmortems if engineers fear shaming and punishment, which reading those customer emails is a minor form of.


I work at Google and have written more than a few blameless postmortems. You don't need to quote things to me.

Is what was described above "finger pointing or shaming"? I don't work in TI so I didn't experience this meeting but it doesn't seem like it is. It also doesn't sound to me like this was the only outcome, where the execs just wagged their fingers at engineers and called it a day. Of course there'd be all sorts of process improvements derived from an understanding of the various system causes that led to an outage.


Yes, if I were made to attend a mandatory training in which my leaders read customer emails saying that the outage caused them to lose trust in the company, I would feel ashamed. That was surely the goal of that exercise. The fact that there were also process improvements doesn't make it any less wrong.

Thankfully, other comments in this thread suggest that this is not how Google normally does things.


That's fuck up.


Not the original googler responding, but I have never experienced what they describe.

Postmortems are always blameless in the sense that "Somebody fat fingered it" is not an acceptable explanation for the causes of an incident - the possibility to fat finger it in the first place must be identified and eliminated.

Opinions are my own, as always


> Not the original googler responding, but I have never experienced what they describe.

I have also never experienced this outside of this single instance. It was bizarre, but tried to reinforce the point that something needed to change -- it was the latest in a string of major customer-facing outages across various parts of TI, potentially pointing to cultural issues with how we build things.

(And that's not wrong, there are plenty of internal memes about the focus on building new systems and rewarding complexity, while not emphasizing maintainability.)

Usually mandatory trainings are things like "how to avoid being sued" or "how to avoid leaking confidential information". Not "you need to follow these rules or else all of Cloud burns down; look, we're already hemorrhaging customer goodwill."

As I said, there was significant scar tissue associated with this event, probably caused in large part by the initial reaction by leadership.


I assume it was training for all SREs, like "this is why we're all doing so much to prevent it from reoccurring"


Facebook also had a nearly 24-hour outage in 2019. https://www.nytimes.com/2019/03/14/technology/facebook-whats... (or http://archive.today/O7ycB )


> leadership read out emails from customers telling us how we let them down and lost their trust).

That's amazing. I would never have expected my feedback to a company to actually be read, let alone taken seriously. Hopefully more companies do this than I thought.


From my experience this is more done to make leadership feel better and deflect blame from their leadership.


I just read your comment out loud if it helps.


The most remarkable thing about this is learning that anyone at Google read an email from a customer. Given the automated responses to complaints of account shutdowns, or complaints about app store rejections, etc, this is pretty surprising.


I'd love to get a read receipt each time someone at Google has actually read my feedback. Then it might be possible to determine whether I'm just shaking my fists at the heavens or not.


“ mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust “

The same leadership that demanded tighter and tighter deadlines and discouraged thinking things through?


> I'd be curious to see what systemic changes happen at FB as a result, if any.

If history is any guide, Facebook will decide some division charged with preventing problems was an ineffective waste of money, shut it down, and fire a bunch of people.


> This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).

Bullshit.

I'd believe this if it was not completely impossible for 99.999999% of google "customers" to contact anyone at the company. Or for the decade and a half of personal and professional observations of people getting fucked over by google and having absolutely nobody they could contact to try and resolve the situation.

You googlers can't even disdain yourselves to talk to other workers at the company who are in a caste lower than you.

The fundamental problem googlers have is that they all think they're so smart/good at what they do, it just doesn't seem to occur that they could have possibly screwed something up, or something could go wrong or break, or someone might need help in a way your help page authors didn't anticipate...and people might need to get ahold of an actual human to say "shit's broke, yo." Or worse, none of you give a shit. The company certainly doesn't. When you've got near monopoly and have your fingers in every single aspect of the internet, you don't need to care about fucking your customers over.

I cannot legitimately name a single google product that I, or anyone I know, likes or wants to use. We just don't have a choice because of your market dominance.


Hi there. I'm a Googler and I've directly interfaced with a nontrivial number of customers such that I alone have interfaced with more than 0.000001% of the entire world population.


All you need to do is browse any online forum, bug tracker, subreddit dedicated to a consumer-facing Google product to know that Google does not give a rat's ass about customer service. We know the customer is ultimately not the consumer.


Maps, Mail, Drive, Scholar, and Search are all the best or near the best available. That doesn’t mean I like every one of them or I wouldn’t prefer others, but as far as I can tell the competition doesn’t exist that works better.

GCP and Pixel phones are a toss-up between them and competitors.

It isn’t market dominance, nobody has made anything better.


Search is famously kind of bad the last few years, but even Maps isn’t that great.

(Data errors I’ve seen this week: the aerial imagery over Brisbane Australia is from ~2010 but labeled 2021, the coastline near Barentsburg in Svalbard is wrong and doesn’t match any other map.)


I’m not saying any of it is great, just that there aren’t better replacements that make me want to switch.


> You googlers can't even disdain yourselves to talk to other workers at the company who are in a caste lower than you.

We must know different googlers then. It's good to avoid painting a group with the same brush




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: