Retrospective and technical details on the recent Firefox outage

ghostly_s · on Feb 2, 2022

Went into this with an indignant "failures in telemetry should NEVER bring down core functionality!" feeling, left with a more nuanced understanding of the fault and impressed and reassured by the mitigation steps being taken. That's a great post-mortem.

nathansherburn · on Feb 3, 2022

I felt a similar way when it first happened. I was fuming "an update should never break your core functionality across all devices!" only to find that it wasn't actually the latest update but rather code that had been there for ages.

It's hard to hate something once you truly understand it, I guess.

minerva23 · on Feb 3, 2022

Anyone know why telemetry is sent from the main process rather than having a separate reporter process?

marcan_42 · on Feb 3, 2022

That would be significant extra engineering effort without much clear benefit. It's always possible to look back and say "if only they'd done that", but this bug was a freak coincidence and it's not possible to foresee things like that, nor is it worth the engineering effort to try to avoid them with hammers as big as "move all telemetry to another process". You'd need a much stronger reason than "what if telemetry happens to trigger a longstanding bug in the network stack" to decide to go with that.

In this case, that the problem was triggered by telemetry was a coincidence; it would've been triggered by other bits of coden in the future as they moved to Rust.

rockdoe · on Feb 3, 2022

It would have made no difference: the hang was in the networking code & process, not the telemetry code (and whatever process that is in). Anything that sent a message with the "bad" header type would have provoked the hang.

As to why, dunno, presumably it's extra effort for an unclear gain. Telemetry code wouldn't be parsing hostile input etc. And it doesn't stop bugs like this either.

marcan_42 · on Feb 3, 2022

Unfortunately, a good fraction of the commenters here don't seem to be doing the same. There's a whole pile of people throwing around "failures in telemetry should NEVER bring down core functionality!"...

deng · on Feb 2, 2022

I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.

To me, this is a perfectly valid write-up with a good lessons learned. They have written it in a very diplomatic way, but to me, it is absolutely clear that Google screwed up here. How can you make such a change to a default behavior of critical infrastructure unannounced? That's just reckless towards your customers, and solidifies my belief to stay away from GCP.

If they had properly announced the change, even if the Firefox team hadn't then tested beforehand, at least the DevOps team would have put one and one together and just changed back to HTTP/2 and the outage would have lasted maybe 10 minutes. Instead, they frantically went through their git log to see what in the code base might have triggered this bug. Everyone who has been in such a position knows how incredibly stressful this is. I'd be absolutely livid at Google in their position. That it took two hours to fix this is clearly their fault.

yjftsjthsd-h · on Feb 2, 2022

> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.

They set themselves a higher standard by marketing as the good guys who fight for the user, and then made any number of moves that said users viewed as not being in their interests. Of course they get more blame. Like, Chrome has issues, but they're issues in line with being made by an adtech company; we might be unhappy at Google breaking adblockers (https://www.eff.org/deeplinks/2021/12/chrome-users-beware-ma...), but it's not out of character. Mozilla can say "More power to you. Mozilla puts people before profit, creating products, technologies and programs that make the internet healthier for everyone." (https://www.mozilla.org/en-US/) or they can, say, make Google the default engine ($), bake in a proprietary service (Pocket), rip out features (RIP compact theme), overrule user autonomy (Want to install an extension? Better upload it to Mozilla to get signed so they permit you to run it on your own computer!), ship a marketing extension through the "experiments" feature (https://blog.mozilla.org/en/products/firefox/update-looking-...).... but not both. Either empower the user, or don't, but don't pretend to empower the user while ripping away their control.

ygjb · on Feb 2, 2022

Yep, you are correct. Each of those decisions was made over the protests of a vocal but relatively small group of users.

You can't please all people all of the time, and I agree the pocket integration, and the looking glass add were mistakes, but the other items were directly related to sustainability of the project ($, eng cycles), or user safety.

You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.

As a user, the average Firefox user has far more control over the browser than Chrome, Edge, or Safari users do, and have the flexibility to use one of many Firefox forks that have the same beef as you.

yjftsjthsd-h · on Feb 2, 2022

Since the first thing that group protested was telemetry, I don't know how we could possibly know that it's a "vocal but relatively small group of users". In general, though, "you can't please everyone, and not that many people objected" isn't really a compelling argument; the criticism is still valid, and people being unwilling to make the effort to make a fuss, fork, find workarounds, or switch browsers doesn't mean that they're okay with it. For that matter, there's not a lot of feedback in general; how many people objected, and how many said they were in favor, compared to the overwhelming majority who never said anything?

> You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.

By that standard Chrome is a paragon of user control. Firefox, as it actually exists, in the thing that Mozilla offers users to download, claims to care about user empowerment while constantly reducing users' power.

pessimizer · on Feb 2, 2022

Especially considering that 95% of Firefox's users voted with their feet.

yjftsjthsd-h · on Feb 2, 2022

In fairness, it's hard to tell what's Firefox throwing away the thing that made them special vs Google abusing its monopoly position to push its way into the browser market.

dblohm7 · on Feb 3, 2022

I think you ought to recheck your numbers.

ttybird2 · on Feb 7, 2022

They still have not released the server side pocket source code...

moeris · on Feb 2, 2022

I agree that Google is at fault here for failing Firefox. But Firefox is guilty of failing its users. Why should the functioning of a browser be dependent on telemetry working? It sounds like if there is high enough latency in their telemetry, or if request for telemetry start failing, it's possible for that to disrupt using the network stack at all. They have a massive design flaw, and they didn't even mention that in the article. Maybe they have good reasons for designing a single point of failure that relies on a cloud provider, but it's not clear what those might be since they don't address it.

phkahler · on Feb 2, 2022

>> Why should the functioning of a browser be dependent on telemetry working?

That was my thought after reading the start of it. Like "Oh no, Firefox has fallen into that void where their need for telemetry trumps users". Another product falling down at doing its primary function. But after reading the entire report that's just not fair at all. A bug relating to telemetry and their network stack caused failure in that networking code which affected everything. That is entirely different than software depending on telemetry to function properly. It wasn't by design that failing to phone home broke the software, it really was just a bug - a fairly obscure one. Sounds like if someone wanted they could just as easily blame the use of Rust in Firefox since some of the code involved was written in Rust. But that's not a fair or accurate conclusion either.

account42 · on Feb 3, 2022

Blame Telemetry and Rust are both fair here since they added complexity that resulted in this bug slipping past testing.

calcifer · on Feb 2, 2022

> Why should the functioning of a browser be dependent on telemetry working?

It isn't. The bug was in the networking stack, and it just happened to be triggered by a GCP change which effected the telemetry service. Firefox having telemetry has nothing to do with the issue here.

dralley · on Feb 2, 2022

Thus proving OP's point.

"The amount of blame that is assigned to the Firefox team is staggering"

barrkel · on Feb 2, 2022

That's not quite right. A single socket thread does all the requests and telemetry is multiplexed with user traffic. If telemetry is different in some way to other network traffic, then it's always possible for it to cause problems with user traffic.

Telemetry is different to user traffic - it's less important! - but of course any in-process QoS would still create a point of interaction with user traffic.

moeris · on Feb 2, 2022

> It isn't.

So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers? That's not what the article said.

I understand that you mean to say that it isn't intended for networking to be taken down by telemetry. That's nonetheless what happened, and it could have been prevented by treating telemetry as a different class of traffic (not collocating it with normal requests), or by not having it, as others point out.

marcan_42 · on Feb 2, 2022

So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? Because that's the only way you'd have avoided this.

It's natural for all the network stuff that goes in inside a browser to share code. You can say what you want about telemetry (I'm not a huge fan, personally), but this was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".

mananaysiempre · on Feb 2, 2022

> So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? [... T]his was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".

I absolutely agree that this a dumb bug having little to nothing to do with telemetry. It is not even the first case-sensitivity HTTP/3 bug I’m personally encountering in the course of completely casual use[1]. Probably not the last, either, those joints ain’t gonna oil themselves.

At the same time, you know what? I’m glad you suggested this, because I certainly didn’t think of it. Yes, in an ideal world, telemetry absolutely should be a separate process (or thread, or at least not share an event loop—a separate “hang domain”, a vat[2] if you want). And so should everything else off the critical path.

I’m not saying Firefox is bad for doing it differently. I’m saying it’s silly that Firefox is forced to play OS to such an extent because the actual one isn’t up to its demands.

[1] https://github.com/ndilieto/uacme/pull/11

[2] http://www.erights.org/elib/concurrency/vat.html

Treblemaker · on Feb 3, 2022

(off topic)

I read that at first glance as

> Probably not the last, either, those joints ain’t gonna roll themselves.

and thought, hm, I need to remember this debugging technique next time I'm stumped.

acdha · on Feb 2, 2022

They're saying what is clearly explained in the article:

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

KronisLV · on Feb 2, 2022

Yes, but the fact that telemetry is in place was the cause for the issue.

> So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers?

Not the telemetry code. Not the fact that it "could" happen elsewhere. But rather the fact that it was in place and in this instance happened because of it.

Not that it matters that much. Regardless of the particular cause, a browser failing to work because of something changing externally is crazy (at least to me), no matter how you look at it.

Edit: this is now largely a duplicate of the other comment, hmm: https://news.ycombinator.com/item?id=30179023

LeifCarrotson · on Feb 2, 2022

How do you reach that conclusion? From the article:

> It just so happens that Telemetry is currently the only Rust-based component in Firefox Desktop that uses the [viaduct/Necko] network stack and adds a Content-Length header. This is why users who disabled Telemetry would see this problem resolved ...

The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.

> ...even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

And then the article contradicts itself and agrees with you using some heavy-duty doublethink. Sure, if there were hypothetically other Rust services using the buggy network stack, they'd also have hit the bug: BUT THERE ARE NONE. The bug was in code which is only running because it's used by the telemetry services, so even though it might be in a different semantic layer it's the fault of the browser trying to send telemetry.

As a user, I place very low (often negative) importance on the tools I use collecting telemetry data, or on protecting DRM content, or on checking licensing status. They should focus on doing the job I'm trying to do with them on my computerr, serving the uses of the user, rather than doing something that someone else wants them to do. Sure, I understand that debugging and quality monitoring are easier with logs and maybe with telemetry, so I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.

acdha · on Feb 2, 2022

> The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.

This is your mistake: as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

KronisLV · on Feb 2, 2022

> ...as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.

Is this really relevant, though? To the users who were unable to use their browsers normally it doesn't matter that this problem could have occurred elsewhere as well, but rather that it did occur here in particular.

If particular sites would break, then that could be debugged separately, but as it stands even people who'd be perfectly fine with browsing regular HTTP/1.1 or HTTP/2 sites were also now impacted, not even due to opening a site that they wanted to visit themselves, but rather some background piece of functionality.

That's not to say that i think there shouldn't be telemetry in place, just that the poster is correct in saying that this wouldn't be such a high visibility issue if there was no telemetry in place and thus no HTTP/3 apart from sites the user visits.

acdha · on Feb 2, 2022

The comment I was replying to worded in a way which was trying to attribute blame to the telemetry service. As shown in this thread, there's a certain ideological position which welcomes any attacks on telemetry and I think that's a distraction from the technical discussion about how Mozilla could better have avoided a bug in their networking libraries. Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.

KronisLV · on Feb 2, 2022

> Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.

Surely one could adopt a "shared nothing" approach, or something close to it - a separate process for the telemetry functionality which only reads things from either shared memory or from the disk, where the main browser processes could put what's relevant/needed for it.

If a browser process fails to work with HTTP/3, i don't think the entire OS would suddenly find itself not having any network connectivity. For example, a Nextcloud client would still continue working and synchronizing files. If there was some critical bug in curl, surely that wouldn't necessarily bring down web browsers, like Chromium, either!

Why couldn't telemetry be implemented in a similarly decoupled way and thus eliminate the possibility of the "core browser" breaking due to something like this? Let the telemetry break in all the ways you're not aware of but let the browser continue working until it hits similar circumstances (if it at all will, HTTP/3 isn't all that common yet).

I don't care much for flame wars or "going full Stallman", but surely there is an argument to be made about increasing resiliency against situations like this one. Claiming that the current implementation of this telemetry is blameless doesn't feel adequate.

angus-prune · on Feb 2, 2022

> I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.

Which is exactly how the code was intended to work. Firefox did not design their software to hang in the event of telementry losing internet access.

I don't know firefox's internal architecture or its development, what follows is pure conjecture.

Their intention seems to be to slowly migrate the codebase from C++ to Rust. That telemetry is the only function to so far rely on their new rust networking library viaduct (and thus trigger the bug) could be because they wanted use their least important fucntionality as a test bed. In which case, if there wasn't any telemetry, a different piece of code would have been migrated to rust first and triggered this same bug. Without the telemtry, it would have presumably taken them longer to realise that things had broken, let alone resolve it.

moeris · on Feb 2, 2022

Firefox also said that this switch to default was an unannounced change. But a quick Google shows that it was announced

> In the coming weeks, we’ll bring HTTP/3 to more users when it's enabled by default for all Cloud CDN and HTTPS Load Balancing customers: you won't need to lift a finger for your end users to start enjoying improved performance.

In their blog on June 22, 2021. [1]. It probably should have been it's own standalone message sent to users (a "this should be a no-op" email), bit to claim that it was unannounced is misleading.

1. https://cloud.google.com/blog/products/networking/cloud-cdn-...

acdha · on Feb 2, 2022

That’s half a year earlier and it’s described as an opt-in change until the very end, where it’s mentioned as a default changing in a few weeks. That’s far different from what, say, AWS does proactively sending email and SNS notifications with a time range and usually listing the affected instances.

moeris · on Feb 2, 2022

Yeah, it doesn't sound like you're disagreeing with me.

ghusbands · on Feb 2, 2022

You expect everyone to read the google cloud blog? The distinction between "unannounced" and "not usefully announced" isn't of merit. If they did not specifically make their affected customers aware of the change and when it would actually happen, it was unannounced. And caused a major outage for at least one of their customers.

tialaramex · on Feb 3, 2022

This afternoon I tried to clone a git repo which, in the morning, was highlighted as containing a useful example to start from in the work I had targeted next.

The clone failed with a mysterious error. After some minutes I checked the accompanying web site. The web site failed too, but, on refresh, this time I got a holding page explaining that the service was down. So I check the overall ticket system, and I find a change ticket, for the git system, saying there is planned maintenance, at 8am for one hour. Unadvertised because hey, it's 8am, most people aren't at work at 8am and this is a regular (Wednesday 8am) maintenance slot.

And I scroll down and I find that nobody remembered to actually do the task. They wrote it up, submitted, got it OK'd and then, eh, never did it. By the time the people who were supposed to do it were reminded it was 9am already. So, astoundingly, the service owner OK'd just doing it after lunch instead.

That failed 8am change was actually a re-run, of a re-run, of a re-run, of an upgrade that keeps failing and definitely takes over an hour to complete.

So instead of "It's fine to do this when nobody is at work and it's low risk" suddenly "It's fine to do this for 2 hours in the middle of the working day, though it'll probably fail and we have no roll back plan".

That's pretty shoddy. Glad to know an "Enterprise" cloud offering is hardly better.

matsemann · on Feb 2, 2022

> They have a massive design flaw, and they didn't even mention that in the article.

From the article:

> This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

Don't spread FUD.

moeris · on Feb 2, 2022

I agree, they should have addressed the issue in the article so that there would be less uncertainty.

barrkel · on Feb 2, 2022

Telemetry traffic is multiplexed with user traffic on a single socket thread, per article. That creates a single point of failure where telemetry can affect user traffic.

Of course all network access is shared for a machine so it's not possible to not have a single point of failure, but there are different ways of slicing up the access.

marcan_42 · on Feb 2, 2022

You're grasping at straws with this argument. That it shares a thread is a technicality. I'm sure the socket management is asynchronous and telemetry wouldn't normally affect normal traffic. This was an infinite loop bug. What if it had been a memory corruption bug instead, would you be saying that telemetry needs to be a separate process, not just a separate thread? The design was reasonable. Dumb bugs can happen anyway and cause things not to work as designed. That's what happened here.

wintermutestwin · on Feb 2, 2022

Not an answer to your question, but why does a browser use telemetry at all?

Santosh83 · on Feb 2, 2022

What happened to the famed intelligence level of Hacker News? Every single person (almost) in this thread is blaming telemetry while it was clear even when the bug was ongoing that it was unrelated to telemetry. I for example had had telemetry disabled and still hit the bug through other traffic and had to temporarily disable HTTP3 from about:config.

Macha · on Feb 2, 2022

> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.

Mozilla has opened themselves up here, as they market as a privacy and user respecting alternative, so when they fail to live up to their own marketing people are more annoyed while they expect random startup #456 to not care about their users privacy and have telemetry out the wahoo.

dblohm7 · on Feb 3, 2022

(I used to work for Mozilla, and spent a few years on the team that at the time owned the telemetry component.)

The way Mozilla does telemetry is different from how most places do.

I think that the biggest issue with these discussions is that there always seems to be this assumption that there is only one way to do telemetry, it always contains super invasive PII, Mozilla's telemetry must do the same, and therefore Mozilla's telemetry is just as evil as anybody else's.

Mozilla is remarkably open about how its telemetry works, beyond just being open source. Maybe this is more a problem of that information not being surfaced well, I dunno. I get that some people are philosophically opposed to telemetry no matter what, but I have seen enough cases of, "Wow, I didn't know it worked that way, and I'm actually okay with this," to know that informed users are not universally opposed to it.

account42 · on Feb 3, 2022

All network requests share at least the IP address, which is PII, and should only happen after obtaining informed consent unless they are required for the requested user action. Since telemetry would be pointless if its the same for everyone there will inevitably be more information that can ultimately be used to identify users. You can argue as much as you want that you are doing it "better" than others (there is always someone worse and the software industry's disregard for user rights is well known) or that it is useful (many unethical actions can be useful but the ends do not justify the means, especially when alternatives like bug reports often go ignored) but that does not change the fact that you are sharing PII without informed consent.

dblohm7 · on Feb 3, 2022

IP addresses are uniquely identifiable, sure, but not personally identifiable.

account42 · on Feb 4, 2022

You should not be so sure about that. For example see this recent judgement:

https://rewis.io/urteile/urteil/lhm-20-01-2022-3-o-1749320/

HN discussion: https://news.ycombinator.com/item?id=30135264

More importantly Mozilla knows that there are people who do not want them to upload this information yet they continue to do it anyway by default without ever asking for consent. Worse, Mozilla keeps adding new leaks that concerned users will have to watch out for and disable after each update. This is of course by no means a problem unique to Mozilla - the software industry as a whole has not yet learned that no means no - but it is also a Mozilla problem and as long as they want to use privacy to market their software they will rightfully receive the loudest criticism. Thankfully laws are beginning to catch up with the digital age and people will have better recourse than asking software vendors nicely to not be mistreated.

wolverine876 · on Feb 3, 2022

Could you share a link?

dblohm7 · on Feb 3, 2022

I can share several!

If you're running Firefox, you can go to `about:telemetry` and see what data is there. Note that some of that data might be populated even if you have telemetry turned off. Don't reach for your pitchfork quite yet: I assure you that the data isn't being sent.

You can go to https://telemetry.mozilla.org to see various analyses involving telemetry data.

In the source tree, telemetry probes must be defined in one of a few places. Here are their locations:

https://searchfox.org/mozilla-central/source/toolkit/compone...

wolverine876 · on Feb 4, 2022

I meant something documenting how telemetry works, though perhaps the source tree is the source for that?

> Don't reach for your pitchfork quite yet

No need to assume people are out of their minds or even critical. I am curious about how it's done on a technical level, with the old local ad system in mind (which I thought was a brilliant solution to Internet commerce and privacy). I've supported and contributed to Mozilla since before Firefox.

dblohm7 · on Feb 4, 2022

Yeah, please don't take the pitchfork thing personally, that was more intended for anybody reading that comment who immediately assumes the worst.

As for high level docs about how it works, I haven't been involved in quite a few years, so I'm not 100% sure about the best source, but this link looks like a good place to start:

https://docs.telemetry.mozilla.org/concepts/history.html

wolverine876 · on Feb 3, 2022

THe problem had nothing to do with privacy or respect for users; it was just a bug triggered by a vendor update.

magicalist · on Feb 2, 2022

> Instead, they frantically went through their git log to see what in the code base might have triggered this bug.

This seems like you're embellishing this part to tell a story? It's not supported by the linked post, and from the bugzilla bugs it seems like it was known almost immediately that the ESR builds were affected as well and so it almost had to be an external service, they just weren't sure which one at first.

TheIronMark · on Feb 2, 2022

Google should not be rolling out unannounced changes, but Mozilla increased their risk by not pinning the HTTP version.

radmuzom · on Feb 2, 2022

[flagged]

jrochkind1 · on Feb 2, 2022

it's probably not the same people, even though you see both opinions on HN.

causi · on Feb 2, 2022

God I hate the modern web. Why the hell does my browser need more than periodic contact with any server other than my DNS provider and the host of the website I'm connecting to?

npteljes · on Feb 2, 2022

Even DNS is pretty WTF worthy if you think about it. Contacting a third party about the website you're about to visit. On an unencrypted channel no less.

Another third party request is Firefox's phishing/malware protection. It periodically downloads their own bad site collection and if you visit a site, it check if it's on the list. And if it isn't, it checks with Google if the site is okay.

https://support.mozilla.org/en-US/kb/how-does-phishing-and-m...

P.s. Tor Browser turns this Safe Browsing feature off. Looks like I'm on the same page as them on the implications of it.

account42 · on Feb 2, 2022

Wow, I did not realize it would send URLs to Google. That does not sound GDPR-compliant.

But even without the privacy issue, you should turn off save browsing because google should not be in control of what users can and cannot donwload. They clearly do not care about keeping that list free of false positives, for example: http://dege.freeweb.hu/dgVoodoo2/

> It does NOT contain any malware. Use a browser that is free of Google Shit Browsing security service crap (which is based on tons of noname antivirus "engines", look at VirusTotal if interested).

I have also experienced Googles disregard for false positives on that list myself. While they may "remove" false listings after you bug them, those entries will just be re-added the next week and of course because this is Google there is no way to get an actual human to look into it. It is insane that all browsers allow a private company to maintain such a list without complete transparency and publicly visible reasoning for why each entry is in it as well as well defined procedures to contest false postives with agagain, publicly visible reasons for denial.

dblohm7 · on Feb 3, 2022

> Wow, I did not realize it would send URLs to Google. That does not sound GDPR-compliant.

That's not how Firefox does safe browsing. https://feeding.cloud.geek.nz/posts/how-safe-browsing-works-...

"One of the most persistent misunderstandings about Safe Browsing is the idea that the browser needs to send all visited URLs to Google in order to verify whether or not they are safe."

account42 · on Feb 3, 2022

Apparently it is how it used to work though for some settings, which is bad enough, and still sends partial hashes in some cases which can leak some information. Even just making any connections to Google (or another provider) without the user explicitly visiting a Google website is a privacy issue that Firefox should resolve.

It also does not adress the problem with making Google the gatekeeper deciding what you can and cannot download - and don't tell me its just a warning, its set up in such a way that regular users will often not even know that they can bypass it.

dblohm7 · on Feb 3, 2022

That's not how Firefox does safe browsing. https://feeding.cloud.geek.nz/posts/how-safe-browsing-works-...

"One of the most persistent misunderstandings about Safe Browsing is the idea that the browser needs to send all visited URLs to Google in order to verify whether or not they are safe."

npteljes · on Feb 3, 2022

Thanks, I think I get it now. What ends up happening with Safe Browsing enabled

1. Firefox checks it own list locally, which it updates every 30 mins

2. If not found, it chops off query params from the url, hashes it, chops the hash to a smaller size

3. Creates several bogus hashes, then checks them all with Google's Safe Browsing service.

MauranKilom · on Feb 2, 2022

What does "more than periodic" mean? But to answer your question: For example, to inform you that an update is available.

causi · on Feb 2, 2022

"Need" is the key phrase. If it wants to check for updates, do telemetry, fiddle around, whatever, that's fine by me. It should not shit the bed if it has to go a few hours without doing so, however. For example, you recall a couple of years ago when Mozilla screwed up their certificates and every Firefox extension was simultaneously disabled. That should not have happened. That should have resulted in a pop-up box that said "The security of your extensions cannot be verified. Using them at this time could be highly dangerous. Disable them to browse safely? Yes/No" Instead Mozilla just blanket turned them all off, which had the potential for getting people killed as they suddenly found themselves not protected by VPN/Tor/NoScript/etc extensions.

rockdoe · on Feb 3, 2022

Software shouldn't have bugs, got it. It should be written by bug-free programmers that foresee every possible failure mode (that they can never imagine happening, because they won't write bugs themselves).

account42 · on Feb 2, 2022

Updates and associated checks should not be done by each application itself.

Karunamon · on Feb 2, 2022

Hard disagree. Having that in the hands of the OS is a separation of concerns issue. I want my OS people focusing on OS stuff. The versions of installed apps is up to the developers of those apps, doubly so on something like a browser that updates rapidly.

pessimizer · on Feb 2, 2022

No, the versions of installed apps should be up to me. And my OS should help me manage that.

rockdoe · on Feb 3, 2022

Security updates for the browser, certificate checks (so you know it's the right website you're connecting to....) you can't get away from.

galosh · on Feb 2, 2022

That was an extremely high risk change on GCP's part, reminds me of the App Engine days when you'd wake up to find a totally healthy program spamming 500s because they'd make a breaking change without any announcement. It's shocking they're still pulling stuff like this in 2022

agilob · on Feb 2, 2022

It reminds me how YouTube enforced new codec with a few days notice knowing that FireFox doesn't support it, so FF couldn't play most YT videos for over a week.

TheGoddessInari · on Feb 2, 2022

> enforced new codec

YouTube turned off flash player as the default in 2015, and VP9 was supported in Firefox at the same time.

YouTube still serves h.264, vp9, and av1.

I was trying to figure out what this could be referring to.

agilob · on Feb 2, 2022

Not all codecs were ported at the same time, and then not enabled by default, and when enabled by default it was platform dependent, for others where it worked FF was eating all possible CPU resources and videos were glitching. I remember this as I was using Debian and I was active on /r/firefox, where this [1] link was posted 10 times every day

[1] https://www.youtube.com/html5

faeyanpiraat · on Feb 2, 2022

Did yt back up or ff updated to fix issue?

agilob · on Feb 2, 2022

Firefox got the codecs working in the next release

kevingadd · on Feb 2, 2022

Feels like an especially severe version of the consistent Google pattern of only testing stuff on Chrome, so new updates/features ship in a way that is some degree of broken on Firefox/Safari. For a significant amount of time YouTube had bad performance on Firefox because they chose to use Web Components by default with a horrible polyfill instead of using the old (still working!) html5 version that ran great.

loulouxiv · on Feb 2, 2022

Putting in place an infrastructure to test this kind on changes on the 5-10 most popular browser would be, I think, very cheap for a company like Google. I can't help thinking these may be deliberate moves to eat the little market shares of Chrome concurrents. I remember reading here on HN an article written by an ex-Mozilla insider relating the dissonance between the "friendly" Mozilla-Google employees exchanges and the year-long track record of very oddly recurrent "unfortunate mistakes" from Google degrading the Firefox compatibility.

loulouxiv · on Feb 2, 2022

I am trying to find the link, but for the moment I only find comment making, I think, references to it :

https://news.ycombinator.com/item?id=19815348 https://news.ycombinator.com/item?id=28495546

Does anybody here remember enough keywords to find it out ?

Edit: I guess it was this Twitter thread https://mobile.twitter.com/johnath/status/111687123179245568...

Edit 2: The associated HN thread https://news.ycombinator.com/item?id=19662852

acdha · on Feb 2, 2022

Yes - they clearly don’t test the GCP console in Firefox since they “accidentally” break it on a regular basis, and there’s just no excuse for that happening at such a rich, well-staffed company.

magicalist · on Feb 2, 2022

> Putting in place an infrastructure to test this kind on changes on the 5-10 most popular browser would be, I think, very cheap for a company like Google.

The problem wasn't some web server, it's the Firefox backend services running on GCP.

marcan_42 · on Feb 2, 2022

Testing wouldn't have revealed anything because this didn't break with Firefox outright, it only broke when Firefox telemetry used it due to a complex series of circumstances.

notyourday · on Feb 2, 2022

> That was an extremely high risk change on GCP's part, reminds me of the App Engine days when you'd wake up to find a totally healthy program spamming 500s because they'd make a breaking change without any announcement. It's shocking they're still pulling stuff like this in 2022 reply

Lay with the dogs, wake up with the fleas.

Google is a shitty company producing shitty products. When you select to do business with Google you select to do business with a shitty company producing shitty products and treating its customers like shit. Hence I fail to understand the Surprised Pikachu face when something like this happens.

brabel · on Feb 2, 2022

The client, Firefox, said it supported HTTP/3 though. Otherwise it wouldn't get to use that.

I don't think that's as bad as you try to make it... if the client says it supports something then it breaks when it uses it, it's the fault of the client, not the server.

throwaway984393 · on Feb 2, 2022

No SRE in the world that is halfway decent at their job would think that way. You never make assumptions about any kind of change, much less a global change to a completely different protocol. Doesn't matter whose fault it is. You just don't introduce any change that has a chance of unexpected behavior without rigorous testing, and you roll it out g r a d u a l l y, and you stop when error rates increase.

Google literally wrote the books on SRE. For them to not know better is absurd.

charcircuit · on Feb 2, 2022

Would a warning have even helped that much? Since HTTP/3 was expected to be working there wouldn't be a cause to worry.

Semaphor · on Feb 2, 2022

> Would a warning have even helped that much? Since HTTP/3 was expected to be working there wouldn't be a cause to worry.

It might (as mentioned in TFA) have made them think to run some extra tests, which could have caught the bug. But it also would have made the response faster, as they would have known what changed far sooner.

0xbadcafebee · on Feb 2, 2022

The key word there is "expected".

When you do Operations for a living, the only thing you can expect is the unexpected. That's why even after you think you've tested a change, you carefully and slowly roll it out a bit at a time, monitoring golden metrics so you can detect a problem, stop the roll-out, and roll back.

It sounds like somebody just flipped a giant switch and never checked error rates, connection metrics, anything. Check out this graph: https://hacks.mozilla.org/files/2022/01/crashes-foxstuck2-20... Think maybe that would indicate somebody needs to roll back the last change?

The problem here, as usual, is a disconnect between stakeholders. Google has this service (it seems like the load balancer for their customer?) it wants to change for one reason or another. The customers may or may not have planned for the change Google is making. Google makes the change, but it isn't a stakeholder of the customer (they basically don't care what happens to the customer). So there is no direct feedback loop for the customer to tell Google something is wrong.

If Google was at risk of losing business from its customers going down, it would have a strong relationship with those customers and have a way to quickly help diagnose problems and roll back changes if needed. This is a great lesson for all customers to take away: don't depend on people who you don't have a close relationship with.

brabel · on Feb 2, 2022

A lot of intelligent sounding words to explain a trivial bug: their network stack was using case-sensitive header names, which anyone doing anything remotely related to HTTP knows is a mistake.

We all make mistakes, but don't try to make it sound more grandiose ("a combination of multiple factors blah balh blah") than it is.

viraptor · on Feb 2, 2022

Http2 and 3 are case sensitive at when encoding the headers:

> As in HTTP/2, characters in field names MUST be converted to lowercase prior to their encoding. A request or response containing uppercase characters in field names MUST be treated as malformed

https://quicwg.org/base-drafts/draft-ietf-quic-http.html#sec...

matsemann · on Feb 2, 2022

> which anyone doing anything remotely related to HTTP knows is a mistake.

Only if you haven't updated your knowledge since HTTP/1.1 days...

0x002A · on Feb 2, 2022

"Because all network requests go through one socket thread, this loop blocked any further network communication and made Firefox unresponsive, unable to load web content." Why side functionality (telemetry) of a tool uses only one network thread and can block any network communication ?

floatingatoll · on Feb 2, 2022

Per elsethread, supposedly any site could have triggered the crash, not just ”side functionality”: https://news.ycombinator.com/item?id=30175916

0x002A · on Feb 2, 2022

Yes that's another issue. But as a design pattern shouldn't we design our products to do their core functionality as much as independent from any anomalies that can happen? This is almost akin to me if Tesla rolls out an update and the car decides to pull over to the curb to do the update, while you are driving to your job or worse to hospital with an emergency. My theory is there should be at least one health enterprise using firefox as their only browser for business functionality out in the wild.

marcan_42 · on Feb 2, 2022

> But as a design pattern shouldn't we design our products to do their core functionality as much as independent from any anomalies that can happen?

When expected anomalies happen, like telemetry being down or taking a long time to respond. Firefox is certainly already designed like that.

This was not that. This was a bug. There is no magical design that avoids bugs.

> This is almost akin to me if Tesla rolls out an update and the car decides to pull over to the curb to do the update, while you are driving to your job or worse to hospital with an emergency.

Firefox is not a car. You're going to have to get Mozilla a lot more funding if you think the browser should be designed with extreme resilience in mind as required for life-critical applications. If a health enterprise is using Firefox in a life-critical role, that's kind of their responsibility, not Mozilla's.

Yoric · on Feb 2, 2022

In theory, that's how it happens in Firefox. But when you have a bug in the core of the product (the network stack), there isn't much that the rest of the product can do to isolate from it.

0x002A · on Feb 2, 2022

Yes that's what you get when you have one thread for all network communications. The network stack did not fail, only the sole network thread got stuck. From the write up I understand if there was another thread for communications firefox would only fail to communicate with telemetry service but firefox would be able to function as users needed.

Yoric · on Feb 2, 2022

Well, it could have limped along with degraded performance. Which would undoubtedly have been better.

0x002A · on Feb 2, 2022

I think that part is clear in the document.

"This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise."

marcan_42 · on Feb 2, 2022

It was a bug. It can't block network communication normally. It's not like telemetry serializes with normal user traffic. It was just a stupid infinite loop that broke what otherwise would've certainly been nonblocking multiplexing of requests.

yjftsjthsd-h · on Feb 2, 2022

I'm not saying this was actually a good design, for obvious reasons, but one decent reason to do it this way is so things like proxy settings are shared.

dblohm7 · on Feb 3, 2022

The socket thread does not do blocking reads.

perryizgr8 · on Feb 2, 2022

When you read the whole account, everything has a justifiable reason and the entire thing is very rational. But if you look at the 10000ft view, if an app hangs and refuses to work at all due to nasty/unexpected input coming over the network, that is just a bad bug. These sorts of bugs should have been caught earlier. It shouldn't matter if the entire internet is sending bad responses, Firefox should still handle it gracefully.

It does not need this big of an explanation. It's just a silly bug, they can do better by improving their testing. The lady doth protest too much.

rockdoe · on Feb 3, 2022

I think the length of the explanation is due to the impact this had. Most Firefox bugs don't cause all Firefoxes everywhere to stop working simultaneously. That's quite the WTF and it's hard to understand how it is even possible without the explanation.

kragen · on Feb 2, 2022

It should be impossible for the phrase "the recent Firefox outage" to make sense. Has there ever been an "outage" of linear algebra? Of Linux? Of mitochondria? Of Bitcoin?

It is critically important that we not introduce new single points of failure into the systems that our civilization depends on, and that we remove the ones that already exist.

If it can happen by accident, it can happen on purpose.

orsenthil · on Feb 2, 2022

> It should be impossible for the phrase "the recent Firefox outage" to make sense.

I completely agree. We shouldn't encourage this term with Firefox too. It is my personal client, not a global service.

ainar-g · on Feb 2, 2022

While I agree with the SPOF point (heh), the header should really read “the recent Firefox Telemetry outage”. It's kind of like a website causing your browser or a tab of the browser to be unresponsive by introducing an infinite busy loop. Except the “tab” was invisible.

floatingatoll · on Feb 2, 2022

This is wrong on two points, based on the other discussion and the post itself.

There was no Telemetry outage. An HTTP3 response header’s case was changed by a third party without notice. Telemetry continued working, other than the case change causing a bug.

There was an http3 infinite loop bug in Firefox that crashed all networking. Many different things could have triggered the bug once it was introduced. Telemetry happened to be the first thing to do so, but not due to any faults in Telemetry’s code or implementation.

mozdeco · on Feb 2, 2022

The infinite busy loop in this case was not the tab no (neither visible or invisible). The loop was directly in the network stack, as stated in the post, not in the caller.

kragen · on Feb 2, 2022

The problem isn't that there was a telemetry outage; the problem is that the telemetry outage caused a Firefox outage, which should not be a thing. Firefox needs to be more robust than that.

dralley · on Feb 2, 2022

That's not what happened though. Telemetry uses the same networking stack as everything else, and the busy loop was in the networking stack.

cxr · on Feb 2, 2022

> That's not what happened though

What do you mean that's not what happened? Was there a "Firefox outage" or not? Are you disputing claims that the engineering team made about Firefox becoming unusable for users "for close to two hours"?

dralley · on Feb 2, 2022

> about this telemetry bug

That's my point. A "telemetry bug" didn't make Firefox unusable, a networking bug that was triggered by a telemetry bug did. But it could just as easily have been triggered by anything else.

cxr · on Feb 2, 2022

First, you're going anachronistic. They didn't write "telemetry bug".

Secondly, "cause" doesn't automatically mean "root cause". (That's the entire reason we distinguish between the two by qualifying the latter to begin with.) It's perfectly reasonable to say "A caused B" even if the root cause lies elsewhere, with C.

Thirdly, none of this matters. It has no impact on the point being made by the person you responded to, which—to repeat—is that:

> It should be impossible for the phrase "the recent Firefox outage" to make sense.

dralley · on Feb 2, 2022

>It should be impossible for the phrase "the recent Firefox outage" to make sense.

It makes perfect sense a world where half the internet is going through Google / Cloudflare / Amazon /Akamai servers or some combination of the above, and they decide to roll out brand-spanking-new protocols to half of the internet at once. Sometimes that's going to break clients.

I don't like that world very much, but it's the one we live in.

barrkel · on Feb 2, 2022

... due to code in Telemetry being different and triggering different code paths in the network stack.

There is a why here, and it includes Telemetry mixed traffic as a potential culprit. There are reasons to unify traffic (proxy support, QoS and whatnot) but unification of the user and Telemetry streams isn't without risk, as has been shown.

cxr · on Feb 2, 2022

> unification of the user and Telemetry streams isn't without risk, as has been shown

A constant refrain over the last 10 years or so of Mozilla's descent while trying to justify the removal of features from Firefox has been that not doing so unnecessarily bloats the surface area of the codebase, and specifically that this increases the chance of vulnerabilities and defects.

Will the same argument be applied here, now with a case in hand, to justify the removal of telemetry, too?

marcan_42 · on Feb 2, 2022

It's the same app. I don't get why you're replying to every thread trying to somehow argue that sharing a thread for all network code is a bad thing and telemetry needs to be a special snowflake that gets a different thread. The networking code had an infinite loop bug. It was triggered by telemetry, but it could've been anything. Telemetry getting its own network thread wouldn't have magically made it impossible for it to cause problems. Bugs happen, and sometimes make things interact in weird ways.

kragen · on Feb 2, 2022

First you say "that's not what happened" and then you explain how that is what happened.

detaro · on Feb 2, 2022

The user impact (Firefox not working) is more important than the technical reason, and thus should be in the title.

denton-scratch · on Feb 2, 2022

> there were many contributing factors working together

Looks like one factor, to my eyes: telemetry.

I have telemetry disabled. But if you're going to default to "telemetry on", and then silently send data to sites that aren't in the address-bar, then it's your responsibility not to "break the web". You can't blame it on rust, or necko, or viaduct, or google.

Liquid_Fire · on Feb 2, 2022

> This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

It's pretty clear from the article that this was a bug, and telemetry requests failing was not intended to break the rest of the browser.

If Google had rolled out this change a year from now, it could probably have broken something other than just telemetry (e.g. maybe update checks, or certificate management) and your browser would still have been broken even with telemetry disabled.

pessimizer · on Feb 2, 2022

> telemetry requests failing was not intended to break the rest of the browser.

You can't find a lighter straw man than that. Find me somebody who said that Firefox intended to break the browser if telemetry failed.

hypertele-Xii · on Feb 2, 2022

> It's pretty clear from the article that this was a bug, and telemetry requests failing was not intended to break the rest of the browser.

If telemetry didn't exist as core part of the browser, nothing would've broken.

Therefore, the telemetry itself is the direct cause of the bug. It was, at best, poorly handled and too deeply integrated into the browser's core function.

matsemann · on Feb 2, 2022

The browser would have broken the next time a malformed http3 request happened. Read the quote again.

jacquesm · on Feb 2, 2022

There was no 'Firefox Outage' because Firefox is not a service. There was a bug, and a production issue with a service that Firefox users were involuntarily opted into.

Lesson learned: do not opt your users into services without their consent.

Klonoar · on Feb 2, 2022

The article outright explains how it wasn’t limited to Telemetry and could be triggered otherwise.

jacquesm · on Feb 2, 2022

It could have been, but it wasn't. See, if I had opted out of this junk, which I wasn't because it was enabled without my consent I would not have experienced that particular problem (but others would have) and I would have been able to save myself a couple of hours of debugging.

So yes, it wasn't limited to Telemetry, but no I had not seen the bug in practice until that very moment.

pdenton · on Feb 2, 2022

I haven't noticed anything and I use Firefox every day. Nor have any of my clients where I deployed Firefox called. Is this because I always disable data collection in settings?

Ygg2 · on Feb 2, 2022

It's not a telemetry bug. It's a networking bug. Avoiding it basically comes down to being extremely lucky.

ainar-g · on Feb 2, 2022

Yes, the bug was in the Telemetry code. I'm not sure if it's on by default, but it's probably better to disable it for any large-scale deployments. Both to prevent things like this and to make sure that things that should be disabled by default actually are.

pdenton · on Feb 2, 2022

Ah, that explains it. Thank you. I recommend https://ffprofile.com/ which was posted earlier here on HN. Makes it easier to deploy Firefox with saner defaults.

You know I moved from Netscape 3.0 Gold to later versions, to Mozilla, to Phoenix, Firebird, and then Firefox. I tried other browsers but it's always a subpar experience for me. My only gripe is that they kept changing the UI.

floatingatoll · on Feb 2, 2022

(FYI: The bug was not in the telemetry code, it was in the http3 code; see comments further up in the post for details.)

mccr8 · on Feb 2, 2022

It was broken for a couple of hours, and late at night in the Americas. You might have been asleep for the entire duration of the outage.

ricardobayes · on Feb 2, 2022

96 by far is the worst release, probably even in the history of FF.

danuker · on Feb 2, 2022

What can we do about it?

Use an older version? Security problems. Use a simpler browser? Many sites will stop working.

Maybe it's for the best to avoid using sites that use complicated JS/CSS/HTML. But will still need it for say, government sites to pay taxes.

userbinator · on Feb 2, 2022

You mean "Security" problems...

There's a lot of FUD and paranoia out there; 99% of exploits need JS and even those which don't technically need it, are almost always obfuscated using JS.

Leave JS off by default (there are extensions to do that) and don't turn it on unless you really do trust the site to run arbitrary code on your computer, and you're unlikely to encounter any problems.

iqanq · on Feb 2, 2022

Everybody should use Firefox ESR. It's stuck at Firefox 91 and you receive security updates, but you get no ads, your settings do not reset...

Semaphor · on Feb 2, 2022

I’m up-to-date and have neither ads, nor reset settings.

FeepingCreature · on Feb 2, 2022

57 was worse.

IAmNotAFix · on Feb 2, 2022

> all network requests go through one socket thread

Looks like the crucial issue to me. The SPOF which enables something as absurd as "Firefox Outage".

rockdoe · on Feb 3, 2022

This isn't atypical for high performance network software, AFAIK. Single-thread with aio.

sirl1on · on Feb 2, 2022

Why does my browser need connectivity to some internal services? I am fine with offering opt-in service integration (Firefox Sync, Pocket, ...) but is there a reason why Firefox needs internal infrastructure to do the one thing it is supposed to do, browsing the web? I can only think of DNS over HTTP, but AFAIK that is also opt-in, right?

Man, I love Firefox and used it since it was called Firebird (with a small gap when Chrome was shiny and new and Firefox a slow RAM hog). But I really resent the Mozilla Foundation, they seem to be interested in everything but browser development. To be fair, (ab)using the browser as application runtime brought us so much complexity that developing and maintaining a secure browser as free software spare time project isn't feasible anymore.

lmm · on Feb 2, 2022

> Why does my browser need connectivity to some internal services? I am fine with offering opt-in service integration (Firefox Sync, Pocket, ...) but is there a reason why Firefox needs internal infrastructure to do the one thing it is supposed to do, browsing the web?

If you read the whole post, the connection was explicitly for telemetry (and so you could avoid the issue by turning off telemetry), and it blocked other connections because the request went into an infinite loop rather than failing outright.

nanis · on Feb 2, 2022

> If you read the whole post, ... (and so you could avoid the issue by turning off telemetry)

Speaking of reading the whole post:

>> users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

This does not make sense to me:

>> Without the header, the request was determined by the Necko code to be complete,

This is written as if it makes sense to treat a request as "complete" when it's missing a content length header. Huh?!

mozdeco · on Feb 2, 2022

At this point, the code relied on the Content-Length header being present because the higher-level API was supposed to add it. The field that is supposed to be populated by Content-Length (mRequestBodyLenRemaining) is pre-initialized to 0.

lmm · on Feb 2, 2022

> users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.

All they mean by that is that it was a bug in their HTTP/3.0 code that could have been triggered by any HTTP/3.0 connection that was using that codepath. But the reason for that particular connection (which had the conditions to trigger the bug) was for telemetry.

martpie · on Feb 2, 2022

It is explained in the second sentence:

> updates, telemetry, certificate management, crash reporting and other similar functionality

We can discuss about the importance of telemetry, but the others seem quite important to me.

The key is just to gracefully fail when something goes wrong (and there it didn't).

piaste · on Feb 2, 2022

All of those except certificate management are important but should never be required for the browser to work. And certificate management should depend on whatever chain of trust is configured, which should not invoke Mozilla as an essential party to every transaction.

Does it mean that the same blocking bug could happen while browse a local website on an air-gapped network? Or while opening local HTML files while offline?

mozdeco · on Feb 2, 2022

Firefox generally does not block if a remote connection does not work. As explained in the post, the infinite loop was a bug in the network stack itself.

So yes, you can use Firefox in any offline environment.

Beltalowda · on Feb 2, 2022

> should never be required for the browser to work

None of them are; disconnect from internet and start Firefox. It will work.

It was just a bug in the Firefox HTTP 3 implementation that caused it to be rendered unusable; it just so happens that connecting to these services triggered it, but it also could have been triggered by another HTTP 3 service (as I understand it, anyway).

danuker · on Feb 2, 2022

From the article:

> the client was hanging inside a network request to one of the Firefox internal services.

Presuming that while offline, there is no pending network request to hang on, yes, it would have worked if you were offline.

deepstack · on Feb 2, 2022

>All of those except certificate management are important but should never be required for the browser to work. And certificate management should depend on whatever chain of trust is configured, which should not invoke Mozilla as an essential party to every transaction.

This is the part that really gets me. For an average user, they trust the certificate that is bundled with the browser vendor (yes you can do certificate pinning). It just seems like something like certificate for encryption, ought to be split up away from browser vendor rather managed by a open public repo mange by a non-profit. Or have it on a block chain type of ledger. Any thoughts on that HN?

herbstein · on Feb 2, 2022

> Does it mean that the same blocking bug could happen while browse a local website on an air-gapped network?

If you're making an HTTP/3.0 request formed "correctly" then yes, it too would cause the infinite loop. It's not in any way specific to the internal service.

vord1080 · on Feb 2, 2022

Under normal circumstances it would gracefully fail. If the connection fails normally the browser will keep trucking along, the problem was a bug deep inside the network stack that could've been triggered by any HTTP/3 connection.

kevingadd · on Feb 2, 2022

Some degree of this is absolutely required to be a decent internet citizen. Even if you think things like emergency configuration and basic telemetry are optional (I disagree), polling things like certificate revocation lists is basically required. Without doing it all your customers are sitting ducks for the latest security vulnerability.

shantara · on Feb 2, 2022

It's incredible to see a normalization of the term "outage" applied to a software running on a local hardware.

luciusdomitius · on Feb 2, 2022

[flagged]

hackerfromthefu · on Feb 2, 2022

She's probably earned another raise if her actual reason for appointment is to drive chrome adoption while soaking up community engagement to prevent another open source browser competitor, then she's succeeding very well!

luciusdomitius · on Feb 2, 2022

https://www.androidheadlines.com/2020/08/mozilla-firefox-goo...

Well. Seems that it cannot be ruled out.

ricardobayes · on Feb 2, 2022

It's very interesting to see how rapidly FF fell out of favor for some reason. Went from poster child to pariah in what seemed like no time.

pessimizer · on Feb 2, 2022

It's amazing to see how rapidly Firefox users fell out of favor with Mozilla for some reason. Their bugtrackers went from joyful to friendly to silent to openly hostile as they tore out every feature that distinguished them from Chrome.

Branding must be really important if people are expected to enjoy using a completely different product because they enjoyed the old product. MS Office doesn't expect me to do that; they give me essentially the same thing in 2022 as they gave me in 1997. They don't expect me to be loyal out of some sense of love or obligation.

brimble · on Feb 2, 2022

Niche opinion, probably, but I think they took a hard wrong turn not later than 3.0 (yes, that long ago) and never recovered.

Here we have this bug that's "not in telemetry" (strictly true) but for which telemetry increased the severity/blast-radius from "partial failure for many users" to "complete failure for most users".

But FTP—an actual feature for users, unlike spyware "features" that keep some chart-readers employed—had to go because that's too risky to keep. OK.

ricardobayes · on Feb 2, 2022

A quick google search revealed FTP was supported until v90, so I'm curious as to what is that you are referring to when you say 'they took a hard wrong turn not later than 3.0 (yes, that long ago) and never recovered'.

brimble · on Feb 2, 2022

Not related to the 3.0 release, just an example of a recent cut of an actual feature while spyware is apparently essential. IIRC 3.0 (might have been one of the 2 series?) was when the browser suddenly got a lot fatter and the UI less responsive, and never made meaningful progress back the other direction, contrary to its feather-weight beginnings which were a big part of why I loved it so much. I kept using it for quite a while longer but never loved it again.

luciusdomitius · on Feb 2, 2022

Well. The actual lead dev got kicked out as Mozilla Foundation chair and got replaced by some SJWs over his support for banning abortions, same-sex marriages or something similar.

rockdoe · on Feb 3, 2022

Not sure I'd consider a "CTO" to be the "lead dev". Definitely not in a bigger org.

Traubenfuchs · on Feb 2, 2022

Previous discussion:

https://news.ycombinator.com/item?id=24563698

blackbear_ · on Feb 2, 2022

[flagged]

kevingadd · on Feb 2, 2022

Does Firefox force updates on your configuration? On Windows it's opt-in for me. I know Android will forcibly update any app (including Firefox) while you're using it, but you can shut that off system-wide.

joshgev · on Feb 2, 2022

The default is that it forces a restart when a new update has been downloaded, which has been a frustration for me as apparently it has been for the parent.

Apparently this can be changed by requesting that FF only install updates with explicit consent[1]. I'd think the best way would be to install anything that's available locally when the browser starts without forcing a restart.

[1] https://superuser.com/questions/1451210/how-can-i-make-firef...

Vinnl · on Feb 2, 2022

Are you using Linux, by any chance? IIRC, the issue here is that the update is done by your package manager, changing Firefox's files out from under them. If you use the direct download from Mozilla, it shouldn't be as disruptive.

I still hope a fix or workaround to this can be found, but knowing why something is the cause makes it easier to accept, at least for me :)

Nextgrid · on Feb 2, 2022

What pisses me off is less about the update itself (they're typically unnoticeable) but the constant nags and "what's new" crap that opens up after the update. Firefox is more hostile than a lot of paid, proprietary software in this regard.

brimble · on Feb 2, 2022

This. That crap is significantly more disruptive and irritating that the actual paid ads in old-school free Opera (the largest feature-comparable browser when FF/Phoenix/Firebird first launched).

alexb_ · on Feb 2, 2022

Happens to me often. I'll be browsing and then suddenly I am told that before I can view the next webpage I MUST restart. Giant PITA if I have a ton of private windows open, as none of those are coming back.

foxfluff · on Feb 2, 2022

It's not firefox, it's your distro.

account42 · on Feb 2, 2022

The update might be started by the distro in this case, but there is no reason that Firefox cannot just keep an fd on the resource files open and use that instead of the updated files. Either that or keep things compatible so it can use the new files. Not being able to use the browser after an update is inexcusable.

foxfluff · on Feb 2, 2022

> there is no reason that Firefox cannot just keep an fd on the resource files open and use that instead of the updated files

If you check what processes you've running, you'll see that firefox has many of them. I'm not going to grep the sources but I believe the ones with the "-contentproc" flag are started with an exec call as needed, and I'm not aware of an exec that works with fds. It requires a path, it executes the binary at that path, and that binary in turn loads a bunch of files it needs. It's all going to blow up if your parent and child processes are running different versions of the program.

Keeping parts of a program compatible with arbitrary versions of other parts of the same program is virtually impossible. Go ahead, checkout 50% of your files from some random version of your project thousands of commits ago while keeping the rest at master, and see if it still compiles and runs correctly.

account42 · on Feb 3, 2022

> If you check what processes you've running, you'll see that firefox has many of them. I'm not going to grep the sources but I believe the ones with the "-contentproc" flag are started with an exec call as needed, and I'm not aware of an exec that works with fds. It requires a path, it executes the binary at that path, and that binary in turn loads a bunch of files it needs. It's all going to blow up if your parent and child processes are running different versions of the program.

That firefox is set up in a way that does not work when replacing the files while running does not meat that it cannot be setup differently. For example you could have one process paused after loading all needed libraries and opening all resource archives and then just fork from that.

NB: You can actually execve the original executable under Linux via /proc/self/exe - while it behaves somewhat like a symlink to e.g. show the original path under ls / readlink, reading from it behaves like a hardlink to the original inode. This does not solve shared libraries however (but does Firefox really need to link its own libraries dynamically?) or resources (you could sendfd them to the new process if you really wanted to go with exec).

> Keeping parts of a program compatible with arbitrary versions of other parts of the same program is virtually impossible. Go ahead, checkout 50% of your files from some random version of your project thousands of commits ago while keeping the rest at master, and see if it still compiles and runs correctly.

The most sensible solution would be to keep all parts in sync, which is possible even if the copies on the filesystem changed. However, you could also have a stable API between the browser binary and the javascript parts if you wanted - that is a far cry from mixing arbitrary source files.

Keep in mind that browsers have become more like a mini-OS and should not get away with behavior common in normal applications but instead should provide stability guarantees more like an OS. And you definitely don't need to reboot because the kernel has been updated on the filesystem.

foxfluff · on Feb 3, 2022

> For example you could have one process paused after loading all needed libraries and opening all resource archives and then just fork from that.

Plausible I guess, but also seems like a lot of effort to work around broken distros. Actually calling exec has tangible benefits too, e.g. it allows each process to have its address space randomized.

> However, you could also have a stable API between the browser binary and the javascript parts if you wanted - that is a far cry from mixing arbitrary source files.

Yes but the problem here is stable API and ABI between the browser binary and.. the browser binary. IPC within a binary is hardly ever written assuming a stable ABI, it'll just constrain the project way too much.

> And you definitely don't need to reboot because the kernel has been updated on the filesystem.

I feel like the kernel isn't really a fair comparison since it's relatively self-contained and you generally don't have multiple instances of the kernel talking to each other using IPC.

But go ahead, update your modules and see if you can still load them on your old kernel without rebooting. No, you can't. Same problem.

Unfortunately OSes today tend to have many dynamically loaded parts that require restarts if you want to keep everything working after updates.

vvillena · on Feb 2, 2022

AFAIK Chrome behaves the same. You can't update a browser cleanly while it is running. In Windows is is handled correctly because both browsers are updated in the background when the browser is closed, but in Linux-style environments this work is done by the package manager. When the package manager stomps over a running instance of Firefox, the old behavior was to crash. At least, now Firefox can detect what happened and keep the active tabs running while instructing you to please restart the browser.

Regarding private tabs, one workaround would be to store the whole window as a bookmark folder (right click on empty tab area, select all tabs, store as bookmarks).

account42 · on Feb 3, 2022

The package manager does not stomp over the running instance but only replaces the files. The original files even remain on the disc (but without any name) as long as they are open. It is firefox that is designed to have to re-open those files during runtime - but that is not inherently required. Before Firefox went multi-process it handled package updates just fine and there is no technical reason why it can't still do that.

rockdoe · on Feb 3, 2022

>Before Firefox went multi-process it handled package updates just fine

No, the problem was just less common and things randomly stopped working or crashed instead of getting the warning page.

In theory you can design the browser so it keeps all files open and passes down the handles, but I imagine it's a mess to do that in practice, especially as Firefox is still somewhat configurable.

skinkestek · on Feb 2, 2022

Annoying indeed but at least on my machines I can always get them back manually if they don't show up automatically by going to the history menu and choose "Restore Previous Session". Hopefully this should work everywhere.

(Note: I'm trying to help, not place blame here. I won't blame anyone for not being aware of every power user trick but I hope to help more people become power users. Please do ask questions about Firefox, even if I'm moving to LibreWolf I still wish Firefox well and think many would be better off using them but just aren't aware : )

alexb_ · on Feb 3, 2022

That does not work for Private Windows, which I use often.

skinkestek · on Feb 3, 2022

I see.

I only use those for specific applications (mostly testing actually).

account42 · on Feb 2, 2022

On Windows it will nag everytime you start the browser even if auto-updates are disabled.

hngpt2 · on Feb 2, 2022

The recent outage for users of Mozilla’s Firefox web browser was caused by a Firefox update, the source code for which was inadvertently pushed to the Mozilla Add-ons Repository. That repository is not publicly available, so it was not possible for users to update their installations. Fortunately, a bug was introduced in the Firefox update which prevented the update from being installed. The bug was fixed when the update was rolled back. Users who updated earlier were protected.

The outage could have been worse, though, because the bug that was introduced in the code for the Firefox update was in a feature of Firefox itself that allows users to block certain updates until a later time. In the case of Firefox, that feature was used to block the update that caused the outage. If a similar situation occurred in the future, users would not have a way to block the update that causes the outage. That feature is available in other web browsers, but it is not as advanced or robust.

drekipus · on Feb 2, 2022

[flagged]

evilpie · on Feb 2, 2022

The actual bug was in the C++ code handling the Content-Length header (https://hg.mozilla.org/integration/autoland/rev/48c0e854be62...) and the Rust code just happened to be first to trigger it.

intothemild · on Feb 2, 2022

I don't get this. The issue had nothing to do with the language used.

account42 · on Feb 2, 2022

It did have something to do with there being two languages used though, resulting in two different ways of excercising the network stack whith less testing for each. So its not a bug caused by Rust, but it is caused by the addition of Rust in Firefox.

kevingadd · on Feb 2, 2022

This isn't specifically caused by rust, it's a second-system problem where they had two different ways to touch the http system and one was broken. It could have easily been the other way around with only rust being correct.

drekipus · on Feb 2, 2022

Yes, I know that, I'm trying to make the joke stand out but it's hard to do

milliams · on Feb 2, 2022

https://en.wikipedia.org/wiki/Poe%27s_law strikes again.

truth_revealer · on Feb 2, 2022

[flagged]

dismalpedigree · on Feb 2, 2022

In standard conspiracy theory fashion, short on who and why. Two parts that are critical.

People talk. The more outrageous the conspiracy, the more people need to be involved. Someone always talks.

Then lets understand how these people stand to benefit. Nobody is going to extraordinary lengths just to burn it all down.

I love me a good conspiracy theory and playing “what-if”, but the things i take action on are the ones which pass my two criteria above.

mrweasel · on Feb 2, 2022

It's a nice write up, but the "Lessons learned" is terrible, it appears as if they didn't learn anything and view the whole thing as an infrastructure problem.

As others pointed out, why does Firefox even need to communicate with Mozilla services? Sure, telemetry needs to feed data back, if enabled, but if that fails why does it need to stop the browser from working?

Shouldn't the lesson learned be: The telemetry functionality in Firefox has a bug, where an infrastructure outage at Mozilla can "break" the browser. The fix isn't in infrastructure, the fix has to be in the code that communicates back, it should fail gracefully. It's not a problem if telemetry fail, either cache locally and just drop the data, it's honestly not important.

I'm sorry, I get that it's interesting how and why all this failed, but Mozilla makes it seem like they don't get what the root of the problem is.

scratcheee · on Feb 2, 2022

They explained that in the retrospective.

The code _does_ work the way you describe, _except_ for the latent bug that caused the networking thread to get stuck in an infinite loop, which it was never supposed to do, even when errors occur.

It was never supposed to work that way, and the fact it did was because of a bug they'd never seen before.

So it wasn't that "oops, we shouldn't have built the system to get stuck forever when it fails" but rather "this bug triggered that bug which combined to cause a far worse result than 1 bug alone could have".

The only "lesson learnt" there is either that they need better ways to find bugs, quadruple up their thread count just so that different subsystems can't coexist on the same threads to avoid a theoretical problem that shouldn't ever happen again, or they just come up with infrastructural changes to minimise the negative results of the next "2 bugs reacted together and caught fire" scenario, which is the one they went with, and the only sane one.