Went into this with an indignant "failures in telemetry should NEVER bring down core functionality!" feeling, left with a more nuanced understanding of the fault and impressed and reassured by the mitigation steps being taken. That's a great post-mortem.
I felt a similar way when it first happened. I was fuming "an update should never break your core functionality across all devices!" only to find that it wasn't actually the latest update but rather code that had been there for ages.
It's hard to hate something once you truly understand it, I guess.
That would be significant extra engineering effort without much clear benefit. It's always possible to look back and say "if only they'd done that", but this bug was a freak coincidence and it's not possible to foresee things like that, nor is it worth the engineering effort to try to avoid them with hammers as big as "move all telemetry to another process". You'd need a much stronger reason than "what if telemetry happens to trigger a longstanding bug in the network stack" to decide to go with that.
In this case, that the problem was triggered by telemetry was a coincidence; it would've been triggered by other bits of coden in the future as they moved to Rust.
It would have made no difference: the hang was in the networking code & process, not the telemetry code (and whatever process that is in). Anything that sent a message with the "bad" header type would have provoked the hang.
As to why, dunno, presumably it's extra effort for an unclear gain. Telemetry code wouldn't be parsing hostile input etc. And it doesn't stop bugs like this either.
Unfortunately, a good fraction of the commenters here don't seem to be doing the same. There's a whole pile of people throwing around "failures in telemetry should NEVER bring down core functionality!"...
I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
To me, this is a perfectly valid write-up with a good lessons learned. They have written it in a very diplomatic way, but to me, it is absolutely clear that Google screwed up here. How can you make such a change to a default behavior of critical infrastructure unannounced? That's just reckless towards your customers, and solidifies my belief to stay away from GCP.
If they had properly announced the change, even if the Firefox team hadn't then tested beforehand, at least the DevOps team would have put one and one together and just changed back to HTTP/2 and the outage would have lasted maybe 10 minutes. Instead, they frantically went through their git log to see what in the code base might have triggered this bug. Everyone who has been in such a position knows how incredibly stressful this is. I'd be absolutely livid at Google in their position. That it took two hours to fix this is clearly their fault.
> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
They set themselves a higher standard by marketing as the good guys who fight for the user, and then made any number of moves that said users viewed as not being in their interests. Of course they get more blame. Like, Chrome has issues, but they're issues in line with being made by an adtech company; we might be unhappy at Google breaking adblockers (https://www.eff.org/deeplinks/2021/12/chrome-users-beware-ma...), but it's not out of character. Mozilla can say "More power to you. Mozilla puts people before profit, creating products, technologies and programs that make the internet healthier for everyone." (https://www.mozilla.org/en-US/) or they can, say, make Google the default engine ($), bake in a proprietary service (Pocket), rip out features (RIP compact theme), overrule user autonomy (Want to install an extension? Better upload it to Mozilla to get signed so they permit you to run it on your own computer!), ship a marketing extension through the "experiments" feature (https://blog.mozilla.org/en/products/firefox/update-looking-...).... but not both. Either empower the user, or don't, but don't pretend to empower the user while ripping away their control.
Yep, you are correct. Each of those decisions was made over the protests of a vocal but relatively small group of users.
You can't please all people all of the time, and I agree the pocket integration, and the looking glass add were mistakes, but the other items were directly related to sustainability of the project ($, eng cycles), or user safety.
You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.
As a user, the average Firefox user has far more control over the browser than Chrome, Edge, or Safari users do, and have the flexibility to use one of many Firefox forks that have the same beef as you.
Since the first thing that group protested was telemetry, I don't know how we could possibly know that it's a "vocal but relatively small group of users". In general, though, "you can't please everyone, and not that many people objected" isn't really a compelling argument; the criticism is still valid, and people being unwilling to make the effort to make a fuss, fork, find workarounds, or switch browsers doesn't mean that they're okay with it. For that matter, there's not a lot of feedback in general; how many people objected, and how many said they were in favor, compared to the overwhelming majority who never said anything?
> You can disagree with them as much as you like, but Firefox continues to support the ultimate in user control by releasing their product as open source. Roll your own build that doesn't require those features, sideload your add-ons, and/or fork the product.
By that standard Chrome is a paragon of user control. Firefox, as it actually exists, in the thing that Mozilla offers users to download, claims to care about user empowerment while constantly reducing users' power.
In fairness, it's hard to tell what's Firefox throwing away the thing that made them special vs Google abusing its monopoly position to push its way into the browser market.
I agree that Google is at fault here for failing Firefox. But Firefox is guilty of failing its users. Why should the functioning of a browser be dependent on telemetry working? It sounds like if there is high enough latency in their telemetry, or if request for telemetry start failing, it's possible for that to disrupt using the network stack at all. They have a massive design flaw, and they didn't even mention that in the article. Maybe they have good reasons for designing a single point of failure that relies on a cloud provider, but it's not clear what those might be since they don't address it.
>> Why should the functioning of a browser be dependent on telemetry working?
That was my thought after reading the start of it. Like "Oh no, Firefox has fallen into that void where their need for telemetry trumps users". Another product falling down at doing its primary function. But after reading the entire report that's just not fair at all. A bug relating to telemetry and their network stack caused failure in that networking code which affected everything. That is entirely different than software depending on telemetry to function properly. It wasn't by design that failing to phone home broke the software, it really was just a bug - a fairly obscure one. Sounds like if someone wanted they could just as easily blame the use of Rust in Firefox since some of the code involved was written in Rust. But that's not a fair or accurate conclusion either.
> Why should the functioning of a browser be dependent on telemetry working?
It isn't. The bug was in the networking stack, and it just happened to be triggered by a GCP change which effected the telemetry service. Firefox having telemetry has nothing to do with the issue here.
That's not quite right. A single socket thread does all the requests and telemetry is multiplexed with user traffic. If telemetry is different in some way to other network traffic, then it's always possible for it to cause problems with user traffic.
Telemetry is different to user traffic - it's less important! - but of course any in-process QoS would still create a point of interaction with user traffic.
So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers? That's not what the article said.
I understand that you mean to say that it isn't intended for networking to be taken down by telemetry. That's nonetheless what happened, and it could have been prevented by treating telemetry as a different class of traffic (not collocating it with normal requests), or by not having it, as others point out.
So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? Because that's the only way you'd have avoided this.
It's natural for all the network stuff that goes in inside a browser to share code. You can say what you want about telemetry (I'm not a huge fan, personally), but this was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".
> So you're saying telemetry should be handled as a separate process that has nothing to do with the rest of the browser, and treated like a hostile service? [... T]his was a dumb bug and it is completely unreasonable to expect some kind of adversarial design "just in case a freak bug triggers on telemetry network requests".
I absolutely agree that this a dumb bug having little to nothing to do with telemetry. It is not even the first case-sensitivity HTTP/3 bug I’m personally encountering in the course of completely casual use[1]. Probably not the last, either, those joints ain’t gonna oil themselves.
At the same time, you know what? I’m glad you suggested this, because I certainly didn’t think of it. Yes, in an ideal world, telemetry absolutely should be a separate process (or thread, or at least not share an event loop—a separate “hang domain”, a vat[2] if you want). And so should everything else off the critical path.
I’m not saying Firefox is bad for doing it differently. I’m saying it’s silly that Firefox is forced to play OS to such an extent because the actual one isn’t up to its demands.
They're saying what is clearly explained in the article:
“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”
Yes, but the fact that telemetry is in place was the cause for the issue.
> So you're saying that Firefox did not on fact have an outage due to a change in their telemetry servers?
Not the telemetry code. Not the fact that it "could" happen elsewhere. But rather the fact that it was in place and in this instance happened because of it.
Not that it matters that much. Regardless of the particular cause, a browser failing to work because of something changing externally is crazy (at least to me), no matter how you look at it.
How do you reach that conclusion? From the article:
> It just so happens that Telemetry is currently the only Rust-based component in Firefox Desktop that uses the [viaduct/Necko] network stack and adds a Content-Length header. This is why users who disabled Telemetry would see this problem resolved ...
The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.
> ...even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
And then the article contradicts itself and agrees with you using some heavy-duty doublethink. Sure, if there were hypothetically other Rust services using the buggy network stack, they'd also have hit the bug: BUT THERE ARE NONE. The bug was in code which is only running because it's used by the telemetry services, so even though it might be in a different semantic layer it's the fault of the browser trying to send telemetry.
As a user, I place very low (often negative) importance on the tools I use collecting telemetry data, or on protecting DRM content, or on checking licensing status. They should focus on doing the job I'm trying to do with them on my computerr, serving the uses of the user, rather than doing something that someone else wants them to do. Sure, I understand that debugging and quality monitoring are easier with logs and maybe with telemetry, so I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.
> The article contradicts your conclusion. If Firefox did not have telemetry, the bug would have had no impact, and users would not have suffered an outage.
This is your mistake: as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.
“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”
> ...as explained in the article, it could have affected any component. Telemetry happened to hit it first but anything using HTTP/3 with that path would have been affected.
Is this really relevant, though? To the users who were unable to use their browsers normally it doesn't matter that this problem could have occurred elsewhere as well, but rather that it did occur here in particular.
If particular sites would break, then that could be debugged separately, but as it stands even people who'd be perfectly fine with browsing regular HTTP/1.1 or HTTP/2 sites were also now impacted, not even due to opening a site that they wanted to visit themselves, but rather some background piece of functionality.
That's not to say that i think there shouldn't be telemetry in place, just that the poster is correct in saying that this wouldn't be such a high visibility issue if there was no telemetry in place and thus no HTTP/3 apart from sites the user visits.
The comment I was replying to worded in a way which was trying to attribute blame to the telemetry service. As shown in this thread, there's a certain ideological position which welcomes any attacks on telemetry and I think that's a distraction from the technical discussion about how Mozilla could better have avoided a bug in their networking libraries. Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.
> Recognizing this as a bug in the network stack first triggered by Telemetry makes it clear that this is not the place to have the millionth iteration of flamewars about that service but rather questions like the design of that network loop or not having test suite of the intersection of those particular libraries.
Surely one could adopt a "shared nothing" approach, or something close to it - a separate process for the telemetry functionality which only reads things from either shared memory or from the disk, where the main browser processes could put what's relevant/needed for it.
If a browser process fails to work with HTTP/3, i don't think the entire OS would suddenly find itself not having any network connectivity. For example, a Nextcloud client would still continue working and synchronizing files. If there was some critical bug in curl, surely that wouldn't necessarily bring down web browsers, like Chromium, either!
Why couldn't telemetry be implemented in a similarly decoupled way and thus eliminate the possibility of the "core browser" breaking due to something like this? Let the telemetry break in all the ways you're not aware of but let the browser continue working until it hits similar circumstances (if it at all will, HTTP/3 isn't all that common yet).
I don't care much for flame wars or "going full Stallman", but surely there is an argument to be made about increasing resiliency against situations like this one. Claiming that the current implementation of this telemetry is blameless doesn't feel adequate.
> I can understand using a few resources in the background to serve some of that data, but it must never get in the way of actual work getting done.
Which is exactly how the code was intended to work. Firefox did not design their software to hang in the event of telementry losing internet access.
I don't know firefox's internal architecture or its development, what follows is pure conjecture.
Their intention seems to be to slowly migrate the codebase from C++ to Rust. That telemetry is the only function to so far rely on their new rust networking library viaduct (and thus trigger the bug) could be because they wanted use their least important fucntionality as a test bed. In which case, if there wasn't any telemetry, a different piece of code would have been migrated to rust first and triggered this same bug. Without the telemtry, it would have presumably taken them longer to realise that things had broken, let alone resolve it.
Firefox also said that this switch to default was an unannounced change. But a quick Google shows that it was announced
> In the coming weeks, we’ll bring HTTP/3 to more users when it's enabled by default for all Cloud CDN and HTTPS Load Balancing customers: you won't need to lift a finger for your end users to start enjoying improved performance.
In their blog on June 22, 2021. [1]. It probably should have been it's own standalone message sent to users (a "this should be a no-op" email), bit to claim that it was unannounced is misleading.
That’s half a year earlier and it’s described as an opt-in change until the very end, where it’s mentioned as a default changing in a few weeks. That’s far different from what, say, AWS does proactively sending email and SNS notifications with a time range and usually listing the affected instances.
You expect everyone to read the google cloud blog? The distinction between "unannounced" and "not usefully announced" isn't of merit. If they did not specifically make their affected customers aware of the change and when it would actually happen, it was unannounced. And caused a major outage for at least one of their customers.
This afternoon I tried to clone a git repo which, in the morning, was highlighted as containing a useful example to start from in the work I had targeted next.
The clone failed with a mysterious error. After some minutes I checked the accompanying web site. The web site failed too, but, on refresh, this time I got a holding page explaining that the service was down. So I check the overall ticket system, and I find a change ticket, for the git system, saying there is planned maintenance, at 8am for one hour. Unadvertised because hey, it's 8am, most people aren't at work at 8am and this is a regular (Wednesday 8am) maintenance slot.
And I scroll down and I find that nobody remembered to actually do the task. They wrote it up, submitted, got it OK'd and then, eh, never did it. By the time the people who were supposed to do it were reminded it was 9am already. So, astoundingly, the service owner OK'd just doing it after lunch instead.
That failed 8am change was actually a re-run, of a re-run, of a re-run, of an upgrade that keeps failing and definitely takes over an hour to complete.
So instead of "It's fine to do this when nobody is at work and it's low risk" suddenly "It's fine to do this for 2 hours in the middle of the working day, though it'll probably fail and we have no roll back plan".
That's pretty shoddy. Glad to know an "Enterprise" cloud offering is hardly better.
> They have a massive design flaw, and they didn't even mention that in the article.
From the article:
> This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
Telemetry traffic is multiplexed with user traffic on a single socket thread, per article. That creates a single point of failure where telemetry can affect user traffic.
Of course all network access is shared for a machine so it's not possible to not have a single point of failure, but there are different ways of slicing up the access.
You're grasping at straws with this argument. That it shares a thread is a technicality. I'm sure the socket management is asynchronous and telemetry wouldn't normally affect normal traffic. This was an infinite loop bug. What if it had been a memory corruption bug instead, would you be saying that telemetry needs to be a separate process, not just a separate thread? The design was reasonable. Dumb bugs can happen anyway and cause things not to work as designed. That's what happened here.
What happened to the famed intelligence level of Hacker News? Every single person (almost) in this thread is blaming telemetry while it was clear even when the bug was ongoing that it was unrelated to telemetry. I for example had had telemetry disabled and still hit the bug through other traffic and had to temporarily disable HTTP3 from about:config.
> I'd claim that no other company is criticized as harshly as Mozilla around here. The amount of blame that is assigned to the Firefox team is staggering.
Mozilla has opened themselves up here, as they market as a privacy and user respecting alternative, so when they fail to live up to their own marketing people are more annoyed while they expect random startup #456 to not care about their users privacy and have telemetry out the wahoo.
(I used to work for Mozilla, and spent a few years on the team that at the time owned the telemetry component.)
The way Mozilla does telemetry is different from how most places do.
I think that the biggest issue with these discussions is that there always seems to be this assumption that there is only one way to do telemetry, it always contains super invasive PII, Mozilla's telemetry must do the same, and therefore Mozilla's telemetry is just as evil as anybody else's.
Mozilla is remarkably open about how its telemetry works, beyond just being open source. Maybe this is more a problem of that information not being surfaced well, I dunno. I get that some people are philosophically opposed to telemetry no matter what, but I have seen enough cases of, "Wow, I didn't know it worked that way, and I'm actually okay with this," to know that informed users are not universally opposed to it.
All network requests share at least the IP address, which is PII, and should only happen after obtaining informed consent unless they are required for the requested user action. Since telemetry would be pointless if its the same for everyone there will inevitably be more information that can ultimately be used to identify users. You can argue as much as you want that you are doing it "better" than others (there is always someone worse and the software industry's disregard for user rights is well known) or that it is useful (many unethical actions can be useful but the ends do not justify the means, especially when alternatives like bug reports often go ignored) but that does not change the fact that you are sharing PII without informed consent.
More importantly Mozilla knows that there are people who do not want them to upload this information yet they continue to do it anyway by default without ever asking for consent. Worse, Mozilla keeps adding new leaks that concerned users will have to watch out for and disable after each update. This is of course by no means a problem unique to Mozilla - the software industry as a whole has not yet learned that no means no - but it is also a Mozilla problem and as long as they want to use privacy to market their software they will rightfully receive the loudest criticism. Thankfully laws are beginning to catch up with the digital age and people will have better recourse than asking software vendors nicely to not be mistreated.
If you're running Firefox, you can go to `about:telemetry` and see what data is there. Note that some of that data might be populated even if you have telemetry turned off. Don't reach for your pitchfork quite yet: I assure you that the data isn't being sent.
I meant something documenting how telemetry works, though perhaps the source tree is the source for that?
> Don't reach for your pitchfork quite yet
No need to assume people are out of their minds or even critical. I am curious about how it's done on a technical level, with the old local ad system in mind (which I thought was a brilliant solution to Internet commerce and privacy). I've supported and contributed to Mozilla since before Firefox.
Yeah, please don't take the pitchfork thing personally, that was more intended for anybody reading that comment who immediately assumes the worst.
As for high level docs about how it works, I haven't been involved in quite a few years, so I'm not 100% sure about the best source, but this link looks like a good place to start:
> Instead, they frantically went through their git log to see what in the code base might have triggered this bug.
This seems like you're embellishing this part to tell a story? It's not supported by the linked post, and from the bugzilla bugs it seems like it was known almost immediately that the ESR builds were affected as well and so it almost had to be an external service, they just weren't sure which one at first.
God I hate the modern web. Why the hell does my browser need more than periodic contact with any server other than my DNS provider and the host of the website I'm connecting to?
Even DNS is pretty WTF worthy if you think about it. Contacting a third party about the website you're about to visit. On an unencrypted channel no less.
Another third party request is Firefox's phishing/malware protection. It periodically downloads their own bad site collection and if you visit a site, it check if it's on the list. And if it isn't, it checks with Google if the site is okay.
Wow, I did not realize it would send URLs to Google. That does not sound GDPR-compliant.
But even without the privacy issue, you should turn off save browsing because google should not be in control of what users can and cannot donwload. They clearly do not care about keeping that list free of false positives, for example: http://dege.freeweb.hu/dgVoodoo2/
> It does NOT contain any malware. Use a browser that is free of Google Shit Browsing security service crap (which is based on tons of noname antivirus "engines", look at VirusTotal if interested).
I have also experienced Googles disregard for false positives on that list myself. While they may "remove" false listings after you bug them, those entries will just be re-added the next week and of course because this is Google there is no way to get an actual human to look into it. It is insane that all browsers allow a private company to maintain such a list without complete transparency and publicly visible reasoning for why each entry is in it as well as well defined procedures to contest false postives with agagain, publicly visible reasons for denial.
"One of the most persistent misunderstandings about Safe Browsing is the idea that the browser needs to send all visited URLs to Google in order to verify whether or not they are safe."
Apparently it is how it used to work though for some settings, which is bad enough, and still sends partial hashes in some cases which can leak some information. Even just making any connections to Google (or another provider) without the user explicitly visiting a Google website is a privacy issue that Firefox should resolve.
It also does not adress the problem with making Google the gatekeeper deciding what you can and cannot download - and don't tell me its just a warning, its set up in such a way that regular users will often not even know that they can bypass it.
"One of the most persistent misunderstandings about Safe Browsing is the idea that the browser needs to send all visited URLs to Google in order to verify whether or not they are safe."
"Need" is the key phrase. If it wants to check for updates, do telemetry, fiddle around, whatever, that's fine by me. It should not shit the bed if it has to go a few hours without doing so, however. For example, you recall a couple of years ago when Mozilla screwed up their certificates and every Firefox extension was simultaneously disabled. That should not have happened. That should have resulted in a pop-up box that said "The security of your extensions cannot be verified. Using them at this time could be highly dangerous. Disable them to browse safely? Yes/No" Instead Mozilla just blanket turned them all off, which had the potential for getting people killed as they suddenly found themselves not protected by VPN/Tor/NoScript/etc extensions.
Software shouldn't have bugs, got it. It should be written by bug-free programmers that foresee every possible failure mode (that they can never imagine happening, because they won't write bugs themselves).
Hard disagree. Having that in the hands of the OS is a separation of concerns issue. I want my OS people focusing on OS stuff. The versions of installed apps is up to the developers of those apps, doubly so on something like a browser that updates rapidly.
That was an extremely high risk change on GCP's part, reminds me of the App Engine days when you'd wake up to find a totally healthy program spamming 500s because they'd make a breaking change without any announcement. It's shocking they're still pulling stuff like this in 2022
It reminds me how YouTube enforced new codec with a few days notice knowing that FireFox doesn't support it, so FF couldn't play most YT videos for over a week.
Not all codecs were ported at the same time, and then not enabled by default, and when enabled by default it was platform dependent, for others where it worked FF was eating all possible CPU resources and videos were glitching. I remember this as I was using Debian and I was active on /r/firefox, where this [1] link was posted 10 times every day
Feels like an especially severe version of the consistent Google pattern of only testing stuff on Chrome, so new updates/features ship in a way that is some degree of broken on Firefox/Safari. For a significant amount of time YouTube had bad performance on Firefox because they chose to use Web Components by default with a horrible polyfill instead of using the old (still working!) html5 version that ran great.
Putting in place an infrastructure to test this kind on changes on the 5-10 most popular browser would be, I think, very cheap for a company like Google. I can't help thinking these may be deliberate moves to eat the little market shares of Chrome concurrents.
I remember reading here on HN an article written by an ex-Mozilla insider relating the dissonance between the "friendly" Mozilla-Google employees exchanges and the year-long track record of very oddly recurrent "unfortunate mistakes" from Google degrading the Firefox compatibility.
Yes - they clearly don’t test the GCP console in Firefox since they “accidentally” break it on a regular basis, and there’s just no excuse for that happening at such a rich, well-staffed company.
> Putting in place an infrastructure to test this kind on changes on the 5-10 most popular browser would be, I think, very cheap for a company like Google.
The problem wasn't some web server, it's the Firefox backend services running on GCP.
Testing wouldn't have revealed anything because this didn't break with Firefox outright, it only broke when Firefox telemetry used it due to a complex series of circumstances.
> That was an extremely high risk change on GCP's part, reminds me of the App Engine days when you'd wake up to find a totally healthy program spamming 500s because they'd make a breaking change without any announcement. It's shocking they're still pulling stuff like this in 2022
reply
Lay with the dogs, wake up with the fleas.
Google is a shitty company producing shitty products. When you select to do business with Google you select to do business with a shitty company producing shitty products and treating its customers like shit. Hence I fail to understand the Surprised Pikachu face when something like this happens.
The client, Firefox, said it supported HTTP/3 though. Otherwise it wouldn't get to use that.
I don't think that's as bad as you try to make it... if the client says it supports something then it breaks when it uses it, it's the fault of the client, not the server.
No SRE in the world that is halfway decent at their job would think that way. You never make assumptions about any kind of change, much less a global change to a completely different protocol. Doesn't matter whose fault it is. You just don't introduce any change that has a chance of unexpected behavior without rigorous testing, and you roll it out g r a d u a l l y, and you stop when error rates increase.
Google literally wrote the books on SRE. For them to not know better is absurd.
> Would a warning have even helped that much? Since HTTP/3 was expected to be working there wouldn't be a cause to worry.
It might (as mentioned in TFA) have made them think to run some extra tests, which could have caught the bug. But it also would have made the response faster, as they would have known what changed far sooner.
When you do Operations for a living, the only thing you can expect is the unexpected. That's why even after you think you've tested a change, you carefully and slowly roll it out a bit at a time, monitoring golden metrics so you can detect a problem, stop the roll-out, and roll back.
It sounds like somebody just flipped a giant switch and never checked error rates, connection metrics, anything. Check out this graph: https://hacks.mozilla.org/files/2022/01/crashes-foxstuck2-20... Think maybe that would indicate somebody needs to roll back the last change?
The problem here, as usual, is a disconnect between stakeholders. Google has this service (it seems like the load balancer for their customer?) it wants to change for one reason or another. The customers may or may not have planned for the change Google is making. Google makes the change, but it isn't a stakeholder of the customer (they basically don't care what happens to the customer). So there is no direct feedback loop for the customer to tell Google something is wrong.
If Google was at risk of losing business from its customers going down, it would have a strong relationship with those customers and have a way to quickly help diagnose problems and roll back changes if needed. This is a great lesson for all customers to take away: don't depend on people who you don't have a close relationship with.
A lot of intelligent sounding words to explain a trivial bug: their network stack was using case-sensitive header names, which anyone doing anything remotely related to HTTP knows is a mistake.
We all make mistakes, but don't try to make it sound more grandiose ("a combination of multiple factors blah balh blah") than it is.
Http2 and 3 are case sensitive at when encoding the headers:
> As in HTTP/2, characters in field names MUST be converted to lowercase prior to their encoding. A request or response containing uppercase characters in field names MUST be treated as malformed
"Because all network requests go through one socket thread, this loop blocked any further network communication and made Firefox unresponsive, unable to load web content."
Why side functionality (telemetry) of a tool uses only one network thread and can block any network communication ?
Yes that's another issue. But as a design pattern shouldn't we design our products to do their core functionality as much as independent from any anomalies that can happen?
This is almost akin to me if Tesla rolls out an update and the car decides to pull over to the curb to do the update, while you are driving to your job or worse to hospital with an emergency. My theory is there should be at least one health enterprise using firefox as their only browser for business functionality out in the wild.
> But as a design pattern shouldn't we design our products to do their core functionality as much as independent from any anomalies that can happen?
When expected anomalies happen, like telemetry being down or taking a long time to respond. Firefox is certainly already designed like that.
This was not that. This was a bug. There is no magical design that avoids bugs.
> This is almost akin to me if Tesla rolls out an update and the car decides to pull over to the curb to do the update, while you are driving to your job or worse to hospital with an emergency.
Firefox is not a car. You're going to have to get Mozilla a lot more funding if you think the browser should be designed with extreme resilience in mind as required for life-critical applications. If a health enterprise is using Firefox in a life-critical role, that's kind of their responsibility, not Mozilla's.
In theory, that's how it happens in Firefox. But when you have a bug in the core of the product (the network stack), there isn't much that the rest of the product can do to isolate from it.
Yes that's what you get when you have one thread for all network communications. The network stack did not fail, only the sole network thread got stuck. From the write up I understand if there was another thread for communications firefox would only fail to communicate with telemetry service but firefox would be able to function as users needed.
"This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise."
It was a bug. It can't block network communication normally. It's not like telemetry serializes with normal user traffic. It was just a stupid infinite loop that broke what otherwise would've certainly been nonblocking multiplexing of requests.
I'm not saying this was actually a good design, for obvious reasons, but one decent reason to do it this way is so things like proxy settings are shared.
When you read the whole account, everything has a justifiable reason and the entire thing is very rational. But if you look at the 10000ft view, if an app hangs and refuses to work at all due to nasty/unexpected input coming over the network, that is just a bad bug. These sorts of bugs should have been caught earlier. It shouldn't matter if the entire internet is sending bad responses, Firefox should still handle it gracefully.
It does not need this big of an explanation. It's just a silly bug, they can do better by improving their testing. The lady doth protest too much.
I think the length of the explanation is due to the impact this had. Most Firefox bugs don't cause all Firefoxes everywhere to stop working simultaneously. That's quite the WTF and it's hard to understand how it is even possible without the explanation.
It should be impossible for the phrase "the recent Firefox outage" to make sense. Has there ever been an "outage" of linear algebra? Of Linux? Of mitochondria? Of Bitcoin?
It is critically important that we not introduce new single points of failure into the systems that our civilization depends on, and that we remove the ones that already exist.
If it can happen by accident, it can happen on purpose.
While I agree with the SPOF point (heh), the header should really read “the recent Firefox Telemetry outage”. It's kind of like a website causing your browser or a tab of the browser to be unresponsive by introducing an infinite busy loop. Except the “tab” was invisible.
This is wrong on two points, based on the other discussion and the post itself.
There was no Telemetry outage. An HTTP3 response header’s case was changed by a third party without notice. Telemetry continued working, other than the case change causing a bug.
There was an http3 infinite loop bug in Firefox that crashed all networking. Many different things could have triggered the bug once it was introduced. Telemetry happened to be the first thing to do so, but not due to any faults in Telemetry’s code or implementation.
The infinite busy loop in this case was not the tab no (neither visible or invisible). The loop was directly in the network stack, as stated in the post, not in the caller.
The problem isn't that there was a telemetry outage; the problem is that the telemetry outage caused a Firefox outage, which should not be a thing. Firefox needs to be more robust than that.
What do you mean that's not what happened? Was there a "Firefox outage" or not? Are you disputing claims that the engineering team made about Firefox becoming unusable for users "for close to two hours"?
That's my point. A "telemetry bug" didn't make Firefox unusable, a networking bug that was triggered by a telemetry bug did. But it could just as easily have been triggered by anything else.
First, you're going anachronistic. They didn't write "telemetry bug".
Secondly, "cause" doesn't automatically mean "root cause". (That's the entire reason we distinguish between the two by qualifying the latter to begin with.) It's perfectly reasonable to say "A caused B" even if the root cause lies elsewhere, with C.
Thirdly, none of this matters. It has no impact on the point being made by the person you responded to, which—to repeat—is that:
> It should be impossible for the phrase "the recent Firefox outage" to make sense.
>It should be impossible for the phrase "the recent Firefox outage" to make sense.
It makes perfect sense a world where half the internet is going through Google / Cloudflare / Amazon /Akamai servers or some combination of the above, and they decide to roll out brand-spanking-new protocols to half of the internet at once. Sometimes that's going to break clients.
I don't like that world very much, but it's the one we live in.
... due to code in Telemetry being different and triggering different code paths in the network stack.
There is a why here, and it includes Telemetry mixed traffic as a potential culprit. There are reasons to unify traffic (proxy support, QoS and whatnot) but unification of the user and Telemetry streams isn't without risk, as has been shown.
> unification of the user and Telemetry streams isn't without risk, as has been shown
A constant refrain over the last 10 years or so of Mozilla's descent while trying to justify the removal of features from Firefox has been that not doing so unnecessarily bloats the surface area of the codebase, and specifically that this increases the chance of vulnerabilities and defects.
Will the same argument be applied here, now with a case in hand, to justify the removal of telemetry, too?
It's the same app. I don't get why you're replying to every thread trying to somehow argue that sharing a thread for all network code is a bad thing and telemetry needs to be a special snowflake that gets a different thread. The networking code had an infinite loop bug. It was triggered by telemetry, but it could've been anything. Telemetry getting its own network thread wouldn't have magically made it impossible for it to cause problems. Bugs happen, and sometimes make things interact in weird ways.
> there were many contributing factors working together
Looks like one factor, to my eyes: telemetry.
I have telemetry disabled. But if you're going to default to "telemetry on", and then silently send data to sites that aren't in the address-bar, then it's your responsibility not to "break the web". You can't blame it on rust, or necko, or viaduct, or google.
> This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
It's pretty clear from the article that this was a bug, and telemetry requests failing was not intended to break the rest of the browser.
If Google had rolled out this change a year from now, it could probably have broken something other than just telemetry (e.g. maybe update checks, or certificate management) and your browser would still have been broken even with telemetry disabled.
> It's pretty clear from the article that this was a bug, and telemetry requests failing was not intended to break the rest of the browser.
If telemetry didn't exist as core part of the browser, nothing would've broken.
Therefore, the telemetry itself is the direct cause of the bug. It was, at best, poorly handled and too deeply integrated into the browser's core function.
There was no 'Firefox Outage' because Firefox is not a service. There was a bug, and a production issue with a service that Firefox users were involuntarily opted into.
Lesson learned: do not opt your users into services without their consent.
It could have been, but it wasn't. See, if I had opted out of this junk, which I wasn't because it was enabled without my consent I would not have experienced that particular problem (but others would have) and I would have been able to save myself a couple of hours of debugging.
So yes, it wasn't limited to Telemetry, but no I had not seen the bug in practice until that very moment.
I haven't noticed anything and I use Firefox every day. Nor have any of my clients where I deployed Firefox called. Is this because I always disable data collection in settings?
Yes, the bug was in the Telemetry code. I'm not sure if it's on by default, but it's probably better to disable it for any large-scale deployments. Both to prevent things like this and to make sure that things that should be disabled by default actually are.
Ah, that explains it. Thank you. I recommend https://ffprofile.com/ which was posted earlier here on HN. Makes it easier to deploy Firefox with saner defaults.
You know I moved from Netscape 3.0 Gold to later versions, to Mozilla, to Phoenix, Firebird, and then Firefox. I tried other browsers but it's always a subpar experience for me. My only gripe is that they kept changing the UI.
There's a lot of FUD and paranoia out there; 99% of exploits need JS and even those which don't technically need it, are almost always obfuscated using JS.
Leave JS off by default (there are extensions to do that) and don't turn it on unless you really do trust the site to run arbitrary code on your computer, and you're unlikely to encounter any problems.
Why does my browser need connectivity to some internal services? I am fine with offering opt-in service integration (Firefox Sync, Pocket, ...) but is there a reason why Firefox needs internal infrastructure to do the one thing it is supposed to do, browsing the web? I can only think of DNS over HTTP, but AFAIK that is also opt-in, right?
Man, I love Firefox and used it since it was called Firebird (with a small gap when Chrome was shiny and new and Firefox a slow RAM hog). But I really resent the Mozilla Foundation, they seem to be interested in everything but browser development. To be fair, (ab)using the browser as application runtime brought us so much complexity that developing and maintaining a secure browser as free software spare time project isn't feasible anymore.
> Why does my browser need connectivity to some internal services? I am fine with offering opt-in service integration (Firefox Sync, Pocket, ...) but is there a reason why Firefox needs internal infrastructure to do the one thing it is supposed to do, browsing the web?
If you read the whole post, the connection was explicitly for telemetry (and so you could avoid the issue by turning off telemetry), and it blocked other connections because the request went into an infinite loop rather than failing outright.
> If you read the whole post, ... (and so you could avoid the issue by turning off telemetry)
Speaking of reading the whole post:
>> users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
This does not make sense to me:
>> Without the header, the request was determined by the Necko code to be complete,
This is written as if it makes sense to treat a request as "complete" when it's missing a content length header. Huh?!
At this point, the code relied on the Content-Length header being present because the higher-level API was supposed to add it. The field that is supposed to be populated by Content-Length (mRequestBodyLenRemaining) is pre-initialized to 0.
> users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.
All they mean by that is that it was a bug in their HTTP/3.0 code that could have been triggered by any HTTP/3.0 connection that was using that codepath. But the reason for that particular connection (which had the conditions to trigger the bug) was for telemetry.
All of those except certificate management are important but should never be required for the browser to work. And certificate management should depend on whatever chain of trust is configured, which should not invoke Mozilla as an essential party to every transaction.
Does it mean that the same blocking bug could happen while browse a local website on an air-gapped network? Or while opening local HTML files while offline?
Firefox generally does not block if a remote connection does not work. As explained in the post, the infinite loop was a bug in the network stack itself.
So yes, you can use Firefox in any offline environment.
> should never be required for the browser to work
None of them are; disconnect from internet and start Firefox. It will work.
It was just a bug in the Firefox HTTP 3 implementation that caused it to be rendered unusable; it just so happens that connecting to these services triggered it, but it also could have been triggered by another HTTP 3 service (as I understand it, anyway).
>All of those except certificate management are important but should never be required for the browser to work. And certificate management should depend on whatever chain of trust is configured, which should not invoke Mozilla as an essential party to every transaction.
This is the part that really gets me. For an average user, they trust the certificate that is bundled with the browser vendor (yes you can do certificate pinning). It just seems like something like certificate for encryption, ought to be split up away from browser vendor rather managed by a open public repo mange by a non-profit. Or have it on a block chain type of ledger. Any thoughts on that HN?
> Does it mean that the same blocking bug could happen while browse a local website on an air-gapped network?
If you're making an HTTP/3.0 request formed "correctly" then yes, it too would cause the infinite loop. It's not in any way specific to the internal service.
Under normal circumstances it would gracefully fail. If the connection fails normally the browser will keep trucking along, the problem was a bug deep inside the network stack that could've been triggered by any HTTP/3 connection.
Some degree of this is absolutely required to be a decent internet citizen. Even if you think things like emergency configuration and basic telemetry are optional (I disagree), polling things like certificate revocation lists is basically required. Without doing it all your customers are sitting ducks for the latest security vulnerability.
She's probably earned another raise if her actual reason for appointment is to drive chrome adoption while soaking up community engagement to prevent another open source browser competitor, then she's succeeding very well!
It's amazing to see how rapidly Firefox users fell out of favor with Mozilla for some reason. Their bugtrackers went from joyful to friendly to silent to openly hostile as they tore out every feature that distinguished them from Chrome.
Branding must be really important if people are expected to enjoy using a completely different product because they enjoyed the old product. MS Office doesn't expect me to do that; they give me essentially the same thing in 2022 as they gave me in 1997. They don't expect me to be loyal out of some sense of love or obligation.
Niche opinion, probably, but I think they took a hard wrong turn not later than 3.0 (yes, that long ago) and never recovered.
Here we have this bug that's "not in telemetry" (strictly true) but for which telemetry increased the severity/blast-radius from "partial failure for many users" to "complete failure for most users".
But FTP—an actual feature for users, unlike spyware "features" that keep some chart-readers employed—had to go because that's too risky to keep. OK.
A quick google search revealed FTP was supported until v90, so I'm curious as to what is that you are referring to when you say 'they took a hard wrong turn not later than 3.0 (yes, that long ago) and never recovered'.
Not related to the 3.0 release, just an example of a recent cut of an actual feature while spyware is apparently essential. IIRC 3.0 (might have been one of the 2 series?) was when the browser suddenly got a lot fatter and the UI less responsive, and never made meaningful progress back the other direction, contrary to its feather-weight beginnings which were a big part of why I loved it so much. I kept using it for quite a while longer but never loved it again.
Well. The actual lead dev got kicked out as Mozilla Foundation chair and got replaced by some SJWs over his support for banning abortions, same-sex marriages or something similar.
Does Firefox force updates on your configuration? On Windows it's opt-in for me. I know Android will forcibly update any app (including Firefox) while you're using it, but you can shut that off system-wide.
The default is that it forces a restart when a new update has been downloaded, which has been a frustration for me as apparently it has been for the parent.
Apparently this can be changed by requesting that FF only install updates with explicit consent[1]. I'd think the best way would be to install anything that's available locally when the browser starts without forcing a restart.
Are you using Linux, by any chance? IIRC, the issue here is that the update is done by your package manager, changing Firefox's files out from under them. If you use the direct download from Mozilla, it shouldn't be as disruptive.
I still hope a fix or workaround to this can be found, but knowing why something is the cause makes it easier to accept, at least for me :)
What pisses me off is less about the update itself (they're typically unnoticeable) but the constant nags and "what's new" crap that opens up after the update. Firefox is more hostile than a lot of paid, proprietary software in this regard.
This. That crap is significantly more disruptive and irritating that the actual paid ads in old-school free Opera (the largest feature-comparable browser when FF/Phoenix/Firebird first launched).
Happens to me often. I'll be browsing and then suddenly I am told that before I can view the next webpage I MUST restart. Giant PITA if I have a ton of private windows open, as none of those are coming back.
The update might be started by the distro in this case, but there is no reason that Firefox cannot just keep an fd on the resource files open and use that instead of the updated files. Either that or keep things compatible so it can use the new files. Not being able to use the browser after an update is inexcusable.
> there is no reason that Firefox cannot just keep an fd on the resource files open and use that instead of the updated files
If you check what processes you've running, you'll see that firefox has many of them. I'm not going to grep the sources but I believe the ones with the "-contentproc" flag are started with an exec call as needed, and I'm not aware of an exec that works with fds. It requires a path, it executes the binary at that path, and that binary in turn loads a bunch of files it needs. It's all going to blow up if your parent and child processes are running different versions of the program.
Keeping parts of a program compatible with arbitrary versions of other parts of the same program is virtually impossible. Go ahead, checkout 50% of your files from some random version of your project thousands of commits ago while keeping the rest at master, and see if it still compiles and runs correctly.
> If you check what processes you've running, you'll see that firefox has many of them. I'm not going to grep the sources but I believe the ones with the "-contentproc" flag are started with an exec call as needed, and I'm not aware of an exec that works with fds. It requires a path, it executes the binary at that path, and that binary in turn loads a bunch of files it needs. It's all going to blow up if your parent and child processes are running different versions of the program.
That firefox is set up in a way that does not work when replacing the files while running does not meat that it cannot be setup differently. For example you could have one process paused after loading all needed libraries and opening all resource archives and then just fork from that.
NB: You can actually execve the original executable under Linux via /proc/self/exe - while it behaves somewhat like a symlink to e.g. show the original path under ls / readlink, reading from it behaves like a hardlink to the original inode. This does not solve shared libraries however (but does Firefox really need to link its own libraries dynamically?) or resources (you could sendfd them to the new process if you really wanted to go with exec).
> Keeping parts of a program compatible with arbitrary versions of other parts of the same program is virtually impossible. Go ahead, checkout 50% of your files from some random version of your project thousands of commits ago while keeping the rest at master, and see if it still compiles and runs correctly.
The most sensible solution would be to keep all parts in sync, which is possible even if the copies on the filesystem changed. However, you could also have a stable API between the browser binary and the javascript parts if you wanted - that is a far cry from mixing arbitrary source files.
Keep in mind that browsers have become more like a mini-OS and should not get away with behavior common in normal applications but instead should provide stability guarantees more like an OS. And you definitely don't need to reboot because the kernel has been updated on the filesystem.
> For example you could have one process paused after loading all needed libraries and opening all resource archives and then just fork from that.
Plausible I guess, but also seems like a lot of effort to work around broken distros. Actually calling exec has tangible benefits too, e.g. it allows each process to have its address space randomized.
> However, you could also have a stable API between the browser binary and the javascript parts if you wanted - that is a far cry from mixing arbitrary source files.
Yes but the problem here is stable API and ABI between the browser binary and.. the browser binary. IPC within a binary is hardly ever written assuming a stable ABI, it'll just constrain the project way too much.
> And you definitely don't need to reboot because the kernel has been updated on the filesystem.
I feel like the kernel isn't really a fair comparison since it's relatively self-contained and you generally don't have multiple instances of the kernel talking to each other using IPC.
But go ahead, update your modules and see if you can still load them on your old kernel without rebooting. No, you can't. Same problem.
Unfortunately OSes today tend to have many dynamically loaded parts that require restarts if you want to keep everything working after updates.
AFAIK Chrome behaves the same. You can't update a browser cleanly while it is running. In Windows is is handled correctly because both browsers are updated in the background when the browser is closed, but in Linux-style environments this work is done by the package manager. When the package manager stomps over a running instance of Firefox, the old behavior was to crash. At least, now Firefox can detect what happened and keep the active tabs running while instructing you to please restart the browser.
Regarding private tabs, one workaround would be to store the whole window as a bookmark folder (right click on empty tab area, select all tabs, store as bookmarks).
The package manager does not stomp over the running instance but only replaces the files. The original files even remain on the disc (but without any name) as long as they are open. It is firefox that is designed to have to re-open those files during runtime - but that is not inherently required. Before Firefox went multi-process it handled package updates just fine and there is no technical reason why it can't still do that.
>Before Firefox went multi-process it handled package updates just fine
No, the problem was just less common and things randomly stopped working or crashed instead of getting the warning page.
In theory you can design the browser so it keeps all files open and passes down the handles, but I imagine it's a mess to do that in practice, especially as Firefox is still somewhat configurable.
Annoying indeed but at least on my machines I can always get them back manually if they don't show up automatically by going to the history menu and choose "Restore Previous Session". Hopefully this should work everywhere.
(Note: I'm trying to help, not place blame here. I won't blame anyone for not being aware of every power user trick but I hope to help more people become power users. Please do ask questions about Firefox, even if I'm moving to LibreWolf I still wish Firefox well and think many would be better off using them but just aren't aware : )
The recent outage for users of Mozilla’s Firefox web browser was caused by a Firefox update, the source code for which was inadvertently pushed to the Mozilla Add-ons Repository. That repository is not publicly available, so it was not possible for users to update their installations. Fortunately, a bug was introduced in the Firefox update which prevented the update from being installed. The bug was fixed when the update was rolled back. Users who updated earlier were protected.
The outage could have been worse, though, because the bug that was introduced in the code for the Firefox update was in a feature of Firefox itself that allows users to block certain updates until a later time. In the case of Firefox, that feature was used to block the update that caused the outage. If a similar situation occurred in the future, users would not have a way to block the update that causes the outage. That feature is available in other web browsers, but it is not as advanced or robust.
It did have something to do with there being two languages used though, resulting in two different ways of excercising the network stack whith less testing for each. So its not a bug caused by Rust, but it is caused by the addition of Rust in Firefox.
This isn't specifically caused by rust, it's a second-system problem where they had two different ways to touch the http system and one was broken. It could have easily been the other way around with only rust being correct.
It's a nice write up, but the "Lessons learned" is terrible, it appears as if they didn't learn anything and view the whole thing as an infrastructure problem.
As others pointed out, why does Firefox even need to communicate with Mozilla services? Sure, telemetry needs to feed data back, if enabled, but if that fails why does it need to stop the browser from working?
Shouldn't the lesson learned be: The telemetry functionality in Firefox has a bug, where an infrastructure outage at Mozilla can "break" the browser. The fix isn't in infrastructure, the fix has to be in the code that communicates back, it should fail gracefully. It's not a problem if telemetry fail, either cache locally and just drop the data, it's honestly not important.
I'm sorry, I get that it's interesting how and why all this failed, but Mozilla makes it seem like they don't get what the root of the problem is.
The code _does_ work the way you describe, _except_ for the latent bug that caused the networking thread to get stuck in an infinite loop, which it was never supposed to do, even when errors occur.
It was never supposed to work that way, and the fact it did was because of a bug they'd never seen before.
So it wasn't that "oops, we shouldn't have built the system to get stuck forever when it fails" but rather "this bug triggered that bug which combined to cause a far worse result than 1 bug alone could have".
The only "lesson learnt" there is either that they need better ways to find bugs, quadruple up their thread count just so that different subsystems can't coexist on the same threads to avoid a theoretical problem that shouldn't ever happen again, or they just come up with infrastructural changes to minimise the negative results of the next "2 bugs reacted together and caught fire" scenario, which is the one they went with, and the only sane one.
Or they could reduce complexity by not adding things like telemetry, which has no direct user benefit and therefore should not be included in release versions. There should be no service that all firefox installs connect to.
Telemetry has massive direct user benefit. It gives every user a vote on how important each feature is rather than letting power users who manually submit feedback control the show.
How does gradually removing power user features over time empower regular users over time?
Hint: It doesn't.
And then when the regular users are all piling up on support because they can't learn how to configure the product (or indeed ask their power user friends for help, since the features no longer exist), what happens then?
Nothing. Nothing happens then. And the shitshow continues.
That is not what telemetry does. Useful features that are hidden aren't distinguished from useless features people don't use. Features that are used rarely but are super important aren't distinguished from features that don't work well.
Feature usage is a poor proxy for usefulness, importance, usability, visibility; it confounds them all.
No. The benefit you're describing is not a direct benefit. It's an exemplary instance of indirect benefit even under the most generous evaluation criteria/process.
I know of changes for the worse that Mozilla has made, using telemetry as a justification.
I don't know if I've heard of any changes for the better that have been prompted by telemetry data. It could be... maybe there are bug fixes or UI refinements. I just haven't heard of any.
Some examples of things I've used telemetry for at Mozilla:
* Noticed performance regressions not caught by our testing, and therefore been able to fix the regression.
* Noticed an unexpected number of users with hardware acceleration disabled, and therefore been able to find and fix the bug that was causing them to have acceleration switched off
* Figure out which device in a category is most commonly used by our users, so that I can dogfood my work on a representative device
Those are just a few examples off of the top of my head. It's not about removing features because telemetry says nobody uses them. People Mozilla use telemetry to answer all sorts of important questions. We also have to jump through hoops to add any new data collection, justifying why it's needed and ensuring the data is not personal. As is right, because we take user privacy very seriously
Incorrect. It gives the designers and developers an opaque dataset of user behavior which they can interpret in many ways. I find their interpretations of this data to be highly motivated and suspect.
Even ignoring that decisions based on Telemetry are very much influenced by the person making the decision, importance has nothing to do with how much a feature is used.
Telemetry is what allows Mozilla to quickly know if e.g. a particular hardware combination is causing lots of failures for lots of people, and to act on that if it happens. That seems to me to be a clear user benefit.
> the fix has to be in the code that communicates back, it should fail gracefully.
The bug that caused the hang was in the network stack itself. There was no way the calling code could have prevented this in any way. You can see this by taking a look at the linked HTTP3 code. It's not that the higher-level code kept retrying over and over causing the hang, that was not the problem here.
Under "Lessons learned" you can also read "investigating action points both to make the browser more resilient towards such problems". I agree that this is broadly spoken, but it covers ideas that would have made this technically recoverable (e.g. can network requests be compartmentalized to not block on a single network thread?).
As explained in the article, this problem was not specific to Telemetry:
“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”
Since a browser's job is to make HTTP requests, a bug in the network stack would almost certainly have been hit in other places. This was highly-visible so it was quickly noticed but it's quite possible that a less frequent trigger could have plagued Firefox users for a much longer period of time as HTTP/3 adoption increases.
The article specifically states that normal web requests went through a different code path that did not trigger the bug. That the bug was not technically in the telemetry code is irrelevant - it happened without user interaction because of telemetry and it did not happen (at least as often) with telemetry disabled. Saying that there was no way to prevent it assumes that telemetry could not have been disabled/removed, which is false.
The article provides the correct logic: Telemetry was the first to use that combination of new code but there's no reason to believe that nothing else would ever have used the stack they've been transitioning towards. Had this bug not been found in Telemetry it would have shown up somewhere else, possibly harder to diagnose.
Per other comments here, supposedly the same issue would have occurred with a variety of non-telemetry tasks as well. One of them is indicated to be certificate-related, which suggests to me that CRL lookups could have triggered it, though I don’t know for sure. It ended up being that the issue occurred first with telemetry rather than with certificate stuff or who knows what else. But the flaw wasn’t in telemetry code, so focusing on telemetry wouldn’t have prevented this at all.
Exactly. I love Firefox and use it as my primary browser but this is the wrong conclusion.
Don't make your telemetry backend more reliable. Instead break it on purpose several times a day. That way a similar bug in the browser will not make it past dev channel.
As mentioned in a few places, the telemetry backend was not the culprit. The network stack was. The network stack is already pretty heavily tested and fuzzed, but obviously that didn't catch this specific bug.
I also get that telemetry was not main culprit, but the way they are have ordered the "lessons learned" list they seem to be blaming GCP which THEY use (not end users) for telemetry collection. so they are unknowingly acknowledging the telemetry collection a major issue on the incident.
My thinking is that list should be reordered as ultimate culprit to blame is firefox itself.(make the third point on lessons learned as the first one)
Not sure how you reach that conclusion, the root cause is described in detail.
Of course they'll fix so that this problem doesn't occur in the future. But as said, it had nothing to do with telemetry. Just that telemetry happened to trigger the bug.
That's fair, I just got the impressing that they where more focused on the infrastructure aspects of the problem, rather than the bugs in the Necko and Viaduct code.
I agree that the lessons learned section is pretty weak. I think that the root cause of the incident was poor code quality, and in particular, using a weekly typed data structure (String) to pass data between system modules that each interpreted that data differently.
The suggested learning was that more testing should have been done, but as a solution, more testing is a cop out. A real solution is to develop code in a language that supports a robust type system, and then using that type system effectively in development.
Not an easy solution, so in the short term, we'll have more testing, and more bugs.
I was also disappointed for these reasons but couldn't have stated it this well. I also find it alarming in this age of citizen hostile countries blocking/rerouting Internet and J-curve countries having intentional calamities to think that everyone who uses Firefox might hit bizarre technical failure contacting unnecessary centralized servers like this as a first barrier to get past before following common DNS setting instructions to get to whatever local networking people are assembling.
If I have not understood the post then it is probably their communication style.
They lay out a bunch of things to fix about a system that could bring down your browser. What if that system that communication with Google about GCP messed up for them isn't even the IPs you are contacting during unrest in Khazakhstan or an election in Uganda? What if it is a semi-intentionally confused transparent proxy?
Testing a few more things that a friendly proxy may do as it improves your connection is hardly the same as assuming the worst about your network in proper paranoia mode.
You expect bugs, I expect bugs, they expect bugs.. This explains why they lead with a discussion of defenses they were taking like certificate pinning or encrypting this tracking that caused them to discount risks of putting this service in a new system and letting it run on startup and contact a 3rd party? Except that isn't what they lead with, they have done nothing to reiterate a position that's appropriate for a browser maker.
Not sure what your main point is? Telemetry is inexcusable?
I'm the first to agree that it absolutely should be OPT IN and not OPT OUT as it is now, but even so my biggest concerns would not be trying to evade hostile countries. If that is your bar you can't just install a mainstream OS and mainstream software and assume that it is a good idea.
It was intended to fail gracefully. It was a bug. They did fix the bug, involving a complicated interaction between different parts of the network stack. But "we should try harder not to have bugs" is, rightfully, seldom considered a valuable "lesson learned".
Well the telemetry is something that is separate topic. But regardless, ny code that can end up blocking forever should have a timeout and recover from that timeout happening.
The problem here was that something that is known to fail for all sorts of reasons (network IO) was happening without such a timeout. Or with a timeout with a failure mode that it never happens (yikes). That's a design problem and even something with a very small chance of happening is extremely likely to actually happen at some point with a product that is this widely used.
This stuff is hard of course and I end up addressing issues related to his once in a while. The fix is usually to surround such code with defensive measures such as timeouts, retry mechanisms, telemetry, logging, etc.
The additional question/learning is why they never noticed this happening before. Because it probably did; they just never noticed because the very thing that would have told them was actually hanging. People killing an application for whatever reason is something that you'd want to know however.
> code that can end up blocking forever should have a timeout and recover from that timeout happening.
There was no way for the calling code to do this. This was literally an infinite loop inside the network stack. Imagine the network stack itself going `while(1) {}` on you, without checking if the request was canceled.
Even if you detect that this happens, there is nothing you can do as the caller. You can't even properly stop the thread, as it is not cooperating. So recovering from this type of failure is hard.
> There was no way for the calling code to do this
Like what happened in a comment that I called out yesterday, you're silently inserting extra qualifiers that aren't in the original; the person you're responding to didn't say anything about calling code.
If the network stack can end up doing the equivalent of `while(1) { /.../ }`, then that's the bug, no matter what's in the ellided part. There's not "no way" to deal with this. (In the specific case of `while(1)`—which I recognize is a metaphor and not a case study, so onlookers should please spare us the sophomoric retort—it's as simple as changing to `while(i < MAX_TRIES)` with some failover checks.) In some industries, this sort of thing is mandatory.
It's a bug. Are you saying there's some magical way of eliminating all possible infinite loops from code? Please write a paper on this amazing technique; I'm pretty sure that's equivalent to solving the halting problem and the computer science community would love to see a proven unsolvable problem being solved.
You write good comments usually, so IMHO this comment is worth replying to:
There is no algorithm that will determine the "halting status" of an arbitrary (program, input) pair, but that does not prevent a team of programmers from working in a subset of the set of all programs in which every program halts. Restricting themselves to that subset might make the team less productive (i.e., raise the cost of implementing things), but it probably does not materially limit what the team can accomplish (i.e., what functionality the team can implement) provided they're not developing a "language processor" (a program that takes another program as input).
Your desire for your insolence to be noted is granted, but to answer the non-strawman form of your question: yes, there is a way to prevent infinite loops from making their way into software in the field. It means providing proofs that your loops terminate. (If you can't show this, your code has to be rewritten into something that you can come up with a proof for.) As I already said, this is mandatory in some industries. The philosophy is also not far off from the rationale for Rust's language design re memory management. And although it might seem like it requires it, there's no need for magic. This is something covered in any ("every"?) decent software engineering program.
I went and looked at the code (it's linked in the article). You absolutely can put a timeout around a case/switch statement. There's like 5 different ways to do it. And the code calling network syscalls can also have timeouts, obviously; otherwise nobody would ever be able to time out any blocked network operation. This is all network programming 101.
> Any code that can end up blocking forever should have a timeout and recover from that timeout happening.
Any code that can end up blocking forever under normal circumstances already has a timeout and recovers from that.
This wasn't a normal circumstance, this was a logic bug.
> The problem here was that something that is known to fail for all sorts of reasons (network IO) was happening without such a timeout.
No. Read the article. It was an infinite loop. Equivalent to while(1);. Not a network timeout. Not a network error. An infinite loop. A logic problem.
I am appalled at how many people replying in the comments here cannot grasp this basic fact. This isn't about some dumb telemetry design where telemetry requests block everything else. This was a logic bug in the network stack that wedged the entire thing eating 100% CPU. There's no miracle fix for infinite loop bugs.
Not sure how Firefox code is structured but it is weird that a particular HTTP3 request would hang the entire network stack and you cannot perform any HTTP 2/1 requests.
All requests go through one socket thread, no matter which HTTP version. I am not a Necko engineer, but since requests can be upgraded, an HTTP/1 request could switch to HTTP/2 and if there was a separation by protocol, the request would have to be "moved" to a different thread. So I'm not sure that would work easily.
Is nobody reading the article? It was an infinite loop. Not a blocked request. A bug. A logic flaw. Something that wouldn't normally happen. It was broken code.
You know what happens when you put a while(1); in the middle of the nginx codebase? The whole server process hangs. This is normal in an async design. We don't write software to be magically resilient against freak bugs, especially not something like a browser that is not intended to be used in life-critical applications.