Hacker News new | past | comments | ask | show | jobs | submit login
Everything Is Broken: Shipping Rust-Minidump at Mozilla (hacks.mozilla.org)
392 points by mthermidor on June 14, 2022 | hide | past | favorite | 64 comments



Extra shoutouts to the folks at Sentry who also flipped rust-minidump on as their default backend and had to deal with way more exotic issues than I did (and fixed them!) because although Firefox sees some horrendous stuff and gets a bajillion crash reports, it's still one application with one basically stable minidump writing configuration.

They have to deal with basically random apps doing whatever they want and it sounds like hell.


Thank you for this work!

I've been involved with minidumps in one way or another since around 2010. Was at a startup at the time that had a browser based on Chromium and we needed crash reporting for our own app. So I wrote a pretty simply backend that received minidumps, ran them through the breakpad processor and shoved the output into Splunk. That was our crash-reporting system.

Circa 2013 the company gets acquired by Yahoo which at the time was using Crittercism for its mobile apps but Yahoo wasn't happy with it. Somehow I was now the mobile app crash reporting expert at the company though so I built a whole new in-house crash reporting solution.

For iOS I wrote an SDK around PLCrashReporter because unwinding stacks on the client works out way better on iOS than dealing with a minidump.

For Android I had to deal with both JVM (er, Dalvik, er ART) stack traces, easy enough, but also native code crashes. For the latter I used breakpad's crash handler and minidumps. But it turns out that minidumps from Android devices are almost useless for two reasons:

1) If the crashes originate in managed code or calls into managed code you can't trace back through the managed code frames from a minidump. Especially if you don't have frame pointers.

2) You basically cannot get the symbols for all the different flavors of Android. Without symbols any stack trace that breakpad reconstructs is pretty useless.

Eventually I abandoned minidumps on Android and instead unwinding on the phone using corkscrew, wait no, libbacktrace, wait no, libunwind. But that still doesn't give useful stack traces very often. In the end, I ended up capturing logcat output when restarting after a crash which actually tends to have the most useful stack traces.

Which is all to say, both Apple and Google make it really hard for a mobile app to find out why it crashed. Both Android and iOS create a crash report for any app which crashes, but the app can't access those. So we're all shipping apps with third-party crash handlers built-in that try to capture a stack or minidump in-process and make sense of it later.


Apple actually recently added support for high-quality crash reports (the kind you'd get from their tooling, more or less) in MetricKit: https://developer.apple.com/documentation/metrickit/mxcrashd....


Ooh, that's useful. PLCrashReporter actually does a darn good job with capturing crashes. But what it can't capture is when iOS force kills the app (watchdog, memory, etc). MetricsKit looks like it has all sorts of useful data for capturing data about that. I was vaguely aware of MetricsKit but didn't realize how far it had come. Thanks!


I was recently looking into CoffeeCatch for Android. Are you saying that I'm losing my time? :-)


That may work for your own code which you can pepper with macros, but I needed to be able to capture crashes in native code I didn't write. Android itself has used a variety of unwinders over the years. Extracted versions of the libraries that have been modified to build with the NDK are here:

https://github.com/ivanarh/libcorkscrew-ndk

https://github.com/ivanarh/libbacktrace-ndk

https://github.com/ivanarh/libunwind-ndk

https://github.com/ivanarh/libunwindstack-ndk

Along with a wrapper that can use all of them:

https://github.com/ivanarh/ndcrash

It's been a while since I dealt with this and I don't recall why Android keeps changing unwinders. Looks like Sentry is using libunwindstack-ndk (well a fork of it anyway) on Android:

https://github.com/getsentry/sentry-native/tree/master/exter...


> Rust is a really good language for writing parsers. C++ really isn’t.

One thing I appreciate about writing Rust is that ADT support implies writing parsers is simpler under the "parse don't validate" mindset (which was clarified for me I think in this [0] article).

[0]: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...


It's not ADT, it's strict types. The reason you can't do "parse, don't validate" in C++ is because you can't assume anything is valid at the point you use the data.

What ADT does is give you enough flexibility so that a strict typing system doesn't suck.


> you can't assume anything is valid at the point you use the data.

Just to double check my understanding: are you talking about raw pointers (i.e. void*) being common in C++ and not in Rust? You're right that I was using ADT a bit loosely; to be honest the main value add for me has been the first class data-holding enums/sum types. C++ has std::variant, but the syntax support in Rust feels nicer.


C++ has a series of issues.

You can't trust pointers have a value, or that the value is valid, you can't trust that your enums have a value inside their interval, or in fact you can't trust that any value from any type is inside its interval at all.

You also can't really trust that your values have the correct size.

We choose some of those to ignore, otherwise we wouldn't be able to program at all, but C++ gives you no guarantees at all about anything. The point is that if you do a parsing run in C++ and encode your value, you will still get many of the above problems because of bugs in your code.


> you can't trust that your enums have a value inside their interval

If you don't set the underlying type, assigning a value that doesn't match an enumerator via `static_cast` is undefined behavior. See https://en.cppreference.com/w/cpp/language/enum . (Doing weird pointer casting things is also undefined behavior per the strict aliasing rule, though, come to think of it, I'm not sure whether memcpying an out-of-range value into an enum through the "reinterpret_cast to `char*`" loophole is undefined behavior.)


I’m assuming you are referring to this part:

> If the underlying type is not fixed and the source value is out of range, the behavior is undefined.

Note the fine print about the meaning of ”out of range”:

> (The source value, as converted to the enumeration's underlying type if floating-point, is in range if it would fit in the smallest bit field large enough to hold all enumerators of the target enumeration.)

So this is not undefined:

  enum E { A = 0, B = 1, C = 2 };
  E valid = static_cast<E>(3);


Ugh. You are right, and that's sad.


Even if that covered all the problem space (instead of replacing it with a much larger one), if your code is flawless, parsing and validating are equivalent.

Choosing one just makes a difference because code has problems.


It's linked at the bottom of the article, but reminder that Gankra's blog (https://gankra.github.io/blah/) has a ton of other great writing like this.

In particular, I always recommend "Text Rendering Hates You."


I link that article every time I see someone on the internet say “that sounds easy, why don’t you just”

And the answer is always well, things are more complicated than they look. Even something as trivial as rendering text on a screen.


I wonder how much of the benefits comes from the rewrite itself and not Rust. I have taken really bad hard to maintain very large code bases written in C++ and step-by-step refactored it into bug free maintainable code. In my experience bad code written in any language is hard to maintain. And good code written in any language is easy to maintain. I have worked with C code that was a joy to maintain and C code that was hell on earth. The difference was the skill of the original developers not the C language itself. Rust is a nice step forward but it is not magic. If it becomes popular enough then people will write massive impossible to maintain code in it. It is inevitable. The same goes for microservices BTW. A friend of mine is deep in coding hell trying to maintain a 20 years old 100+ microservices system. There ain’t no silver bullet.


Original rust-minidump author here: I started out by doing a pretty straightforward port of Breakpad into Rust, which I had just started learning at the time. I figured that learning a new language by porting a codebase I was intimately familiar with for a content area that I knew extremely well would make it so only the new language was the hard part. (It worked out well!) I did try to make the API more idiomatic Rust in places, since there are parts of Breakpad's API that are fairly C++-centric. I was also using Rust 1.0 originally, so it didn't quite have all the niceties available in the current Rust release.

All that being said, I would choose Rust over C++ for any new development anywhere. If my boss told me we had to use C++ for a new project I would actually quit. I've worked on plenty of C++ codebases (including Firefox) in my career. Sure, you can write bad code in any language, but C and C++ are just bad languages.

(I also ported Mozilla's sccache tool from the original Python implementation to Rust, which was a fun exercise. The Rust version is in production use in a wide variety of places, and AFAIK is still the only ccache-like tool that can cache Rust compilation.)


> Sure, you can write bad code in any language, but C and C++ are just bad languages.

Let's give them credit for what we've achieved using them. I would definitely pick Rust over C++ any time, I respect C++ for all the cool things it gave us.


If they are bad languages then you should give most of the credit to the programmers that were using them.


I wish I could upvote this more than once :) The unsung heroes out there are the developers who takes over a garbage dump of a system and turn it around. That is way harder than writing a system from scratch.


Yep makes sense. My point was that the developers matter more than the language. However you are absolutely right that good developers with a better language will do better.


There are entire classes of bugs that cannot happen in (safe) Rust that can still happen in a “bug free” C++ code based. You’re correct that Rust isn’t a panacea, but it does eliminate certain concerns that C++ just can not remove from thought while developing with it.

The great thing about Rust is that it reduces the knowledge and skill a developer needs in order to write a stable and (usually) performant program. This is a good thing.


Unsafe Rust is inevitable to write in any real-world program that needs {O(log n) insertion-ordered collections instead of key-ordered like BTreeMap, intrusive collections, custom lock-free data structures, or non-tree-shaped object graphs where you cannot spare the performance, static analysis, and ergonomic penalty of RefCell and Cell doesn't work out}, and working with aliasing pointers is so difficult (a representative example is https://github.com/Storyyeller/stable_deref_trait/issues/15) that I'd rather write C++ than unsafe Rust.


You’re mixing up two different categories of development. There are plenty of libraries that provide safe BtreeMaps and lock-free data structures. These can isolate all unsafe code in their own crate and only expose safe APIs.

Programs on the other hand can be written in entirely safe Rust, and the vast majority of Rust developers never need to use unsafe Rust to do this.

In Rust, we try to isolate unsafe usage and test it excessively. In C++ you have no such option.


So far I have found no Rust insertion-ordered collections which supports removing items without scrambling items (necessary for editing a key-value INI file) with O(log n) reads/writes, though you can use a vector with O(n) accesses which takes O(n^2) to lookup every key, which is probably good enough in most cases.

And non-tree-shaped object graphs are not a library concern, but permeate entire applications; if you want to rewrite an app using shared mutability in Rust, you must either write code awkwardly with Cell/RefCell (and RefCell has runtime overhead), restructure the whole program in one Big Rewrite, or fallback to unsafe accesses (at this point, outside of multithreading C++ is a better Unsafe Rust than Unsafe Rust).

Please respond with a factual rebuttal before downvoting.


It seems that indexmap would fit your bill:

Repo: https://github.com/bluss/indexmap

IndexMap type itself: https://docs.rs/indexmap/latest/indexmap/map/struct.IndexMap...

It has amortized O(1) reads and writes, and it supports removal without maintaining order in O(1), or while maintaining order in O(n).


I was not initially aware of O(n) order-preserving shift_remove[...](). Although likely not a problem in practice, removing a large number of items one by one (eg. when a user is deleting a selection containing many file associations) can result in quadratic behavior. I found that you can avoid that by calling retain() (at the cost of being a higher-order function) to keep/remove every item in the list in a single pass (linear time is fast enough for INI files). So in retrospect IndexMap would be a workable choice, if used carefully.


There are entire classes of bugs that cannot happen in (safe) Java/Erlang/Scheme/etc. that can still happen in a "bug free" Rust code base.


Like? I know that you’re trolling, but looking at Java specifically, the error handling in Rust and RAII pattern in Rust ensures that things like file handles are cleaned up. Modern Java has made this nicer, but it still offers a lot of potential bugs around resource leaks.

Sticking with Java, Rust also offers a better story around thread safety, such that sharing mutable state across threads requires types that allow that to work.

Finally, Java’s exception handling and null usage generally allows a lot of low hanging fruit bugs to slip through where in Rust these types of errors are far less likely to happen.

I never said Rust is “bug free”, I’ve been working with software far to long to make such a provably wrong statement.


I completely agree with you. Rust seems to be a good step forward.


I was interested in the remarks about Windows perverted stackframes.

I once had some MS C code that called some 3rd-party library function. This was a long time ago, I don't remember all the details! But roughly, the code did a comparison, called this function, then made a jump conditional on the result of the comparison. But the condition wasn't being evaluated correctly.

It turned out that the library was built using a Borland compiler; my code was Microsoft C. The calling conventions differed; Borland expected the caller to save the flag register on the stack before calling, and restore it on returning; Microsoft expected the called function to do this. As a result, the flags that had been set before the function call were junk after it returned.

The way that Borland did it was the "convention" - as far as I'm aware, Microsoft stood alone on this. I formed the belief that Microsoft was doing things the way it did in order to deliberately make Microsoft C code incompatible with libraries built with 3rd-party compilers.


Gankra is the most entertaining Rust author (Rust programmer who writes about Rust). Easily.


Mmm. I think @m_ou_se is probably the most entertaining at least if we consider that both Saturday Night Live and Nightmare On Elm Street is entertainment.

For example, Rust deliberately doesn't have the tertiary operator, and random other types don't get silently coerced as booleans - so you can't write a = x ? 1 : -1; however you can write a = if x != 0 { 1 } else { -1 }; with the same effect. But Mara isn't satisfied with this verbose yet sensible answer, and proposes you could instead, for example:

a = x.count_ones().count_ones().count_ones().count_ones() as i32 * 2 - 1;

Hilarious? Or maybe terrifying? Entertaining certainly. https://twitter.com/m_ou_se/status/1404034056405368833?lang=...

Aria is more informative but I'm not going to end up choking and spilling my beverage all over the desk.


That's a hilarious code crime! I checked it and it turns out that this works for 64-bit integers too. It's 2^128 - 1 that is the first number that requires five calls to count_ones to work!


You could also do `1|!((x|-x)>>30)`


There's an unreasonably grumpy commenter below that disagrees, but I personally agree with you and found this to be a fun read.

I was interested in the topic before reading, but it could have easily been a slog of technical minutia. I'm glad that wasn't the case!

Edit: the comment I referenced was deleted in the time I took to post this. It's probably for the best


What a fun read! :3 I really like your writing style. Deploying stuff to production is always so nerve-wracking, I related to that very hard. I recently developed a golang alternative to an old erlang-ruby-hodgepodge, and when it worked in production I found myself constantly not believing that nothing went wrong.


> I was in a bit of a stupor for the rest of that week, because I kept waiting for the other shoe to drop. I kept waiting for someone to emerge from the mist and explain that I had somehow bricked Thunderbird or something. But no, it just worked.

I think a lot of us can recognize this feeling when you have deployed big changes. Everything seems to be working fine, but you just don't trust it.


Ha, weeks and months of thinking, "Please just work" and then it does and it's always a shock.


> how we got absolutely owned by simple fuzzing

> You are reading part 1, wherein we build up our hubris.

Props to anyone willing to own their faults this readily:)


Huh, wow, I used breakpad/minidumps daily when I was at Google working on Chromecast based things. I had no idea it existed outside that ecosystem.

Now I know there's nice Rusty work happening with this stuff that I could maybe make use of for my next employer or a personal project. Neat.


I integrated Breakpad into Firefox to replace the old closed-source Talkback implementation, we shipped it in Firefox 3. I suspect (but would have to ask Mark Mentovai to confirm) that Breakpad was probably written for use in the not-yet-publicly-announced Chrome. If so, that would mean that we shipped it first. :) I probably still have commit access to Breakpad, although I haven't contributed to it in years.


What kind of issues and crashes did they see in rust code?

The article is very vague about the details, which could be of interest to people in as similar situation.


This is a fantastic article, thank you for writing it. Looking forward to part 2!


https://docs.microsoft.com/en-us/windows/win32/debug/minidum...

Minidumps designed so well the initial Windows/x86 impl could be easily extended to multiple platforms like Google breakpad did. Symbols on demand over HTTP. For all the crap they get Microsoft got many things right.


Do tell us more, don't leave us hanging ! Loved it.


A better/more technical article on the same tech, from Mozilla's collaborators on this project: https://jake-shadle.github.io/crash-reporting/


That article is about the client-side (generating the minidump for a crashed process) to this article's server-side (processing/analyzing the minidump).


Current status: refreshing https://hacks.mozilla.org/author/abeingessnermozilla-com/ waiting for part 2.


If the follow-up post does not make it to HN front page, I'll have a hole in my life.


so? when can I ditch c++ for a full firefox and its sdk and use only the rust-written rust compiler?

I did not check lately, but is rust syntax still sane compared to the abomination which is the c++ syntax?


Maybe I'm missing something, but they ported from C++ (because 'C++ is bad donchaknow') to Rust and still ran into problems parsing crash dumps?

If the dump is corrupt then just stop trying to parse/make sense of it; it's garbage.


No we removed many random crashes that the C++ code had. You cannot "simply" discard a crash report if something is slightly off because then you would discard most crash reports. And most debuginfo too.

You can't expect "thing that runs when a process may have just experienced memory corruption" and "all builds of your application for all eternity" and "every toolchain you ever built your program with for all eternity" to be even vaguely reliable, because those things are in the past and we're trying to figure out how to fix the bugs people are experiencing in production today.

It is a horribly miserable answer to tell your coworkers "yeah sorry I know users are getting thousands of crashes this morning but the crash-dumper didn't sign its name in cursive so I'm gonna refuse to let you read the letter it sent at all".

And just an incoherent answer to say "yeah I know this is a stack overflow but it left the stack in a mildly corrupt state so I absolutely refuse to try to even look at the stack and figure anything out about it". Like, that is the entire purpose of a crashreporter, to investigate a program in an invalid state!


Reminds me of "Your program shouldn't have bugs in it isn't an acceptable position to take for a debugger", from the rr folks. Unfortunately I can't find the source of the quote any more, but it stuck in my mind.


Yeah computing backtraces in a crashreporter is extremely similar to a debugger in that you need a lot of fudge-factor heuristics and fallback modes for known toolchain bugs or common corruptions.


You're probably remembering https://pernos.co/blog/tzcnt-portability/


Speaking of the rr folk, they also had the fascinating point that you can reliably generate a "stack trace" by figuring out which `call` instructions were executed with what values (also other jump instructions I suppose), instead of walking the stack. Thereby skipping the whole "parsing the stack is insanely difficult and unreliable" issue.


FWIW, that's from pernosco, not rr.


I think it's the same people?


It is. rr records the trace and the Pernosco converts it into a database that can be queried.


This is the excessively fun part of dealing with crash dumps in general. Many of them are going to be 1% corrupt, 99% fine, and somewhere in them likely has vital information about what caused the corruption.

So the entire reason for being for things like rust-minidump are to make enough sense out of files that are known to be corrupt garbage to be able to find bugs.


A lot of the weird bits in the Breakpad codebase were definitely from us finding extremely broken minidumps from Firefox users in the wild and then me tweaking the code to see if we could get something out of it so we had a chance at diagnosing the issue.


The worst bugs are often the ones that have trashed some of the stack or ended up with some rubbish register state too, and so being able to try and get some useful information out of a seemingless garbage dump is critical. I imagine a project like Firefox is also big enough to see bad things happening because of incorrect CPU behaviour and bit flips and such too...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: