> What is ThreadSanitizer? ThreadSanitizer (TSan) is compile-time instrumentatio...

inetknght · on April 6, 2021

Address Sanitizer -- never seen a false positive. Ever.

UBSan -- again never had a false positive.

Thread Sanitizer does have false positives in some circumstances. They're hard to verify to be falsy and in many ways it's better to refactor to eliminate the complaint because you'll end up with better cleaner code. For example, Qt uses those atomic fences that the article describes [0] [1].

[0]: TSan does not produce false positive data race reports when properly deployed, which includes instrumenting all code that is loaded into the process and avoiding primitives that TSan doesn’t understand (such as atomic fences).

[1]: https://lists.qt-project.org/pipermail/interest/2014-March/0...

derf_ · on April 6, 2021

I have seen false positives in ASAN.

In Firefox, media threads use a reduced stack size (128 kB), due to, historically, 32-bit address-space concerns with lots of media elements on a page. Opus, in its default configuration, requires a fair amount of working stack memory (a few dozen kB in some cases). "A few dozen" is much smaller than 128 kB, but apparently enabling ASAN increases the stack usage, quite a bit. So we got crashes under ASAN when it overflowed the stack that would not have been possible without ASAN enabled.

https://bugzilla.mozilla.org/show_bug.cgi?id=750231

inetknght · on April 6, 2021

Address Sanitizer inserts sentinels on the stack to watch for certain kinds of dangerous stack behavior. I would be willing to bet that the stack overflow could occur without ASAN and it's not a false positive.

derf_ · on April 6, 2021

You are right to be concerned. The primary culprit is the use of C99 variable-length arrays (or alloca when those are not available). I can see why ASAN would want to pad those.

Fortunately, the allocation sizes of these arrays are tightly constrained by the limits of the Opus format (this is not an accident: I helped design the format). I performed a careful analysis (by hand) to ensure the maximum possible stack usage (without ASAN) was not anywhere close to the size allocated for the stack. It is of course possible that I made mistakes, but I tried very hard not to.

The broader point is that this is an entire class of memory access error that ASAN cannot properly validate due to its design. Because it changes the system under test in ways that differ from a real deployment, I cannot use it to mechanically validate my hand analysis.

It is worth noting that valgrind does not have this problem (because it instruments the original binary, instead of making compile-time changes), and is also used by Firefox in testing.

vlovich123 · on April 7, 2021

Configure using larger stacks when building with ASAN since this is a known problem?

The sanitizers are useful enough that I’ve found adding small targeted changes to support those modes to be valuable vs the increased risk of not testing what you run.

inetknght · on April 6, 2021

Address Sanitizer isn't a tool that solves every problem. But it solves well the problems it's intended.

I'm assuming you're still talking about Firefox running Opus [0]? I'm not familiar with it. But I do think that any platform that can run Firefox and wants to listen to audio should have a lot more than 128k lying around. Otherwise your user isn't using the right tool for the job.

I understand memory constraints can be tight and it feels like there's 128k of space right there that you can use. And to achieve that you've written it in C and manually calculated the stack space needed. It feels awesome to have that kind of knowledge and experience with opportunity to leverage it! It's great that you've helped design a media format and implement something to process it!

But Firefox wastes tons of megabytes just to load, store, and transmit telemetry, preferences, search providers... the list can go on. Do you really want to try to convince me that a 128k slab for audio that the user specifically loaded is too expensive in the face of all of the telemetry loaded by Firefox these days?

Why add all the maintenance headache when you can instead allocate a single slab for your own use? If I were to write it and manually optimize then I'd use an allocated slab. And, as a bonus, the slab's ownership can move to any other thread if needed. And it would make it clean in Address Sanitizer. And, even more important, I wouldn't have to worry about what happens if the frames before my entrypoint changes. For example, Firefox's software design could change and adds a couple of wrappers around the manually-sized stack. What happens if I want to compile Firefox on a different platform which doesn't permit a manually-sized thread stack?

So it sounds like you've written something that's going to be very difficult to maintain if you were to vanish. And you've written it for a software product where that level of optimization is, perhaps, better done automatically by tools available to you.

If you build a standalone media player specifically for Opus then that'd be pretty cool. But Firefox shouldn't have pieces that are difficult to maintain.

[0]: https://hacks.mozilla.org/2018/10/introducing-opus-1-3/

nyanpasu64 · on April 6, 2021

I've gotten false positives in ASan when trying to instrument a Win32 application code compiled in MSVC using MSVC's then-experimental ASan support, using the wrong combination of the maze of ASan libraries (static library combined with dynamic MFC). And I also get startup crashes when running MSVC ASan binaries in Wine. But it's obscure beta use-cases and some may be fixed by now.

mozdeco · on April 7, 2021

It would probably be fairly easy to change Qt's Mutex implementation to be TSan-compatible and only do so for TSan builds (by swapping out the fences for atomics when building with TSan). This is what we did in Firefox.

loeg · on April 6, 2021

UBSan has "false" positives, in that unsigned integer overflow is not C/C++ UB. They've decided to warn about it anyway, and sometimes it could be a bug, so that isn't totally unreasonable.

codys · on April 7, 2021

only clang (not gcc) ships ubsan with support for checking unsigned integer overflow. clang's ubsan does not enable it by default (it must be explicitly enabled).

As a result, it's not clear that it's a false positive in the technical sense here (as one would need to enable checking for unsigned overflow explicitly).

CodeBrad · on April 6, 2021

In my opinion, the main limitation of TSan is that it cannot find all possible races. (An impossible task so I think TSan is a pretty great tool to get to the point that this is the main limitation)

As a dynamic tool, if a race is observed during execution by TSan, the race is very likely to be real. But it is impossible to "observe" execution for every part of a large code base, and so real races are likely still hiding in some edge case that TSan did not execute.

Static Analysis tools are interesting in that they have the opposite problem. Static tools can detect races across the entire code base, but without runtime information it is difficult to know if the races are real. This leads to more real races being detected, but also comes with more false positives.

There is a lot of interesting working being done on data race detection and how we can bring these two sides closer together to find more real races, without overwhelming developers with false positives.

mozdeco · on April 7, 2021

This is absolutely true and hence we combine not only our tests with TSan, but also fuzzing, to explore even more corner cases.

On the static vs. dynamic side, I would always opt for the dynamic when it can guarantee me no false positives, even if the results are incomplete. It is pretty much impossible to deploy a tool that produces lots of false positives because developers usually will reject it at some point and question every result.

inetknght · on April 6, 2021

> In my opinion, the main limitation of TSan is that it cannot find all possible races.

In my experience TSan finds all possible races if it's built with unit tests and unit tests provide enough coverage. If unit tests don't provide enough coverage then there are likely other bugs lurking beyond just data races.

tene · on April 6, 2021

I've never actually used TSan, so please correct me if I've missed something significant about how it works, but doesn't TSan need to observe the race happening in order to detect it?

The code coverage analysis I've seen in the past has only evaluates what code is run at least once, but how do you get coverage analysis to evaluate the coverage of what code runs concurrently with what other code effectively enough to get confidence that you'll detect data races?

For example, if you've got a cache that safely handles concurrent use for most operations, but has a data race when two threads both cache miss on the same key within a short window of each other, are there kinds of code-coverage analysis that could theoretically point out that none of your tests involve a cache miss of the same key in two different threads?

I've seen concurrency permutation testing tools, that can reorder the sequence of operations between different threads searching for a permutation that causes a data race, but that relies on actually executing the potentially-racy behaviour at all during testing, right?

I guess this is also what fuzzing is about, but are there tools or techniques you can use to increase your confidence in how well your tests cover the space of concurrent use of your data structures?

steveklabnik · on April 6, 2021

TSan uses run-time instrumentation, yes.

lmm · on April 7, 2021

Unit tests rarely cover high-load concurrency scenarios. I used to think unit test coverage was important, but I've found using the type system to make errors impossible is a much more effective way of lowering the defect rate.

inetknght · on April 7, 2021

They're not mutually exclusive concepts. :)

lmm · on April 7, 2021

Given the project management triangle they kind of are - the goal should be to achieve the required defect rate at minimum maintenance cost. A lot of people talk as though more tests or more verification is always a good thing, but zero defects is usually not a business priority if it means slowing down feature work. So adding more different kinds of verification should be a last resort, only when you actually require a lower defect rate than you can achieve with one method alone - which is pretty rare IME.

inetknght · on April 7, 2021

> the goal should be to achieve the required defect rate at minimum maintenance cost

I disagree. Maintenance cost doesn't need to be minimized. It needs to be reasonable.

The goal should be to produce high quality software. High quality software is easy to understand and easy to instrument with tools.

lmm · on April 7, 2021

Much as I enjoy writing high quality software, it's irresponsible to spend time or money producing something higher quality than is justified by the needs of the business. Generally I find that unit tests for anything other than actual logic (which is a tiny fraction of most codebases) don't offer enough benefit to be worth the costs.

(I'd also argue that unit testing can reduce the quality of the software as it makes the internals of the architecture more costly to change).

inetknght · on April 8, 2021

> Much as I enjoy writing high quality software, it's irresponsible to spend time or money producing something higher quality than is justified by the needs of the business.

I think it's more irresponsible to produce something that the customer isn't going to be able to use for many years or that needs a patch before it's even released.

lmm · on April 8, 2021

The responsible approach is the one that delivers the best cost/benefit for the customer. Yes, it doesn't mean producing something that's completely unusable - but it rarely means producing something completely bug-free either.

nitwit005 · on April 6, 2021

Unit tests tend to be written to run without parallelism, or at least with less parallelism than the normal app has.

hedora · on April 6, 2021

Not my unit tests. :-)

CodeBrad · on April 6, 2021

I agree, good code coverage can probably get most of the way there. But there are some problems with relying on coverage.

Coverage can be misleading. Maybe there is some function being executed in parallel:

  int global;
  void foo() {
    int local;
    
    if (complex_condition)
      bar(&local);
    else
      bar(&global);
  }

And assume maybe somewhere way down the call chain the data passed to bar is modified. The bar function and all of the functions called in bar can have 100% test coverage, but if the else branch is not covered, the race will be missed.

So without true 100% test coverage, it is possible races are being missed by dynamic tools, and true 100% test coverage is probably not possible or feasible in large complex projects.

inetknght · on April 6, 2021

> but if the else branch is not covered, the race will be missed.

Yes.

> true 100% test coverage is probably not possible or feasible in large complex projects.

I have yet to encounter a single large complex project where 100% test coverage is not possible. I have encountered such projects where 100% test coverage required massive refactoring of the code to facilitate testing and doing so resulting in fixing most customer complaints received.

tsimionescu · on April 6, 2021

So you truly believe that Firefox or Windows or Ubuntu could not only be 100% covered by unit tests, but that it would actually be worth taking the time to do so? (edit: typos)

Yoric · on April 6, 2021

As someone who has worked on the code of Firefox, I strongly doubt that.

For instance, pretty much every piece of code (whether native or JS) I've seen is programmed defensively, assuming that for some reason invariants can be broken - something that typically cannot happen unless there is a memory corruption involved. Testing all these defensive programming branches is not possible without introducing memory corruptions.

inetknght · on April 6, 2021

Yes, I firmly believe that Firefox, Windows, and Ubuntu can achieve 100% unit test coverage of software. I truly believe it would be worth its while.

KMag · on April 7, 2021

As Yoric mentioned, that would require some custom Qemu patch to corrupt memory, some test boxes with dodgy RAM, and/or other things along those lines.

Removing those "impossible" code branches is a bad idea. "Impossible" things happen millions of times a day to Firefox.

I used to work for LimeWire, which had tens of millions of users. We were sometimes seeing bug reports of "impossible" stack traces. They went down dramatically when we added a splash screen that calculated SHA-1 hashes of all of the DLLs and JARs before using the reflection API to load and start the actual application. Given that the "corrupted" JARs were still presumably passing the JARs' CRC-32 checksums and bytecode verification at class load time, I think we were most often just detecting dodgy RAM that would later have caused "impossible" errors by corrupting bytecode or native code after bytecode verification.

inetknght · on April 7, 2021

Writing defensively is great! There's nothing wrong with that.

If you're writing code to defend against "dodgy RAM" whose symptom is a corrupt SHA then it's super trivial to test that code: have a valid JAR file and adjust its content so that it will still "load" but isn't the file you expect. Then verify that the program refuses to load the JAR.

KMag · on April 7, 2021

> have a valid JAR file and adjust its content so that it will still "load" but isn't the file you expect. Then verify that the program refuses to load the JAR.

Sorry for being a bit slow today. but what's the difference between "load"ing and loading? You clearly understand that you're using two different definitions of load, otherwise you wouldn't have put one in quotes. (Also, it would be self-contradictory if the two were identical.) However, I can't for the life of me figure out two different definitions for "load" which both involve CRC-32 and bytecode verification checks, and yet aren't identical.

If I'm correct, we saw a dramatic reduction in the "impossible" errors not because we were detecting the actual problem that caused the errors (corruption of internal JVM buffers after bytecode validation), but instead because our SHA-1 failures (loading chunks of files into our own buffers and checksumming those) were highly correlated with the "impossible" failures because they had the same underlying cause.

In our case, the JVM loaded the code from the JAR, which, unless I misunderstand Sun/Oracle's classloader, means that both CRC-32 checksums and bytecode verification pass, even for unsigned JARs. The chances of random bit flips passing CRC-32 checks are 4 billion to one. The odds of random bit flips passing both a CRC-32 check and bytecode verification are probably at least quadrillions to one. I'm pretty sure the corruption was happening minutes to hours after the classes actually got loaded by the JVM.

In other words, I think we were getting saved by SHA-1 summing a bunch of large files being a poor-man's memtest86: writing semi-random-ism patterns to big chunks of memory and verifying you can read back what you wrote. Failing this correlates heavily with seeing "impossible" errors at runtime, but we weren't directly examining the classloader's internal buffers that were the actual cause of the errors.

On a side note, modern OSes should really have a low rate memory checker built into the kernel's memory manager. It's a crutch, but cheap consumer IoT devices and WiFi APs aren't getting ECC any time soon.

inetknght · on April 7, 2021

The difference is that "load" means that the file is otherwise acceptable (valid header, valid content) but that it's been adjusted so that the checksum isn't what's described in the manifest.

That is:

1) the JAR file successfully loads by the JVM -- as in it will "load"

2) the JAR file doesn't pass the SHA checksum requirement you've added -- as in it isn't the file you expect.

tsimionescu · on April 7, 2021

Even disregarding tne very good points about code defending from memory errors mentioned by others, I imagine a realistic time frame for such coverage for a project the size of Firefox or Windows to be no less than a decade, with 0 other development going on. I find it absurd to imagine it would be worth it, even if it reduced bug count to 0 (which it wouldn't).

perryizgr8 · on April 6, 2021

> Yes, I have seen this phenomenon!

I've had this happen to me personally. Looking again and again at the tsan error and the code it pointed to, I was absolutely sure it was a false positive. Thankfully I got a friend to look at it and after a good night's sleep he was able to explain to me how exactly the code was wrong.

Tl;dr: It is extremely hard to write correct threaded code in C/C++, and it is quite hard to recognize errors even when they are pointed out to you.