> In my opinion, the main limitation of TSan is that it cannot find all possible races.
In my experience TSan finds all possible races if it's built with unit tests and unit tests provide enough coverage. If unit tests don't provide enough coverage then there are likely other bugs lurking beyond just data races.
I've never actually used TSan, so please correct me if I've missed something significant about how it works, but doesn't TSan need to observe the race happening in order to detect it?
The code coverage analysis I've seen in the past has only evaluates what code is run at least once, but how do you get coverage analysis to evaluate the coverage of what code runs concurrently with what other code effectively enough to get confidence that you'll detect data races?
For example, if you've got a cache that safely handles concurrent use for most operations, but has a data race when two threads both cache miss on the same key within a short window of each other, are there kinds of code-coverage analysis that could theoretically point out that none of your tests involve a cache miss of the same key in two different threads?
I've seen concurrency permutation testing tools, that can reorder the sequence of operations between different threads searching for a permutation that causes a data race, but that relies on actually executing the potentially-racy behaviour at all during testing, right?
I guess this is also what fuzzing is about, but are there tools or techniques you can use to increase your confidence in how well your tests cover the space of concurrent use of your data structures?
Unit tests rarely cover high-load concurrency scenarios.
I used to think unit test coverage was important, but I've found using the type system to make errors impossible is a much more effective way of lowering the defect rate.
Given the project management triangle they kind of are - the goal should be to achieve the required defect rate at minimum maintenance cost. A lot of people talk as though more tests or more verification is always a good thing, but zero defects is usually not a business priority if it means slowing down feature work. So adding more different kinds of verification should be a last resort, only when you actually require a lower defect rate than you can achieve with one method alone - which is pretty rare IME.
Much as I enjoy writing high quality software, it's irresponsible to spend time or money producing something higher quality than is justified by the needs of the business. Generally I find that unit tests for anything other than actual logic (which is a tiny fraction of most codebases) don't offer enough benefit to be worth the costs.
(I'd also argue that unit testing can reduce the quality of the software as it makes the internals of the architecture more costly to change).
> Much as I enjoy writing high quality software, it's irresponsible to spend time or money producing something higher quality than is justified by the needs of the business.
I think it's more irresponsible to produce something that the customer isn't going to be able to use for many years or that needs a patch before it's even released.
The responsible approach is the one that delivers the best cost/benefit for the customer. Yes, it doesn't mean producing something that's completely unusable - but it rarely means producing something completely bug-free either.
I agree, good code coverage can probably get most of the way there. But there are some problems with relying on coverage.
Coverage can be misleading. Maybe there is some function being executed in parallel:
int global;
void foo() {
int local;
if (complex_condition)
bar(&local);
else
bar(&global);
}
And assume maybe somewhere way down the call chain the data passed to bar is modified. The bar function and all of the functions called in bar can have 100% test coverage, but if the else branch is not covered, the race will be missed.
So without true 100% test coverage, it is possible races are being missed by dynamic tools, and true 100% test coverage is probably not possible or feasible in large complex projects.
> but if the else branch is not covered, the race will be missed.
Yes.
> true 100% test coverage is probably not possible or feasible in large complex projects.
I have yet to encounter a single large complex project where 100% test coverage is not possible. I have encountered such projects where 100% test coverage required massive refactoring of the code to facilitate testing and doing so resulting in fixing most customer complaints received.
So you truly believe that Firefox or Windows or Ubuntu could not only be 100% covered by unit tests, but that it would actually be worth taking the time to do so? (edit: typos)
As someone who has worked on the code of Firefox, I strongly doubt that.
For instance, pretty much every piece of code (whether native or JS) I've seen is programmed defensively, assuming that for some reason invariants can be broken - something that typically cannot happen unless there is a memory corruption involved. Testing all these defensive programming branches is not possible without introducing memory corruptions.
As Yoric mentioned, that would require some custom Qemu patch to corrupt memory, some test boxes with dodgy RAM, and/or other things along those lines.
Removing those "impossible" code branches is a bad idea. "Impossible" things happen millions of times a day to Firefox.
I used to work for LimeWire, which had tens of millions of users. We were sometimes seeing bug reports of "impossible" stack traces. They went down dramatically when we added a splash screen that calculated SHA-1 hashes of all of the DLLs and JARs before using the reflection API to load and start the actual application. Given that the "corrupted" JARs were still presumably passing the JARs' CRC-32 checksums and bytecode verification at class load time, I think we were most often just detecting dodgy RAM that would later have caused "impossible" errors by corrupting bytecode or native code after bytecode verification.
Writing defensively is great! There's nothing wrong with that.
If you're writing code to defend against "dodgy RAM" whose symptom is a corrupt SHA then it's super trivial to test that code: have a valid JAR file and adjust its content so that it will still "load" but isn't the file you expect. Then verify that the program refuses to load the JAR.
> have a valid JAR file and adjust its content so that it will still "load" but isn't the file you expect. Then verify that the program refuses to load the JAR.
Sorry for being a bit slow today. but what's the difference between "load"ing and loading? You clearly understand that you're using two different definitions of load, otherwise you wouldn't have put one in quotes. (Also, it would be self-contradictory if the two were identical.) However, I can't for the life of me figure out two different definitions for "load" which both involve CRC-32 and bytecode verification checks, and yet aren't identical.
If I'm correct, we saw a dramatic reduction in the "impossible" errors not because we were detecting the actual problem that caused the errors (corruption of internal JVM buffers after bytecode validation), but instead because our SHA-1 failures (loading chunks of files into our own buffers and checksumming those) were highly correlated with the "impossible" failures because they had the same underlying cause.
In our case, the JVM loaded the code from the JAR, which, unless I misunderstand Sun/Oracle's classloader, means that both CRC-32 checksums and bytecode verification pass, even for unsigned JARs. The chances of random bit flips passing CRC-32 checks are 4 billion to one. The odds of random bit flips passing both a CRC-32 check and bytecode verification are probably at least quadrillions to one. I'm pretty sure the corruption was happening minutes to hours after the classes actually got loaded by the JVM.
In other words, I think we were getting saved by SHA-1 summing a bunch of large files being a poor-man's memtest86: writing semi-random-ism patterns to big chunks of memory and verifying you can read back what you wrote. Failing this correlates heavily with seeing "impossible" errors at runtime, but we weren't directly examining the classloader's internal buffers that were the actual cause of the errors.
On a side note, modern OSes should really have a low rate memory checker built into the kernel's memory manager. It's a crutch, but cheap consumer IoT devices and WiFi APs aren't getting ECC any time soon.
The difference is that "load" means that the file is otherwise acceptable (valid header, valid content) but that it's been adjusted so that the checksum isn't what's described in the manifest.
That is:
1) the JAR file successfully loads by the JVM -- as in it will "load"
2) the JAR file doesn't pass the SHA checksum requirement you've added -- as in it isn't the file you expect.
Even disregarding tne very good points about code defending from memory errors mentioned by others, I imagine a realistic time frame for such coverage for a project the size of Firefox or Windows to be no less than a decade, with 0 other development going on. I find it absurd to imagine it would be worth it, even if it reduced bug count to 0 (which it wouldn't).
In my experience TSan finds all possible races if it's built with unit tests and unit tests provide enough coverage. If unit tests don't provide enough coverage then there are likely other bugs lurking beyond just data races.