Hacker News new | past | comments | ask | show | jobs | submit login
Hash-based bisect debugging in compilers and runtimes (swtch.com)
254 points by rsc 9 months ago | hide | past | favorite | 45 comments



Aside: bisecting flakes doesn't have to involve repeated runs. You can reformulate bisection as an information probing operation, expanding the scope to support noisy benchmarks or low-probability flakes. Bayesian inference narrows down the probable range of the failure for each new observation, and you can choose new probes to maximize information gain-- or even run them in parallel to minimize the total time.

You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.

The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.


Indeed. Something like this is what I intend to do when I get some time.


Could you briefly sketch out the math (or point to a sketch) so that other people can pick it up? I'm quite interested!

(I suspect you could get pretty far with a Monte Carlo simulation, and that would let you bypass most of the math anyway.)


The Probabilistic Bisection Algorithm by Horstein, original paper is "Sequential transmission using noiseless feedback," IEEE Trans. Inform. Theory, 9(3):136–143. https://ieeexplore.ieee.org/document/1057832

Here's a nice presentation on an analysis of the method, references start on slide 47

https://people.orie.cornell.edu/pfrazier/Presentations/2014....

Paraphrasing the theorem from Jedynak, Frazier, Sznitman (2011):

Suppose [flake probability] is constant, known, and bounded away from 1/2, and we use the entropy loss function. ... The policy that chooses [next test point] at the median of [current posterior distribution] is optimal.

And some python code that implements a version of the analysis

https://github.com/choderalab/thresholds/blob/master/thresho...


Thanks!


Here is a doc explaining a Bayesian search algorithm; not sure if it's precisely what GP has in mind.

https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba...

(Edit- had wrong link originally)


Thanks!


For Bayesian Inference, the example in the Wikipedia article is good (who doesn't like cookies?): https://en.m.wikipedia.org/wiki/Bayesian_inference#Examples

When gathering more evidence, you'd use your new belief about which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4


That just explains Bayesian inference in general, which is useful, but not what I'm after.

I was specifically interested in the application to binary search / bisection in the presence of flaky tests.


You apply the same process, but your hypotheses are "bug is before/after this bisection point". Your "probability of evidence given hypothesis before/after" are where you incorporate your guess about the tests flakiness. Still works even if you don't have "true" numbers for the tests flakiness, just won't converge as quickly


H


J


This is so cool!

It reminds me a bit of one my favorite debugging techniques: Running an individual test with coverage reporting enabled, and then clicking around the HTML coverage report to see exactly what code path the test followed, without needing to use either print statements or a specialized debugger.

Very helpful for answering questions of the form "Why does this test not exercise the code I thought it did?".


In my former life, I used to maintain a script that can be given two sets of objects files, one compiled with optimization, and one without, and the script will effectively do a binary search by choosing which object files you link and run the executable to determine success/fail. Each iteration is quick, since linking step is usually fast. This was useful when troubleshooting a big binary, since optimized build back then was often quite slow for a large executable.


This is very cool and seems similar to what we do/did in Cinder: https://bernsteinbear.com/blog/cinder-jit-bisect/

EDIT: Oops, this tool is mentioned in the post already (but the post is not, so here it is if you want to read about it). Neat!


Thanks for that link. I've changed the link beneath the Cinder mention to that page instead of the GitHub page.


A closely related technique for debugging optimization passes is that of "optimization fuel". Each rewrite decreases the fuel by one, and when the fuel is gone no more rewrites happen. You can then perform binary search on the optimization fuel to find a specific rewrite instance that breaks things.


Yes, I believe LLVM has a flag for "run only the first N optimization passes" that gets used with binary search this way.

A global optimization fuel that worked at finer granularity would be even more precise but you'd have to have the compiler run single-threaded to make sure the numbering always matches. At least in the Go compiler, we compile different functions in different threads, so that fouls up any global numbering.


When I read https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ and saw the hash bisection trick for the first time I was super impressed, it really does sound incredibly slick :) I imagine that's how first coming across normal git bisect must feel to new engineers.


I'm running one of these now, interestingly enough. Apparently something's broken in OpenJDK if you build with the new Xcode. So I'm bisecting on all the files (choosing either the old or new compiler) trying to see which one is breaking things.


For those following along, I ended up doing four levels of bisects to find the actual problem. First the bisect I described above to identify the problematic file–stackMapTable.cpp. Then another on the functions inside of the file with optimizations and without to find the function that was broken: StackMapReader::next(StackMapFrame, bool, unsigned short, unsigned short, JavaThread). With that I manually inspected the code generated, found that it was wrong, and then did a bisect on optimization passes to see which one broke it (ConstraintEliminationPass). Finally one last bisect to see the history of where this came from (seems to be fixed in a change Xcode hasn't pulled yet).


In the event that this is added to the standard library, I'm going to be really curious to see what a "hello world" project/example would look like.

I went so far as to find the commit where David Chase added for loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try to design my own hello world with x/tools/cmd/bisect, but I'm out of my depth.

The hash tree is a great visualization. I wouldn't have grasped the importance of the hash suffix until I saw the tree. Awesome stuff.


If we add it to the standard library it will be for runtime uses (identify the call stack for which swapping in the new implementation or behavior breaks the test). It is pretty simple. You call 'godebug.New' in a global init to get a lookup keyed to a particular setting, and then you look at that setting at runtime. For example take a look at https://go.dev/src/crypto/x509/x509.go#L881 and look at the uses of x509sha1 a few lines later.

You just call x509sha1.Value() and make a decision about old or new behavior based on that. Bisect can then toggle that result based on the hash of the call stack to narrow down the exact path that leads to the critical decision.

You can ignore the IncNonDefault call a few lines later. That's about providing monitoring for whether any non-default (old) behaviors are being executed in a program, so that people can hook it up to something like prometheus and see whether they would be affected by changing to "always default behavior".


LLVM can bisect down to individual transformation steps within a pass (for passes that implement the interface): https://llvm.org/docs/OptBisect.html

And there is a script bisecting functions when given assembly produced by working baseline and “bad” compiler to compare: https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...


This is pretty cool. Thanks for the pointer.

Also, hi Matthias!


This is pretty close to what I built for maintaining demo compatibility in Doom engines. Basically it runs a demo and dumps a save game to a file every frame. As soon as there's a divergence it says what the difference is (monster 17 is at (4, 22); should be (4, 21)) and bails. Not a ton of difference between that and diffing the stack.

https://github.com/camgunz/democomp


What you are describing sounds like comparing traces of two supposed-to-be-identical programs to see where they first diverge. That's not what this is describing. There is no trace at all, nor any decision about what to trace or what not to trace.

There is only a decision about whether to use the old or new implementation based on the hash of the call stack at that moment. Then you binary search on hash values to identify the exact call stack hash for which the new implementation causes a problem. Then you run the program once more with the instructions "print the stack with this hash".


What you implemented is just checking output against a reference. That's an extremely common testing method.

This is significantly different because it actually changes the program that is being run. It's not common at all - probably because there aren't too many situations where it applies. You need to be processing some large input that you don't understand, and commonly change bits of your code in ways that you can turn on and off dynamically during a single run.


Are there any non-compiler use cases for this technique?


Anyone working on libraries that are used by large programs can use them, like in the sort and timer cases described toward the end of the paper. When you work on libraries used by other larger programs you inevitably break them accidentally. This technique pinpoints the exact context of the breakage.


All sorts of stuff, I git bisect reasonably regularly.

You can even do things like automatically git bisect the linux kernel, with a little care. I wrote this up a few years ago https://paulgraydon.co.uk/posts/2020-12-27-automated-kernel-....

I can't talk about the actual case that lead me to ever need to git bisect a kernel in the first place, but at the same time I also learned how to make a minimalist one-C-file initrd, because what I needed to catch was visible under a /sys or /dev mount, or something like that. I later realised it's possible to do the same thing with Go in slightly easier syntax, if you cajole it in to producing a statically compiled binary!


He means the runtime bisection that this article is about, not `git bisect`.


You can use this for any codebase of sufficient complexity where you want to identify which change is causing the effect you're observing.


Not really, it's only suitable where your code is processing some other large input that you don't understand, and when your changes are usually easily toggleable at runtime and can be applied to only part of the input.


It works just fine for me what that wasn't the case.


Yeah I think you didn't get through the (admittedly very long) article. We aren't talking about git bisect.


I assume you were talking about using hashes to encode some property of the program that you use for bisection. git bisect is still bisection but somewhat different IMO


I would frame it differently: stack hash bisect is only suitable where your code is linked into some other large program that you don't understand, and when your changes can be toggled on a per-invocation basis. Then you can use it to identify the exact stack through the larger program that leads to your code and works with the old behavior but breaks with the new behavior.

(saagarjha seems to be talking about 'git bisect', which is not really what the post is about.)


this is a wonderful post!

the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?

also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now

the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code

i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures

you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?

often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard

in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine

binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?

on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets

on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name

______

* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer


One small correction: we were doing bisect over compiler optimizations a decade ago, but bisect over call trees only happened in the past year or so.

For the hash table case, deciding the function per-table as you suggest is probably good enough. In a large program there are going to be tons of hash tables, and if bisect gets you to "it's this specific hash table allocated by this call stack that matters", that's a huge win.

The timer change is much like the hash table one, in that we toggle the "timer kind" at creation time, but it's only later operations that actually change behavior. Still, identifying the specific timer and call stack that created it turned out to be good enough in all cases.


thank you very much for the correction and the clarifications—and for the excellent post

does the list of callsite return counters being hashed get truncated at some point for stability? i'd think that hashing the call stack f→g→j→k→m→n→o→p to a hash unrelated to f→g→h→j→k→m→n→o→p would tend to interfere with the 'forced' set, but maybe this is only being used for things where the behavior is so closely equivalent that this problem doesn't arise


The assumption is that for a given test program, the call stack used at the specific moment where the failure begins is the same each run. If that's not true, there will be problems.

If there are two different call stacks that get to the function in question, but only one of them is where the failure begins, you absolutely want to distinguish those.

In practice we do cut off the stacks, using just 16 frames, but that's to avoid O(stack depth) overhead, not to coalesce different things.


i was reading zeller's book today and came to his example of delta-debugging minimization, which his ddmin algorithm solves in 48 tests, and it seemed like your 'easy' algorithm would be more efficient, so i hacked together a test program. your algorithm only takes 29 tests on this case. but because i couldn't remember your algorithm clearly, i first accidentally implemented a different variant that also works, but only takes 21 tests. it's not more efficient in the general case. still, it seems interesting, because it is arguably simpler than your 'easy' version (at least if we sweep the call to first_satisfying_bisect under the standard library) but still preserves logarithmic performance when the satisfying set it finds is of bounded size:

    def bisect_dumb(p, s):
        end, need = len(s), []

        while True:
            need.sort()
            needed = [s[j] for j in need]
            last = first_satisfying_bisect(lambda i: p(s[:i] + needed), 0, end)
            if last == 0:
                return need

            end = last - 1
            need.append(end)
the full program, with comments, is at http://canonical.org/~kragen/sw/dev3/bisectreduce.py. the corresponding version of bisect_easy would be

    def bisect_easy(p, s, targets=None, need=[]):
        if targets is None:
            targets = list(range(len(s)))
        if not targets or p([s[i] for i in sorted(need)]):
            return []
        if len(targets) == 1:
            return [targets[0]]

        m = len(targets)//2
        left, right = targets[:m], targets[m:]

        left_reduced = bisect_easy(p, s, left, right + need)
        right_reduced = bisect_easy(p, s, right, left_reduced + need)

        return left_reduced + right_reduced
bisect_easy is much faster if the minimal set it finds, of some size m, is a long substring a long way from the beginning of the input; i think it takes time proportional to log m + log n in that case, while bisect_dumb takes m log n

however, in cases where hashing scatters the size-m minimal set all over the search space, instead of compacting it into a little lump, i think it also ends up with m log n performance

i may get to implementing the easy/hard algorithm and zeller's ddmin. also, i have the intuition that, as with quicksort, trisection will have consistently but not overwhelmingly better performance than bisection for this purpose (though not in bisect_dumb)


i see, thanks!


> this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code

There is a compiler fuzzing method called "equivalence modulo inputs" that does something like this. Take a fuzzed program with fixed inputs and profile it to find parts that are never executed. Change only code in those never executed parts. If the original and modified programs behave differently, you found a compiler bug: The compiler got confused by code that it can see but that (unknown to the compiler) cannot dynamically influence the program's behavior on the given inputs.

Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: