Hacker Newsnew | past | comments | ask | show | jobs | submit | hauntsaninja's commentslogin

I don't yet know a better way to do this than using a threshold!

I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25

(I really enjoy how many generalisations there are of this problem :-) )


Perf is not normally distributed. Or rather, it is very very common for it to not be. Where I work, we often see multimodal (usually mostly bimodal) distributions, and we'll get performance alerts that when you look at them are purely a result of more samples happening to shift from one mode to another.

It's easy to construct ways that this could happen. Maybe you're running a benchmark that does a garbage collection or three. It's easy for a GC to be a little earlier or a little later, and sneak in or out of the timed portion of the test.

Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.

The law of large numbers says you can still make it normal by running enough times and adding them up (or running each iteration long enough), but (1) that takes much longer and (2) smushed together data is often less actionable. You kind of want to know about fast paths and slow paths and that you're falling off the fast path more often than intended.

As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!


Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests


You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.

Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:

128 commits, 500 trials:

Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.

Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...


Yes! It will show you the posterior probability as a single commit starts to become more likely. In addition, running `git bayesect status` will show you all posterior probabilities


Yay, I had fun with it too!

IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add one

But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead


recompiling can be high constant overhead


In theory, the algorithm could deal with that by choosing the commit at each step, which gives the best expected information gain; divided by expected test time. In most cases it would be more efficient just to cache the compiled output though.


This doesn't sound quite right, but I'm not sure why.

Perhaps: a reasonable objective would be to say that for N bits of information, I would like to pick the test schedule that requires the least total elapsed time. If you have two candidate commits and a slow recompile time, it seems like your algorithm would do many repeats of commit A until the gain in information per run drops below the expected gain from B divided by the recompile time, then it would do many repeats of B, then go back to A, etc. So there are long runs, but you're still switching back and forth. You would get the same number of bits by doing the same number of test runs for each commit, but batching all of the A runs before all of the B runs.

Then again: you wouldn't know how many times to run each in advance, and "run A an infinite number of times, then run B an infinite number of times" is clearly not a winning strategy. Even with a fixed N, I don't think you could figure it out without knowing the results of the runs in advance. So perhaps your algorithm is optimal?

It still feels off. You're normalizing everything to bits/sec and choosing the maximum. But comparing an initial test run divided by the rebuild time vs a subsequent test run divided by a much faster time seems like you're pretending a discrete thing is continuous.

I wish I could math good.


The general requirement for this approach to be optimal, is called "dynamical consistency". A good description is in [1]. It is the situation where, suppose you have a budget B , and you search until your budget is exhausted. Then you are informed that there is an additional budget, B2, and you can continue searching until that is exhausted. A situation is dynamically consistent if, for any B,B2, the optimal strategy is such that you would make the same choices whether you know that you will get B2 or not.

So you are correct that discreteness is a problem, because if you are nearing the end of the budget you may optimally prefer to get more dice rolls than take bigger bets. But the optimal solution is then often analytically intractable (or at least it was - I last read about this a while back), and the entropy approach is often reasonable anyway. (For cases where search effort is significant, a good search plan can be found by simulation).

[1] https://bayes.wustl.edu/etj/articles/search.pdf


Note that "pick the commit with best expected information gain" in git_bayesect isn't optimal even in the no overhead regime. I provide a counterexample in the writeup, which implies ajb's heuristic is also not optimal. I don't see a tractable way to compute the optimal policy.

One idea: if you always spend time testing equal to your constant overhead, I think you're guaranteed to be not more than 2x off optimal.

(and agreed with ajb on "just use ccache" in practice!)


git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.

In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html


Nice! I implemented a similar thing a while back: https://github.com/Ealdwulf/BBChop

I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.

It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.


This is really cool! Is there an alternative way of thinking about it involving a hidden markov model, looking for a change in value of an unknown latent P(fail)? Or does your approach end up being similar to whatever the appropriate Bayesian approach to the HMM would be?


[flagged]


Please stop spamming HN wih AI slop.


My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.

But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)

I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.


You can run bisect with first-parent


That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.


> I know git-bisect doesn’t support this

It does.


Neat! I work on a system with some very similar math, but a slightly different model. I really like how in bayesect making the error rates asymmetric via independent Beta priors on bidirectional errors allows the computations to be nice and symmetric.

I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:

1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]

2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.

Anyway, thanks for sharing!


Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)


Yes, uv skipping this step is a one time significant hit to start up time. E.g. if you're building a Dockerfile I'd recommend setting `--compile-bytecode` / `UV_COMPILE_BYTECODE`


I’ve added ecosystem regression checks to every Python type checker and typeshed via https://github.com/hauntsaninja/mypy_primer. This helped a tonne with preventing unintended or overly burdensome regressions in mypy, so glad to hear upgrades are less of an Event for you


Mentioned this in another comment, but the spec conformance suite is not representative of the things users care about (nor is it meant to be).

The spec mostly concerns itself with the semantics of annotations, not diagnostics or inference. I don't really recommend using it as the basis for choosing a type checker.

(I was on the Python Typing Council and helped put together the spec, the conformance test suite, etc)


Note: while spec conformance is important, I don't recommend using it as the basis for choosing a type checker. It is not representative of the things that most users actually care about (and is not meant to be).

(I was on the Python Typing Council and helped put together the spec, the conformance test suite, etc)


Can you add some examples of the things users care about that aren't well covered by this? I empathize with everyone who wants a feature comparison chart so they can be confident switching without unknowingly losing important safety checks.


The conformance test suite is currently mostly focused on “what does an explicit type annotation mean”

A shared spec for this is important because if you write a Python library, you don’t want to have to write a different set of types for each Python type checker

Here are some things the spec has nothing to say about:

Inference

You don’t want to annotate every expression in your program. Type checkers have a lot of leeway here and this makes a huge difference to what it feels like to use a type checker.

For instance, mypy will complain about the following, but pyright will not (because it infers the types of unannotated collections as having Any):

  x = []
  x.append(1)
  x[0] + "oops"
The spec has nothing to say about this.

Diagnostics

The spec has very little to say about what a type checker should do with all the information it has. Should it complain about unreachable code? Should it complain if you did `if foo` instead of `if foo()` because it’s always true? The line between type checker and linter is murky. Decisions here have nothing to do with “what does this annotation mean”, so are mostly out of scope from the spec.

Configuration

This makes a huge difference when adapting huge codebases to different levels of type checking. Also the defaults really matter, which can be tricky when Python type checkers serve so many different audiences.

Other things the spec doesn’t say anything about: error messages quality, editor integration, speed, long tail of UX issues, implementation of new type system features, plugins, type system extensions or special casing

And then of course there are things we would like to spec but haven’t yet!


> but pyright will not (because it infers the types of unannotated collections as having Any)

This is incorrect. pyright will infer the type of x as list[Unknown].


Unknown has the exact same type system semantics as Any.

Unknown is a pyright specific term for inferred Any that is used as the basis for enabling additional diagnostics prohibiting gradual typing.

Notably, this is quite different from TypeScript’s unknown, which is type safe.


This was confusing me, thanks.


In case you’re not well versed in Python typecheckers, in the mypy vs Pyright example, Pyright can be configured to complain about not annotating the collection (and so both typecheckers will yell at the code as written).

TypeScript takes the same approach in this scenario, and I assume this helps both be fast.


They were "on the Python Typing Council and helped put together the spec, the conformance test suite, etc" so I assume they are an expert on Python typecheckers


I didn’t write it for parent lol. I guess I should be more careful with “you”.


TypeScript will use flow typing to determine the type as number[] in this code:

    const x = []
    x.push(1)
    type t = typeof x // number[]


I think the idea is not that there are features that aren’t listed, but rather that if a typechecker supports 10 features people care about and is missing 10 that people don’t really use, it will look a lot worse on a list like this than a typechecker with 100% compliance, when in practice it may not really be worse at all.

Edit: Based on this other comment, the point was also about things not covered by the spec. “The spec mostly concerns itself with the semantics of annotations, not diagnostics or inference.” https://news.ycombinator.com/item?id=46296360


The chart does not describe speed (either in general or in any particular case). Speed/performance/latency is a thing users care about that is not included in the feature list.


Yea that one is fine and well covered in the blog post, and pretty easy to spot in light testing. I'm much more worried about the ones that are harder to spot until you have a false negative that turns into a real bug which would be caught by 1 tool and not another.


I can't recall a single time that type-checker speed was the limiting factor for me.


I can't say I've been bottlenecked on it, but I've certainly been bothered by it.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: