Random thought. Another commenter worried about the runtime of the program becoming mangled and performing destructive operations on your machine. What if you run the reducer as a source-to-source Nix derivation? Protects against dangerous things, and can be easily distributed to remote builders.
Oh, that’s a neat idea. To reply to the sibling comment: Nix runs commands under a sandbox. It’s not bullet proof, but useful enough that it will prevent obvious commands like deleting local files and accessing the network.
You generally shouldn’t use absolute paths in your programs. That makes it dangerous for local development, and makes running two copies of the program to compare behaviors difficult.
There’s a Venn diagram of people who don’t care enough to do that (works for me!) and people who would never think to use c-reduce. It’s not a perfect circle, but it’ll be fairly close.
> You generally shouldn’t use absolute paths in your programs.
This is true, but I was enjoying the irony that there is an old sys-sdmin adage that you should only use absolute paths in your program(usually a shell script, in this environment) this is to make sure it is running exactly what you expect it to run.
So always put "/usr/bin/awk" instead of just "awk"
I had a co-worker once who took this as gospel. His scripts were always... interesting... to port to a new environment.
It must be that by running any program within the nix build sandbox you don't expose your files unless you discover a privilege escalation attack by chance during the reduction process.
I think it might do a little but not a lot better if I left it to run for longer but it looked to have mostly stalled then so I got bored and killed it.
I actually tried to Shrinkray it myself, but after ~3 minutes it crashed with the following stack trace:
(4 nested BaseExceptionGroup: Exceptions from Trio nursery (1 sub-exception), and then:
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/nix/store/5cdw4sa0n0lwym47rd99a1q6b4dj7nr9-python3.12-shrinkray-0.0.0/lib/python3.12/site-packages/shrinkray/work.py", line 221, in parallel_map
| yield receive_out_values
| File "/nix/store/5cdw4sa0n0lwym47rd99a1q6b4dj7nr9-python3.12-shrinkray-0.0.0/lib/python3.12/site-packages/shrinkray/work.py", line 71, in map
| yield v
| GeneratorExit
+------------------------------------
Not sure if this is a known issue? Note that I ended up not respecting all of the version requirements in the pyproject.toml since I tried to run it all in Nix and there the Poetry support is currently quite lacking.
Note that as recommended by John Regehr, author of C-Reduce, for this use case you might also want to try Shrinkray, a tool that was written to be format independent and works well for cases that C-Reduce dow not: https://mastodon.social/@regehr/113489759789563570
I accidentally got obsessed with test-case reduction as a result of writing Hypothesis, and wrote shrinkray because I thought it was ridiculous that I hadn't put all of the things I'd learned about test-case reduction into some general purpose tool that other people could benefit from.
Shrinkray is still a bit under-advertised and rough around the edges in places, but I think that basically there's no good reason to use C-reduce over shrinkray for things that aren't C or C++. Shrinkray has very good generic heuristics that should work in most languages (it even works pretty well on binary files apparently, although I've not tested this myself), a couple of language specific passes for other languages, and is much more fun to watch.
It might even be the case that you should use shrinkray for C and C++ too (because I use C-Reduce's transformations for those languages), but I'm less confident that that's true. The best I'll confidently claim is "probably more competitive than you'd expect".
Yes, hypothesis also does minimization of failed test cases, so it's kind of a similar problem, just a question if what format the data is in and how you invoke your test.
I read this paper and I still feel lost as to how can this even be possible. It seems to understand how to tokenize, merge lines, remove tokens etc for arbitrary programming languages. Is there another paper that explains this algorithm alone?
I would guess `pass_lines` is the most important for non-C code; I'm guessing (it's written in unreadable Perl) that it removes lines.
So while it can work for languages other than C, most of its features are C-specific so it's not going to work nearly as well. Still I'd never heard of C-Reduce; pretty cool tool!
unreadable perl?! I clicked on the link expecting something super-terse and unstructured, but it's anything but. What the hell is wrong with people casually dropping such remarks.
I was wondering the same thing but I guess the key is probably that you don't actually have to do it correctly every time. Just tokenizing based on a few common characteristics (brace pairing, quotes, indentation, newlines, etc.) should let you trim a ton without knowing anything about the language, I imagine?
My real worry is what if this ends up running dangerous code. Like what if you have a disabled line that writes instead of reading, that randomly gets reactivated?
> Just tokenizing based on a few common characteristics (brace pairing, quotes, indentation, newlines, etc.) should let you trim a ton without knowing anything about the language, I imagine?
Yep, here's an explanation from a related tool, which was spawned by the parser-parser-combinator[0] approach to syntactical analysis and transformation: https://comby.dev/blog/2021/03/26/comby-reducer - which is based on what you've said.
It's intended to run for producing compiler test cases, so there shouldn't be any code that's actually running.
CPython includes a flag to only run parsing/compiling to bytecode. While you can use it like they did here and run the code - it really depends on how much you trust every possible subset of your code
Doing this manually was part of my first job out of college on a C/C++ compiler team. Kind of amazing that there is automation that can accomplish the same thing!
Hypothesis in particular has a very clever way to do test reduction. The key problem is that if your test generator is enforcing some constraint on the input, it's not necessarily the case that naive pruning will preserve the constraint. Hypothesis has a fairly general way of enforcing that the generated test input continues to satisfy constraints.
Yes! The actual commands that you have to search up to get it to run automatically without user input, take almost as long to put together as finding the bug that you're finding (in the 80% case, of course there are 20% cases that take forever) but there is something so satisfying about seeing the thing just humming along once you have it set, that just makes it so satisfying.
> Yes! The actual commands that you have to search up to get it to run automatically without user input, take almost as long to put together as finding the bug that you're finding
Funny, I just wrote the exact same thing only to scroll down and see you had the same opinion:
One nice thing is that, once it's in your shell history, doing it again is really easy. So on a big codebase I was working on, it really did help multiple times and saved more than the initial investment.
The frustrating thing is, because it's in your shell history, you have to look it up again after you change jobs...
> I did not know about C-Reduce until just now, and I'm already hooked. This is basically like discovering git bisect for the first time.
I've known about git bisect for maybe a decade, and used it maybe... twice so far? And I think at least one of those times it took me longer to understand the command line options than it took to just do it manually.
YMMV, maybe it's more useful in a larger collaborative codebase, but to me it sounds life-changing in theory but I haven't found it that in practice.
At a previous job, with a huge C++ codebase and ~100 developers over 20 years, many platforms supported, a build could take hours, and the test suite took so long to run that it would take several days or weeks of real time to get through it all.
This cycle time combined with occasional unexpected interactions between components meant that in every release cycle, there were dozens of complicated failing tests where it was not obvious which code change was responsible.
`bisect` here was extremely helpful: instead of having to pore over commit history and think very hard, I could bisect with a small wrapper script that would build and run the failing test in question. Builds still took hours, but I could usually autonatically pinpoint the responsible code change for one of the tests overnight.
(This was not using Git, but Perforce, for which I had to write `p4-bisect`. Alas, it's not open-source...)
> in every release cycle, there were dozens of complicated failing tests
Sorry if this is a dumb question, perhaps I'm not understanding what you mean by every release cycle, but does this mean nobody in the team/company ran tests until it was time for release? That sounds like a really big everlasting wound to put a tiny git-bisect bandage on!
About a dozen OSes / configurations were supported, and the entire test suite would take a few days to run on each such configuration.
This was native desktop+server software, not a hosted SaaS thing.
Major releases were put out every few months.
Developers did run tests regularly with every change they would make, but it was infeasible to run all the tests in every configuration for each change. So they would try to choose tests to run that seemed most relevant to the code they had changed.
The entire test suite would run more or less constantly on shared machines, and every few days some new tricky failure would be detected. The tricky failures were almost always the result of some unanticipated interaction of features, frequently on the more obscure configurations (like IBM z/OS).
The problem was not that developers were not testing, but that the tests were infeasible to run every time. So instead, testing became an optimization problem.
> does this mean nobody in the team/company ran tests until it was time for release?
The OP wrote “many platforms supported, a build could take hours, and the test suite took so long to run that it would take several days or weeks of real time to get through it all”
⇒ I think they did run tests, but typically not all of them on all platforms.
It's pretty incredible if I'm trying to find a breakage in a repo I don't understand. Build's failing? Easy enough:
git bisect start master v1.1 # Bisect from v1.1 to current master
git bisect run cmake --build .build # Find the first commit where `cmake --build` returns non-zero.
No need to bother trying to understand the error message, I can go make coffee instead and then just look at the commit that broke it. Also, this is really useful for and in conjunction with Nixpkgs.
It's not as useful for personal projects because chances are it won't tell you anything you don't already know.
> It's pretty incredible if I'm trying to find a breakage in a repo I don't understand.
> Build's failing? Easy enough:
> It's not as useful for personal projects
How often do you run into this situation, though? For your scenario to arise, you need:
(1) A repo you're unfamiliar with (which is already unusual for most people... most people spend most of their time working on repositories they've become familiar with)
(2) The unfamiliar repo to use standard build commands you already know, which is basically never the case in my experience (even cmake-based projects almost always need variables defined with custom names I don't know beforehand, and the configuration process is so onerous I find the GUI easier)
(3) An obscure enough breakage where the output is complete garbage (otherwise the compile error would just tell you the file and line, and git blame would tell you the commit directly)
(4) Knowledge that this unfamiliar repo actually would have actually built on your machine at some point (which is often not the case because cmake isn't hermetic and breakages are often due to having the wrong toolchain installed)
(5) For the problem to be a mere build error, as opposed to some running of the program that requires investing time to automate
(6) For you to not have any sort of prior CI or other form of validation on your dependencies, because you apparently didn't catch even catch a build error when it was introduced (never mind automated testing)
I could go on, but these set of circumstances are more or less indistinguishable from ~never to me.
Well, the conditions were satisfied with pipewire, that introduced some dependency at some point that were not satisfiable in Debian Stable.
So, for now, I'm stuck with the version one commit before that dependency :). (I haven't tried undoing the change later, I suspect it might be dependend upon later on..)
> (1) A repo you're unfamiliar with (which is already unusual for most people... most people spend most of their time working on repositories they've become familiar with)
I don't understand a ton of things. I've bisected Linux, Chromium, Wine, Nixpkgs, and many more. These are projects whose codebases I was never going to understand, the codebases are too big for one person to ever fully grasp. Then there's smaller codebases where I could if I wanted to, but if I don't have to, why bother?
> (2) The unfamiliar repo to use standard build commands you already know, which is basically never the case in my experience (even cmake-based projects almost always need variables defined with custom names I don't know beforehand, and the configuration process is so onerous I find the GUI easier)
In real life I make great use of Nixpkgs understandings of how to build things. Nix development shells take care of the dependencies and the build commands. However, I do have confidence that I can figure out the actual desired build commands of most random repositories in under five minutes. This applies to private repos at work and random open source projects. Even for autotools projects. Usually, not that much hoopla is needed, most CMake projects don't need custom cache variables to make a basic functional build.
> (3) An obscure enough breakage where the output is complete garbage (otherwise the compile error would just tell you the file and line, and git blame would tell you the commit directly)
Nah, not necessarily, really just any breakage where you're not sure what happened. Compiler errors are not a great use case for git bisect, but test failures are a really good use case. Likewise, git bisect is great for crashes. Even if you can't automate the full process easily, for e.g. a GUI program, it's still a lot less work to periodically repeat some steps to crash or cleanly exit a GUI program to help the bisect along. Still, some compiler errors are great for Git bisect. Basically any weird linker error is a good candidate.
> (4) Knowledge that this unfamiliar repo actually would have actually built on your machine at some point (which is often not the case because cmake isn't hermetic and breakages are often due to having the wrong toolchain installed)
Or just test it.
> cmake isn't hermetic
Nix is, so no problem!
> (5) For the problem to be a mere build error, as opposed to some running of the program that requires investing time to automate
Not true. It's harder to automate GUI failures, but that does not mean bisecting isn't incredibly valuable for GUI apps. Remember, in a bisect where you never have to skip a commit (e.g. intermediate builds rarely fail for unrelated reasons, a very common case surprisingly) bisect only ever takes log2[n] steps, so even if you are in a repo that has a very huge commit volume, the bisect will be at most a handful of steps. Repeating the same GUI action like 10 times is hardly a big chore, especially for a machine to magically crank out the exact origin of where the thing broke. For CLI apps, it's even easier since you can use bisect run still.
> (6) For you to not have any sort of prior CI or other form of validation on your dependencies, because you apparently didn't catch even catch a build error when it was introduced (never mind automated testing)
Not true. Everyone uses CI, but CI will never catch 100% of regressions. There's a combinatorial explosion of potential build environments and there's no way a CI environment can test every possibility. Can't test with every library version, every release of every compiler, every combination of using vendored library vs system library or so on.
> I could go on, but these set of circumstances are more or less indistinguishable from ~never to me.
For me it happens more often touching open source things than at work, but really the probability it will be useful increases as the size and inscrutability of a project increases. For trying to debug Chromium or Electron issues, bisect reigns supreme. (My current workplace doesn't ship software built on top of Chromium, but replace Chromium with any other big and inscrutable dependency and you get the idea.)
I suspect it may be the case that I bother to root cause and debug things that other people would simply completely write off as impossible or not worth the debugging, thanks to the power of Git bisect.
Just to clarify: I wasn't disagreeing with the use cases you described in this comment (which are certainly valid, just not common for most people I think), but rather the use case in your previous comment. The scenarios are basically the opposite of each other, hence my replies above. For example:
>> Build's failing? git bisect run cmake --build .build
> Compiler errors are not a great use case for git bisect, but test failures are a really good use case
These aren't build failures.
> For trying to debug Chromium or Electron issues, bisect reigns supreme.
These are neither a simple CMake! Even their build process changes over time.
(Side note, I'm surprised that for a massive third party dependency that you don't understand, the specific commit is what you really want. I would've thought you'd just go to the previous version, which is presumably more stable and supported too.)
>> No need to bother trying to understand the error message, I can go make coffee instead and then just look at the commit that broke it.
> Not true. It's harder to automate GUI failures, but that does not mean bisecting isn't incredibly valuable for GUI apps.
I'm not sure how we got into discussing GUIs, but if you're debugging a dependency that's a GUI, then having to bisect it (as useful as that is) is very much the opposite of "no need to understand it, just go grab a coffee"!
All of which is to say: I'm not claiming git bisect is useless by any means. It's nice that it's there and you can use it. I'm just saying that it's often (often != always) a small difference in cost vs. bisecting manually.
While they are not compilation failures, I do sort-of consider unit test failures to be build failures if the unit tests are typically ran as part of the build. But either way, I do still use bisect to find the origin of weird compilation errors, FWIW. As an example, I used it on Envoy proxy to figure out a really strange linker error on Darwin. It's just not great for e.g. a syntax error, because compiler diagnostics will give a more obvious answer most of the time. I do agree with that.
> These are neither a simple CMake! Even their build process changes over time.
Chromium and Electron are both fairly easy builds. Not because they are simple to build, but rather because the build process is extremely automated and documented. (The real hard part is balancing concurrency with OOM: Electron defaults to -j200 and on my 7950X3D it will happily eat all of the 128 GiB of RAM. Keeping modern CPU cores fed is hard work...)
> (Side note, I'm surprised that for a massive third party dependency that you don't understand, the specific commit is what you really want. I would've thought you'd just go to the previous version, which is presumably more stable and supported too.)
Well, I want the bug to be fixed. Even if I am not going to submit the actual patch that fixes it, providing bisect details in my bug report will vastly increase the odds that it gets fixed. However, with the magic of bisect I often can make working patches for complex bugs and then patch them. If it's software I use on my system, I can then add a temporary patch in my Nix configuration and continue to run the latest version. I don't always do this, but it is very useful. I don't like to rely on hoping someone else will figure out my bugs if I can help it.
I am also a package maintainer in some cases. If I run into a bug on something I maintain a package for, I will often send a PR fixing it and then pull that patch into my package temporarily.
> I'm not sure how we got into discussing GUIs, but if you're debugging a dependency that's a GUI, then having to bisect it (as useful as that is) is very much the opposite of "no need to understand it, just go grab a coffee"!
I'm not sure what you mean, I'm just saying that git bisect works even for complicated runtime issues. There's no need to debug the GUI: all you have to do is tell git bisect if the bug is still there in whatever commit it dumps you in. You do it a few times and you have the commit that broke things. It's magic every time.
As an example of a GUI failure I've "debugged" with this, I found a regression in Wine using this technique.
> All of which is to say: I'm not claiming git bisect is useless by any means. It's nice that it's there and you can use it. I'm just saying that it's often (often != always) a small difference in cost vs. bisecting manually.
Bisecting manually? I don't understand. Do you mean still using git bisect without using the run command? I do that when debugging more complex failures so I can skip commits broken for unrelated reasons. Or, do you mean literally manually picking commits to test and bisecting like that? If so, I'm confused. Starting a git bisect is a single command where you pass it the bad and good commit. The rest of the commands don't even have to be memorized because it tells you what to run (not that it's hard since its just git bisect good, git bisect bad, git bisect skip...) I cannot fathom it being less work to do this manually, especially since git bisect handles complex situations where there are diverging merges to explore.
Even at Google, where the CL numbers were pretty much linear, we still had a bunch of tooling around bisecting google3 to find regressions. I'm a little surprised it's any question that it's worth it, you can bisect 100,000 commits in ~17 steps without sitting there trying to figure out which commits you need to pick. That's incredible value and it has helped me root cause bugs in all kinds of stuff, like pure magic.
Anyway, I'm not trying to be condescending, but as you may have gathered I actually bisect fairly frequently and I also am very sure it is saving me time every time I do it. Especially since I reckon more than half the time I don't really have to do anything, so even when the build can take a long time it's still not a problem.
I should perhaps clarify, this whole discussion I'm referring to the usefulness of the git bisect command specifically, not bisection as a technique. So these weren't in any way intended to be a comment on e.g. tooling you may have had at Google. I bisect things all the time -- across files, across lines, and across time. It's just ~never via git-bisect.
Please also note I have no doubt that your workflow gets immense value from git-bisect! Almost every tool is incredibly valuable to certain subset of users and workflows. What I've been debating is just how useful I see it being to the most typical users. I know I haven't actually come across anyone using git-bisect for several years now, and I myself haven't either. Bisection as a technique itself is incredibly useful for everybody -- it's just not something I see done via git-bisect with any frequency, is all.
> Bisecting manually? do you mean literally manually picking commits to test and bisecting like that? If so, I'm confused
That's what I mean, yeah.
> The rest of the commands don't even have to be memorized because it tells you what to run (not that it's hard since its just git bisect good, git bisect bad, git bisect skip...) I cannot fathom it being less work to do this manually
While I didn't actually intend to say it's faster to avoid git-bisect -- just that it's a similar amount of effort -- now that you mention it, I actually do think it is often faster for me. There are multiple reasons why:
- Despite it taking a few more iterations in the worst case in theory, manually picking my own commits lets me split based on e.g. version numbers, rather than random commits. I find it much, much more useful to know that a problem came up between v4.3.1 and v4.4, rather than between commit 1133728 and commit 1133262, say. And I know -- and this is key -- that those commits will be well-tested, stable, and released to the public. So I don't have to worry about encountering random bugs in the middle, even unrelated ones. By contrast, "git bisect skip" might seem easy to type in 3 seconds, but if you encounter a single failed build, then you lose all that time you "saved", and then an enormous amount on top of that, merely by virtue of having to build & testing those failed iterations. I really, really wouldn't want to do a full build of Chromium -- even an "incremental" one -- just to encounter a bug in it and have to skip to another commit!
- I actually almost always have some idea of where the bug is, merely based on the tags/versions, or commit messages, or just the timestamps. Sometimes, literally searching for a keyword pops up a relevant commit for me to test, and then doing an incremental build with the commit immediately after is much faster (almost free) compared to doing what ends up being a ~full rebuild for a distant commit. So it actually does end up being far fewer iterations and less time in some cases.
- Git bisect is stateful. It means your repo is no longer in a normal state, it's in the middle of a bisect! That's annoying to say the least. I've gotten burned way too many times by running the wrong command or forgetting that I'm in the middle of an interactive merge or rebase. I don't want to have to think about an additional state I could be in when I leave my work and come back to it. It's a huge cognitive burden, and I would really rather avoid unless it's really clearly going to pay for itself.
Again, I'm not saying these trade-offs apply the same way to you. You encounter scenarios that I clearly don't :-) I'm just explaining where I'm coming from, and (what I believe are) the reasons I don't really see other people using git-bisect frequently.
> especially since git bisect handles complex situations where there are diverging merges to explore.
Diverging merges is a super interesting case, I've never needed to deal with that with bisection before (whether manual or automatic). If your history is nonlinear then I definitely see a much larger value in git-bisect. Though at the same time the result isn't even necessarily a single commit, so I'm not even sure what exactly I would expect it to do in that case!
I think we're trying to solve different problems: I don't care what version the problem was introduced in, I want to find what broke it. It's not about the commit itself, it's about the diff.
Maybe more importantly, when the thing I am bisecting is Nixpkgs itself, there are no version numbers to go off of. Same with large monorepos like google3: There simply are no version numbers.
When there are no version numbers, there is nothing better to do than what git bisect does.
When there are version numbers, if I want to know which versions the problem was introduced in, that's really not a problem either: Git has `git tag --contains`, it's relatively easy to determine the first version to contain the problem commit as an ancestor. Though, I'm not really sure why that information is particularly useful.
> Git bisect is stateful. It means your repo is no longer in a normal state, it's in the middle of a bisect! That's annoying to say the least. I've gotten burned way too many times by running the wrong command or forgetting that I'm in the middle of an interactive merge or rebase. I don't want to have to think about an additional state I could be in when I leave my work and come back to it. It's a huge cognitive burden, and I would really rather avoid unless it's really clearly going to pay for itself.
If you find this to be a particular problem for you, I would recommend adopting Git worktrees. That way, you can create a worktree for your bisect, making it a lot harder to accidentally mix peanut butter and toothpaste.
All of Git is stateful and obviously there's nothing special about bisect, so it's really just the same burden. That said, `git status` will tell you if you're in middle of a bisect. Generally it's a good idea to check your status before doing some work just to double check that you're not working on the wrong branch: I personally also have git status in my shell prompt, which will also tell me if I am in middle of a bisect. And if you do happen to fuck up and do some work during bisect it is not a huge deal: You can do a `git bisect reset HEAD` to exit the bisect without messing with your working tree, and if you want to jump back into it, that's not too big of a deal either: you can take a look at the bisect status with `git bisect visualize` and then all you need to do to be able to jump back in later is set the good/bad terms to match the bounds you had last time.
I'm not defending Git, it's a pretty poorly designed program. It's confusing that `git checkout` does 3 different things. The working directory vs staging area vs HEAD vs current branch is annoying. The workflows are often unnecessarily confusing and there are obvious features that I wish Git had natively that it doesn't, like an interactive TUI for rebasing (The chistedit extension in Mercurial is awesome, for example). But, it seems like you are running into issues where it comes down to having trouble dealing with Git rather than anything specific to `git bisect`. If you are afraid to use features because you will have to spend a lot of time digging yourself out of messes, it might be worth some time to look at ways to improve the workflow (or learn new tools for getting out of messes. An all-time favorite tool of mine here is definitely the reflog, though I find myself needing it very seldom these days.)
Git bisect itself is very nearly one of the easiest commands in Git that I can think of: there's very few commands to remember and Git does most of the work.
> Diverging merges is a super interesting case, I've never needed to deal with that with bisection before (whether manual or automatic). If your history is nonlinear then I definitely see a much larger value in git-bisect. Though at the same time the result isn't even necessarily a single commit, so I'm not even sure what exactly I would expect it to do in that case!
git bisect does a good job dealing with this case: it tries to figure out what parent of the merge to explore by trying each of them, then it continues the bisect running down that lineage. Of course, git bisect can't tell you if there are potentially multiple commits that independently introduce the problem, but it can very reliably tell you one of the commits that introduced the problem, which is all I care about, since what I want to do is discover more about the bug and how to fix it.
> When there are version numbers, if I want to know which versions the problem was introduced in, that's really not a problem either: Git has `git tag --contains`, it's relatively easy to determine the first version to contain the problem commit as an ancestor. Though, I'm not really sure why that information is particularly useful.
The versions are important because they're publicly-released, stable builds (generally), rather than random and potentially poorly-tested unstable commits. They're much friendlier too, if you're trying to tell a user to revert to a previous version. If you're trying to create a patch, that's obviously useless. But if you're trying to figure out which version of the software you can use, that's obviously useful.
> I think we're trying to solve different problems: I don't care what version the problem was introduced in, I want to find what broke it. It's not about the commit itself, it's about the diff.
Yup, exactly.
> If you find this to be a particular problem for you, I would recommend adopting Git worktrees.
Oh I'm familiar with them. I don't find them particularly attractive for the typical "manual" bisects I do. Using a different path means I have to set up a totally different build just to do the bisect -- that's at least one extra iteration I have to go through (+ space I have to waste). Again, not saying it's never useful, but I really am not keen to extra iterations if I feel like I can avoid it.
> All of Git is stateful and obviously there's nothing special about bisect, so it's really just the same burden.
Actually funny enough, not quite: there are two fundamental differences for git-bisect:
1. For most common operations, I usually use a GUI for git. This includes rebase, cherry-picks, and merges. The dialogs are modal and stack on top of each other by nature, so it's pretty darn to forget you were in the middle of one of these operations and accidentally start another one -- they stay on the screen till your done! It's the command-line where you're back in a normal shell and can easily forget you were in a weird state, thus forcing you to check your status periodically.
2. Git bisect is by far the longest-running stateful operation. Rebasing and merging just involve modifying source files at every iteration. Bisect requires building + running tests too. That takes orders of magnitude longer. By the time you come back to it for the next iteration, a long time will have passed, and it's much easier to lose context then.
> I'm not defending Git, it's a pretty poorly designed program. It's confusing that `git checkout` does 3 different things.
I've heard this a lot but funny enough git checkout is the least of my problems with it. The syntax is second-nature to me now. Having to remember what state my repo was in, or constantly check it? That's another beast.
> But, it seems like you are running into issues where it comes down to having trouble dealing with Git rather than anything specific to `git bisect`.
Nope, it's really git bisect (see above). The only thing it's doing is saving me the hassle of opening the commit log and picking my own commit. That's very little gain -- on the order of seconds/minutes -- at an often larger cost (more iterations etc. as explained previously).
P.S., when you're near the end of a bisect (i.e. all the candidate commits are a few away from each other), it's very helpful to be able to see the messages & files each commit touched -- which I can do easily when I'm looking at a log. That way you can make educated guesses as to the root cause and save a couple iterations. Using git-bisect is just blindfolding yourself to useful information. Again: guaranteed small win, at the cost of a not-so-infrequent large loss.
To be honest with you, I re-read this multiple times and I still can't fathom how doing this manually could ever be better than having it done automatically. Even if I had to bisect one million commits, it would be at most about 21 iterations. Cutting down an iteration or two near the end just doesn't seem very useful to me in exchange for having to manually pick commits to bisect with.
Also, nothing in particular stops me from viewing the log. In fact, the `git bisect visualize` command is for the exact thing you are describing. You can easily narrow your bisect window at any time if you have a hunch. I have done this and I'm pretty sure the manual even describes doing this.
I also haven't even mentioned some of the more useful features about git bisect. For example, often times, even if you don't know exactly what the problem is, you might be able to narrow down with good certainty what set of files would've had to change to cause it. `git bisect start` accepts pathspecs after everything else. If you tell it that, it will only select commits where files in that pathspec have changed. You could of course manually find commits of interest and then divide them in half and so forth, but that really doesn't feel like it scales very well.
And if all you care about is what release version something broke in, there's really no particular reason why an automated tool couldn't do that, too. I suspect the reason why `git bisect` doesn't have a "tags only" mode is because it's not what bisect is really for, though it would probably be a fairly trivial addition.
I also personally never loved Git GUIs. Being able to visualize the state of the working directory/repository feels very nice, but in practice it doesn't really solve a problem for me. In practice, I always come away feeling like Git GUI tools create as many problems as they solve. The only exception is that I vastly prefer JetBrain's 3-way-merge tool over almost any other merge tool. (Runner's up would probably be kdiff3, I guess. Not a huge fan of vimdiff.)
I'll just reply to this bit in the interest of wrapping up the discussion, but I really appreciate all the replies and discussion :)
> git bisect visualize
This is the first time I'm hearing about this, it sounds useful! I've never seen anyone use it before, so it must've slipped past me if I've ever even encountered it in the manual. Thanks for sharing it!
Bisect is a type of tool that you hope you won't need, but when you need it it's a life saver. You're right about the larger collaborative codebase. Imagine trying to find a strange bug that got introduced sometime in the last 12 months in a huge and active repo.
I released it as Open Source because Microsoft Research sent someone to my office to ask me to, back when Microsoft was calling Open Source a "cancer".
The LLVM introduction by Latner refers to the "standard delta debugging tool", so it is rather well-known: https://aosabook.org/en/v1/llvm.html 'unlike the standard "delta" command line tool.'
For other readers' benefit: C-Reduce is a little more sophisticated than plain delta-debugging. From the abstract of Test-Case Reduction for C Compiler Bugs (2012):
> [...] [C-Reduce] produces outputs that are, on average, more than 25 times smaller than those produced by our other reducers or by the existing reducer that is most commonly used by compiler developers. We conclude that effective program reduction requires more than straightforward delta debugging.
(Of course, this means that C-Reduce is 12 years old now.)
At the same time, C-Reduce seems to be more general than the LLVM tool you linked ("BugPoint", dating to 2002), since that one works with LLVM IR specifically.
I think most developers are generally unfamiliar with automatic test case minimization tools and techniques, so the post may be helpful even if the ideas have been known in their respective circles for quite some time.
But still having trouble understanding how it knows WHAT to remove when trying each iteration. There must be some tokenization going on, but then I don't know how that would work across programming languages
I used a test script that spent hours using CSmith to generate random test programs when developing an esoteric llvm target. When they crashed it automatically Creduced and left them for inspection. Invaluable!
Without an explanation of why it works on languages other than C, that is a hard claim to believe! I mean, I don't think they're lying but (given it doesn't use an LLM) I am bewildered.
So the short answer is that some of the reduction passes are pretty generalizable to C-family languages, and these can be some of the most effective passes.
Some of those passes are:
* Tokenize the input according to C, and randomly chuck away chunks of length 1-(I think) 13. Most languages vaguely similar to C have very similar tokenization rules to C, so these kinds of passes are going to have similar effects, and this will tend to be effective in doing things like omitting needless qualifiers or attributes.
* Balanced parentheses, and chuck away units of balanced parentheses. This includes (), {}, and []--and for almost any language, this can be a useful level of stuff to throw away or reduce.
* Stripping comments and whitespace. Again, many languages use the /* and // styles of comments that C does, so this is pretty effective.
There's actually relatively few passes that are incredibly C/C++-specific, and in my experience, one of the biggest issues with creduce is that it is absolutely terrible at trying to detemplatize the code as a reduction step, which should be a relatively easily automatible step.
I recommend skimming the PLDI paper linked by asmeurer, it has a good summary. Some of its transforms are pretty C-specific, using the Clang frontend; some are pretty generic and probably work on any Algol-descended language. It's meant to be a modular tool, so you could add transforms that understand other languages if you like.
> I mean, I don't think they're lying but (given it doesn't use an LLM) I am bewildered.
I’m pretty sure (based on the last time I saw this) that this is just good old fashioned computer science.[1]
I really hope that HN hasn’t forgotten about good old fashioned computer science stuff already.
[1] All the algorithm and whatnot stuff, not the spooky machine learning stuff. Even things like Prolog-for-AI, although that has the slight downside of not working (for the purposes of making AI).
To be clear my comment was meant to be an awareness that it is good old fashioned computer science. Without LLMs, which this predates, it is surprising to me that you'd have a lot of success reducing a program in an arbitrary language and still having something that's valid syntax!
I do get that it will reject a lot of stuff as not working (and has to even in the target language) and I also get that algol-like languages are all very similar, but I am still surprised that it works well enough to function on ~arbitrary languages.
These are very LLM-era properties for a program to have. The question is not "does it work for language x" but "how well does it work for language x", and the answer is not "is it one of the languages it was designed for" but instead "idunno try it out and see".
The fascinating part isn't that it can modify code accordingly to some rules - it's that it can modify code from _arbitrary_ languages without having a grammar or semantic analysis for them.
Yes, and without understanding how it works, I'm left wondering whether it's even safe to use this way. Will creduce execute mutated versions of the input script, potentially deleting my files or eating my lunch?
C-reduce is meant to... Reduce your files, it would not add stuff that was not there in the first place. Also, I think it's meant to only be run against the "frontend" of most languages, not full execution
While I've used c-reduce before, I've never done it in a way where it could be destructive. However speculating based on what I do know. I think the 2 things I would do would be in the interesting-ness test, grep for already known harmful instructions and force them to be uninteresting (returning 1). And then if I was still unsure of the potential harm of the program and I had to run it to determine if it's interesting or not (segfault or something similar). I think I would hoist the binary/script into a docker container and run it there. That way the log and result can still be checked, but it's access to actual file system is minimized as much as possible.
TLDR; C-Reduce just gives you text to run/compile, if you're worried about things like that sandbox as much as possible.
Isn't that potentially unsafe though? Random permutations can reduce arbitrary programs to destructive programs, in particular any program can be mutated to `rm -rf /` given enough time. Also even the problem of finding syntactically valid programs is combinatoric, it's pretty surprising that it can go toward a local optima in such a short time without having any understanding of the function or its derivatives.
For compiled languages it should be fine, as you're only going to compile the permuted source code, not execute it.
Given a sufficiently bad compiler bug anything is possible, but I think given that it's trying to minimize the size of an input that gives a particular output I don't think it's likely to explore distant branches.
I don't think that should trigger a reasonable interesting condition one has specified, so that should not happen. So I suppose for non-compiler-bugs you need to check (from the output) that the correct thing is being done, in addition to detecting the actual issue.
I was not disagreeing with that, merely pointing out that the reproduction test shouldn't be passing with the minimal runnable C program, or indeed for many many previous iterations before that. In my mind also runnable programs would be tested with grep, instead of relying on the return code of the program.
That depends completely on the interesting-ness test that's provided. If you're looking for a compiler bug (like I do often for my language), then the interesting-ness test checks the compile logs for information like the "Segmentation Fault" text, no need to run the actual executable. You could also hoist everything into docker if you really want to, or ship it off to another device to run.
If your compiler bug is "the compiler crashes when compiling this code" then you just need to try to compile it. Or if it's "it takes 15 minutes to compile this file" then you just need to check if it compiles in a reasonable/expected amount of time.
If the bug is "the compiler generates an if/else where neither branch is hit", then you would need to execute the code. However, you would likely be able to directly reduce such a bug yourself. When the compiler is generating bad code you can usually just reduce it to that section of code directly. It seems like C-Reduce must be for compiler bugs where it's not generating any code at all (crashing) or the issue isn't with the code generation (extremely slow compiles).
> Given a sufficiently bad compiler bug anything is possible, ...
I can't help but wonder about the consteval/constexpr machinery in the various C++ compilers... It'd be interesting to see how much adversarial input has been tested for those.
(I guess Zig's comptime might also be relevant, but I'm not sure what that's allowed to do. Maybe execute arbitrary code?)
It seems pretty unlikely to mutate in a malicious direction, random chance probably wouldn't get there, and there doesn't seem like any reason it would be guided in that direction.
"Enough time" does a lot of work here and warrants further analysis. With enough time it might produce the works of shakespare (if you ignore its designed to minimize), but we should probably view anything that would take more than a year as too much times.
I think the easiest way to think about it is that it's gradient descent, it's just a slightly weird form of it over a discrete space.
If you imagine you've got some neighbourhood function N(x) that takes a program and returns a large sample of valid programs that are "similar" to x (its neighbourhood), then your basic test-case reduction algorithm is just:
1. Start with some initial test case.
2. Look at the neighbourhood of your current test case. If any elements of it are both interesting and smaller than your current test case, move to one of those. If not, stop.
3. Return the smallest interesting test case you've seen at the end.
And most of what varies from test-case reducer to test-case reducer is that neighbourhood function (how you explore the neighbourhood also varies, but you can think of that as an implementation detail that mostly only affects performance).
This approach will generally work pretty well because the sort of bugs you tend to find are "dense in program space". This will e.g. do nothing if you start from the property that the file has a particular sha1 hash, but if you start from a property like "this uses this particular pair of language features", most nearby programs will share it.
The nice thing about doing test-case reduction is that it doesn't actually matter if your neighbourhood contains only valid programs, because you can rely on the interestingness test to filter out any invalid program. So all you care about really is:
1. It contains a fairly large set of valid programs among the test cases you consider.
2. It's not too large so as to be prohibitive to explore.
And this frees you up to just try a bunch of heuristics that are likely to work pretty well, and it turns out that there are a lot of easily accessible ones. In particular, deleting contiguous chunks of a program pretty often produces a valid program, which will always be shorter.
For a trivial example, imagine you've got something like:
if True:
blah()
in a Python program. It might look like you need to know about Python syntax to reduce this to just `blah()`, but in fact you can reduce it by deleting the 13-byte string `"if True:\n "`. So if your reducer is happy to brute force try all short strings, it will be able to make this transformation.
This isn't a real example exactly. C-reduce I think won't find this particular one (but I might be misremembering its exact heuristics here). Shrinkray will, but it will do so by understanding Python - it stops at 10-byte sequences, which won't work here. But it demonstrates the sort of thing that can work surprisingly well without understanding syntax.
Another thing you can do is you can understand just enough syntax in a general purpose way. For example, shrinkray (and I think C-Reduce) will look for integer literals in the program and try replacing them with integer literals representing a smaller number. In order to do this, it doesn't have to know anything about the program's lexical syntax, it just has to know what a number looks like.
Heuristics like this will often fail to work, but that's OK, you just need them to succeed enough, and often some pretty crude stuff will get you down to something that is small enough that a human can get it the rest of the way.
I glanced at the code. It seems to have multiple possible "passes" which reduce the code in various ways, and the passes here not tagged with "C"=>1 are used in the not-c mode recommended in the post.
The key thing is that the transforms it makes aren’t required to always produce a program that can actually run, or even build.
The user provides a script which determines if a program is “interesting.” A program with build-time errors should be considered “not interesting” in most cases. (And if you’re hunting an incorrectly reported error, you’ll want to make sure you only consider that error to be interesting.)
Then it yoinks out parts of your program and checks if the result is still interesting. If it’s not, that attempt is discarded and it tries something else. If the reduced program is still interesting, it will try yoinking out more stuff. Repeat until you like the result.
There doesn’t need to be any understanding of the program in order for this to work. You just need something where removing some bit of the program has a chance of producing a new program that’s still interesting. That works in most programming languages.
This process can take a long time, and it’s slower when there’s a lower probability of a removal producing an interesting program. Thus heuristics are added: don’t remove random character ranges, but work at a token level. When removing a brace, find the matching brace and remove the insides. That sort of thing. Most modern languages are similar enough to C that many of these rules are helpful for them too. And even if they don’t help, this just slows the process down, but it still works.
This kind of algorithm needs to be better known. For example, the original DD algorithm can be used to automatically minimise the set of cloud permissions for a service (from a given set) (A PoC is aws-policy-minimize here: https://github.com/kanocomputing/aws-tools/)
Of course, because it's perl, the Swiss army knife. Reducing token sets, blocks, statements is pretty language agnostic. It would have been an overarchitectured multi-file mess in python
Okay fine, let's see for ourselves:
Niiice: It seems to have stopped at "(96.4 %, 7347 bytes)" with the following output: https://gist.github.com/judofyr/47cba8a20cb2cd5798943ef975d0...reply