I used to do something like this all the time with C/C++ compiler tests. I tried lots of fancy tools and stuff, but I kept going back to: expand all macros and retokenize to one token per line (I made a custom build of the preprocessor that had this built in). Then, have a shell script randomly remove lines, and use another script to check that the resulting test case behaves consistently with the failure. It would run for a few hours (or days, for boost problems), then usually you'd get a really minimal testcase that shows the problem. Often I would use this to find regressions. Just have the shell script check one is good, the other has the problem. The resulting output would then usually point exactly at the regressed feature, and make an amazing unit test.
Before leaving compiler team, I wanted to find a way to do this one the AST level (so parens always go in pairs, etc), but that could also complect with the bug.
I wonder if LLMs could accelerate this by more intelligently removing stuff first, iteratively?
Sophisticated reducers like C-Reduce do know things like that parens go in pairs. C-Reduce has many transformation operations, and while some are syntax agnostic (delete a few characters or tokens), others use Clang to try to parse the input as C++ and transform the AST.
Perses isn't language agnostic, it just knows the syntax of a lot of languages because there are antlr grammars for most commonly used languages.
Really there's no such thing as a language-agnostic test-case reducer. shrink ray is much closer than most, but all this means is that it's got some heuristics that work well for a wide variety of common languages (e.g. the bracket balancing thing). It's also got a bunch of language-specific passes.
This is sortof inherent to the problem, because in order to get good results and good performance, a test-case reducer has to have a strong idea of what sort of transformations are likely to work, which in turn means it has to have a strong idea of what sort of languages it's likely to be run on.
I am just shooting in the dark here so excuse me if my comment is too ignorant: have you considered rolling your own reducer and use TreeSitter grammars for it?
Very interesting. Previously I've seen two similar ideas:
1. Reduce the test data by property-based tests' auto shrinking a là QuickCheck [1]
2. Reduce the change set to a codebase, a là delta debugging [2]
The reducer is able to do both at the same time in a pretty generic way, with some restrictions on the expressivity. Seems to be a good fit for end-to-end tests.
Do you have examples where Python’s Hypothesis (and its shrinking) works better than Haskell’s QuickCheck? This would let us improve Haskell’s QuickCheck.
My understanding is that Haskell’s QuickCheck can shrink functions, and with parametric polymorphism, can work with vector types and more [1], which would be difficult if not impossible in Python.
And in general, the base types covered by the shrinking in Python’s Hypothesis should have analogues by the shrinking in Haskell’s QuickCheck. And with deriving, Haskell’s QuickCheck should get shrinking ‘for free’ as in Python’s Hypothesis.
Also, Haskell’s QuickCheck extends to stateful and parallel checking, which is not supported in Python [2].
I have to admit the last time I seriously worked with QuickCheck was quite a while ago.
> My understanding is that Haskell’s QuickCheck can shrink functions [...]
Are you talking about shrinking a value of type (a -> b) (for some concrete a and b)? In Hypothesis, if you write a generator that spits out functions, you get a shrinker for free. So that shrinker will shrink functions.
As far as I can tell, QuickCheck's shrinking is based on values. I just checked, the type is still `shrink :: a -> [a]`, so that can't have changed.
In Hypothesis shrinking is based on the generation process. In Hypothesis generators act a bit like 'Functors' in Haskell, ie you can map over them. (You can also filter and `<>` and do monadic binds etc.)
So in Hypothesis you can take a generator that produces integers, and create one that produces only even numbers by (in Haskell terms) mapping (2) over it. Now, the shrinker you get for free will also only produce even numbers.
That's not possible (or at least not easily possible) in Haskell. You would have to define at least a newtype wrapper.
Apart from shrinking, Haskell's QuickCheck can also learn a lot from Hypothesis in terms of floating point number generation. The last time I used QuickCheck seriously, they basically generated floats by converting from a sort-of uniformly sampled fractional number. I see that this commit https://github.com/nick8325/quickcheck/commit/07942642d7987b... seems to have made float generation a lot smarter!
Hypothesis is especially 'nasty' when it generates floats. It has a good chunk of probability allocated to things like various NaNs, infinities, sub-normal numbers, zero, one float past zero, etc, plus of course some probability mass on uniformly chosen floats. See https://hypothesis.readthedocs.io/en/latest/data.html#hypoth... for the docs, but you should check out the code, too.
> In Hypothesis shrinking is based on the generation process. In Hypothesis generators act a bit like 'Functors' in Haskell, ie you can map over them. (You can also filter and `<>` and do monadic binds etc.)
You are right that the generators in Hypothesis correspond to `Functor`s, so you can map over it. Indeed, in QuickCheck, its `Gen` [1] (type of `arbitrary` [2], the generator in QuickCheck) is a `Functor`, and moreover a `Monad`, so you can do monadic binds on `Gen`.
> So in Hypothesis you can take a generator that produces integers, and create one that produces only even numbers by (in Haskell terms) mapping (2) over it. Now, the shrinker you get for free will also only produce even numbers.
> That's not possible (or at least not easily possible) in Haskell. You would have to define at least a newtype wrapper.
Since `Gen` in QuickCheck is a `Functor`, to multiply its output by 2 to generate only even numbers, you can do this in Haskell as:
fmap (\x -> x*2) (arbitrary @Int)
Or, for those preferring OOP-style chaining:
arbitrary @Int <&> (*2)
where `<&>` is `fmap` with arguments flipped, and `(*2) = \x -> x*2`, and `@Int` is type application to specialize `arbitrary` to generate `Int`.
I think what you are hinting at is that, QuickCheck separates generation and shrinking, while Hypothesis combines generation and shrinking. If you prefer combining generation and shrinking, you may want to check out hedgehog in Haskell, see [3] for a discussion of this trade-off.
I think by using the `MonadGen` in hedgehog [4], you should be able to map over and bind over generators, with automatic shrinking, much like Hypothesis.
Hypothesis was first released in 2013, while hedgehog in 2017, so it is possible that hedgehog was inspired by Hypothesis or similar property-based testing libraries.
But in general, I would be surprised if such ‘functional’ APIs (map, filter, reduce, bind, etc.) could not be ported to Haskell.
Yes, I know that `Gen` in QuickCheck is a Monad. What I am saying is `Arbitrary` isn't a Monad.
I know, Arbitrary technically can't be a Monad, because the kinds are wrong. But I mean 'morally': you can fmap the generation process, but you can't fmap the shrinking. Exactly because generation and shrinking are separated; and shrinking is driven purely by the value and type of the generated item, but has no access to any information about the generation process.
> But in general, I would be surprised if such ‘functional’ APIs (map, filter, reduce, bind, etc.) could not be ported to Haskell.
Yes, have a look at the Jack library I already linked to.
Interesting. The techniques are quite similar to what one can do with Hypothesis in Python ( https://hypothesis.readthedocs.io/en/latest/quickstart.html ). Basically, it generates a bunch of test data and then tries to shrink it to the smallest example that fails the test automatically.
There was a talk about some novel ways to approach test suite generation and reduction at a PLDI workshop a few weeks ago. That work came from the compilers and programming languages tool domain:
I find it odd that test-case reduction is billed as a “research skill”. To me, this is a basic debugging skill for anyone who works with code.
Maybe it’s more critical when you’re dealing with “research-quality” code. This post seems to ignore the underlying problem: it’s rare to see research code with any tests.
As the article says, right at the top: "The plan is to demonstrate techniques that “everyone knows” because everyone, in fact, does not already know them."
This is actually something I find hard to resist in myself as an aging software engineer on HN. When I was young, articles would describe “obvious” things in excruciating detail, and I’d be grateful, because they were often things I didn’t know.
Now when I see articles about things I find obvious on the front page — particularly if they’re topics I’ve seen hit the front page before (which, mind you, means “any time in the last 15 years or so”) — I have to resist an annoyed gut reaction. The fact that knowledge and skills seem obvious and boring once internalized is a kind of terrible quirk of the human condition.
My issue isn't with articles describing "obvious" things, it's with articles acting like this obvious thing is an amazing new discovery. This is especially funny when you see it in articles about relational database stuff, since most of the fundamentals have been discovered since the 70's.
For whatever it's worth, I enjoy explaining things to others so each time a topic recurs I can do that where it seems appropriate. And hey I can pass on things I didn't know about and then learned here, for example that's why I tell people about WUFFS, can you do all the bounds checking at compile time? Sure, for a price. And that price (generality) might be easily affordable in which case using Rust, never mind C++ is very silly because WUFFS was offering you better performance and total safety if you can do without generality.
No; I think in this context, "research skills" specifically means "skills in the service of conducting publishable research", not "skills that involve some quantity of exploration" or something like that.
(To be clear, I don't agree with the sentiment of the comment to which you were replying, but I also don't think debugging "is research by definition".)
It's not scientific research with the aim of publishing, but it is research as defined in the dictionary [1], and uses some of the same skills.
Honestly, I am a scientist first. I am not aware of any activity outside of academia that is more research than debugging. You start with an observed phenomenon (the bug). Then you search for the cause of it. Along the way you formulate and test hypothesis, conduct experiments, gather evidence, check the literature.
What's more: these activities are the bulk of what you do, especially in difficult cases. They are not incidental to debugging, they are the essence of the work.
Much of what is published is the outcome of processes that look a lot less like research.
[1] (E.g. I do research to win online argguments, or for a school essay, or a reporter does research for a story, or an engineer does research on how to implement some idea. None of these cases of research are academic/scientific research)
Before leaving compiler team, I wanted to find a way to do this one the AST level (so parens always go in pairs, etc), but that could also complect with the bug.
I wonder if LLMs could accelerate this by more intelligently removing stuff first, iteratively?