Hacker News new | past | comments | ask | show | jobs | submit login
Automated Test-Case Reduction (cornell.edu)
102 points by ingve 9 months ago | hide | past | favorite | 29 comments



I used to do something like this all the time with C/C++ compiler tests. I tried lots of fancy tools and stuff, but I kept going back to: expand all macros and retokenize to one token per line (I made a custom build of the preprocessor that had this built in). Then, have a shell script randomly remove lines, and use another script to check that the resulting test case behaves consistently with the failure. It would run for a few hours (or days, for boost problems), then usually you'd get a really minimal testcase that shows the problem. Often I would use this to find regressions. Just have the shell script check one is good, the other has the problem. The resulting output would then usually point exactly at the regressed feature, and make an amazing unit test.

Before leaving compiler team, I wanted to find a way to do this one the AST level (so parens always go in pairs, etc), but that could also complect with the bug.

I wonder if LLMs could accelerate this by more intelligently removing stuff first, iteratively?


Sophisticated reducers like C-Reduce do know things like that parens go in pairs. C-Reduce has many transformation operations, and while some are syntax agnostic (delete a few characters or tokens), others use Clang to try to parse the input as C++ and transform the AST.

Perses is a reducer that works on an AST and claims to only try syntactically valid candidate inputs. It also claims to be language agnostic, and I don't know how these two things go together. But it does work nicely for Java for me. https://github.com/uw-pluverse/perses / https://faculty.cc.gatech.edu/~qzhang414/papers/icse18_cheng...


Perses isn't language agnostic, it just knows the syntax of a lot of languages because there are antlr grammars for most commonly used languages.

Really there's no such thing as a language-agnostic test-case reducer. shrink ray is much closer than most, but all this means is that it's got some heuristics that work well for a wide variety of common languages (e.g. the bracket balancing thing). It's also got a bunch of language-specific passes.

This is sortof inherent to the problem, because in order to get good results and good performance, a test-case reducer has to have a strong idea of what sort of transformations are likely to work, which in turn means it has to have a strong idea of what sort of languages it's likely to be run on.


I am just shooting in the dark here so excuse me if my comment is too ignorant: have you considered rolling your own reducer and use TreeSitter grammars for it?


Well, I do have my own reducer (shrinkray is mine), but I actually hadn't considered using TreeSitter grammars for it. That's a good idea, thanks!


I'm trying to understand - would you convert

  #define RET 42
  int main(void){return RET;}
to

  int
  main
  (
  void
  )
  {
  return
  42
  ;
  }
and then randomly remove lines, recompile (if it is compile-able) and check the behaviour?


Very interesting. Previously I've seen two similar ideas:

1. Reduce the test data by property-based tests' auto shrinking a là QuickCheck [1]

2. Reduce the change set to a codebase, a là delta debugging [2]

The reducer is able to do both at the same time in a pretty generic way, with some restrictions on the expressivity. Seems to be a good fit for end-to-end tests.

[1]: https://www.cse.chalmers.se/~rjmh/QuickCheck/

[2]: https://dl.acm.org/doi/10.1145/318774.318946


Btw, Python's Hypothesis is a lot better at shrinking than QuickCheck.

Not only do you get shrinking 'for free', Hypothesis's shrinker is also smarter at finding opportunities to shrink.


Do you have examples where Python’s Hypothesis (and its shrinking) works better than Haskell’s QuickCheck? This would let us improve Haskell’s QuickCheck.

My understanding is that Haskell’s QuickCheck can shrink functions, and with parametric polymorphism, can work with vector types and more [1], which would be difficult if not impossible in Python.

And in general, the base types covered by the shrinking in Python’s Hypothesis should have analogues by the shrinking in Haskell’s QuickCheck. And with deriving, Haskell’s QuickCheck should get shrinking ‘for free’ as in Python’s Hypothesis.

Also, Haskell’s QuickCheck extends to stateful and parallel checking, which is not supported in Python [2].

[1]: https://library.mlabs.city/mastering-quickcheck "Mastering QuickCheck: Advanced yet Practical Techniques for Property-Based Testing"

[2]: https://stevana.github.io/the_sad_state_of_property-based_te... "The sad state of property-based testing libraries: A survey of property-based testing libraries"


I have to admit the last time I seriously worked with QuickCheck was quite a while ago.

> My understanding is that Haskell’s QuickCheck can shrink functions [...]

Are you talking about shrinking a value of type (a -> b) (for some concrete a and b)? In Hypothesis, if you write a generator that spits out functions, you get a shrinker for free. So that shrinker will shrink functions.

As far as I can tell, QuickCheck's shrinking is based on values. I just checked, the type is still `shrink :: a -> [a]`, so that can't have changed.

In Hypothesis shrinking is based on the generation process. In Hypothesis generators act a bit like 'Functors' in Haskell, ie you can map over them. (You can also filter and `<>` and do monadic binds etc.)

So in Hypothesis you can take a generator that produces integers, and create one that produces only even numbers by (in Haskell terms) mapping (2) over it. Now, the shrinker you get for free will also only produce even numbers.

That's not possible (or at least not easily possible) in Haskell. You would have to define at least a newtype wrapper.

Have a look at https://hypothesis.works/articles/compositional-shrinking/ and https://hypothesis.works/articles/integrated-shrinking/ for some more write-up, especially about how Hypothesis can preserve information even when shrinking past a monadic bind.

If you follow some links in the two articles above, you land at https://github.com/icicle-lang/disorder.hs-ambiata/tree/mast... which is a Haskell library that shares a few design decisions with Hypothesis.

Apart from shrinking, Haskell's QuickCheck can also learn a lot from Hypothesis in terms of floating point number generation. The last time I used QuickCheck seriously, they basically generated floats by converting from a sort-of uniformly sampled fractional number. I see that this commit https://github.com/nick8325/quickcheck/commit/07942642d7987b... seems to have made float generation a lot smarter!

Hypothesis is especially 'nasty' when it generates floats. It has a good chunk of probability allocated to things like various NaNs, infinities, sub-normal numbers, zero, one float past zero, etc, plus of course some probability mass on uniformly chosen floats. See https://hypothesis.readthedocs.io/en/latest/data.html#hypoth... for the docs, but you should check out the code, too.


> In Hypothesis shrinking is based on the generation process. In Hypothesis generators act a bit like 'Functors' in Haskell, ie you can map over them. (You can also filter and `<>` and do monadic binds etc.)

You are right that the generators in Hypothesis correspond to `Functor`s, so you can map over it. Indeed, in QuickCheck, its `Gen` [1] (type of `arbitrary` [2], the generator in QuickCheck) is a `Functor`, and moreover a `Monad`, so you can do monadic binds on `Gen`.

> So in Hypothesis you can take a generator that produces integers, and create one that produces only even numbers by (in Haskell terms) mapping (2) over it. Now, the shrinker you get for free will also only produce even numbers.

> That's not possible (or at least not easily possible) in Haskell. You would have to define at least a newtype wrapper.

Since `Gen` in QuickCheck is a `Functor`, to multiply its output by 2 to generate only even numbers, you can do this in Haskell as:

    fmap (\x -> x*2) (arbitrary @Int)
Or, for those preferring OOP-style chaining:

    arbitrary @Int <&> (*2)
where `<&>` is `fmap` with arguments flipped, and `(*2) = \x -> x*2`, and `@Int` is type application to specialize `arbitrary` to generate `Int`.

> Have a look at https://hypothesis.works/articles/compositional-shrinking/ and https://hypothesis.works/articles/integrated-shrinking/ for some more write-up, especially about how Hypothesis can preserve information even when shrinking past a monadic bind.

I think what you are hinting at is that, QuickCheck separates generation and shrinking, while Hypothesis combines generation and shrinking. If you prefer combining generation and shrinking, you may want to check out hedgehog in Haskell, see [3] for a discussion of this trade-off.

I think by using the `MonadGen` in hedgehog [4], you should be able to map over and bind over generators, with automatic shrinking, much like Hypothesis.

Hypothesis was first released in 2013, while hedgehog in 2017, so it is possible that hedgehog was inspired by Hypothesis or similar property-based testing libraries.

But in general, I would be surprised if such ‘functional’ APIs (map, filter, reduce, bind, etc.) could not be ported to Haskell.

[1]: https://hackage.haskell.org/package/QuickCheck-2.15.0.1/docs...

[2]: https://hackage.haskell.org/package/QuickCheck-2.15.0.1/docs...

[3]: https://tech.fpcomplete.com/blog/quickcheck-hedgehog-validit...

[4]: https://hackage.haskell.org/package/hedgehog-1.4/docs/Hedgeh...


Yes, I know that `Gen` in QuickCheck is a Monad. What I am saying is `Arbitrary` isn't a Monad.

I know, Arbitrary technically can't be a Monad, because the kinds are wrong. But I mean 'morally': you can fmap the generation process, but you can't fmap the shrinking. Exactly because generation and shrinking are separated; and shrinking is driven purely by the value and type of the generated item, but has no access to any information about the generation process.

> But in general, I would be surprised if such ‘functional’ APIs (map, filter, reduce, bind, etc.) could not be ported to Haskell.

Yes, have a look at the Jack library I already linked to.


Interesting. The techniques are quite similar to what one can do with Hypothesis in Python ( https://hypothesis.readthedocs.io/en/latest/quickstart.html ). Basically, it generates a bunch of test data and then tries to shrink it to the smallest example that fails the test automatically.


This isn't a coincidence. I wrote both.


Thank you! I have no experience with Shrinkray, but Hypothesis is great.


This is more generally called property based testing. My first encounters with it were quickcheck for erlang and Haskell. It has been around for ages.


In addition to c-reduce and shrinkray there's also cvise [1] for reducing tests. It's great.

Also, if you happen to be reducing an LLVM case you can use llvm-reduce [2].

[1] https://github.com/marxin/cvise

[2] https://llvm.org/docs/CommandGuide/llvm-reduce.html


There was a talk about some novel ways to approach test suite generation and reduction at a PLDI workshop a few weeks ago. That work came from the compilers and programming languages tool domain:

https://pldi24.sigplan.org/home/egraphs-2024#program

Specifically, the EGSTRA paper.


Zig has a built-in test-case reduction command ("zig reduce").


I find it odd that test-case reduction is billed as a “research skill”. To me, this is a basic debugging skill for anyone who works with code.

Maybe it’s more critical when you’re dealing with “research-quality” code. This post seems to ignore the underlying problem: it’s rare to see research code with any tests.


As the article says, right at the top: "The plan is to demonstrate techniques that “everyone knows” because everyone, in fact, does not already know them."


This is actually something I find hard to resist in myself as an aging software engineer on HN. When I was young, articles would describe “obvious” things in excruciating detail, and I’d be grateful, because they were often things I didn’t know.

Now when I see articles about things I find obvious on the front page — particularly if they’re topics I’ve seen hit the front page before (which, mind you, means “any time in the last 15 years or so”) — I have to resist an annoyed gut reaction. The fact that knowledge and skills seem obvious and boring once internalized is a kind of terrible quirk of the human condition.


My issue isn't with articles describing "obvious" things, it's with articles acting like this obvious thing is an amazing new discovery. This is especially funny when you see it in articles about relational database stuff, since most of the fundamentals have been discovered since the 70's.


Yes, but they were lost.

We stored those discoveries in mongodb. Apparently it had failed in a web scale way and nobody has figured out how to restore from backup.


For whatever it's worth, I enjoy explaining things to others so each time a topic recurs I can do that where it seems appropriate. And hey I can pass on things I didn't know about and then learned here, for example that's why I tell people about WUFFS, can you do all the bounds checking at compile time? Sure, for a price. And that price (generality) might be easily affordable in which case using Rust, never mind C++ is very silly because WUFFS was offering you better performance and total safety if you can do without generality.


Maybe. But test case reduction for compiler bugs is absolutely not a basic debugging skill for just "anyone".


I don't understand your point. Debugging is research almost by definition, no?


No; I think in this context, "research skills" specifically means "skills in the service of conducting publishable research", not "skills that involve some quantity of exploration" or something like that.

(To be clear, I don't agree with the sentiment of the comment to which you were replying, but I also don't think debugging "is research by definition".)


It's not scientific research with the aim of publishing, but it is research as defined in the dictionary [1], and uses some of the same skills.

Honestly, I am a scientist first. I am not aware of any activity outside of academia that is more research than debugging. You start with an observed phenomenon (the bug). Then you search for the cause of it. Along the way you formulate and test hypothesis, conduct experiments, gather evidence, check the literature.

What's more: these activities are the bulk of what you do, especially in difficult cases. They are not incidental to debugging, they are the essence of the work.

Much of what is published is the outcome of processes that look a lot less like research.

[1] (E.g. I do research to win online argguments, or for a school essay, or a reporter does research for a story, or an engineer does research on how to implement some idea. None of these cases of research are academic/scientific research)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: