Hacker News new | past | comments | ask | show | jobs | submit login

The author focuses on the state machine and parallel aspect of PBT, but there are other aspects that might have larger effect.

One of these is coverage guided PBT, check eg. Dan Luu's article: https://danluu.com/testing/

Another one, which I'm biased towards, is making shrinking automatic while keeping all the invariants you created while generating the values. I have a talk on this but the TL;DR is that

- derived QuickCheck-like shrinker functions that work on values (shrink : a -> [a]) have issues with the constraints, making people turn off shrinking instead of dealing with the issues

- rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind (using results of generators to dispatch to another generator)

- the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves. The disadvantage is that the generators are now parsers from the lists of bytes that can fail (introducing some inefficiency) and user can make a crazy generator that the internal shrinker will not be able to shrink perfectly. Nevertheless, it's the best DX of the three approaches, and given it's a small miracle people are testing at all, feels to me like the approach most worth building on as a testing library author.




> - rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind.

We're at the limits of my amateur knowledge, but I believe this is a fundamental limitation of monadic bind/generators. Instead, you should prefer applicative generators for optimal shrinking. https://github.com/hedgehogqa/haskell-hedgehog/issues/473#is...

In other words, applicative generators do not use "results of generators to dispatch to another generator", but instead shrinking is optimal due to the "parallel" nature of applicatives (I'm using "parallel" in the monadic sense, and not the sense of article's "threading" sense). Since applicatives are "parallel", they can shrink the generators independently. (Whereas monadic generators are in "series" and therefore shrinking one necessarily changes the behavior of the subsequent generator, as you noted.)

Please feel free to link your talk if it's public!


Re the talk: it's unfortunately one of the Haskell Exchange talks that got pulled from the internet when the organizer, SkillsMatter, imploded. The Haskell Foundation is trying to revive and reupload those recordings, but my particular one isn't back on YouTube yet. But here's at least the slide deck: https://drive.google.com/drive/folders/1DjfhdeFW-qCCSlJUv_Xw...

I agree using applicatives helps somewhat, it basically minimizes your usage of bind.

Hypothesis works around this (and gets a bind that shrinks nicely) by recording those choices into (in principle) a flat list, and shrinking that. So your shrinker can work on both "before the bound fn" and "after it" at the same time, but the result might be something that the generator can't parse back into a value.


I now realize how far I am from amateur knowledge, I don't understand any of this.

But always interesting to discover an avenue I haven't explored before :)


Sequence points in C and CPS/ANF compilation in Lisp are kind of related to this. Both monads and applicatives enable you to define f(g(), h()), where f, g, and h are stateful computations of some sort, but monads force you to specify which of g and h is invoked first [so it’s more like x = g(), y = h(), f(x, y)] while with applicatives the implementor of the stateful model can decide what happens in such cases.

[Disclaimer: I don’t know how QuickCheck actually works, so I’d appreciate a sanity check (hah) of the following from somebody with actual knowledge on the matter.]

GP’s point, if I understood it correctly, is as follows. If you’re doing randomized testing, then g and h could perhaps be defined as { for (i=0; i<n; i++) { int x = rand(); if (!fork()) return x; } } (and yes, in monad-land the moral equivalent of fork() is admissible as what I vaguely called a “stateful computation”). You see how strict ordering of side effects enforced by monads essentially forces you into a depth-first search of the state space (although I guess you could do iterative deepening?), while applicatives can allow for different exploration strategies (ones more interesting than C’s official behaviour of “that’s UB, you fool”).


C sequence points are arbitrarily imposed by imperial decree in the standard; they are not linked to any evaluation causality.

Dependency-driven de facto evaluation orders have always existed in C though: things that you can figure out must be ordered even though there isn't a sequence point.

The standard spells out that expressions like i = i + i are okay, but actually it's not necessary to do so. The assignment cannot take place until the assigned value is known. The assigned value is not known until the + is evaluated and the + cannot be evaluated until the operand values are retrieved. Therefore, the retrieval of the prior value of i necessarily precedes the movement of the new value into i.

This, rather than sequence points, is rather a bit like monadic sequencing.


> The assignment cannot take place until the assigned value is known.

Of course it can. It's retrieving the value out of the lvalue is what cannot be done and even then you can still stretch this delay further with the "as-if" rule. Haskell does something similar with its lazy evaluation IIRC.


> C sequence points are arbitrarily imposed [and] not linked to any evaluation causality.

That’s exactly my point. Writing g >>= \x -> h >>= \y -> f x y is like writing (x = g(), y = h(), f(x,y)), with an arbitrary ordering between computations g and h that the programmer is forced to choose, whereas f <$> g <*> h is like f(g(), h()), with no ordering between g and h imposed by the applicative model itself, similar to the absence of a sequence point between sibling arguments in C.


Let’s say your test data is a record with two fields that can be generated independently. After the property test framework finds a bug, it can independently shrink each field to find a simpler failing test.

Contrast with a test data generator that randomly chooses the first field, and then calculates the second field based on the first field’s value. The framework can’t directly mutate the second field because there’s a data dependency. If it just changes the second field, it will get a record that couldn’t have been generated.

The Hypothesis test framework took a different approach. Instead of mutating the test data directly, it mutates the “random” input and reruns the generator. If you do shrinking that way, you always get valid test data, though it might not be very similar to the original.

This is similar to what fuzzers do - you start with unstructured binary data and build data structures out of it. Your “random” test data generator is really a parser of random (or maybe not so random) input.


To have an idea of monads being sequential, think about a JS Promise:

  promise.then(...).then(...).then(...)...
Promises are (almost) monadic, and the chain is sequential. You can't make it parallel, and that's the point: monadic binding represents sequential computation, aka "the semicolon operator".

To have an idea about applicatives being parallel, think about a list of functions, and a list of values; each function would be applied to a corresponding value, resulting in a list of results:

  results = []
  for ix in range(functions):
    f = functions[ix]
    x = values[ix]
    results.append(f(x))
It's pretty obvious that you can do each f(x) in parallel and in any order, instead of the sequential code above.

(Why would one care about this all? Because many daily things are functors, applicatives, and monads, e.g. a list is usually all three.)


Haha yeah I kinda went off the deep end with applicatives. Here's a short primer on applicative vs monadic shrinking behavior using F# syntax https://github.com/hedgehogqa/fsharp-hedgehog/issues/419#iss...

You can think of `let!` as `let + await`, and `let! x ... and! y` as `await Parallel([x, y])`.

Please feel free to ask any questions if it's still confusing!


Not quite accurate with the Parallel example, if I understand you correctly. Don Syme is explicit that applicative `async` should not implicitly start work in the thread pool (https://github.com/dotnet/fsharp/issues/10301#issuecomment-7...).


My usage of "parallel/await" is entirely metaphorical; I was kinda going for a Javascript-esque syntax with "await Parallel" - I'm assuming most people aren't familiar with F#'s `and!`. It doesn't make sense for applicative generators to (necessarily) use the thread pool.


Anecdotally, Hypothesis was very far from "just working" for me. I don't think it's really production-ready, and that seems to be by design.[0]

I did have quite a bit of prior experience with clojure.spec.alpha (with or without test.check), so despite some differences it's not like I was completely alien to the general ideas.

[0] https://github.com/HypothesisWorks/hypothesis/issues/3493


I've tried to replicate your issue in the Elm testing library I'm a maintainer of, which uses some of the core ideas from Hypothesis.

It was able to generate values just fine ([1]), so your problems might have been just some issue of the Hypothesis _implementation_ instead of something wrong with the general algorithm?

Note that your usage of Hypothesis was "weird" in two aspects:

1. You're using the test runner for generation of data. That's sometimes useful, particularly when you need to find a specific example passing some specific non-trivial property (eg. how people solve the wolf, goat, cabbage puzzle with PBT - "computer, I'm declaring it's impossible to solve the puzzle, prove me wrong!"), but your particular example had just "assert True" inside, so you were just generating data.

There might be libraries better suited to the task. Elm has packages elm/random (for generation of data) and elm-explorations/test (for PBT and testing in general). Maybe Clojure or Python have a faker library that doesn't concern itself with shrinking, which would be a better tool for your task. The below gist [1] has an example of using a generator library instead of a testing library as well.

2. You were generating then filtering (for "at least one non-0 values across a column"). That's usually worse than constructively generating the data you want. An example alternative to the filtering approach might be to generate data, and then if a column has all zeroes, generate a number from range 1..MAXINT to replace the first row's column with. Anyways, "all zeroes" shouldn't happen all that often so it's probably not a huge concern that you filtered.

[1]: https://gist.github.com/Janiczek/c71050aa89fb0b19d73e683f9a3...


The code provided was not real test code. The only reason `assert True` is there is to highlight the fact that the issue was with generation, and not any other part of the test.

> Anyways, "all zeroes" shouldn't happen all that often so it's probably not a huge concern that you filtered.

This is really the point.

Like I said, Clojure’s spec (with or without test.check) has no trouble.


The issue seemed to be based on a misunderstanding of the implied probability distribution hypothesis uses to draw examples from?


The probability distribution in my case seemed to be a single result (the zero matrix) with P=1.

So if that is intentional and by design, then you’re right, I don’t understand.


Hypothesis doesn't try to give you a consistent probability distribution, especially not a uniform one. It has lots of heuristics to give you 'nasty' numbers in order to trigger more bugs.

That's actually one big reason I prefer Hypothesis over eg Rust's proptest, which makes no attempt to give you 'nasty' numbers. Last time I checked, QuickCheck didn't make any attempts to give you floats well-known to cause trouble either (like inf, negative inf, different NANs, 0, 1, -1, the least positive float, the largest negative float, the largest finite float just before infinity etc.)


> Hypothesis doesn't try to give you a consistent probability distribution, especially not a uniform one.

This is totally understood and expected. But a “probability distribution” of exactly ONE value with P=1 is not much of a distribution.

I don’t understand this response, because it doesn’t address the issue at all.


There has been some research (can't find the paper right now but it was about F-metric or something?) suggesting uniform generators will trigger bugs more often than the edge-case-preferring ones. I'm still not too sure about whether I want to believe it though.


I would be very interested in that research!

Anecdotally, I have triggered many more bugs with the edge-case-preferring generation:

Basically, that was me using Rust's proptest library which does uniform generation by default, and I hacked it up to prefer edge cases. For simplicity, for eg u32 I set it up to with something like 50% probability to pick one from 0, 1, 0xFFFF_FFFF, 2, and a few other special values, or to pick uniformly at random.

I vaguely remember some research that compared carefully hand-picked values vs property based testing. And the carefully hand-picked values performed slightly better at the same number of test cases. But if you modesty increased the number of test cases generated for property based testing, that swamped the advantage of careful hand picking.

Similarly, I suspect that any disadvantage that might exist for the scheme I outlined above compared to uniform sampling would go away, if you increased the number of test cases slightly. At least as long as you give a decent chunk of probability weight in my scheme to uniform sampling.


> - rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind

> - the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves.

My preferred framework is falsify, which provides "internal integrated shrinking". It's similar to Hypothesis, but uses a tree of generators rather than a linear sequence (it's based on "selective functors", which are a nice interface that's also useful for e.g. validators).

From https://hackage.haskell.org/package/falsify

> This library provides property based testing with support for internal integrated shrinking: integrated in the sense of Hedgehog, meaning that there is no need to write a separate shrinker and generator; and internal in the sense of Hypothesis, meaning that this works well even across monadic bind.


I've been talking with Edsko de Vries about Falsify when he was implementing it / writing the paper. Very cool stuff, although somewhat mind-bending :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: