I'm always suspicious of tests when test coverage is the main metric. I've seen developers write tests that don't really check anything but run all the code paths. I've also seen tests that check every bit of output, which end up being brittle.
How well do the tests hold up over time, and how well are the tests validating the contract of the code instead of just historical behavior and quirks?
We actually use real user sessions to train our model, so when I use the term coverage our main metric is covering as many user behaviors as possible.
We collect data in a privacy-focused way essentially anonymizing all sensitive information, as we don't need to know the user specific context. Only the main flow.
If this is trained on user sessions, how would the model learn to generate tests for edge cases that wouldn’t necessarily show up in the training data?
We train the model based on user sessions to learn how to use an app. The model learns how to execute specific flows, but also how to interact with components in the more general sense. Since most developers use compostable components, patterns of usage are repeated across the same app.
Then, during test generation, we bias the model to explore edge cases (in a few ways), and the model is still able to complete those even with low sample.
In other words, we direct the model toward certain goals, and flows and also add chaos to the process which result in the model executing unexpected flows.
How well do the tests hold up over time, and how well are the tests validating the contract of the code instead of just historical behavior and quirks?