Show HN: Checksum – generate and maintain end-to-end tests using AI

therealrifath · on April 19, 2023

This is dope, a couple of feedback points from a technical person that is a potential customer (I could be wrong on these!):

-- I think the name doesn't sell me or even most people because "checksum" is more of a security/crypto term. When I saw the HN post say Checksum I didn't think it was going to be about end-to-end tests. I thought it was going to be some crypto thing. Maybe a name like "Tested" or "Covered" is going to click better with the potential customer.

-- I don't feel like the demo video is making me feel like I know what this product is doing. I could also be misunderstanding the product. It might help more if the demo showed the following (in ideally less than 5-10s or most users might tune out):

1. A quick setup step for checksum 2. A set of generated tests 3. Passing tests

Seeing those steps would give me the emotion as an end-user "wow this must something I can quickly setup and will make me feel like I have test coverage out of the box"

Bootstrapper909 · on April 19, 2023

Thanks for your feedback. It definitely makes sense and we'll incorporate it!

ezekg · on April 19, 2023

> Our impact on performance is non-existent as we use battle-tested open source tools used by Fortune 500 companies

What does that mean, exactly? Just because it's open source and used by F500s doesn't mean it can't have performance issues.

Bootstrapper909 · on April 19, 2023

That's a fair comment and I guess we are missing and "AND" there.

1. We (and others) have tested our tools' impact on memory, CPU, network performance and found only negligible impact, even on slower/older devices

2. Also, they are used by F500 companies and have wide adoption, which indicates that other well established devs have run the same tests and decided to move forward.

We'll work on the language there to clarify.

satisfice · on April 20, 2023

Most tool companies making claims about their tools show a shocking lack of knowledge about testing. This generally guarantees that their tools are dismissed by serious professionals. That still leaves a pretty substantial market among credulous wishful thinkers, of course.

But as a tester, I would like to see a tool that isn’t just more bullshit. For this to happen you will have to explain:

- What exactly is your product designed to do? What kind of products can it be applied to test?

- What do you mean by the word test? Humans test in many ways and levels. Do you simply mean “exercise code while detecting crashes?” Because that’s a tiny part of testing.

- Code coverage is not the only kind of coverage. So how do you automatically achieve state and data coverage? I’m guessing you don’t, but hoping you will surprise me.

- Test oracles come in all shapes and sizes. One of the reasons I say testing cannot be automated is that I can easily demonstrate that a human tester cannot fully specify their own oracles, and thus cannot write code to implement them, either. So, how does your product recognize a bug when it sees it?

- How much human handholding is needed to operate your product?

- Testers think critically about how users interact with the product as users attempt to fulfill their purposes. This guides practical testing. I haven’t yet seen any product that thinks critically. ChatGPT can’t. So how does your product cope?

- When the product under test changes, what does your product do?

- Can your product EXPLAIN its test coverage (other than reporting code coverage, which is a poor indicator of good test coverage)?

- Say I have a product that sends the user through a multimodal questionnaire (including the use of animated screens that guide the user through measuring heartrate) and then produces a diagnosis of possible illnesses. Can your product tell if the diagnosis was correct in relation to the original intent of the logic that is documented in Jira tickets and Slack conversations? Will it generate questions about any of that, the way a real tester does?

JohnFriel · on April 19, 2023

This is a really compelling idea – but I'm having a little trouble making the leap from the high level description to what it would mean for my projects in more concrete terms. Would it be possible to show off some example tests that the model generated and maybe even a story about how the generated tests caught a bug before the code made it to production?

Bootstrapper909 · on April 19, 2023

Our landing page at checksum.ai has a video in the hero section of test. We added some graphics (e.g. the green checkmark), but the steps executed are real tests that we generated.

But the tl;dr is 1. We learn how to use your app based on real sessions (we remove sensitive information on the client side) 2. We train a model on this data 3. We connect this model to a browser and generate Playwright or Cypress tests

The end result is code written and Playwright or Cypress. You can edit and run the tests regularly

johnsillings · on April 19, 2023

I know Gal has the link in plaintext above, but for folks who want to check out the homepage, it's here: https://checksum.ai

8organicbits · on April 19, 2023

I'm always suspicious of tests when test coverage is the main metric. I've seen developers write tests that don't really check anything but run all the code paths. I've also seen tests that check every bit of output, which end up being brittle.

How well do the tests hold up over time, and how well are the tests validating the contract of the code instead of just historical behavior and quirks?

Bootstrapper909 · on April 19, 2023

That's a great question!

We actually use real user sessions to train our model, so when I use the term coverage our main metric is covering as many user behaviors as possible.

We collect data in a privacy-focused way essentially anonymizing all sensitive information, as we don't need to know the user specific context. Only the main flow.

johnhenning · on April 19, 2023

If this is trained on user sessions, how would the model learn to generate tests for edge cases that wouldn’t necessarily show up in the training data?

Bootstrapper909 · on April 19, 2023

We train the model based on user sessions to learn how to use an app. The model learns how to execute specific flows, but also how to interact with components in the more general sense. Since most developers use compostable components, patterns of usage are repeated across the same app.

Then, during test generation, we bias the model to explore edge cases (in a few ways), and the model is still able to complete those even with low sample.

In other words, we direct the model toward certain goals, and flows and also add chaos to the process which result in the model executing unexpected flows.

afro88 · on April 19, 2023

How do you know what data is sensitive, and how do you anonymize?

Thinking of apps that might fall under HIPAA etc

Bootstrapper909 · on April 19, 2023

We have a few privacy controls in place:

1. We hash all inner text and then backfill static strings on the server side. So every text that is specific to the user remains hashed

2. We detect special cases like passwords, SSNs, credit cards, and completely block it (even not hashed)

3. We provide full privacy controls to our customers to easily mask any sensitive elements

4. We discard the user IP and don't require any PII to be sent. So we can connect a session together, but don't really know who the user is

jtambunt · on April 19, 2023

Congratulations on all the progress you've made! We are all learning as we're building and talking to users. I know for my team, E2E/Integration testing is our main priority (over unit tests), and maintaining E2E tests is definitely a struggle. I imagine this problem is even more of an issue for larger codebases so I see why you're going after medium-size startups where the product isn't completely rebuilt every few months.

Bootstrapper909 · on April 19, 2023

Thanks for your kind words! Yes many teams struggle with that (and I have in the past) and the essence of ur mission is to allow dev teams to focus on progressing on their roadmap and goals instead of wrestling with tests.

Feel free to sign up for a demo if that's a priority for your team. Even if it's just to chat and connect.

hiatus · on April 19, 2023

Noticed a couple small typos in the marketing copy

> Our impact on pefromence is non-existant as we use battle-tested open source tools used by Fortune 500 companies

johnsillings · on April 19, 2023

Thank you for calling those out – will get that fixed right up!

varunjain99 · on April 19, 2023

Congrats! Definitely think QA farms can be automated using AI! Can you explain more what part Checksum is using AI?

Is it for the identification of user sessions that are good candidates to make into tests? Is it the generation of test specification in some DSL / Cucumber / Selenium / etc.?

Bootstrapper909 · on April 19, 2023

It's all of the above but more specifically:

1. We use AI to analyze the user patterns and find common paths and edge cases, basically building a representation of your UX in a DB

2. We then use the DB to train another ML model that learns how to use your app the same way a user does. Given a certain page and user context, the ML can complete UX flows.

3. Finally, we learn to generate assertions, run the tests and convert the model actions in step 2 into proper Playwright or Cypress tests

BillSaysThis · on April 19, 2023

Nothing on pricing on your site. Makes any other question difficult to formulate.

Bootstrapper909 · on April 19, 2023

Yep totally understand. We are an early stage startup and currently 100% focused on improving our models and our product.

We don’t have pricing, not because we try to be vague, but because we haven’t fully figured out our training costs, which can vary significantly per app. We are very much in the “Do things that don’t scale” phase where we hand-pick our customers, provide white-glove treatment and prioritize learnings over price

BillSaysThis · on April 19, 2023

Reasonable.

artur_makly · on April 21, 2023

congrats on the launch.

0 - seriously rethink your branding. I can help.

1 - how does the Ai know when the test is successful? — is it a visual comparison? — if so.. is there a threshold range that can be adjusted?

2 - how does this differ from https://www.meticulous.ai/

3 - would it work on highly complex UX/Ui interactions like these here?

https://youtu.be/WtglzRWQzVE

sachuin23 · on April 19, 2023

How is the product different from the other test generation tools? How do you check if the are testing the intended behavior. My experience with automated testing solutions has been lukewarm so far.

Bootstrapper909 · on April 19, 2023

I agree! My experience with test generation tools was also lukewarm which is why we founded Checksum.

> How is the product different from the other test generation tools

We train our models based on real user sessions. So our tests are: 1. Completely auto-generated 2. Achieve high coverage of real user flows, including detecting edge cases 3. Automatically maintained and execute with our models so they are less flakey.

> How do you check if the are testing the intended behavior

Our models are trained on many real sessions so it learns how your website (and others) should behave. In that sense, it's similar to a manual QA tester which can detect bugs. To supplement for functionality that is not obvious by the UI, we are now looking at adding LLMs to parse code, but most of the functionality can be inferred from the UI

satisfice · on April 20, 2023

So you are saying that your system needs me to do all the testing (it is infeasible to watch our users, because we test the product before it is released to any users) so it can learn how to test?

How can it know, by watching my clicks, how I decide if the behavior is correct on the backend?

execore-1 · on April 19, 2023

Awesome idea! Excited to see where this goes

shyamkumar7 · on April 19, 2023

Thank you! Appreciate the kind words.

firedup · on April 20, 2023

Interesting, unrelated, but related to your intro, what is the best open source datasets for maritime data?

mackeyja92 · on April 19, 2023

Does it only support web? What about react native mobile apps?

Bootstrapper909 · on April 19, 2023

We're currently focusing on web apps.

There's nothing "specific" in the underlying model that prevents it from testing mobile. It's just a matter of focus at the current time.