Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Checksum – generate and maintain end-to-end tests using AI
78 points by Bootstrapper909 on April 19, 2023 | hide | past | favorite | 32 comments
Hey HN!

I’m Gal, co-founder at Checksum (https://checksum.ai). Checksum is a tool for automatically generating and maintaining end-to-end tests using AI.

I cut my teeth in applied ML in 2016 at a maritime tech company called TSG, based in Israel. When I was there, I worked on a cool product that used machine learning to detect suspicious vehicles. Radar data is pretty tough for humans to parse, but a great fit for AI – and it worked very well for detecting smugglers, terrorist activity, and that sort of thing.

In 2021, after a few years working in big tech (Lyft, Google), I joined a YC company, seer W21, as CTO. This is where I experienced the unique pain of trying to keep end-to-end tests in a good state. The app was quite featureful, and it was a struggle to get and maintain good test coverage.

Like the suspicious maritime vehicle problem I had previously encountered, building and maintaining E2E tests had all the markings of a problem where machines could outperform humans. Also, in the early user interviews, it became clear that this problem wasn’t one that just went away as organizations grew past the startup phase, but one that got even more tangled up and unpleasant.

We’ve been building the product for a little over a year now, and it’s been interesting to learn that some problems were surprisingly easy, and others unusually tough. To get the data we need to train our models, we use the same underlying technology that tools like Fullstory and Hotjar use, and it works quite well. Also, we’re able to get good tests from relatively few user sessions (in most cases, fewer than 200 sessions).

Right now, the models are really good at improving test coverage for featureful web-apps that don’t have much coverage (ie; generating and maintaining a bunch of new tests), but making existing tests better has been a tougher nut to crack. We don’t have as much of a place in organizations where test coverage is great and test quality is medium-to-poor, but we’re keen to develop in that direction.

We’re still early, and spend basically all of our time working with a small handful of design partners (mostly medium-sized startups struggling with test coverage), but it felt like time to share with the HN community.

Thanks so much, happy to answer any questions, and excited to hear your thoughts!




This is dope, a couple of feedback points from a technical person that is a potential customer (I could be wrong on these!):

-- I think the name doesn't sell me or even most people because "checksum" is more of a security/crypto term. When I saw the HN post say Checksum I didn't think it was going to be about end-to-end tests. I thought it was going to be some crypto thing. Maybe a name like "Tested" or "Covered" is going to click better with the potential customer.

-- I don't feel like the demo video is making me feel like I know what this product is doing. I could also be misunderstanding the product. It might help more if the demo showed the following (in ideally less than 5-10s or most users might tune out):

1. A quick setup step for checksum 2. A set of generated tests 3. Passing tests

Seeing those steps would give me the emotion as an end-user "wow this must something I can quickly setup and will make me feel like I have test coverage out of the box"


Thanks for your feedback. It definitely makes sense and we'll incorporate it!


> Our impact on performance is non-existent as we use battle-tested open source tools used by Fortune 500 companies

What does that mean, exactly? Just because it's open source and used by F500s doesn't mean it can't have performance issues.


That's a fair comment and I guess we are missing and "AND" there.

1. We (and others) have tested our tools' impact on memory, CPU, network performance and found only negligible impact, even on slower/older devices

2. Also, they are used by F500 companies and have wide adoption, which indicates that other well established devs have run the same tests and decided to move forward.

We'll work on the language there to clarify.


Most tool companies making claims about their tools show a shocking lack of knowledge about testing. This generally guarantees that their tools are dismissed by serious professionals. That still leaves a pretty substantial market among credulous wishful thinkers, of course.

But as a tester, I would like to see a tool that isn’t just more bullshit. For this to happen you will have to explain:

- What exactly is your product designed to do? What kind of products can it be applied to test?

- What do you mean by the word test? Humans test in many ways and levels. Do you simply mean “exercise code while detecting crashes?” Because that’s a tiny part of testing.

- Code coverage is not the only kind of coverage. So how do you automatically achieve state and data coverage? I’m guessing you don’t, but hoping you will surprise me.

- Test oracles come in all shapes and sizes. One of the reasons I say testing cannot be automated is that I can easily demonstrate that a human tester cannot fully specify their own oracles, and thus cannot write code to implement them, either. So, how does your product recognize a bug when it sees it?

- How much human handholding is needed to operate your product?

- Testers think critically about how users interact with the product as users attempt to fulfill their purposes. This guides practical testing. I haven’t yet seen any product that thinks critically. ChatGPT can’t. So how does your product cope?

- When the product under test changes, what does your product do?

- Can your product EXPLAIN its test coverage (other than reporting code coverage, which is a poor indicator of good test coverage)?

- Say I have a product that sends the user through a multimodal questionnaire (including the use of animated screens that guide the user through measuring heartrate) and then produces a diagnosis of possible illnesses. Can your product tell if the diagnosis was correct in relation to the original intent of the logic that is documented in Jira tickets and Slack conversations? Will it generate questions about any of that, the way a real tester does?


This is a really compelling idea – but I'm having a little trouble making the leap from the high level description to what it would mean for my projects in more concrete terms. Would it be possible to show off some example tests that the model generated and maybe even a story about how the generated tests caught a bug before the code made it to production?


Our landing page at checksum.ai has a video in the hero section of test. We added some graphics (e.g. the green checkmark), but the steps executed are real tests that we generated.

But the tl;dr is 1. We learn how to use your app based on real sessions (we remove sensitive information on the client side) 2. We train a model on this data 3. We connect this model to a browser and generate Playwright or Cypress tests

The end result is code written and Playwright or Cypress. You can edit and run the tests regularly


I know Gal has the link in plaintext above, but for folks who want to check out the homepage, it's here: https://checksum.ai


I'm always suspicious of tests when test coverage is the main metric. I've seen developers write tests that don't really check anything but run all the code paths. I've also seen tests that check every bit of output, which end up being brittle.

How well do the tests hold up over time, and how well are the tests validating the contract of the code instead of just historical behavior and quirks?


That's a great question!

We actually use real user sessions to train our model, so when I use the term coverage our main metric is covering as many user behaviors as possible.

We collect data in a privacy-focused way essentially anonymizing all sensitive information, as we don't need to know the user specific context. Only the main flow.


If this is trained on user sessions, how would the model learn to generate tests for edge cases that wouldn’t necessarily show up in the training data?


We train the model based on user sessions to learn how to use an app. The model learns how to execute specific flows, but also how to interact with components in the more general sense. Since most developers use compostable components, patterns of usage are repeated across the same app.

Then, during test generation, we bias the model to explore edge cases (in a few ways), and the model is still able to complete those even with low sample.

In other words, we direct the model toward certain goals, and flows and also add chaos to the process which result in the model executing unexpected flows.


How do you know what data is sensitive, and how do you anonymize?

Thinking of apps that might fall under HIPAA etc


We have a few privacy controls in place:

1. We hash all inner text and then backfill static strings on the server side. So every text that is specific to the user remains hashed

2. We detect special cases like passwords, SSNs, credit cards, and completely block it (even not hashed)

3. We provide full privacy controls to our customers to easily mask any sensitive elements

4. We discard the user IP and don't require any PII to be sent. So we can connect a session together, but don't really know who the user is


Congratulations on all the progress you've made! We are all learning as we're building and talking to users. I know for my team, E2E/Integration testing is our main priority (over unit tests), and maintaining E2E tests is definitely a struggle. I imagine this problem is even more of an issue for larger codebases so I see why you're going after medium-size startups where the product isn't completely rebuilt every few months.


Thanks for your kind words! Yes many teams struggle with that (and I have in the past) and the essence of ur mission is to allow dev teams to focus on progressing on their roadmap and goals instead of wrestling with tests.

Feel free to sign up for a demo if that's a priority for your team. Even if it's just to chat and connect.


Noticed a couple small typos in the marketing copy

> Our impact on pefromence is non-existant as we use battle-tested open source tools used by Fortune 500 companies


Thank you for calling those out – will get that fixed right up!


Congrats! Definitely think QA farms can be automated using AI! Can you explain more what part Checksum is using AI?

Is it for the identification of user sessions that are good candidates to make into tests? Is it the generation of test specification in some DSL / Cucumber / Selenium / etc.?


It's all of the above but more specifically:

1. We use AI to analyze the user patterns and find common paths and edge cases, basically building a representation of your UX in a DB

2. We then use the DB to train another ML model that learns how to use your app the same way a user does. Given a certain page and user context, the ML can complete UX flows.

3. Finally, we learn to generate assertions, run the tests and convert the model actions in step 2 into proper Playwright or Cypress tests


Nothing on pricing on your site. Makes any other question difficult to formulate.


Yep totally understand. We are an early stage startup and currently 100% focused on improving our models and our product.

We don’t have pricing, not because we try to be vague, but because we haven’t fully figured out our training costs, which can vary significantly per app. We are very much in the “Do things that don’t scale” phase where we hand-pick our customers, provide white-glove treatment and prioritize learnings over price


Reasonable.


congrats on the launch.

0 - seriously rethink your branding. I can help.

1 - how does the Ai know when the test is successful? — is it a visual comparison? — if so.. is there a threshold range that can be adjusted?

2 - how does this differ from https://www.meticulous.ai/

3 - would it work on highly complex UX/Ui interactions like these here?

https://youtu.be/WtglzRWQzVE


How is the product different from the other test generation tools? How do you check if the are testing the intended behavior. My experience with automated testing solutions has been lukewarm so far.


I agree! My experience with test generation tools was also lukewarm which is why we founded Checksum.

> How is the product different from the other test generation tools

We train our models based on real user sessions. So our tests are: 1. Completely auto-generated 2. Achieve high coverage of real user flows, including detecting edge cases 3. Automatically maintained and execute with our models so they are less flakey.

> How do you check if the are testing the intended behavior

Our models are trained on many real sessions so it learns how your website (and others) should behave. In that sense, it's similar to a manual QA tester which can detect bugs. To supplement for functionality that is not obvious by the UI, we are now looking at adding LLMs to parse code, but most of the functionality can be inferred from the UI


So you are saying that your system needs me to do all the testing (it is infeasible to watch our users, because we test the product before it is released to any users) so it can learn how to test?

How can it know, by watching my clicks, how I decide if the behavior is correct on the backend?


Awesome idea! Excited to see where this goes


Thank you! Appreciate the kind words.


Interesting, unrelated, but related to your intro, what is the best open source datasets for maritime data?


Does it only support web? What about react native mobile apps?


We're currently focusing on web apps.

There's nothing "specific" in the underlying model that prevents it from testing mobile. It's just a matter of focus at the current time.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: