Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Mutahunter – LLMs to support mutating testing for all major languages (github.com/codeintegrity-ai)
31 points by coderinsan 5 months ago | hide | past | favorite | 9 comments
Background: We were inspired by how aider.chat was making use of pageranked AST and realized that we can use this to power high quality mutation testing using LLMs. Check it out!



Hey engineer here. Mutation testing may not be a familiar concept. To put it simply, it tells you how effective your unit test cases are at catching faults by injecting faults into your codebase.

If you are interested in learning more about mutation testing and how big tech companies use it, read:

- State of Mutation Testing at Google: https://research.google/pubs/state-of-mutation-testing-at-go... - Industrial Application of Mutation Testing: https://homes.cs.washington.edu/~rjust/publ/industrial_mutat... - LLM-based Mutation Testing: https://arxiv.org/pdf/2406.09843 - Medium Blog on Mutahunter: https://medium.com/codeintegrity-engineering/transforming-qa... - Short Demo: https://www.youtube.com/watch?v=8h4zpeK6LOA

Feel free to ask me any questions. My wish is to get mutation testing widely spread!


Since LLMs necessarily generate mutations slower than traditional techniques and generally cost more, why doesn’t the paper compare against traditional mutation testing frameworks to demonstrate the bug / $ and bug / time spent testing? Seems like important criteria to justify that LLMs are worth it.

The abstract claims LLMs are 18% better than traditional approaches, but I can’t actually find that in the body of the paper (unless uBert is the “traditional way” but that’s an LLM approach too).


Nice question! The paper acknowledges that LLMs generate mutations slower and are more costly than traditional methods like PIT and Major, which are traditional testing tools. They did include metrics like cost per 1K mutations. However, the researchers focused on the effectiveness and high quality of the mutations generated by LLMs. For instance, GPT-3.5 achieves a 96.7% real bug detectability rate compared to Major’s 91.6% (not to mention GPT-4 outperformed all of them). All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.


> All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

The problem with PIT and Major is that they don’t do profile guided mutation testing [0] which in theory would raise the detectability rate without a meaningful cost increase. Other works explore the use of GANs [1] which would probably be cheaper and likely as effective but not as sexy as LLMs.

[0] https://arxiv.org/pdf/2102.11378

[1] https://ar5iv.labs.arxiv.org/html/2303.07546


Thanks for sharing the papers! I remember reading the first one from Google and can’t wait to dive into the new one. Appreciate the insights!


+ it would be great to see more research comparing LLMs and traditional methods in terms of cost-effectiveness. In my opinion, regarding the cost issue, there are engineering ways to mitigate this, such as running the LLMs only on changed code.


Quick feedback on the presentation:

- a oneliner over the video that explains what you are doing would be helpful, - and then "If you don't know what mutation testing is, you must be living under a rock! " brings people away from your repo faster than you can look the other side.


Appreciate the feedback and suggestions!


Hey, we are announcing a Cash Bounty Program to build out some high-priority features for Mutahunter More details here: https://github.com/codeintegrity-ai/mutahunter#cash-bounty-p...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: