Hacker News new | past | comments | ask | show | jobs | submit login

Since LLMs necessarily generate mutations slower than traditional techniques and generally cost more, why doesn’t the paper compare against traditional mutation testing frameworks to demonstrate the bug / $ and bug / time spent testing? Seems like important criteria to justify that LLMs are worth it.

The abstract claims LLMs are 18% better than traditional approaches, but I can’t actually find that in the body of the paper (unless uBert is the “traditional way” but that’s an LLM approach too).




Nice question! The paper acknowledges that LLMs generate mutations slower and are more costly than traditional methods like PIT and Major, which are traditional testing tools. They did include metrics like cost per 1K mutations. However, the researchers focused on the effectiveness and high quality of the mutations generated by LLMs. For instance, GPT-3.5 achieves a 96.7% real bug detectability rate compared to Major’s 91.6% (not to mention GPT-4 outperformed all of them). All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.


> All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

The problem with PIT and Major is that they don’t do profile guided mutation testing [0] which in theory would raise the detectability rate without a meaningful cost increase. Other works explore the use of GANs [1] which would probably be cheaper and likely as effective but not as sexy as LLMs.

[0] https://arxiv.org/pdf/2102.11378

[1] https://ar5iv.labs.arxiv.org/html/2303.07546


Thanks for sharing the papers! I remember reading the first one from Google and can’t wait to dive into the new one. Appreciate the insights!


+ it would be great to see more research comparing LLMs and traditional methods in terms of cost-effectiveness. In my opinion, regarding the cost issue, there are engineering ways to mitigate this, such as running the LLMs only on changed code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: