Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Automated Unit Test Improvement Using Large Language Models at Meta (arxiv.org)
301 points by mfiguiere on Feb 17, 2024 | hide | past | favorite | 188 comments



At a large insurance company I worked for, management set a target of 80% test coverage across our entire codebase. So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal. Of course devs also weren't allowed to change the coverage measuring rules in Sonar.

As a young dev, it taught me that focusing only on KPIs can sometimes drive behaviors that don't align with the intended goals. A few well-thought out E2E test scenarios would probably have had a better impact on the software quality.


My favorite anecdote on the topic involved a codebase like this, handled by inexperienced programmers. I came into the team, and realized that a whole lot of careless logic could be massively simplified, so I sent a PR cutting the codebase by 20%, and still passing all the tests and meeting user requirements.

The problem is, the ugly, careless code extremely very well tested: 95% code coverage. My replacement had 100% code coverage... but by being far shorter, my PR couldn't pass tests, as total coverage went down, not up. The remaining code in the repo? A bunch of Swing UI code, the kind that is hard to test, and where the test don't mean anything. So facing the prospect of spending a week or two writing swing test, the dev lead decided that it was best to keep the old code somewhere in the repo, with tests pointing at it, just never called in production.

Thousands of lines of completely dead, but very well covered code were kept in the repo to keep Sonar happy.


My goodness…

While there are people trying with great effort to make computer as intelligent as humans, there are countless organization trying to make humans as dumb as computers by making them adhere to arbitrary numbers without giving them any agency on the evaluation of the usefulness of the metric…


This is mostly just laziness and apathy. Don't try to learn about the problems that the business actually has, instead focus on a simple metric and the enforcement of it.


> This is mostly just laziness and apathy.

Sometimes it is attached to incentives, like monetary rewards (bonuses and promotions). I've seen someone promoted on the back of a "high-impact" project that was bug-riddled and constituted more than half the support calls for the rest of the team for at least a year, not to mention financial penalties for the organization. It wasn't laziness or apathy, just rational actors optimizing benefits from the current rules with inadequate oversight and/or penalties for adverse outcomes.


If you can prove that a given KPI can be mapped to the bottom line, then you really have something. Usually it isn't this simple.

Was the high impact project rushed or was it staffed with crappy devs? Or both? Usually that is the reason for bad outcomes.


What's the reasoning in this? Why wouldn't they just allow you to drop the coverage percentage?


The logic looks something like "The metrics are there to measure SWE productivity and quality, if we let them change it, they will game the metrics."


I can relate. At my first internship, there was a code quality tool forced on the team by management as well. It had a rule to "disable magic numbers".

The result was a header with:

    static const unsigned ONE = 1;
    static const unsigned TWO = 2;
    static const unsigned THREE = 3;
   ... 
Up to some thousands.


Heavens forbid you want to raise some number to the power of 2, err I mean TWO


Really? You could not come up with a way to avoid magic numbers?


Sometimes, a number is just a number. The average of a, b, c is (a+b+c)/3, an positive interger is single digit when less then 10, the formula for the volume of a sphere is 4/3PIr^3, etc... And that's excluding 0, 1 and 2 which are naturally everywhere.

That some numbers are not magic would be obvious to a human reviewer, but the tool probably just treats any number in an expression as a magic number or something like that, and the workaround is to define constants for raw numbers. Which entirely defeats the purpose since now, people will just use these constants for actual magic numbers and the tool will see nothing.


Maybe in rare cases, but in my experience, literals (numbers or strings) almost always have _some_ meaning where giving them a name helps readability. In my experience, it’s rare that “a number is just a number”. And sure, in those cases, naming them is silly.


Listen, what happens if we want to calculate delivery distances in a n-dimensional universe?


That 3 in your first example is definitely a magic number that should be either dynamically calculated from the number of elements being averaged, or defined as a const NUMBER_ELEMENTS_AVERAGED.


In the general case, of course you would use arrays (static or dynamic) and some kind of "size" attribute.

But this is just 3 values in an expression and using a constant could actually be bad. Let's be a bit more practical.

  int lightness(int r, int g, int b) { return (r+g+b)/3; }
Simple and straightforward

  int lightness(int r, int g, int b) {
    const int NUMBER_ELEMENTS_AVERAGED = 3;
    return (r+g+b)/NUMBER_ELEMENTS_AVERAGED; 
  }
Ok, I guess, but I think verbose for no good reason. But not as bad as the seemingly "cleaner"

  const int NUMBER_OF_COLOR_COMPONENTS = 3;
  int lightness(int r, int g, int b) {
    return (r+g+b)/NUMBER_OF_COLOR_COMPONENTS; 
  }
Imagine that you want want to add a color component, for example to support transparency (alpha). So you set NUMBER_OF_COLOR_COMPONENTS = 4, and then, your "lightness" function breaks, the simple (r+g+b)/3 would have stayed correct. That happened because didn't get the real meaning of that "3". Even if semantically, at the time you written that code, it is the number of color components, in reality, it is the number of terms in the expression. There is r, g, b: 3 terms, so 3. Who cares how many color components there are?

Side note: I know it is the wrong formula for lightness, that's just an example.


If you name it `NUMBER_ELEMENTS_AVERAGED`, then when you add a new element to average, you will miss the fact that you also need to modify that value :)

You either have them on a list and calculate it dynamically based on the size, or have it as a magic number.


Sometimes a magic number is just a magic number.


I believe even truly magic numbers don’t need to be extracted in many cases.

Say in some UI code called FooBarItem you suddenly have some call to set padding to 12. People will take this number, put it in constant and name it FOO_BAR_ITEM_PADDING = 12.

This is not better. It’s just jumping more through the code, and whatever is in the name of the constant is easily deductible from the usage pattern.

I learnt that pattern from others and nowadays I see it as useless. If you can add interesting info in the name or comments of the number, don’t bother extracting a constant.


Don't get me started on string constants. OK_LABEL_TEXT = "OK". GET_METHOD = "GET". Bonus points if this is only used in exactly one place.

Is it really difficult to see that these are pointless? If you actually think about what you are writing, that is.


Many uses of numbers aren’t magic numbers. If you want to count something then setting the counter to 0 and incrementing by 1 isn’t unexplained or subject to change.


Sometimes it's much clearer the number itself, specially when it's a mathematical formula. For instance, is

const MILLIS_PER_SECOND = 1000;

...

...

...

...

...

const durationMs = MILLIS_PER_SECOND * duration

really clearer than

const durationMs = 1000 * duration

?


Capital M is for Mega. I would use duration_s and duration_ms.

const duration_ms = 1000 * duration_s

And _us for microsec.

const duration_us = 1000 * duration_ms

But then the tool would probably reject my code for not following the naming conventions which ”disallows using underscores in variable names”.

Guess what I wanted to say is that there are always exceptions to the rule and there should always be some way to turn off the automatic checker for certain sections of the code.


Along the same lines, I see this a lot

    long delay = (5 * 60 * 1000); // 5 minutes, in millseconds
And it's perfectly clear to me. Now, I think the comment is really helpful there (indicating intent), but I don't think having separate constants for each of the numbers there is going to make the code better. As it is, it's very easy to read at a glance, know what it's intended to do, and determine if it's correct (should you be worried about that at the moment). Which is what's important there.


I do that as well, but typically name that variable delayMilliseconds so there's no confusion.

sleep(delay) always looks ok, but sleep(delayHours) is probably going to catch your eye as suspicious.


The latter one leaves the possibility for this though

const duration = 1000 * duration


The solution to that is mutation tests. They force your tests to actually verify the implementation instead of just running the code to fake coverage. https://en.m.wikipedia.org/wiki/Mutation_testing Tools and frameworks exist for almost all languages. Some examples:

- stryker-mutator (C#, Typescript)

- pitest (Java)

- mutatest (Python)


I have tried those out for a large Java project. The problem is that it is just too slow to use with a large unit test suite. What mutation testing does, is running all the tests in your suite multiple times while changing some of your code base ("mutation points") to see if it affects the test outcome (it should). So say that your test suite normally takes 10 minutes, then a full mutation suite can easily be a factor 50 slower, which means 8 hours.


Find ways to reduce those 10 minutes. Some ideas: - make the individual tests run faster - remove obsolete tests - increase parallellism - use a tool that can determine which tests to run based on what has changed, for example nx.dev.


Perfect for a daily CI job?


"daily CI"?


People tend to use CI as a synonym for automated build pipeline.


So I've heard


You don’t block on that, you run it on a continuous loop in a CI. Then, if tests fail, you are limited to the checkins over the last day or whatever, and you can more easily bisect to find the offending PR, and file a bug against that dev to fix, or just revert the change. But you dont run the full suite locally, though there are ways for stryker anyways to only test code that changed, which you can run locally, or in the PR build.


Mutation testing is really amazing for some types of code. And it definitely solves the problem of “fake unit tests”. Though sometimes it does force the tests to be a little too tightly coupled with the code. (Though that usually indicates that the code could be refactored)


Or instead of adding more rules and tech solutions, discuss with the team and gain buy-in on code quality and processes.


The paper in the OP mentions mutation testing at the end when talking about future work and open problems (section 7). They think it'd be very useful, but it's "challenging to deploy such computationally demanding techniques at the scale we would require".


Also mutant (ruby). Good but unfortunately with costly subscription for commercial use


We have a mandatory Sonar scan, and when I was hired, my tech lead proudly shown me the "A" grade they have been attributed, and said something like "we have a high standard to maintain"

I have never seen such poorly written application in my 6 years of experience (and I am not only talking about style, stuff was absolutely utterly broken, while they had no clue what's wrong).

I hate Sonar with passion. It only ever should be used to report vulnerabilities, not telling me to rename variables or that i "should refactor this code duplication!" I already have a fucking backlog with Jira tickets, don't tell what i am supposed or not to do, and when i am supposed to do it.

But oh boy mangers love this stupid power burner


You wanna hear retarded? I worked at an org that used Snyk/Sonar, and would block PRs on failing quality gate score. The problem was that you couldn't see why you failed quality gate score in jenkins/github, only people with Sonar accounts could , so you had to find someone with a sonar account and get them to take a screenshot of the error before you could fix it.


I'm fighting against this now. We already use linters built and tuned by engineering, we have custom rules that solve real paint points and disabled everything that the team doesn't like or doesn't think provides value.

Then management sweeps in and is trying to add sonar, and it's a nightmare. Besides tripling our total build time to run this horrible tool, they want us to waste time rewriting our codebase to follow these insane rules like "cognitive complexity", and editor integration that takes several seconds to update after every file change.


I don't understand this. Why is management telling you what to do? In our company, the managers ask/push the teams to improve some metric, but the teams are responsible for the how. The metric is usually high-level, like "reduce the number of bugs reported", not "increase code coverage".


I think having cognitive complexity indicators are great. Many devs don’t think about that so having the visibility reminder is good on average. However it should not be blocking in any manner and it’s silly to drive a rewrite based on it. Engineering leaders should understand the architecture at a visceral level, know good patterns/design, and suggest refactors without static analysis tools.


"having them" is fine, being forced them down your throat is not. I mean like "your hotfix cannot be merged because you touched this piece of ancient code nobody understands and now it's complexity grew beyond the arbitrary threshold, please put all your tickets aside, and make a start refactoring this function because Sonar saud so"


Totally agree! That sounds absurd.


Sonar os actually an incredibly powerful tool for dev teams WHEN they know how to use it and configure it according to their agreed upon standards.

When it's used thoughtlessly, as it often is, it's terrible.


"When a metric becomes a target, it ceases to be a good metric".

A big problem is making it mandatory with huge bureaucracy to avoid its stupidness. Just last week I was battling yet another code quality tool they made mandatory: it was complaining that my res.status(200).json() wasn't setting up HSTS headers. And then I tried setting it up manually, it kept complaining, app.use(helmet()), same thing. Apparently it wanted me to write the whole backend code into a single file for it to stop complaining. And of course, HSTS is much more elegantly and automatically handled by the ingress or load balancer itself.

I could have spent a week or two flagging it as a false positive and explaining what is HSTS for upper management to approve it. I ended up just adding a res.sendJson(data, status = 200) to the prototype of the response object. Which is obviously stupid, but working in a bureaucracy-heavy sector has made me realize how much of bad software is composed of many such bad implementations combined.


Goodhart's Law


IME, the only "test coverage % rule" that I've ever seen work was "must not decrease the overall percentage of tested code". Once you get to 100%, that becomes "All code must have a test".

Various people objected to this, pointing out that 100% test coverage tells you nothing about whether the tests are any good. Our lead (wisely, IMO) responded that they were correct - 100% tells you nothing - but that _any other percentage_ does tell you something.


This works better as "All tests decreasing the overall percentage of tested code must have a good explanation signed off on by the reviewer". I've occasionally deleted dead code paths, had an almost entirely red diff, had close to 100% test coverage over the functions I touched, and yet decreased the overall percentage of tested code because the dead code path was more heavily tested than the rest of the code base.


I kind of disagree, in that any rule should have an implied "you can ignore this with good justification". Otherwise you turn people into robots, which renders moot all the reasons why you have people, and not robots.


I agree with this - in practice any rule can be broken - but how easily it's broken and whether developers are explicitly told to break it is a process decision. How big a roadblock is your CI putting in the path of developers? If you require e.g. sign-off from a team lead or manager or whoever has "merge anyway" permissions in GitHub, that's more difficult than sign-off from the same reviewer who's reading your code anyway. You can make it easier or harder to break rules and in this case I think it should be so easy to break that there's a codified procedure to do so.


I haven’t tried that - in that situation we did “find something that needs a test”


What tells you something is how many releases get rolled back as a % and how much rework you have. If your developers need 3 attempts to put something working in production that's all you need to know about the quality of the reviews and tests. High or low coverage, what you need to look at is actual issues.

All you need is a bit hyperbolic, because you also need to quarantine flaky tests and other things, but coverage as a whole I think is useful only if you have an engineering organization that doesn't see the point in tests - which is going to be its own uphill battle.


> Coverage as a whole I think is useful only if you have an engineering organization that doesn't see the point in tests - which is going to be its own uphill battle.

I think coverage stats are always useful as they help find the edge cases that people forgot to test. A common culprit I've seen is error handling code where a bunch of tests target the happy path, but nothing tests the error logging when something breaks.


Good heavens, a team needing 3 tries to get something into production sounds like they need a QA refresher.


Most of the time, I see teams mixing refactoring, bug fixing, and new features into a single PR which causes this. Keeping PRs focused makes it easier to review and find issues before shipping to prod.

In other words, first bug-fix PR should be hacky af to fix the bug. No refactoring, nothing controversial (other than the hacky af fix). After you verify the fix in production, then, and only then, do you open a PR to refactor the code. Finally, after that is verified in production, close the bug ticket.


I loved having small, focused PRs. So much easier to review without interrupting my day, too. Then we got handed a policy of one PR per ticket to simplify our QA process. Oh well.


I had a colleague who wrote unit tests without any assertions. Perfect idea! 100% coverage and always green.

Another comment here mentioned mutation tests which could be a solution to increase quality of unit testing, but I've never seen anyone to actually use it in enterprise development. Same story with test driven development concept.


I'm not saying this is a good practice, but it's not valueless. Exceptions are failures. If you have code that is straight broken -- refers to an undefined variable, assumes something that's not true, syntax error that could make it past initial app load -- a test that at least makes sure the code is runnable without throwing an error is better than nothing.


I hear this point as a counter argument to “never decrease coverage helps maintain quality”.

It is technically correct. But, it is only meaningful if you assume a bad actor in the team who knowingly games the system, and a team who tolerates it.

At that point, your problem has nothing to to with code quality, nor is coverage meant to be a solution for it.


You're missing the point entirely. This point is a counter example to the implied assertion that some percentage of code coverage indicates correctness. That alone is enough to prove that the implication is false.

That being said, you don't need to assume a bad actor in the team to encounter the situation of having 100% code coverage with meaningless tests. Even the most well-meaning engineer can accidentally write tests that boil down to asserting "true == true" without proving anything about the code paths it touches. This isn't necessarily a cultural issue.

I'd even assert that, due to the languages in common use, it's very common to see these sort of "touch all the code paths but assert nothing about the correctness of the code" tests. Standard OOP languages like TypeScript or Java for example have relatively limited type systems and allow mutable variables, so you end up with implementations of algorithms and data structures that don't lend themselves to property-based testing. This leads to tests which basically just duplicate the implementation and assert "I wrote what I wrote" or in other words "true == true".


> You don't need to assume a bad actor in the team to encounter the situation of having 100% code coverage with meaningless tests.

Indeed. The thing is, I was replying to a comment that referred _explicitly_ to a bad actor. The situation @ponector describes (team's coverage indicator is rendered useless because have a bad actor games it), THEN the cause is not coverage. It says nothing about coverage beyond "a tool only works in certain conditions".

You bring 2 more cases that break the indicator, and I agree with both (there are more!). We have (1): We all make mistakes and write dumb tests. (2): Coverage is not useful for _some_ practises / scopes of testing.

I agree on both. But we're back on the same place. I'm not saying those problems don't exist. I'm saying that those are not problems coverage ever claimed to solve. Making a sweeping dismissal of a tool because it doesn't solve problems it never claimed to solve is throwing away the baby with the dirty water.

* Coverage does not claim to be a tool to fix bad actor, (1) or (2)! There are other tools to cover those risks (e.g. managers, code reviews, pair programming, etc.).

* Discussions about those tools (code reviews, etc.) tend to make the same mistake. Find problems the tool doesn't claim to solve to dismiss the tool.

* This all happens because people pretend to treat tools like coverage, tests, DORA metrics, as silver bullets. They are not. They are all meant to be a toolbox that engineers evaluate and use where they yield value.

And this is why yes, a lot of the "$tool is useless because $situation_where_it_doesnt_work" conversations are fundamentally about cultural issues. If your team uses a tool without knowing what problem is trying to solve, you have a cultural issue. If your tool has an actual purpose, and yet engineers are intentionally working around it, you have a cultural issue. Etc.


There is no such implied assertion, he said:

> 100% tells you nothing - but that _any other percentage_ does tell you something.

100% doesn't tell you it is meaningful coverage, but less than 100% tells you for sure that uncovered part doesn't have meaningful coverage.


Mutation testing is enabled at at least some of the large tech companies, though I don't know how widely.


> IME, the only "test coverage % rule" that I've ever seen work was "must not decrease the overall percentage of tested code".

That’s a stupid rule and not measuring what you think it is.

It fails if you just delete tested code.

It fails if you remove some code thats tested and add some tested code, but not enough.

These are all extremely common for any refactor. Here is a simplistic example but imagine the same principles applied to a large set of changes. Dozens of files, thousands of lines. You cant just manually account for that.

50%: 100 lines of code. 50 covered, 50 uncovered.

remove some function and its 44% from 40 covered, 50 uncovered. Failed.

Or remove some function and replace it with something better thats not as long. 44 covered, 50 uncovered. 47% coverage. Failed.

A stack overflow post about this.

https://softwareengineering.stackexchange.com/questions/4007...

This inevitably leads to worthless tests to increase coverage and avoiding optionL refactors.


It sounds great but I beg to differ.

If I had to pick one rule that is great to always follow, is when you get hit with a bug, write tests that catch it and then write code to fix the test.

Aside from preventing said bugs from resurfacing, it forces coverage of code that is likely complex enough to be buggy.


>Once you get to 100%

Once you get there you already fucked up. In Java covering 100% lines means in average case testing Lombok and every equals hash. If you're doing that you fucked up.


I've seen broken hash implementations many times, so I'm not sure it would be a bad idea to require tests for them ;)

My other favourite of trivial code that's broken: returning the same Iterator instance in Iterable.iterator()


Broken or just working with Hibernate?

I remember hibernate recommending using static hash. In order to prevent saving entities changing hash values.


> "must not decrease the overall percentage of tested code"

This rule has been a problem for me when deleting code. On my precious check we had an automated checker that wouldn't let us merge (easily) if we broke this rule. We had mediocre coverage. Some parts of the code had lots of tests, some parts were totally uncovered. Many times I had to delete code in the tested parts, and the checker would complain that total coverage went down because I deleted a tested line without adding more tested lines.

But I agree it's a good rule in general, which is why we kept it despite the occasional hiccup.


I like that.

What I also do, is ensure test coverage is over 100%¹ for important parts. Important is designated through churn (if a file or class os changed on every second commit, it must be important) and through domain knowledge (building a recipe app, then likely the Recipe is important).

¹ covered by unit tests, AND (partially) covered by integration AND by E2E tests.


I'd worry about test coverage as soon as the team commits to actually fixing every bug found.


> So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal.

To me that reads like your team fucked up at a very fundamental level, as they both failed to take into account the whole point of automated tests and also everyone failed to flag those nonsense tests as a critical blocker for the PR.

Unless your getters aand setters are dead code, they are already exercised by any test covering the happy path. Also, a 80% coverage target leaves out plenty of headroom to leave out stupid getter/setter tests.

A team that pulls this sort of stunt is a team that has opted to develop defensive tricks to preserve their incompetence instead of working on having in place something that actually benefits them.


The incentives are obviously pointing towards this. And that it is so common should make you rethink your stance of "they're just incompetent".



Thanks, I was looking for that name.

Interesting examples:

> San Francisco Declaration on Research Assessment – 2012 manifesto against using the journal impact factor to assess a scientist's work. The statement denounces several problems in science and as Goodhart's law explains, one of them is that measurement has become a target. The correlation between h-index and scientific awards is decreasing since widespread usage of h-index.

> International Union for Conservation of Nature's measure of extinction can be used to remove environmental protections, which resulted in IUCN becoming more conservative in labeling something as extinct


> So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal

Ideally, the code coverage tool would have heuristics to detect trivial getter/setter methods, and filter them out, so adding tests for them won't improve code coverage. Non-trivial getters/setters (where there is some actual non-trivial logic involved) shouldn't be filtered, since they should be tested.

Although, there is room for debate about what counts as trivial. Obviously this is trivial:

    public void setUser(User user) {
       this.user = user;
    }
But should this count as trivial too?

    public void setUser(User user) {
       this.user = Objects.requireNonNull(user);
    }
Probably. What about this?

    public void setOwners(List<User> owners) {
       this.owners = List.copyOf(owners);
    }
Probably that too. Which suggests, maybe, there ought to be a configurable list of methods, whose presence is ignored when determining whether a getter/setter is trivial or not.


> At a large insurance company I worked for, management set a target of 80% test coverage across our entire codebase. So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal.

I attended many TOC conferences in the 90s and early 2000s. Eli Goldratt was famous for saying "Tell me how you'll measure me, and I'll tell you how I will behave."


Java is the only place where (non-automatic/non-syntax-sugared) getters and setters are though as important and valuable

Only goes to confirm my view of them that the language is deficient


I don’t think it’s accurate to label that as a language problem.

That is very squarely a people problem.

It’s not as bad today - many Java juniors don’t have the “bean” affliction burned into their brains so they don’t object to public fields on data carriers (today you’d just use a record) but even the bean generation (mostly people my age) can usually be won over these days by negotiating with them on cases where getters/setters can be eliminated (e.g. start with value objects, then suggest maybe DTOs then you can go for the kill - why do we need the stupid Java bean convention?)


Record types solved this in JDK 14


Cries in Java 8


I mean public final was there before as well.


I don’t find they are valued like that. IME most devs who use them perceive them as annoying boilerplate. Proof being how Lombok or Records are generally well received among the younger gen.

The problem on this topic is that they cargo cult getters/setters by inertia, usually because more experienced engs pass it on as good practice.

It’s not an inherent problem to the language.


Modern Java projects either use records, org.immutables or Lombok. Manual getter/setter creation is the exception nowadays.


> As a young dev, it taught me that focusing only on KPIs can sometimes drive behaviors that don't align with the intended goals

Something I've learned along the way as well. A few times in my career I will end up working under a manager that insists only work "that can be explicitly measured" be performed. That means they forbade library upgrades, refactors, things like that because you couldn't really prove an improvement in customer metrics or immediate changes in eng productivity.

I've also been at companies that follow that mantra more broadly and apply it to eng performance reviews. The entire culture turns into engineers focusing on either short term gains without regard for long term impact, or gaming metrics for meaningless changes and making them look important and impactful.

Important but thankless work gets left behind because, again, it's work that is not "immediately measurable." End result is a bunch of features customers hate (eg dark patterns), and a rickety codebase that everyone is disincentivised fix or improve.


And you'd be dead wrong.

Career QA here. E2E are the absolute highest pevel of test, that take the most time to implement, have the most dependencies, and tell you the least about what is actually wrong.

If you think finding what broke is painful with full suites of unit/integration tests under the E2E suite is bad, throw those out or stop maintaining them at all. Let me know how it goes.



Must reads:

- Categorizing Variants of Goodhart's Law [1]

- Building less-flawed metrics: Understanding and creating better measurement and incentive systems [2]

[1] https://arxiv.org/abs/1803.04585

[2] https://www.sciencedirect.com/science/article/pii/S266638992...


Yea, I've seen this get carried away even by just individual team members.

My personal favorite was tests that a team member introduced for an object that had all the default runtime parameters in it. Think like how long should a timeout be by default, before getting overridden by instance specific settings. This team member introduced a test that checked that each value was the same as it was configured to.

So if I wanted to update or add a default, I had to write it in two places, the actual default value, and the unit test that checked if the defaults were the same as the test required.


I'm convinced high coverage targets cause people to write worse code.

With Java at least, it seems to drive people to do things like not catch exceptions. It's hard to inject errors so all the catch blocks are hit.

Also makes people not want to switch to use record classes, as that feature removes class boilerplate that is easy to cover.


I’ve started getting pinged for this at my current job. I think it’s time to move on.



Is it hard to write ONE test case for ALL getters and setters using reflection?


at that point what are you testing? allocation/deallocation? Nothing is actually happening unless you depend on constructors/destructors when you use reflection to gen a mock type or real value type to inject. Or do you mean just assert true that all getters are not null?

I agree that unit testing for the sake of KPI's is the wrong approach and unit testing functionality (as a means of documenting it, proving it still works as intended) is far better.


> at that point what are you testing?

Memory.

> Nothing is actually happening unless you depend on constructors/destructors when you use reflection to gen a mock type or real value type to inject.

Just get an object from IoC container, then test that getFoo() == getFoo(setFoo(getFoo())) for every Foo with get and set methods, so you will have 80% of coverage for those getters and setters. For read-only properties, just get value and throw it away.

> I agree that unit testing for the sake of KPI's is the wrong approach and unit testing functionality (as a means of documenting it, proving it still works as intended) is far better.

Unit tests are for proving correctness of logic in a unit of code. Data structures are not logic. IMHO, such trivial parts of program should be ignored by coverage tool, but topic starter said that they cannot change rules. :-/


>> at that point what are you testing? > Memory.

It doesn't make sense to test something you have no control over, if the OS fails to allocate memory there's nothing you can change in your code to fix that.

Likewise, if you're improperly instructing the OS to allocate memory (via some constructor or factory or whatever) there is no test you can write to test your own intentions that won't be subject to exactly the same level of incorrectness as the implementation code itself. If you've written "do the thing" writing a "test" that says "hey make sure I wrote 'do the thing'" is ridiculous.

I've often found this to be a matter of semantics. Someone will say they're writing a "unit test" and load all the assertions and logic associated with the meaning of "unit test" into their brain and then try to apply it in a way that subtly invalidates those assertions when they start writing.

To make things worse, there's a cultural disinclination toward "semantic arguments" so you end up arguing with people that have basically no chance of understanding why what they're doing doesn't provide the value it should


> > at that point what are you testing?

> Memory.

In unit testing, the "memory" is not the system under test though.


You shouldn't have setters in the first place. What's the point of encapsulation if anyone can just set properties at will?


Sometimes the value you pass to a setter can go through additional logic to determine the final value to be set e.g.:

    private int value1;

    private int value2;

    public void setValue2(int value2) {
        this.value2 = value2;                  
    }
    
    public void setValue1(int value1) {
        if (this.value2 > 0) {
            this.value1 = value1 + this.value2;
        } else {
            this.value1 = value1;
        }
    }

Obviously this is a contrived example but if you have logic other than a simple "this.value = value", then you might want to unit test that bit.


Also useful if you want to add logging, or capture some metrics, or set a breakpoint.


Ah, the sound of people not using their type system properly.

Parse, don't validate.


But also don’t create AggregateNonNegativeInteger type when generics will do.


I had to do it in one project. It's not trivial, but also not too difficult. Getters were easy, setters required different types of values. After handling various date type values for setters it worked fine. Occasionally I would see an exception for fields generated by APIs like Lombok, which needed to be excluded from my setter list.

I didn't like that I had tobdo it, but it was easier than getting several approvals to get Sonar rules changed.


Or stop using the broken pattern of getters and setters


I ask interview candidates to explain why they’re using getters or setters. They never have a reason.


Would "because it's conventional and most tools and developers assume them" be an acceptable answer? The reason I read is "because you might wanna change the implementation" but honestly it's seems very rare to need to do that


Because your data structure and your API are separate things and generally speaking, the data structure should be opaque to users of that structure unless it's sole purpose is as a record for a bunch of request arguments. For those fat request structures, it ends up being a lot of noise indeed, but in any case most of those structures for most people are dealt with by deserialization and you can often skip generation of setters and getters or autogenerate them.

Do you want to return a copy/clone of the inner object or just let people mutate it, destroying all of your data structure's invariants? Yes, if you change the implementation of the structure, user code will also break and it is certainly easier to do that behind a method, this is indeed rare in practice but the nuisance of migrating direct access to indirect access is bigger than the nuisance of (today's) unnecessary indirection.

Since in some cases safety around invariants and future proofing will require the level of indirection, it is easier to just expect the convention. Moreover, codegen can just produce them for you, they can be excluded from test coverage. Then there are languages like python that allow you in the future to pretend that indirect access is direct access and cannot protect invariants in any case, so just go direct from day one.


Yeah, that's the textbook explanation, I guess. So I'll stick to it in interviews, as useless it is in real life.

All the hoops for this... Codegen, excluding from coverage, Lombok, and all that magic to hide you from a simple obj.field that is all you're doing in 99.9% of cases. It's just so rare in practice. I think the language itself should allow for readonly fields and property setters for the extremely rare occasions where you need this.


Let's say you have a type. Let's call it List.

When you create a List you want to enable the possibility of setting a Capacity - how large can this List be.

And you also want to enable reading how large the List is at a given point.

Pretty valid case for a setter and a getter in my book.


Simplest solution? Don’t have any setters.


Removing them would reduce your coverage percentage.


Good thing. It shows that a % of your tests were actually not targeted at the most important areas of the code base


The correct solution is not having either. Getter/setters are effectively an anti pattern as they support needless mutability but worse they add nothing of value.

As a general rule most fields should be initialized in the c-tor, and they should be final.


True.

However we might think about this differently if we flipped it to "our standard is 20% completely untested".

Uncoverage communicates the value much better.


This is called "Goodhart's Law"


What if…

They knew that people would write coverage tests for getters and setters, and calculated that eventuality into their minimums.


So you're saying they knew engineers would be wasting their time doing useless things, but still went ahead? (instead of mandating 75% and spending 1/100 of the wasted time to adjust the metric to filter out getter/setter)


I just assumed they wrote a script to automatically generate all the setter/getter tests and then took a long lunch.


At least they didn’t do what IBM did, write tests and pay coding farms to write the code to satisfy the unit tests.


True of any metric when it becomes the goal in itself.


> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage.

The problem I have with LLM generated tests is that it seems highly likely that they'd "ratify" buggy behavior, and I'd think that'd be especially likely if the code-base already had low test coverage. One of the nice things about writing new tests by hand is you've got someone who can judge if it's the system being stupid or if it's the test.

At a minimum they should be segregated in a special test folder, so they can be treated with an appropriate level of suspicion.


Writing tests is indeed a great opportunity for finding bugs.

But a codebase with good test coverage allows you to safely perform large scale refactorings without having regressions and that's useful property even if you have bugs and the refactoring preserves them faithfully.

The risk of using a tool that generates tests designed to encode the current behaviour is that you may be lulled in a false sense of safety, while all you've done is to encode the current behaviour, as advertised.

Perhaps this problem can be just solved by not calling these tests "tests" but something like "behavioural snapshots" or something like that (cannot think of a better name, but the idea is to capture the idea that they were not meant to encode necessarily the correct behaviour but just the current behaviour)



I'm a huge fan. But sometimes it's hard to produce the snapshots.


Sometimes you could replace snapshots with checksums. In other words, they're just alerts that code changed, or make changing code unnecessarily tedious.


Unit tests are dead weight during refactoring. Integration tests are useful.

Code coverage metrics are most easily met by writing unit tests. Unit tests are tedious to write.

If you have a robot writing unit tests, I do not want to see your codebase. Refactor all day long. I’m not going near it.


It is not at all uncommon for me to refactor the internals of a unit (ie, a class) in order to add a feature. Unit tests are extremely useful here.


Meta is specifically writing unit tests, where probably every part of the system except the code being tested is mocked.

Those tests tend to have very low value with testing significant refactoring, as the code the tests covers often won't exist anymore.


> One of the nice things about writing new tests by hand is you've got someone who can judge if it's the system being stupid or if it's the test.

This is an instance of a more general problem, which I call "the problem of unwanted change". If you have an automated system which can change itself, how do you know a change is actually intended/correct or merely a symptom of a bug, failure or imperfect knowledge the automation has?

That's why I think human supervision is always needed to an extent, to determine what scenario has occured.

This happens in all sorts of systems. And people tend to think they can solve this with just another layer of automation like here. Testing was originally invented as a way to check if the program works correctly. If you automate it, you will face the same problem, just with a bigger code (in the form of tests rather than assertions).


On the other hand, in a codebase where test coverage is poor and average tenure of engineers is ~1yr, a significant impediment is getting the initial test scaffold set up; maybe I don’t know how to build factories for all the incidental inputs required to test my code, but I do know how the code itself should behave.

If LLMs can help you to scaffold a test, and make it easy for you to write the business logic validation, I think that would be a big win.

On the other hand, if the generated tests are like most UTs, they will be over-coupled to the implementation and therefore slow down development. You might even see folks start deleting all the tests and regenerating them as part of a big diff, if it’s too hard to fix individual test cases that are failing due to coupling rather than any logic issue.


In sufficiently large systems there is some value in test that just detect changed behavior, even if the behavior is buggy. Parts of the code probably rely on the bugs and accidentally (or intentionally) fixing them can lead to more severe problems.

Of course these kinds of tests are no replacement for tests that check actual requirements.


> Parts of the code probably rely on the bugs

The combination of pieces is what should be tested then, not the behavior of each individual piece “just in case something needs it that way”. You’ll never be able to change anything with that approach. And why shouldn’t bugs be fixed anyways?


It’s not that bugs shouldn’t be fixed it’s that they should be fixed intentionally, and with awareness of what the fix might break so you can make it backwards compatible if necessary.


> LLM generated tests is that it seems highly likely that they'd "ratify" buggy behavior

For a new project or a project under active development, I agree that auto generating tests is probably a bad idea. But there are countless legacy systems with low coverage that are in maintenance mode. For those, generating tests that verify the current behavior is super useful. It lets someone make a change, and see that everything else stayed the same.


> you've got someone who can judge if it's the system being stupid or if it's the test.

but why couldn't this be done even with the llm generated test cases?


In that case, I think the point is the difference between what is LLM "assisted" tests (like say Copilot) vs. LLM "owned" tests.


If a test is worth having, it’s worth writing by hand.

Throw away mandatory code coverage tech debt instead of adding artificially-intelligent tech debt on top.

The best I can see a use for something like this is more like a linter than a test writer. “Robot, find weird things in the code and bring them to me for review.”

Then you write the tests yourself.


But this tool doesn’t know what is weird and what isn’t. What you’re talking about is something more like a fuzzer or static analyzer.


Yeah, but that means you can really live Hyrum's Law.


Keeping them separated would also improve future training.


I mean if I can use it to generate coverage for a method then prune the tests it will still save me hours.

Might also be useful when you want to refactor a legacy system or try to figure out the input space for a module, library or method.


From reading the PDF it seems that this ‘merely’ generates tests that will repeatedly pass i.e. that are not flaky. The main purpose is to create a regression test suite by having tests that pin the behaviour of existing code. This isn’t a replacement for developer written tests, which one would hope come with the knowledge of what the functional requirement is.

Almost 20 years ago the company I worked for trialled AgitarOne - its promise was automagically generating test cases for Java code that help explore its behaviour. But also Agitar could create passing tests more or less automatically, which you could then use as a regression suite. Personally I never liked it, as it just led to too much stuff and it was something management didn’t really understand - to them if the test coverage had gone up then the quality must have too. I wonder how much better the LLM approach FB talk about here is compared to that though…

http://www.agitar.com/solutions/products/agitarone.html


A lot of unit tests generated that way will simply be change detectors (fail when code changes) rather than regression tests (fail when bug is re-introduced). Those are pretty big distinctions, I don’t see LML’s getting here until they can ascertain tear correctness without just assuming good tests pass or depending on an oracle (the prompt will have to include behavior expectations somehow).


This articulates the problem I’m having right now in an interesting way. I’m fine writing unit tests that validate business logic requirements or bug fixes, but writing tests that validate implementations to the point that they reimplement the same logic is a bit much.

I want to figure out how to count the number of times a test has had to change with updated requirements vs how many defects they’ve prevented (vs how much wall clock time / compute resources they’ve consumed in running them).


1. Define your APIs in terms of "what" it should do, not "how" (which is for the implementation).

2. Use protocols/interfaces in Swift/Java to define APIs.

3. Then write tests to the API's public contract, without using internal implementation details.

Tests written in the above way will actually detect bugs, and stay stable to internal implementation changes that don't affect the external behavior.


Point 3 is the key. Code coverage should only be measured by tests that only use the “external API.”


Brilliant distillation of this insight, I've never heard it put in those words before but it's perfect. It cuts both ways too, if you have lots of tests but most of them aren't really exercising the external API, then you're worse off.


> I want to figure out how to count the number of times a test has had to change with updated requirements vs how many defects they’ve prevented

I did the same some years back in a project that had both a unit test suite with pretty high code coverage, and a end to end suite as well. The results for the unit test suite were abysmal. The number of times they caught an actual regression over a couple of months time were close to zero. However the number of times they failed simply because code was changed due to new business requirements was huge. With other words: they provided close to zero value while at the same time having high maintenance costs.

The end to end suite did catch a regression now and then, the drawback of it was the usual one, it was very slow to run and maintaining it could be quite painful.

The moral of the story could have been to drastically cut down on writing unit tests. Or maybe write them while implementing a new ticket or fixing a bug, but throwing it away after it went live. But of course this didn't happen. It sort of goes against human nature to throw away something that you just put a lot of effort in.


That’s what I believe Facebook have created here, so you’re right ‘regression’ is a big word - the tests are more likely detecting change e.g. by asserting the existing behaviour of conditionals previously not executed.


And it will lock the system into behaviour that might just be accidental. The value of tests is to make sure that you don't break anything that anyone cares about, not that the every little never used edge case behaviour, which might just an artefact of a specific implementation, is locked in forever.


This is my experience as well. The problem is that persisting "but what _shall_ it do?" on a low level is seen as redundant, as long as everything works. Typically forgotten edge cases are detected elsewhere. The metric _that_ you ran past those code lines says nothing about that you came there for the right reason.


In my experience writing tests is generally an outstanding method to determine code quality.

If a test is complicated or coverage is hard to achieve, it's likely your tested code needs improvement.


Testability of code is indeed a good benchmark for code quality. The things that make it hard to test code are also the things that are associated with low quality code.

Something with low coupling, high cohesion, and low complexity should be easy to unit tests.


Competitive programming code is extremely easy to test, but many would argue it isn't high quality.


  In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers.
…is that a good rate? I guess I have to read more and see if the unacceptable ones were silly mistakes like the ones that make us all do code review, or serious ones. I don’t think a human engineer with 25% failure rate would be very helpful, if it’s a certain kind of failure.

  As part of our overall mission to automate unit test generation for Android code, we have developed an automated test class improver, TestGen-LLM.
Is that a good mission? I feel like the TDD people are turning over in their graves, or at least in their beds at home. But again something tells me that they caveat this later


There is a lot of testless code in Facebook, nobody gets PSC points for that.


At unlogged.io, for some time - our primary focus was to auto-generate junit tests. The approach didn't take off for a few reasons: 1. A Lot of generated test code that no devs wanted to maintain. 2. The generated tests didn't simulate real-world scenarios. 3. Code coverage was a vanity metric. Devs worked around to reach their goals with scenarios that didn't matter.

We are currently working on offering no-code replay tests that simulate all unique production scenarios and developers can replay locally while mocking external dependencies.

Disclaimer: I am a founder at unlogged.io


I want to go the other way. Let me feed acceptance criteria in, have it generate tests that check them, and only then generate code that passes the tests.

You can get close to this with Copilot, sometimes, in a fairly limited way, but why do I feel like nobody is focusing on doing it that way round?


TestGen-LLM is such a strange creation. I can see how it could be used as a first step in a refactoring or rewrite, but the emphasis on code coverage in the paper seems totally brain-broke. I suppose it'd be great if your org is already brain-broke and demanding high coverage, but TestGen-LLM won't make your project's code better in any way, and it'll increase the friction involved in actually implementing improvements. It'd be much more useful to generate edge-case tests that may or may not be passing, but TestGen-LLM relies on compiler errors and failing tests to filter out LLM garbage. The lack of any examples of the generated tests in the paper makes me suspect that they're the same as the rest of the LLM-generated code I have seen: amateurish.


I recently had to refractory a project that had no tests what so ever. Having LLMs automatically generate a first draft of tests was very helpful. Even just to understand what the code was supposed to do.


I'll admit its interesting, a 12 page paper by Meta employees to promote AI for developers. Even brought out the Sankey diagram.

I'm probably wrong but if its published in this way, shouldn't the information be given to reproduce it ?

Edit: This is not tinfoil hat, I just don't have the kind of data that meta has to learn from. So, maybe they released something ?


If it's anything like Google, it's way too intimately tied to their infra and monorepo to release


For these big tech companies, infrastructure is optimized for high scale & reliability over velocity & flexibility.


It’s an FSE 2024 paper, so I’m guessing the artifacts need theory or formal evaluation.


I wonder what would be the cost of maintaining some huge auto-generated corpus of tests in the future. They need to provide some automated way not only to generate cases, but also to update them.


So write unit tests automatically, change code later and then regenerate the unit tests? Now the code has a bug but the unit tests pass. I'm already seeing this today with devs using ChatGPT to quickly get the "test boilerplate" over and over.


You still need a human in the loop. I doubt Meta are letting it run blindly. More just automating part of the process and having a human decide what is and isn't committed.


> In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage.

That doesn't seem great?


Their abstract doesn't match their actual paper contents. That's unfortunate. Their summary indicates rates in terms of test cases (repeating your quote a bit):

> 75% of test cases built correctly, 57% passed reliably [implying test cases by context], and 25% increased coverage [same implication]

The actual report talks about test classes, where each class has one or more test cases.

> (1) 75% of test classes had at least one new test case that builds correctly.

> (2) 57% of test classes had at least one test case that builds cor- rectly and passes reliably.

> (3) 25% of test classes had at least one test case that builds cor- rectly, passes and increases line coverage compared to all other test classes that share the same build target.

Those are two very different statements. They even have a footnote acknowledging this:

> For a given attempt to extend a test class, there can be many attempts to generate a test case, so the success rate per test case is typically considerably lower than that per test class.

But then in their conclusion they misrepresent their findings again, like the abstract:

> When we use TestGen-LLM in its experimental mode (free from the confounding factors inherent in deployment), we found that the success rate per test case was 25% (See Section 3.3). However, line coverage is a stringent requirement for success. Were we to relax the requirement to require only that test cases build and pass, then the success rate rises to 57%.


It looks like it only gets committed if passes all of those checks (i.e. no humans have to look at it unless it's actually reliably passing and does increase code coverage). 25% code coverage improvement that passes reliably, for the cost of the electricity required for the GPUs, seems pretty cheap.

Of course, this doesn't seem like it's going to replace engineers, but it'll help the organization out for relatively low cost.


> 25% code coverage improvement

The paper does not report code coverage improvement and it is probably not 25%. The paper does say this:

> The median number of lines of code added by a TestGen-LLM test in the test-a-thon was 2.5.


All three are machine-verifiable, so you can easily filter out the ones that don't work, right?


Considering human tests should already have a high coverage rate, if the 25% that increased coverage were actual good tests, I think it’s a useful tool


I was actually surprised they had less than 75% coverage to start with.


25% of test cases increased coverage, not coverage was increased by 25%. For example, if they started with 90% coverage and each test case increased coverage by 0.1% and there were 100 test cases, final coverage was 92.5%.


Not sure about improving. But for Tunnelmole (https://github.com/robbie-cahill/tunnelmole-client) I have used GPT-4 to generate unit tests just by showing it a TypeScript module and asking it to create tests.


I am currently looking into building something like this for a client with large (several > 3M LoCs) and old (started in 2001) Java Projects with low coverage.

Interesting to read how it would work it you already have good coverage (I assume).


Is there still no type theoretic answer to unit testing? Does not the type or the class generally contain all the necessary information to unit test itself, assuming its a unit? That is, we should not have to even write these "theoretically". Just hit "compiler --unit_test <type>"


What you're describing is more or less fuzzing [1], at the unit level. I can't remember the names, but there are tools around that work like this at runtime (ie you define a test that executes functions from the test library that run tests based on input/output types and other user defined constraints).

There's almost always more business logic to what a unit should do than it's types though. Depending on the language, the type system can only encode so much of that logic.

Consider the opposite: can't the compiler generate implementations from types and interfaces? In most cases, no. LLMs are filling some of that gap though because they can use some surrounding context to return the high probability implementation (completion) from the interface or type definition.

[1] https://en.m.wikipedia.org/wiki/Fuzzing


It’s papers like this that will act as the justification of the next round of FAANG lAIoffs. Regardless of how successful this approach is in the long term.


Happy to see such a huge team collaborating on a project at any company.

Perhaps becoz, it involves LLM and LLMs are hot, and everyone wants a piece of it.


So, assign to: LLM when lots of tests are broken after the next refactoring!


"Automated Unit Test Improvement using Large Language Models at Meta" (2024) https://arxiv.org/abs/2402.09171 :

> This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. [...] We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.

Coverage-guided unit test improvement might [with LLMs] be efficient too.

https://github.com/topics/coverage-guided-fuzzing :

- e.g. Google/syzkaller is a coverage-guided syscall fuzzer: https://github.com/google/syzkaller

- Gitlab CI supports coverage-guided fuzzing: https://docs.gitlab.com/ee/user/application_security/coverag...

- oss-fuzz, osv

Additional ways to improve tests:

Hypothesis and pynguin generate tests from type annotations.

There are various tools to generate type annotations for Python code;

> pytype (Google) [1], PyAnnotate (Dropbox) [2], and MonkeyType (Instagram) [3] all do dynamic / runtime PEP-484 type annotation type inference [4] to generate type annotations. https://news.ycombinator.com/item?id=39139198

icontract-hypothesis generates tests from icontract DbC Design by Contract type, value, and invariance constraints specified as precondition and postcondition @decorators: https://github.com/mristin/icontract-hypothesis

Nagini and deal-solver attempt to Formally Verify Python code with or without unit tests: https://news.ycombinator.com/item?id=39139198

Additional research:

"Fuzz target generation using LLMs" (2023) https://google.github.io/oss-fuzz/research/llms/target_gener... https://security.googleblog.com/2023/08/ai-powered-fuzzing-b... https://hn.algolia.com/?q=AI-Powered+Fuzzing%3A+Breaking+the...

OSSF//fuzz-introspector//doc/Features.md: https://github.com/ossf/fuzz-introspector/blob/main/doc/Feat...

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=Fuz... :

- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350 : > This survey provides a systematic overview of the approaches that fuse LLMs and fuzzing tests for software testing. In this paper, a statistical analysis and discussion of the literature in three areas, namely LLMs, fuzzing test, and fuzzing test generated based on LLMs, are conducted by summarising the state-of-the-art methods up until 2024


Thanks for sharing this. By far the best tool I've seen in the market centered around Code Integrity is CodiumAI (https://www.codium.ai/). They generate unit test based on entire code repos. Also integrates into SDLC through a PR Agent on GitHub or GitLab. My whole team uses them.


Any take on whether an LLM trained solely on formally verified code will generate unverifiable code?


Meta have been publishing fire content recent years




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: