One test case that I've used with LLMs that generate code: generate a stop light.
In short,
Generate a stoplight with unit tests.
I have not yet found a model that produces sufficiently safe code. I haven't tested this in awhile; but I don't expect any current LLM to be sufficient at this task.
> We should not expect an LLM trained solely on formally-verified code to produce formally-verified code. I don't think that also training on specs and hateful training material will fix that.
In short,
Generate a stoplight with unit tests.
I have not yet found a model that produces sufficiently safe code. I haven't tested this in awhile; but I don't expect any current LLM to be sufficient at this task.
Maybe an LLM could generate a safe stop light with formal methods as a primary meta procedure? From https://news.ycombinator.com/item?id=44889967 re: formal methods:
> We should not expect an LLM trained solely on formally-verified code to produce formally-verified code. I don't think that also training on specs and hateful training material will fix that.