Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Where is this claim from? The Codex paper claims 28.8% on OpenAI's novel dataset:

On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.

https://arxiv.org/abs/2107.03374

The numbers I quote (5-7%) are on the APPS dataset, reported by SalesForce:

https://arxiv.org/abs/2207.01780

(Recently shared on HN).

Note that there is no standard dataset like ImageNet or MNIST in program synthesis so it's not correct to speak of a "benchmark". LLM code generators in particular tend to be evaluated on whatever dataset their creators thought best demonstrates their system's capabilities.



https://gpt3demo.com/apps/openai-codex They claim the current version has 37% accuracy, they may have continued to improve it, I assume thats on the original dataset they used but I can't find a source for that


Thanks. I guess we'll have to wait for a publication to know what was done exactly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: