Where is this claim from? The Codex paper claims 28.8% on OpenAI's novel dataset:
On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.
Note that there is no standard dataset like ImageNet or MNIST in program synthesis so it's not correct to speak of a "benchmark". LLM code generators in particular tend to be evaluated on whatever dataset their creators thought best demonstrates their system's capabilities.
https://gpt3demo.com/apps/openai-codex They claim the current version has 37% accuracy, they may have continued to improve it, I assume thats on the original dataset they used but I can't find a source for that
On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.
https://arxiv.org/abs/2107.03374
The numbers I quote (5-7%) are on the APPS dataset, reported by SalesForce:
https://arxiv.org/abs/2207.01780
(Recently shared on HN).
Note that there is no standard dataset like ImageNet or MNIST in program synthesis so it's not correct to speak of a "benchmark". LLM code generators in particular tend to be evaluated on whatever dataset their creators thought best demonstrates their system's capabilities.