Where is this claim from? The Codex paper claims 28.8% on OpenAI's novel dataset... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		YeGoblynQueenne on Aug 7, 2022 \| parent \| context \| favorite \| on: AlphaFold won’t revolutionise drug discovery Where is this claim from? The Codex paper claims 28.8% on OpenAI's novel dataset: On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. https://arxiv.org/abs/2107.03374 The numbers I quote (5-7%) are on the APPS dataset, reported by SalesForce: https://arxiv.org/abs/2207.01780 (Recently shared on HN). Note that there is no standard dataset like ImageNet or MNIST in program synthesis so it's not correct to speak of a "benchmark". LLM code generators in particular tend to be evaluated on whatever dataset their creators thought best demonstrates their system's capabilities.

mountainriver on Aug 7, 2022 [–]

https://gpt3demo.com/apps/openai-codex They claim the current version has 37% accuracy, they may have continued to improve it, I assume thats on the original dataset they used but I can't find a source for that

YeGoblynQueenne on Aug 7, 2022 | [–]

Thanks. I guess we'll have to wait for a publication to know what was done exactly.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact