Hacker News new | past | comments | ask | show | jobs | submit login
Re-Evaluating GPT-4's Bar Exam Performance (ssrn.com)
47 points by homarp on May 22, 2023 | hide | past | favorite | 9 comments



abstract: Perhaps the most widely touted of GPT-4's at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam, with its reported 80-percentile-points boost over its predecessor, GPT-3.5, far exceeding that for any other exam.

This paper investigates the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that suggest that OpenAI's estimates of GPT-4's UBE percentile, though clearly an impressive leap over those of GPT-3.5, appear to be overinflated, particularly if taken as a “conservative” estimate representing “the lower range of percentiles,” and moreso if meant to reflect the actual capabilities of a practicing lawyer.


GPT-4 is clearly very impressive, they should show it off honestly and transparently. Instead OpenAI clearly treats these evaluations as a part of their sales and marketing, with inflated claims to match.


I know, right? It’s outstanding that GPT-4 could pass a bar exam at all, so it’s not clear why they should risk making misleading claims about its performance.


What did they risk exactly?

A relatively tiny percentage of people have read this relatively obscure academic paper, an even smaller amount care, and probably none will stop using ChatGPT or investing in OpenAI because of it.

Meanwhile its performance on the bar exam made international headlines which millions if not bilions of people have read.

A lie is halfway across the world before the truth gets out of bed.


> Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be ∼63rd percentile, including ∼41st percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to ∼48th percentile overall, and ∼15th percentile on essays


"... when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to ∼48th percentile overall, and ∼15th percentile on essays ..."

so this means that ChatGPT is right in the middle of the performance of licensed or license-pending attorneys.

and for essays its still better than 15 percent of the humans.

Still quite impressive ...


I wouldn’t immediately go as far as to consider lawyers humans.


Coming from someone that is likely a software developer, that is amusing


ah nice. i knew this couldn’t have been fully true




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: