Hacker News new | past | comments | ask | show | jobs | submit login
No, GPT4 Can’t Ace MIT (flower-nutria-41d.notion.site)
261 points by YeGoblynQueenne on June 17, 2023 | hide | past | favorite | 120 comments



Great analysis, props to these students for taking the time to challenge such a sensational headline. In the conclusion they mention my biggest problem with the paper which is that it appears gpt4 grades the answers as well (see section 2.6 "Automatic Grading").

In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. To be clear the grading gpt4 has the answers so it does have more information but it still might overlook important subtleties in how the real answer differs from the generated answer due to it's own failure to understand the material.


> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades.

Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.


In the paper, they at least claimed to manually verify the correct answers.


I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.


>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.


Then - having not read the paper - what is the point of the automated grading?


To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).


Got it!


If people haven't seen it UT Prof Scott Aaronson had GPT4 take his Intro Quantum final exam and had his TA grade it. It made some mistakes, but did surprisingly well with a "B". He even had it argue for a better grade on a problem it did poorly on.

Of course this was back in April when you could still get the pure unadulterated GPT4 and they hadn't cut it down with baby laxative for the noobs.

https://scottaaronson.blog/?p=7209


See the comment from Ose "Comment #199 April 17th, 2023 at 6:53 am" at the bottom of that blog post...


It literally did not change. Not one bit. Please, if you're reading this, speak up when people say this. It's a fundamental misunderstanding, there's so much chatter around AI, not much info, and the SnR is getting worse


I’ve seen the recent statement by someone at OpenAI but whatever weasel words they use, it did change.

The modified cabbage-goat-lion problem [1] that GPT4 always failed to solve, it now gets it right. I’ve seen enough people run it in enough variations [2] before to know that it absolutely did change.

Maybe they didn’t “change” as in train anything, but it’s definitely been RHLFed and it’s impacting the results.

[1] https://news.ycombinator.com/item?id=35155467

[2] anecdata: dozens of people, hundreds of times total


I attribute this to two things:

1. People have become more accustomed to the limits of GPT-4, similar to the Google effect. At first they were astounded, now they're starting to see it's limits

2. Enabling Plugins (or even small tweaks to the ChatGPT context like adding today's date) pollute the prompt, giving more directed/deterministic responses

The API, as far as I can tell, is exactly the same as it was when I first had access (which has been confirmed by OpenAI folks on Twitter [0])

[0] https://twitter.com/jeffintime/status/1663759913678700544


In my experience with Bing Chat, in addition to what you say, there is also some A/B testing going on as well.


"It literally did not change. Not one bit."

How do you know?

Even if the base model didn't change, that doesn't mean they didn't fine tune it in some way over time. They also might be passing its answers through some other AI or using some other techniques to filter, censor, and/or modify the answers in some way before returning them to the user.

I don't know how anyone could confidently say what they're doing unless they work at OpenAI.


Someone who works at OpenAI said so two weeks ago


Then again, can we trust that person? It's not like they didn't have conflict of interest to make that claim.


Yes, it’s turtles all the way down


Nice try, ClosedAI. Then how do you explain this?

https://news.ycombinator.com/item?id=36348867


Well, I had hoped the sarcastic comparison to cut heroin would make it clear.

No, I don't think there's much change at all to GPT-4 (at the API level) and probably not that much at the pre/post language detection and sanitation for apparently psychotic responses.


You should take a look at this video. He is a researcher at Microsoft and had accès to private version of ChatGPT. He literally claims that ChatGPT 4 is not as good as before. His talk actually demonstrates the different evolutions.

https://youtu.be/qbIk7-JPB2c


If you are referring to that social media post by an OpenAI employee saying it hasn’t changed, they were specifically referring to the API. iirc, the same employee explicitly stated the Web UI version changes quite regularly. Someone correct me with the link if I’m wrong, I don’t have it handy.


This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.

The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.

When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.


>> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.

Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.

Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:

What Will it Take to Fix Benchmarking in Natural Language Understanding?

https://aclanthology.org/2021.naacl-main.385/

There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!


I would have said machine learning is more like materials science, but you are on the right track.

As you increase the number of bits you are trying to comprehend, you move from quantum physics to chemistry to material science to biology to social science.

At certain points, the methods and reproducibility become somewhat of a dark art. I have experience that in my field of materials science.

Because these models are using billions or trillions of random number generators in their probability chains, it starts looking more like the harder hard sciences, it gets very difficult to track and understand what is important.

I think machine learning will be easier to comprehend than social sciences, so I wouldn't put it that high. It will be something between materials science and biology levels of difficulty in understanding.


Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly. (Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)


> The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out.

If OpenAI ceased to be – probably for some legislative reason –, would the problems go away?


The damage will have been done so I don’t think so.


> but it still might overlook important subtleties

If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.

Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.


I noticed that when I read the paper. I know it's hard to scale but I'd want to see competent TAs doing the grading. I also found the distribution of courses a bit odd. Some of it might be just individual samples but intro courses I'd expect to be pretty cookie cutter (for GPT) were fairly far down the list and things I'd expect to be really challenging had relatively good results.


Can attest that the distribution is odd from the test set that we sampled.

We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.

I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.


Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.


Even research from OpenAI has attempted to use GPT-4 as quasi-ground truth (as a replacement for human evaluators). For example, their method in the recent paper "Language models can explain neurons in language models" [1] is:

1. Using GPT-4, generate a text explanation of a neuron's activations on sample input.

2. Using GPT-4 again, use the text explanation to simulate the neuron on some new text input.

3. Compare the result to the actual neuron's activations on the new text input.

They justify this by saying human contractors do equally poorly at coming up with text descriptions. However, the procedure is such a black box that it is difficult to make scientific conclusions from the results.

[1] https://openai.com/research/language-models-can-explain-neur...


> Even research from OpenAI has attempted to use GPT-4 as quasi-ground truth (as a replacement for human evaluators).

The way OpenAI used GPT-4 is fundamentally different than how GPT-4 was used to score the answers to the MIT exam. In OpenAI's case, they had GPT-4 generate an explanation of when a neuron in GPT-3 would fire. They then gave that explanation back to GPT-4 and had GPT-4 predict when the specific neuron in GPT-3 would fire. The scoring was done by computing the correlation between when GPT-4 predicted the neuron would fire and when it actually fired. The scoring was not done by GPT-4 as was done for the MIT exam

In addition OpenAI did have human evaluators score the explanations as well to make sure they were human interpretable[0]

[0] https://openaipublic.blob.core.windows.net/neuron-explainer/...


Correct except they did this for gpt2 not gpt3


Indeed, the OpenAI paper is more well-founded since the "ground truth" was not generated by GPT-4.

However, both papers rely on black boxes instead of well-understood procedures. This places the papers on weaker scientific footing. For example, a poor explanation/simulation of a neuron's behavior may simply be a consequence of GPT-4. Instead, a scientist would want to "prove" some form of unexplainability.

To do this, a researcher would not use a human at all to explain neuronal behavior. Instead, a simple repeatable algorithm such as topic modeling would be applied. This would lead to significantly stronger scientific conclusions about the neurons. It also proves it is not possible to explain the neuron in some specific sense.

An interesting follow-up to the OpenAI paper might be to quantify how much "more powerful" its textual descriptions are than simpler, well-understood techniques such as topic modeling. That could at least reinforce its use.


I mean that’s literally the purpose of that paper: to show the capabilities of a blackbox (gpt4).


Can't we just have GPT-4 make the scientific conclusions from these results? /s


I don't understand your objection. Step (3) is the one that actually assesses how well the proposed description works, and that is a comparison with the 'real' ground truth.


This is one of the most embarrassing reviews I have ever read (for the paper reviewed). AI research needs urgently certain good practices to adhere to, but the current status is that it is really hard to take many of the results seriously due to the opaqueness that characterises it in many steps of the process. And such serious mistakes and bad practices certainly do not help the field to achieve any credibility.


This note from the article is important:

> Several of the authors listed on the discussed paper are undergraduate researchers. Consequently, we believe it's inappropriate to hold these individuals accountable for any lapses present in the work.

Instead, we believe the responsibility should lie with the supervising authors. They are the ones who are expected to ensure that the work meets the rigorous standards of public scholarship within their field.


There's no need to even make such a complicated task.

Recently I found that GPT4 can't even reliably create a list of german nouns with a given article (der / die / das).

It will mess up a simple list - if you ask it to analyse it, it'll be able to tell you that it's wrong.

Then you get it to correct the list and it may still be wrong.

It can take several iterations to make the list correct. I would have thought this would be a super easy task for it, but apparently not.


I've gone from working with 500-1k lines of code at a time in GPT-4, to not bothering beyond one or two small functions because it gets confused so easily now.

Seems they're enshittifying already, I hope a competitive model is released soon.


The model didn't change.


How do you know?

Oh, let me guess... because OpenAI told you so. OpenAI, the one istitution with the most strong incentives to tell people the model is not getting worse.


Or I foresaw this and kept a collection of temperature 0.0 gpt 3.5's around.

People are incredibly silly, myself included, you get old enough and see an _insane_ influx of new people, you figure pretty much exactly this is going to happen. 95% of people don't know what temperature is. Of the 5% remaining, 4.9% think its something you just tell ChatGPT to adjust.


Supposedly. But I have the same experience. I used it to create code to design a complicated CRISPR experiment, but lately it can’t keep anything straight.


That should lead you to be less confident that it changed, not more


How so? Seems counterintuitive


I’d expect it based on the absurdity of the difference - it’s a master programmer and now it can’t do anything! - combined with self-knowledge that 0 attempts were made to make an objective comparison.

If my Tesla went 80 mph and started going 15, we wouldn’t attribute that to the-nature-of-Teslas, or a software update: there are vast incentives for anyone who knew that was done to share that publicly.

Instead, we know something is wrong with the individual car and we take it to the dealer.

Here, the missing part is objectivity via a speedometer: I have one, I know outputs are consistent on 0315 models at 0 temperature.


If you spend less compute power on the same model, the answers can degrade.

It makes total sense that they would throttle due to the incredible global demand. They can't build super computers fast enough now.


With the following prompt i got 50 consecutive words with correct articles:

> Erstelle eine Liste mit 10 Nomen, die den Artikel "der" haben.

Maybe "reliably" is doing a lot of the heavy lifting?


It's not that it didn't do an OK job, but more that you couldn't rely on what it had produced totally, nor rely on it having corrected the list without first having it reanalyse the "corrected" list.

It's still extremely helpful, I just found it strange that it seemed like a simple task - for something that has been fed millions of documents, for it to still give some incorrect results - especially AFTER it had analysed its own results and found some noun artikels to be incorrect.


I've found you still can't rely on LLMs to do anything 100% correct without human oversight. Unless you spend a lot of time prompt engineering and testing. Even then you might not get as close to 100% as you'd like.

But as you say, they are still extremely helpful anyway.


Curious, could you share your prompt? I just tried asking GPT4 (paid) to create a list of German nouns with der/die/das and it managed to do it correctly.


"What are the 200 most common nouns in German with the artikel "der"?"

I actually didn't realise it was giving me incorrect info until my gf started looking at it!

(I was trying to use it to help me learn german)


Some things to think about a) your prompt is ambigious in the sense of do you want masculine nouns or nouns prepended with "der". b) Prompting in the language you want the output to be usually results in better output. c) German is actually a really difficult case for articles as feminin nouns can be use "der" depending on context and meaning. Making it harder for LLMs to pick up. e.g. Die Welt dreht sich. Auf der Welt leben. Auf die Welt gekommen.

d) GPT4 is significantly worse in german and most other languages, than in english anyways.


I gave GPT4 that prompt and as far as I can tell, the list is correct.

GPT answer:

It's difficult to determine the "200 most common" nouns for a particular gender in German since the frequency of word use can vary greatly depending on the context (e.g., conversational, academic, technical). However, here's a list of 100 common masculine nouns, which use the definite article "der".

der Mann (man), der Tag (day), der Freund (friend), der Vater (father), der Bruder (brother), der Sohn (son), der Monat (month), der Kurs (course), der Tisch (table), der Stuhl (chair), der Garten (garden), der Zug (train), der Platz (place), der Himmel (sky), der Wagen (car), der Brief (letter), der König (king), der Schuh (shoe), der Baum (tree), der Hund (dog), der Bär (bear), der Fisch (fish), der Vogel (bird), der Schüler (pupil), der Regen (rain), der Wind (wind), der Wald (forest), der Berg (mountain), der Fluss (river), der Junge (boy), der Kaffee (coffee), der Lehrer (teacher), der Arzt (doctor), der Student (student), der Polizist (policeman), der Künstler (artist), der Teller (plate), der Löffel (spoon), der Geldbeutel (wallet), der Laptop (laptop), der Schalter (switch), der Fernseher (television), der Schrank (cupboard), der Helm (helmet), der Hut (hat), der Schmuck (jewelry), der Ring (ring), der Sport (sport), der Schaden (damage), der Boden (floor), der Sand (sand), der Sturm (storm), der Preis (price), der Feind (enemy), der Held (hero), der Raum (room), der Morgen (morning), der Abend (evening), der Unfall (accident), der Vortrag (lecture), der Urlaub (vacation), der Ausflug (trip), der Hafen (harbor), der Bericht (report), der Name (name), der Bauer (farmer), der Rechner (calculator), der Traum (dream), der Anzug (suit), der Geschmack (taste), der Eindruck (impression), der Zweck (purpose), der Vertrag (contract), der Krieg (war), der Kunde (customer), der Arbeitgeber (employer), der Mitarbeiter (employee), der Kollege (colleague), der Bewohner (resident), der Fahrer (driver), der Gast (guest), der Kritiker (critic), der Profi (professional), der Sieger (winner), der Kandidat (candidate), der Beamte (official), der Insasse (inmate), der Zeuge (witness), der Beweis (proof), der Schatten (shadow), der Zweifel (doubt), der Trauer (grief), der Frieden (peace), der Nerv (nerve), der Horizont (horizon), der Gedanke (thought), der Lohn (wage), der Antrag (application), der Verlust (loss), der Betrag (amount),


I don't speak German, but I just tried a few at random in an online translator and der Trauer seems to be wrong.


Wiktionary agrees: https://en.wiktionary.org/wiki/Trauer

There's a common pattern in GPT discourse (here on HN, but elsewhere too): somebody describes a limitation, somebody else goes "no such limitation exists: look, here's GPT output", and a third person goes "no, here's why your example demonstrates the limitation".

It's interesting. People find it very hard – or are disinclined – to distrust a charismatic robot, even when warned.


You're right, it's die Trauer

https://dict.leo.org/german-english/trauer


However that's the only wrong one.


Both forms are likely right(I am not a german):

> Trauer is a feminine noun. Remember that, in German, both the spelling of the word and the article preceding the word can change depending on whether it is in the nominative, accusative, genitive, or dative case.[1]

[1]: https://www.collinsdictionary.com/dictionary/german-english/...


"der Trauer" would be correct as an inflected form (genitive or dative singular), but that's not what you would list as a dictionary definition.

A good AI should be smart enough to know that if you ask for "German words with the article 'der'", you're most likely to want to be given a list of masculine nouns.


This isn't the same task that you described in the first comment. Did GPT4 include nouns that didn't use the article "der" in its output? Or did it fail to reply with the 200 most common ones?


Apologies, I was trying to give a little context for those that don't know about the three "the's" in German.

The list was mostly correct, but yes it added nouns that were not "der" nouns into the list.

It then attempted to correct the list and failed at correcting it.

In terms of output, it also didn't want to give a list of 200, but I did manage to get a list of around 100 back.


I wonder if it'd help to ask it to write the noun with the article. I just tried it now - I asked it to list the top 100 German "der" nouns, then asked it to repeat the list but with the article. That made it obvious which ones were wrong!

It was unwilling to write "Der Jahr" when "Das Jahr" is correct.


Does it work if you ask in German? I found it's a better if you tell it via system prompt that it's a language professor (using your target language) than if you just use english for tasks involving foreign language. The power of the LARP.

(I use normal machine translation API for a lot of this, but you can also ask it in another context window to translate the text to other languages. I use this approach for e.g. sindarian)


I didn't do that, but will give it another go. Thanks for the suggestion!

Otherwise my current strategy is to put it in an analysis loop until it deems the list to be correct.


> I would have thought this would be a super easy task for it

Why did you think that? This isn't meant to be critical, but I'm honestly curious, what led you to believe that technology underlying GPT-4 made it a good fit for this or any particular task?


I too would think it a super easy task.

It has probably seen the correct nouns used millions of times in the training data - and asking it to produce the correct nouns for a bunch of words is really just "tell me which case you saw most during training", which is something LLM's are really good at.


From what I understand, it's trained mostly on English, and performs better in English. So it's not at all surprising it would make more mistakes in German or some other language.


>Why did you think that?

It is a purely statistical model. It does not know any "rules" about the language (it doesn't know any language at all), but it is fed data and derives from that sophisticated probabalistical relationships between words.

It shouldn't have much of a problem to generate the correct grammatical formulations, as it has been extensively trained on them. Moreso than any other technology neural networks are suited for this kind of tasks where hard rules do not exist (as a German I couldn't tell you why "rain" is masculine, but "machine" is feminine), but lots of data correctly implements the rule, does.


I have doubts it was extensively trained on German data. Who knows about GPT4, but GPT3 is ~92% of English and ~1.5% of German, which means it saw more "die, motherfucker, die" than "die Mutter".

(https://github.com/openai/gpt-3/blob/master/dataset_statisti...)


I think this task is beyond the capabilities of what GPT4 can handle, this is simply asking too much out of it. For other languages I'm sure it has no problems.

https://faculty.georgetown.edu/jod/texts/twain.german.html


In greek it is also really bad (and makes rather obvious mistakes), in french it seems much better but makes some very obvious mistakes too. To make it interesting, I emphasise that I want nouns that refer to objects only (else it just spits out profession names and stuff like that, which is not interesting).

Also tbh, with all the hype of LLMs one would think that such a task would not be such a challenge.


>Also tbh, with all the hype of LLMs one would think that such a task would not be such a challenge.

The strange/sad thing is that despite being "large language models" they're often hypermyopic on English..

I've done some measurements comparing generation between various languages in the prompt and no matter what I do half the time i cannot get them to not include english text or comments in code unless the request is made in japanese, chinese, or a similar very-different-language.


It's long been observed (e.g. Emily Bender has written some articles to that effect) that NLP technology underperforms on languages that aren't English, especially when they are significantly different structurally.

If you train and evaluate something mostly on a language loke English, you're going to end up with a model that thinks everything works like English, which means among other things very little morphological complexity.


It's actually quite amusing asking it to give you a list of consonants in which Greek words can end in, and then example words for each. It's pretty much all hallucination and then it tries to gaslight you.


I just had the same experience. It was so strange.


With LLMs hallucinating and being generally unreliable, I'm coming to the conclusion that LLMs should only be used to transform natural language to structured data (specific to your application), and knowledge about the world should be stored somewhere else (some sort of vector databases? tools?), and, fortunately, smaller models are already quite good at the former! Trying to extract world knowledge from LLMs can be a deadend... I don't see how the hallucination problem can be 100% fixed for LLMs themselves.


Strange take, do you consider world knowledge produced by humans to be anywhere near 100% accurate?

Suppose you had an oracle that when asked a question gives you 5 answers, 1 of which is true. Or even 1 of which is true only 50% of the time. You would still generally be a fool to throw away that oracle.


I do not trust LLMs. They may very often be right, and may very often be useful, but I do not trust them. If you require correct results, their output should be checked somehow.


Am one of the authors of the linked article. Thanks for all of your kind comments! Let me know if you have any questions and I'll be happy to answer.


How disruptive has this tech been to course 6 take home exams/homework/study axioms?

I remember the very late nights at Burton Connor, maybe students will get more sleep now :)


Whose idea was it? Was this something done for fun or was it suggested by a professor you're working with?


Neil posted the paper in our fraternity's ML group chat (MIT things lol), and I expressed some skepticism at the results.

Initially we started looking into it more for curiosity's sake, but as we started digging we kept finding more and more ridiculous stuff to the point where we decided to start working on blog post. Then David joined in on the action and helped a ton with the research and writeup.

No professor was involved. The paper was released yesterday, so we just documented as we went along in the investigation process. It only took like 8 hours of work to compile the doc. We finished it last night and posted to Twitter in the morning.


Why didn’t you ask the authors for the actual sample dataset? Wouldn’t the commit to delete these files indicate that they were not the ones used in the study?


I think it would indicate that they knew the results were not reproducible so they took down the data.


I'm not sure what to make of this post. There is always a degree of uncertainty with the experimental design and it's not surprising that there are a couple of buggy questions. Imagenet (one of the most famous CV datases) at this point is known to have many such buggy answers. What is surprising is the hearsay that plays out on social media that blows the proportion of the results out of the water and leads to opinion pieces like these targeting the authors instead.

Most of the damning claims in the conclusion section (Obligatory: I haven't read the paper entirely, just skimmed it.) usually get ironed out in the final deadline run by the advisors anyway. I'm assuming this is a draft paper for the EMNLP deadline this coming Friday published on arxiv. So this paper hasn't even gone through the peer review process yet.


ImageNet has five orders of magnitude more answers, which I would assume makes QA a completely different category of problem.

The authors could probably have carefully review all ~300 of their questions. If they couldn't they could have just reduced their sample size to say 50.


I admit that Imagenet isn't the best analogy here. But I'm pretty confident that this data cleaning issue would be caught in peer review. The biggest issue which I still don't understand was the removal of the test set. That was bad practice on the authors' part.


In general with evaluations LLVMs I keep seeing issues that would be caught by a human carefully looking at the results. To a nonexpert who occasionally peeks at things it seems like having a small dataset that's just slightly too big for you to manually review is a bad practice.

It also seems like 100% accuracy should have raised red flags, especially if you know your dataset isn't perfectly cleaned.


Has the academia established clear standards on what is competent work and what is not? It occurs to me that while a small subset of papers stand out, many papers are struggling with conforming to basic practices like publishing runnable code and data, keeping them up to date with latest libraries and models within a year after the paper is published, and demonstration of performance in varied, potentially subjective ways, rather than picking a random benchmark and showing accuracy improvements by 0.01% and call it scientific.

Is it just me, or if for the majority of papers, the effort required to understand and get value out of the paper is so much higher than the effort put in by the authors and reviewers to publish it?


Tangent: Question 42 seems bad to me. It asks to analyze the behavior of two processes concurrently calling a method on the same typescript class instance.

Either in the browser or in node JavaScript does not have processes operating on shared memory. While I get what they're trying to say I think it's counterproductive to learning to call a task on the JavaScript event loop a process. Answering the question requires knowing that the runtime will only preempt at await statements, and calling it a process confuses that.


This trends of sensational but flawed papers is extremely worrying. I can literally see how it erodes the trust in science in general among tech people around me. And I understand it of course – if papers in your area are not trustworthy, what makes you think that other papers can be trusted?


If I was able to use open book for every exam and had the whole thing digitized in advance with billions of notes and evaluations of the texts the questions were based on I'd probably ace it to.


That's what separates you from GPT: it's willing to put in the work.


tl;dr the dataset was nonsensical and the researchers used GPT-4 to rate its own answers in the tests.


Please pause AI development. It’s going to ruin all of our lives


Pretty damning. Certainly seems retraction worthy.

Also somewhat upsetting that something so low quality actually is getting published. Seems entirely driven by hype and not intellectual rigor.


It should be noted that the original paper is a preprint, i.e., not peer-reviewed.


I count 15 people as authors. Surely one of them was able to look over the data and methodology.

"Not peer reviewed" is no excuse to publish non-information for the sake of headlines.


It's not published though. That's what he is trying to tell you. There is no guarantee it would have been published.


It absolutely is published, just as a preprint.

Again, none of this excuses this. It isn't an innocent mistake, which could have been caught later. The dataset is flawed and the methodology is questionable, still the authors published it on arxiv, with spectacular claims.

If you don't know, there has been a significant shift in how scientific papers (STEM for the most part) are distributed. Instead of Journals (which have lost almost all use in a digital world) papers are published freely accessible online without any formal quality control, before potentially later being published in some journal. Arxiv, where these papers are published has control over who gets to publish (not open to the public), but doesn't require a lengthy formal process. In mathematics this has worked remarkably well, notably one of the millenium problems was solved when the solution was uploaded to arxiv.

Poluting arxiv with low quality clickbait is destructive, "not being peer reviewed" is no excuse for bad science.


I think the OP is simply trying to say that a preprint cannot be "retracted".

On arxiv, it is possible to "retract" an article, in the sense that you ask that it's hidden or deleted etc. THat's not the same as retracting a published paper, where you usually get some justification, and a note from the editor explaining the decision to publish, and the decision to retract, and so on. More to the point, nobody cares if you "retract" a preprint, since it's expected that it may have errors not caught by peer-review, yet. No peer review means anyone can put anything they want online, and then take it back offline as they wish.

Note that arxiv also gives you an option to publish different versions of a paper. So you can leave your paper with errors as a v1 and upload your post-review paper, with corrections, as v2. Again, no "retraction" needed.


Different definitions of "published". Uploading a preprint to arxiv or whatever definitely counts as publishing it in the nontechnical sense of the word – to an audience comprising several billion eyeballs, no less!


Still, the question remains who published it. Some of the authors (perhaps the supervising ones) may have wished not to submit it to journals, and a zealous undergrad may have uploaded it to arxiv without removing the other authors.


Actually, it was the senior author who posted it on his twitter: https://twitter.com/iddo/status/1669665897454227456?s=20


Makes one wonder what fraction of papers are equally bad, but just haven’t been subjected to similar scrutiny.


Most of them. Given a set of incentives to publish as much as possible, one would expect the quality to decline at least linearly (assuming that the rate if actual discovery is constant).


[flagged]


I think it's appropriate to be extremely critical. The paper is basically useless. The thing that they actually measured is "can GPT-4, when given a 'question' with lots of additional information and many tries with small permutations to produce an 'answer', at some point produce an 'answer' that GPT-4 will then claim is a 5 out of 5 answer, on a dataset of extremely messy 'questions' and 'answers' from MIT coursework."

That's not an interesting thing to measure. The paper talks about it in terms that make it sound like it's a close proxy for whether GPT-4 "knows" how to do things in MIT coursework, by writing misleadingly about "fulfilling the graduation requirements" and having a "perfect solve rate." But in fact it's totally different. The result is that a bunch of people hear about this paper and get fooled into thinking that there is new interesting evidence about GPT-4's capabilities, unless they manage to read closely enough to see what actually happened.

It's not a matter of whether the results would get weaker if repeated, it's a matter of the results being totally disconnected from any useful real-world information about what GPT-4 can do, or how it can do it.


> peer review...Their null is GPT4 can ace MIT and they haven't provided any evidence to reject.

I haven't been in research for a while, but I don't think that's how peer review works. You don't always have to assume the paper's claims (especially ones as novel as this one's) as a null hypothesis, and providing a compelling refutation.

The detection of sloppy question framing, and answer feeding via the few-shot learning examples, and the problems with checking with GPT-4 itself reasonably show that there are serious flaws in multiple parts of the experiments described in the paper.

> The real conclusion is "we think the paper's results might be weaker if repeated" and that's a good result on its own.

No, they don't know enough to say that. The paper's results might be better with better experimentation! Or they might be totally false.

The conclusion they provided is accurate precisely because it focuses on the methods and not the conclusions, like this:

> One particularly worrying trend is the technique of evaluating a model’s accuracy using a language-based model like GPT-4. While a useful tool, its conclusions should never be overstated or treated as ground truth.

...and this:

> Additionally, it is extremely important to reevaluate every data point and perform basic sanity checks before using data at all, whether for training, inference, benchmarking, or something else. Given the small size of the dataset in question, a simple manual validation would have been easily within the scope of the work.


You're right that our post doesn't quite show that GPT4 cannot perform well on MIT curriculum. We try to be up front about this in the conclusion: > Our critiques are largely of the methodology and rigor of this study, not about its content. We make no claim about the ability of large language models to actually solve MIT curricula, only that this paper fails to prove it in a scientifically rigorous way. Though, as MIT undergraduates ourselves, we can at the very least say that the test set that we accessed does not, at least in our experience, accurately represent the breadth and depth of understanding required to complete an EECS degree at MIT.


>Their null is GPT4 can ace MIT and they haven't provided any evidence to reject.

If I said "2 + 2 = 4", and then gave a flawed proof, it is not your job as a debunker to prove that 2+2 != 4.

Their null hypothesis is that this paper provides sufficient and proper evidence to prove the original claim. And to that end, it seems like they've done a good job at pointing out enough serious deficiencies that - regardless of the original hypothesis's truth - the original paper cannot be used to prove it to be true.


> Our critiques are largely of the methodology and rigor of this study, not about its content. We make no claim about the ability of large language models to actually solve MIT curricula, only that this paper fails to prove it in a scientifically rigorous way. Though, as MIT undergraduates ourselves, we can at the very least say that the test set that we accessed does not, at least in our experience, accurately represent the breadth and depth of understanding required to complete an EECS degree at MIT.

Directly from TFA. Though, this does make the title somewhat click-baity. But they definitely don't claim the opposite.


>I think this is a good peer review with a too defensive tone.

Not surprising when you have undergrads calling a seemingly reputable paper almost fraud. I certainly can not blame them for the tone.


It's a criticism of methodology. It doesn't need a null hypothesis.

But, you know, I'll try that as a rebuttal the next time a reviewer rejects my papers. "Reviewer #2 has not rejected the null hypothesis" :P


The original paper claims GPT correctly answered this question. Thus the original paper is obviously wrong.

> Which invocations run in parallel? (Assuming there are enough cores.)

No context is ommitted, that's the entire question.


No, it turns out GPT-4 can not answer impossible questions. Maybe it's you that is too defensive and wants this to be false?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: