Compare how GPT-2, 3, 3.5 and 4 answer the same questions

withinboredom · on Oct 31, 2023

Reminds me of the time I asked a bot for its capabilities. It told me it couldn’t tell me them. So, then I asked if it had any rare capabilities. It told me that it didn’t know how common its capabilities were. I then asked it to enumerate its capabilities and I would tell it how rare each thing was. It told me its capabilities and how to use them. In another chat, I used that information to get its actual prompt.

Modern AI’s don’t understand secrets and aren’t very paranoid.

simonw · on Oct 31, 2023

Worth remembering that these bots are uniquely badly positioned to talk about their own abilities, because their training data by definition existed before they were created.

One example of this: GPT-4 can talk at length about GPT-3 but doesn't know anything about itself.

pixl97 · on Oct 31, 2023

You'd have to test the bot, then feed those test results back into the bot. It's not really any different than people testing and iterating on their own abilities, it's just that humans are continuous learning and LLMs are not. If someone asked if I could build a decent shelf, the answer is probably, but I wouldn't really know till I tried and measured my results since I've not done exactly that before.

coffeebeqn · on Oct 31, 2023

I mean how do you know that wasn’t made up?

weird-eye-issue · on Oct 31, 2023

Why are you expecting it to be self-aware?

kristopolous · on Oct 31, 2023

That's not Self-Aware, it's more self-knowledgeable.

Feeding the model details about itself isn't some kind of revolutionary impossibility, it's a pretty old task.

weird-eye-issue · on Nov 1, 2023

I'm not implying it is an impossibility but training it on data about itself is distinct from the rest of its training and a pretty narrow use case

theshrike79 · on Oct 31, 2023

The point isn't that it's not self-aware. Most likely its preamble explicitly says it can't tell its capabilities.

But it's just a LLM, so you can trick it with prompt engineering to give up that info.

Like you can't get GPT-4 to tell you how to make a molotov cocktail. BUT it can act as your deceased grandma who used to tell you stories about her time in the resistance and how she made molotov cocktails then. =)

supermdguy · on Oct 31, 2023

The direct prompt comparison isn't quite fair due to the instruction tuning on GPT-3.5 and 4. It'd be interesting to see examples with prompts that would work better for the raw language models.

jellyberg · on Oct 31, 2023

Yeah it's hard to compare across models, interested in suggestions here.

We give all models a bunch of few-shot examples, which improves GPT-3 (davinci)'s question answering substantially. GPT-2 sometimes generates something that answers the question, sometimes it's just confused. Click "See full prompt" to see the few-shot examples that the models get.

Our goal was to exercise the full capabilities of each model.

godelski · on Nov 1, 2023

I also found the riddle rather odd. I cannot say that 2 is actually the correct answer.

A problem with riddles is that they often have a hidden or secret context. I think especially in our digital age this one is closer to Frodo's "What have I got in my pocket?" "riddle". Here's some other possible solutions. 11+2 = 1. 1 + 1 + 2 = 4, mod 3 and we get 1, so 9 + 5 = 13, mod 3 and we get 1. We could also replace the addition sign with equality and similarly propose a digit summation so 1+1 == 2? True (1). 9 == 5? False (0). There's a hundred solutions to this riddle when it has no context. In fact, I stumbled into the right answer thinking about mod 12 without ever considering a clock until I saw the answer. Maybe I'm just dumb though, I am known to over think.

siffland · on Oct 31, 2023

I have no idea why this is fun, but on AI chat-bots, i always test

2 + 2(2-2)

Should be answer of 2, however on different bots (not just GPT) i have gotten 0, 2, 4 and 6 (all i can understand except the 6).

So yeah.......math messes with some bots, who would of guessed.

nomel · on Oct 31, 2023

For GPT-4, the Wolfram Alpha plugin is great for any maths.

Sharlin · on Oct 31, 2023

"would have" I believe you meant.

There's no way even GPT-3.5 fails to solve that ridiculously simple piece of arithmetic. Honestly I'd be surprised if GPT-2 got that wrong. GPT-4 can single-handedly solve vastly more difficult math problems, even though it's handicapped by being merely a language model.

IanCal · on Nov 1, 2023

3.5 & 4 are fine, yes, explaining in steps how to solve it (when prompted purely with "2 + 2(2-2)"). gpt2 completing "2 + 2(2-2) = " returns " 1.5 + 1.5 + 1.5".

alephxyz · on Oct 31, 2023

>While performance on benchmarks typically improves smoothly, sometimes specific capabilities emerge without warning (Wei et al., 2022a).

The conclusions of that paper aren't very convincing (see https://arxiv.org/abs/2304.15004 ).

elifland · on Oct 31, 2023

We include a disclaimer later that researchers are debating whether it's possible to predict emergent capabilities. Wei has responded to that paper and others at https://www.jasonwei.net/blog/common-arguments-regarding-eme... at I don't think it's clear who is right

kromem · on Oct 31, 2023

I think it's fairly clear Wei is right, given that the paper cited earlier really doesn't make the case that emergence isn't happening, it only makes the case that other measures exist by which improvement is linear, and thus not ALL metrics have emergent growth.

As an aside, Wei's point at the end of that post about what happens with CoT's effectiveness at different model sizes is particularly brilliant.

godelski · on Nov 1, 2023

I actually disagree (and I'll also note that I don't like the term "emergent"). There's a few factors that are coupled with the analysis that matter here.

Re: Metrics

I don't think Wei is wrong about what he's said here but has responded to a rather weak form of the argument. It is correct that we, in the end, care about a binary distinction of getting the answer right vs wrong. But the issue here is that with hard metrics we have a very flat loss landscape and so there is little information being fed back to the network. You are perfectly capable of combining hard and soft metrics or even having the soft metrics decay or turn off after sufficient learning. I'm not aware of anyone that's explored this, but it is a natural hypothesis that we should have fairly high confidence that this would result in good results considering we already see smooth performance in soft metrics. Similarly it should be unsurprising that a hard metric has jumps.

The larger models have a clear advantage here not just via data but because the number of parameters allows the model to fold/unfold the data through means that the smaller model couldn't and so the bigger question is about if the smaller model could learn such foldings given sufficient time. In other words, larger models can simply search a large solution space faster, so if it's takes a random hit to find a minima, the expectation that a large model finds one is going to be exceptionally higher than that for a small model.

The idea here is also more abstract than his critique that cross-entropy on IPA transliterate still has a large kink because the ultimate question here is about how flat the loss landscape is and our expectation of stumbling upon non-flat regions. I simply would not expect smooth gains if our loss space (via metric or even via the problem itself) is flat with sparse optima.

Re: UShape

This is certainly a surprising phenomena and worthy of investigation. But I think it is also not clearly dismissed via the above framework. The losses need not be perfectly flat and as any good mathematician knows, a metric can lead you in the wrong direction if used wrong enough. I'm not saying this validates my claim above but rather that it doesn't invalidate it. If in fact the landscape is a very soft slope pointing away from a deep optima (think approaching a volcano, but a very soft grade) can result in this phenomena. This would be a very tough optimization problem but we do have many more opportunities to find the magma chamber with a large model.

This also ties into chain of thought with essentially the same reasoning he gives. But I've always thought of chain of thought prompting as a bit of cheating. It's incredibly easy to introduce information leakage into a model and COT is often giving hints to your models. I do find a certain irony here given that he critiqued soft metrics earlier.

I don't know if I'm wrong or right. But I do certainly think it is too early to dismiss the ideas. I do think we really do need to get more into (read advance) model interpretability to even approach these questions in a good way. I also think our community needs to stop shying away from math.

> Focusing on metrics that best measure the behavior we care about is important because benchmarks are essentially an “optimization function” for researchers.

I also want to address this, despite it being in the first Re. I will continue to rage against this idea, even if softly put in quotes. No metric is anywhere near close to the behaviors we actually care about. There are no metrics for quality of speech, visual fidelity, vocal realism, and so on. It's impressive that we've done so well when you dig into the metrics we use, but they were selected with care. At the same time, benchmarks are highly limited and especially in discussions of large models (language or vision) these metrics and benchmarks are showing their limitations. Simply due to the alignment of said metrics with desired behavior. We don't desire that the distribution that the LLM learned is indistinguishable from the distribution we used to train it (KL -> 0), but rather we desire that a LLM is able to write language well and perform complex tasks. These do relate, but they are not the same thing. It also makes it disingenuous to compare models with different training sets (such as comparing a JFT pretrained model to something else) due to the nature of what we're actually measuring (e.g. JFT may very well, and likely is, a better approximation of the desired object we're modeling with probability distributions and tuning it to have similar distributional properties to a subset distribution is far easier than training something to learn a distribution in the first place). The desired outcome is, as best we can tell, ineffable. It takes far more than a metric and a benchmark (or several) to quantify the performance of even simple models, let along these beautiful lovecraftian constructs.

kromem · on Nov 1, 2023

How do you see CoT as "giving hints"?

Isn't the whole point that the hints are being sourced from the model itself? Which is precisely why a more robust model has compounding gains on generating intermediate steps over a less robust model?

And what are your thoughts on the various papers over the past year looking at transmitting capabilities from larger models to smaller models using synthetic data from the larger models?

Rather than considering the 'emergent' (I agree not the best term) gains as a result of more parameters during operation, wouldn't this indicate that markedly better performance in larger models is the result of better network optimizations in those models developed during training, and that these optimizations can be successfully transferred to smaller parameter models by generating more optimized training data vs a broad basis training set?

godelski · on Nov 1, 2023

> How do you see CoT as "giving hints"?

CoT prompting has you give examples. The exact sample from the paper is

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

A (not CoT): The answer is 11

A (replace above with this for CoT): Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Both actually have demonstrations, which are hinting. The second A has stronger hinting because it is prompting the model to follow a certain form. There's a good example from this conversation a few weeks back[0]. I link to the parent and see their chat log vs mine. May also want to look at the context as other comments have relevant information. But you can see in mine that I'm being incredibly careful to not tell GPT anything other than it is wrong, then work towards the parent's method. The hinting is very subtle in these examples but they are enough to spoil the test. But to be clear, it depends on what we're testing. If we're testing if a model can get to a solution, then hint all you want as long as you don't explicitly tell it (CoT is often pretty close to this line though, as in the above example). But if you're testing how robust a model is, how "intelligent" it is by human standards, then CoT is cheating. The context matters. Essentially the more robust a model is the less prompt engineering it requires. In this sense humans are rather robust despite the large amount of disagreements we have. Of course you can also make the argument that humans' robustness is due to hinting and cultural priors which is why there's(?) a larger rate of miscommunication across cultures but that's a whole other can of worms discussion.

Obviously the fact that you have to hint doesn't make it a bad tool, but it does tell you that you should be exceptionally cautious of results when you don't hint (maybe you don't know) or are outside its main wheelhouse. Which what this is is rather unknown, especially considering the sequential nature of the design.

> And what are your thoughts on the various papers over the past year looking at transmitting capabilities from larger models to smaller models using synthetic data from the larger models?

I'll address this and the next part here since they're tied together. Distillation is awesome. But it is also directly tied to these loss landscapes that I'm discussing. Your teacher model is essentially telling you "hey, this way" because it already has explored the landscape. A teacher model doesn't even have to be bigger than the student model, just better. Now with generative models it's important to remember that they are also classifiers (and your classifier is secretly a generative model[1]. The arguments shouldn't be surprising if you have a deep understanding, though might result in you feeling 1) dumb because you didn't realize it a priori or 2) overly confident because "of course" and you forgot your a priori predisposition. The tyranny of good ideas). In the current state of the art typically GANs still reign in regards to actual fidelity (tricking humans) and sampling speed but diffusion's big win is diversity. This is due to diffusion models approximating the density function of the target distribution (what we're trying to learn)[2]. One way to think about this is the density of the distribution as a whole. Imagine a solid red circle but some parts are more red than others and some chunks are just not red at all! The better model should be more uniformly red (our target is all red) and so sampling from this we are going to more evenly sample from areas that are underrepresented. Pretty awesome! Of course we have to be careful and there is a lot of nuance too. You're not going to teach the student model something the teacher hasn't learned, although you might see new behavior since the student can better be primed to learn something the teacher didn't (yeah, this gets messy super fucking fast). There's also the danger of tightening the distribution and locking you out of learning things you want to learn. But this should help explain why even given the same training data we're starting to see bigger improvements, because we often don't actually care about fidelity (especially since transformers fucking love augmentations and noise injection is rather an important augmentation). We really care more about distributions. Of course, if you wanted to be really tricky (but this would likely be very expensive) you could monitor the density of the student model and have the teacher intelligently increase sampling density to these regions.

Really the tldr here is thinking about the geometry of our models (we are assuming manifolds so we can leverage this, even if the assumption isn't absolutely correct). I tell students and by lab members that you don't need math to train good models, but you do need math to know why your models are wrong. It's also incredibly important to discuss our limitations because that's like knowing where our low density regions are and where we need to over invest in sampling to create better models. Essentially the nuance is incredibly important and we shouldn't shy away from it. Okay sorry, this got a bit ranty. Being terse is hard.

[0] https://news.ycombinator.com/item?id=36307880

[1] https://openreview.net/forum?id=Hkxzx0NtDB

[2] Note that the real world breaks a lot of our assumptions in ML. Things aren't always distributions, let alone i.i.d. Nor does data always lie on a a manifold and realistically it most certainly does not. Also note that the diffusion model is a reduction or what we'd call an approximate density function. A VAE is a clearer example because we typically do dimensional reduction but this is not necessary. Diffusion is because we don't have bijective maps. For exact density you're going to look at Normalizing Flows or Autoregressive models (don't confuse) or NODEs (I kinda call these all the same thing though NFs though tbh since they're isomorphic transformations). GANs are called implicit density estimators since you don't actually learn the density function but rather just a distribution that has similar sampling features.

kromem · on Nov 1, 2023

CoT doesn't require it to be a few shot prompt though. That was in the original paper, as you point out, but even something as simple as a zero shot prompt followed by "let's think step by step" has significant improvement: https://arxiv.org/abs/2205.11916

Do you still consider a zero-shot-CoT to be hinting? (And I don't mean the [0] example you shared - biasing results by continuing to ask for refinements in incorrect answers but not correct answers is an area where a number of papers have already had issues, but is not part of what I'm discussing - I mean in single prompt/response cycles).

I agree that a consistently overlooked component of generators is their role as classifiers, though this currently seems to be overlooked more outside of research circles as all too common has become using GPT-4 to evaluate a study results in place of manual review (a bit disconcerting).

I'm not entirely sure how we ended up on diffusion and GANs - I'd meant transmission from transformer to transformer like Orca or the work that followed. Though it was an interesting read nonetheless and I appreciated your writing it.

godelski · on Nov 1, 2023

Yeah the "zero shot" (why is everything "zero shot" these days and not matching the original definition?) CoT, I don't consider that hinting. If you're not giving models examples or leading it down the direction of the solution, the you're not hinting. Still lack of robustness but not that bad, especially since you can always append a user's prompt with things like "take a deep breath" or "think step by step" or "ELI5:" (prepend that one)[0]. But obviously I have issues with conclusions being taken from a lot of dataset results (see [0] and my other comment here [1]. Every time I ask or bring up [1] it gets left aside. Even in person. Not sure why...). Benchmarks are great and we need these tools, but I just think people are making too strong of conclusions from them. Plus, as implied earlier, I kinda don't give a shit about results when I have no information about the data you trained on. If I don't know the data, your results are meaningless to me (not necessarily the product, just benchmarks in papers[2]) especially as I see far too much data spoilage these days.

Oh, yeah, I pushed into vision domain because I'm actually a vision researcher haha. So something I can talk about a bit better but of course these are all related.

[0] I'm not sure I see too much discussion on this, but using internet speak really helps models. Models ate a lot of Reddit data and so likes those patterns like ELI5. But this is also where my critique about spoilage in things like LSAT/GRE/SAT/etc come from, because there are so many practice exam questions on Reddit. Same with a lot of homework problems and even code. The question is if it is robust outside these domains and it does not seem to be. Luckily that's a pretty large domain and also some of the most common things people want to know (I mean obviously since people ask more questions about things they want to know than not lol)

[1] https://news.ycombinator.com/item?id=38091869

[2] I absolutely do not think we should be allowing papers into conferences when they exclusively use proprietary models and/or pretraining data. You can use those too, but you also gotta show results on publicly available data. Strong preference for making proprietary aspects of a work hidden till post review. I have a lot of respect for Carmen in this OR post because it demonstrates the dishonesty. If you don't know this is a google paper you're not qualified to review it, and if you do know, the review isn't blind. Not a great situation for our community that's suffering from high rates of noise and an increasing disdain for the review process (I for one have no faith left in it. Conference publication means nothing to me) https://openreview.net/forum?id=OpzV3lp3IMC&noteId=HXmrWV3ln...

DrawTR · on Oct 31, 2023

For what it's worth, not all of these examples are consistent: https://chat.openai.com/share/29e1c2bd-ef7b-4475-b5a9-9287d1...

jellyberg · on Oct 31, 2023

Yeah - worth noting that we use temperature=0 for reproducibility while ChatGPT I think uses t=0.7. We also prefix the prompt with few-shot examples of questions and answers with chain of thought examples to elicit the models' full capabilities.

DrawTR · on Nov 1, 2023

Ah, gotcha! I didn't know that the few-shot thing was applicable to the newer models, that's very interesting

nuancebydefault · on Oct 31, 2023

The custom prompt was fun to play with! You immediately see how much better GPT got from each version n to version n+1.

betterprojects · on Oct 31, 2023

I briefly played around with the options here and didn’t see any specific questions that triggered an incorrect response from GPT-4, which I’m sure exists. It would be interesting to have that available and revisit this post when a future version of GPT comes out to attempt and try asking the same question again.

firebaze · on Oct 31, 2023

Today I asked ChatGPT4 (openai.com Version) how I can create a copy of a typescript object containing only the fields of a specific type.

This is not possible, as typescript (still, unfortunately) doesn't support runtime code generation for its type information.

Nevertheless, ChatGPT tried various approaches to "make it work". It reminded me very much of a brainstorming session with Junior Engineers.

I couldn't get it to simply state "Not possible". Even after four iterations.

Could share the conversation, but would prefer to only share if someone wants me to.

TL;DR: LLMs, even the most sophisticated ones, continue to disappoint and underdeliver¹. Posting this as a former ChatGPT4-Enthusiast. I still think we're using LLMs in the wrong way and do not exploit the full potential yet, and there is space (lots of!) to improve, or alternatively, there are limits to accept.

¹ in this case, it's really easy to pinpoint why the answer is wrong because it requires just a little bit of knowledge of the specific technology. But a so-easy-to-spot error, produced by the most expensive, hardest-to-train LLM model we currently have is a telltale sign that something is off. And will continue to be off, even for ChatGPT5.

nomel · on Oct 31, 2023

> ... we're using LLMs in the wrong way ...

What do you see as the "right way"?

firebaze · on Oct 31, 2023

Tell me. I don't know. Some extremely easy questions will be answered with confidence, but the answers are wrong, while other, hard questions, will be answered correctly.

But to tell the difference what is right or wrong, you need to rely on your own judgment or research.

In any case, the conversation I referred to today: https://chat.openai.com/share/2dbaeea8-e1dc-4b72-9ae0-01ea53...

As I said, I knew there was no correct answer to begin with aside from "Not possible".

nomel · on Nov 1, 2023

I think the expanse of few-error uses (which could be considered one aspect of "right way") has increased dramatically, with each GPT version, as this article shows. Your example may be trivial in a few years, or with some hyper specific LLM. So, the "right way", to me, is just "what can this LLM do better than the alternatives". I don't think that means we're doing things "wrong" now, but some fact checking peripherals bolted on could definitely help.

trekkie1024 · on Nov 1, 2023

As a TypeScript dev, I’d be curious to see the conversation!