So the original paper from Stanford and Berkley is also linked to this AI influe...

ambrozk · 2023-07-19T15:05:34

The paper does things like ask GPT-4 to write code and then check if that code compiles. Since March, they've fine-tuned GPT-4 to add back-ticks around code, which improves human-readable formatting but stops the code compiling. This is interpreted as "degraded performance" in the paper even though it's improved performance from a human perspective.

StackOverlord · 2023-07-19T15:52:41

There is degraded performance because GPT4 refuses to carry out certain tasks. To figure it out though, you must need to be able to switch between GPT4-0316 and GPT4-0614. The task it is reluctant to do include:

- legal advice

- psychological guidance

- complex programming tasks.

IMO OpenAI is just backtracking on what it released to resegment their product into multiple offerings.

bredren · 2023-07-19T16:23:42

>IMO OpenAI is just backtracking on what it released to resegment their product into multiple offerings.

I think this is it, but also it is pulling back on the value of products if it reduces their compute costs.

I'm hoping competition will be so fierce in this market that quality and opaque changes to priced LLM experiences won't be a thing for very long.

kvetching · 2023-07-19T22:53:17

Hilarious because Google is taking the exact opposite approach.

They released Bard as a terrible model to begin with, and now you can program and have it interpret images. It's been consistently improving.

OpenAI really got sidetracked when they put out such a good model that freaked everyone out. Google saw this and decided to do the opposite to prevent too much attention. Now Google can improve quietly and nobody will notice.

ARandumGuy · 2023-07-19T16:30:54

I think it's probably a good idea that GPT4 avoids legal or psychological tasks. Those are areas where giving incorrect output can have catastrophic consequences, and I can see why GPT4's developers want to avoid potential liability.

StackOverlord · 2023-07-19T17:37:56

I wasn't asking for advice. I was in fact exploring ways to measure employee performance (using GPT as a search engine) and criticized one of the assumptions in one of theoretical framework with my own experience. GPT retorted "burnout", I retorted "harassment grounded on material evidence", and it spitted a generic block about how it couldn't provide legal or psychological guidance. I wasn't asking for advice, it was just a point in a wider conversation. Switching to GPT-4-0314 it proceeded with a reply that matched the discussion main's topic (metrics in HR) switching it back to GPT-4-0613 it outputted the exact same generic block.

SoftTalker · 2023-07-19T16:40:36

Yes, considering those are fields where humans have to be professionally educated and licensed, and also carry liabilty for any mistakes. It probably shouldn't be used for civil or mechanical engineering either.

hutzlibu · 2023-07-19T16:46:37

That's right, no serious work should be done with it. But programming is fine.

https://xkcd.com/2030/

kvetching · 2023-07-19T23:03:23

It's ridiculous. We should have access to a completely neutered version if we want it, we should just have to sign for access if they are worried about begin sued.

josefx · 2023-07-19T15:31:00

The query explicitly asks it to add no other text to the code.

> it's improved performance from a human perspective.

Ignoring explicit requirements is the kind of thing that makes modern day search engines a pain to use.

lolinder · 2023-07-19T15:46:45

If I'm using it from the web UI, this is exactly what I would want—this allows the language model to define the output language so I get correct syntax highlighting, without an error-prone secondary step of language detection.

If I'm using it from the API, then all I have to do is strip out the leading backticks and language name if I don't need to check the language, or alternatively parse it to determine what the output language is.

It seems to me that in either case this is actually strictly better, and annotating the computer programming language used doesn't feel to me like extra text—I would think of that requirement as prohibiting a plaintext explanation before or after the code.

lukas099 · 2023-07-19T15:54:36

It is still ignoring an explicit requirement, which is almost always bad. The user should be able to override what the creator/application thinks is 'strictly better'. Exceptions probably exist, but this isn't one of them.

lolinder · 2023-07-19T16:03:36

Like I said, I read their requirement differently—the phrase they use is "the code only".

I don't think that including backticks is a violation of this requirement. It's still readily parseable and serves as metadata for interpreting the code. In the context of ChatGPT, which typically will provide a full explanation for the code snippet, I think this is a reasonable interpretation of the instruction.

lukas099 · 2023-07-19T17:43:06

I understand your point better now. I'm still not sure how I feel about it, because code + metadata is still not "the code only", but it's not a totally unreasonable interpretation of the phrase.

ambrozk · 2023-07-19T15:51:44

This sounds a lot like a disagreement over product decisions that the development team has made, and not at all what people normally think when you say "a new paper has proved GPT-4 is getting worse over time."

pessimizer · 2023-07-19T16:44:55

> what people normally think

While we're doing Keynesian beauty contests, I think that 98% of the time when people say that a product is getting worse over time, they're referring to product decisions the development team has made, and how they have been implemented.

gwd · 2023-07-19T14:38:59

This comment has an interesting take on it, haven't read the paper to verify the take:

https://news.ycombinator.com/item?id=36781968

EDIT: FWIW I haven't noticed any such regression. I don't generally use it to find prime numbers, but I do use it for coding, and have been really impressed with what it's able to do.

8<---

This paper is being misinterpreted. The degradations reported are somewhat peculiar to the authors' task selection and evaluation method and can easily result from fine tuning rather than intentionally degrading GPT-4's performance for cost saving reasons.

They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.

Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.

The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.

In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.

tanaros · 2023-07-19T15:17:47

> Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

In the prompt they specifically request only the Python code, no other output. An “attempt to be helpful” that directly contradicts the user’s request seems like it should count against it.

oefnak · 2023-07-19T17:41:55

That's false. If it outputs formatted code, it's easier to read. I don't see the backtics, I see formatted code when using the chat interface.

Filligree · 2023-07-19T15:38:23

I guess, but that still isn't the sort of degradation people have been talking about. It's not a useful data point in that regard.

gwd · 2023-07-19T19:24:51

I mean, if you were hoping to use the API to generate something machine-parse-albe, and that used to work, but it doesn't any more, then sure, that's a sort of regression. But it's not a regression in coding, but a regression in following specific kinds of directions.

I certainly have found quirks like this; for instance, for a while I was asking it questions about Chinese grammar; but I wanted it only to use Chinese characters, and not to use pinyin. I tried all sorts of prompt variations to get it not to output pinyin, but was unsuccessful, and in the end gave up. But I think that's a very different class of failure than "Can't output correct code in the first place".

majkinetor · 2023-07-19T14:44:30

Fine tuning or not, its definitelly a proof one should not rely on it apart from very specific use cases (like lorem ipsum generator or something).

gwd · 2023-07-19T14:51:45

Just to be clear, you're saying that because they're tweaking GPT-4 to give more explanations of code, you shouldn't rely on it for coding?

Obviously if that's your own preference, I'm not going to tell you that you're wrong; but I think in general, most people wouldn't agree with that statement.

agentgumshoe · 2023-07-19T15:02:25

I'm still wondering, why should anyone rely on AI generated answers? They are logically no better than search engine results. By that I mean, you can't tell if it's returning absolute trash or spot on correct. Building trust into it all is going to be either a) expensive or b) driven by all the wrong incentives.

gwd · 2023-07-19T15:28:43

> By that I mean, you can't tell if it's returning absolute trash or spot on correct.

You use it for things which are 1) hard to write but easy to verify -- like doing drudge-work coding tasks for you, or rewording an email to be more diplomatic, or coming up with good tweets on some topic 2) things where it doesn't need to be perfect, just better than what you could do yourself.

Here's an example of something last week that saved me some annoying drudge work in coding:

https://gitlab.com/-/snippets/2567734

And here's an example where it saved me having to skim through the massive documentation of a very "flexible" library to figure out how to do something:

https://gitlab.com/-/snippets/2549955

In the second category: I'm also learning two languages; I can paste a sentence into GPT-4 and ask it, "Can you explain the grammar to me?" Sure, there's a chance it might be wrong about something; but it's less wrong than the random guesses I'd be making by myself. As I gain experience, I'll eventually correct all the mistakes -- both the ones I got from making my own guesses, and the ones I got from GPT-4; and the help I've gotten from GPT-4 makes the mistakes worth it.

XenophileJKO · 2023-07-19T19:21:07

I think you have pointed out the two extremely useful capabilities.

1. Bulky edits. These are conceptually simple but time consuming to make. Example: "Add an int property for itemCount and generate a nested builder class."

Gpt4 can do these generally pretty well and take care of other concerns like updating the hashcode/equals without you needing to specify it.

2. Iterative refactoring. When generating utility or modular code, you can very quickly do dramatic refactoring. By asking the model to make the changes you would make yourself at a conceptual level. The only limit is the context window for the model. I have found that in java or python, the GPT4 is very capable.

StackOverlord · 2023-07-19T15:58:33

Like any other information, you cross-check it.

I use to generate code I'd get from libraries. Graph-theory related algorithms, special datastructures, etc...

phillipcarter · 2023-07-20T14:13:00

Like clockwork, it's coming out that the original paper was wildly misinterpreted: https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...