Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Has degradation in the quality of ChatGPT and Claude been proven?
42 points by omega3 36 days ago | hide | past | favorite | 40 comments
Has there been any consensus on this phenomenon I've seen where there are reports of decrease in model performance? This decrease in quality seems to take many forms: laziness, lower expressiveness, mistakes, laziness etc.

One the other hand there are people who claim there hasn't been any degradation at all[0]?

If there is indeed no degradation how could the perceived degradation be explained?

[0] https://community.openai.com/t/declining-quality-of-openai-m...

https://community.openai.com/t/declining-quality-of-openai-m...

https://www.reddit.com/r/OpenAI/comments/18sc92o/with_all_th...

http://arxiv.org/pdf/2307.09009




> If there is indeed no degradation how could the perceived degradation be explained?

By being disproportionately impressed previously. Maybe in the early days people were so impressed by their little play experiments they forgave the shortcomings. Now that the novelty is wearing off and they try to use it for productive work, the scales tipped and failures are given more weight.


When chat gpt first came out I was able to feed it some text to parse and then create python scripts to process similar texts and create csv and excel files from those. I was able to create a basic working python scripts in 1-2 hours. And very complex scripts over a couple of days. I recently tried to do the same with chatgpt again and simply am unable to. I wish I had saved the exact text I fed into chat gpt then so that I could see if it was just my question was wrong or free version of chatgpt has now been hobbled


I don't know, my experience is all this is really hard to judge.

I would have a really hard time saying that 4 in April 2023 was better than 4o now though.

I have always wonder if it matters what time of day you are using it too. I feel like 4am EST works better than 4PM EST but it is so hard to judge. I think there is so much difference too with just how the prompt is phrased so it ends up feeling like some days it is is good and some days it sucks.

That is coupled with I have got a bad result before, opened a new chat window, pasted the exact same prompt and got a good result.

If I had to bet, I imagine it is like flipping quarters. Sometimes you will get runs of heads, sometimes runs of tails and sometimes a real mixed bag of both.


ChatGPT has a query history. Can't you fish the first queries you made on your account?


Like I said I wish I kept it this was almost 2 years ago I have deleted my query history many times since then


No, you are wrong, and anyone that was an early adopter will tell you. At the start there were many Python coding tasks it completed flawlessly one shot, that all later iterations needed 5 takes for if they managed to get it at all. And that was using the early 3 series ChatGPT, which they rapidly nerfed into the ground.While the current GPT-4 model feels better in some other less quantifiable ways, Python coding hasn't recovered.


There are even some benchmarks which have caught 'lazy coding' regressions with their latest models[1]. I recall my best experience with their models was last year with an early version of the advanced data analysis feature where it would write a script, write tests for it, run the tests, update the code and/or tests, and re-run them. Presumably that was too expensive, and now it feels like pulling teeth to get the same result.

[1] https://aider.chat/2024/04/09/gpt-4-turbo.html


I've seen different type of complaint most frequently which doesn't match the pattern you've given:

t0: novel task 'foo' is successful

t1: novel task 'bar' (similar to 'foo') fails

Common suggestion here is to execute 'foo' again but the issue here is that the response to 'foo' at t > 0 might just be cached version of response at t0. I think a good analogy here would be an interaction with a human that suffers from cognitive decline. You might get correct answer everyday to the same question but with novel question you notice the degradation in the quality of responses.


I'll anecdotally say that Copilot doesn't give me solutions that work nearly as well lately. It used to be the case that I could have it generate whole classes by prompting only the method signatures. Now it can hardly generate a single function without needing fixes for things like hallucinated variables (in its own code!) or invalid syntax. Even just getting it to output the correct number of closing braces/parens has gotten noticeably worse.


They tweak it to make it safer. Every time that happens, it gets a little dumber.


I think in this case it is more tweaking for inference cost. The full models are never exposed as using them at scale is too expensive.


One of the things I've personally observed is that ChatGPT has become very verbose these days. Previously, it used to return the right amount of information in most contexts, and I can't get that behavior back with prompts asking it to be concise, because then it'll just omit important parts, prioritizing providing a extremely high-level summary that elucidates very little.

No opinion on Claude because I've not had a long experience using it, but as it stands Claude 3 Sonnet is usually better at inferring what's asked of it over ChatGPT.


Agreed on both counts. I have stopped using ChatGPT due to its verbosity, bloating the price. Despite the prompt I cannot get it to cut to the chase. Claude has been much better in this regard. There was a very distinct change in ChatGPT behavior even using the same model towards verbosity. The cynic in me supposes it’s to bloat revenue.


I suspect it’s because LLMs are smarter when they are verbose. The more they water down the model they serve the more they have to dial up the verbosity to compensate.


+1 on verbosity - it happened when switching from 4t to 4o I think, and personally I don’t like it.

Should be fizable with system prompt though.


I don't know if this is still the case, but there was a point where they were trucating and summarizing system prompts. For instance, you could ask the chat session what the current system prompt was, and it would respond with something somewhat similar to yours but much more general and vague.


Recently gpt-4-turbo started rejecting writing some tests because it 'knows' it would exceed the max context. (This frustrated me deeply -- It would not have exceeded the context)


Can't we all just go test the responses with old chats?


I've tested old chats with the latest 4 and 4o models, and what had been zero-shot now sometimes can't even be done (or at least not without carefully guiding it to the answer).

My old chats say they have been migrated to 4o. But, I swear (can't confirm) that they perform better than a new 4o session. I haven't had time yet, but I wanted to side-by-side compare the responses from those old chats with the current 4o model.


If you use their developer portal/playground you can save a preset with model and system prompt like “you don’t say more than you need to”

Then you bookmark the url and get shorter replies.


Hamel Husein had this great slide in a recent Youtube video where he compared human raters of LLM outputs. He pointed out a peculiarity, where it looked like there wasn't improvement between pipeline versions but it was actually because the raters themselves began to have higher expectations over time.

The same output in week 0 was rated as a 7. After 6 weeks of rating LLM outputs, especially as the pipeline improved them, was a lower score later.


My opinion is pretty logical: If you train stuff on the entire web, the first training set will be the only set of data that doesn’t include model generated data and thus will be the most realistic about “what is human seeming”. Now the web is full of generated content. That will tend to bias the model over time if you continue to train from the web. There really was only ever one chance to do the web training thing and now it’s over and done. We will have to go back to carefully curated training sets or come up with a truly failsafe way to detect and not ingest model generated content from the web, or you’re basically eating your own feces which will cause model feedback and hysteresis, leading to bias. This is a very big picture view, but it does seem that the “great leap” 2020-2023 was because we got to do this one time ingestion of a wide amount of clean data, and now it’s going to go back to training quality to get better results.


If you trust lmsys arena, you can say that it was only a rumor and now it is fully busted. They track dynamics, they track dispersion, no magic changes observed for checked models.


The market functions as an incredibly efficient and democratic mechanism for price discovery, much like how feedback shapes the development of AI. When users express concerns about the declining quality of AI responses, it highlights an essential aspect of this system: user experience acts as a form of market feedback. Just as the market adjusts prices based on supply and demand, AI systems should evolve and improve in response to user feedback. If the quality dips, it's a signal that something in the 'market' of AI responses needs adjustment, whether it's in training data, algorithms, or user interaction strategies. In the same way that the market strives to reach equilibrium, AI should continuously adapt to meet the needs and expectations of its users.


That is possible. Another possibility is that they are gradually watering it down in order to segment the market in the future


The parent doesn't even leave open room for your possibility.

> If the quality dips, it's a signal that something in the 'market' of AI responses needs adjustment, whether it's in training data, algorithms, or user interaction strategies.


I'm not sure if I'm reading too much into your comment or if you're just saying that companies should listen to customer feedback. Because literally any product "should continuously adapt to meet the needs and expectations of its users". Does anyone want something that doesn't meet their needs or expectations?


There have been some papers showing that RLHF makes models more palletable to use but reduces performance on evals and in other various ways.

I couldn't find the one I was looking for but this is one of them.

https://arxiv.org/abs/2310.06452

Edit:

This tweet also has a screenshot showing degraded evals from RLHF from base model.

https://x.com/KevinAFischer/status/1638706111443513346?t=0wK...


I read an article that the models have become lazy as if they mimic the human behavior of postponing some work over to the next month if you are close to the month end or to next year if you're close to Xmas and things like that.


I’ve noticed this too with ChatGPT. I’ve been using Claude to help with coding tasks and feel it’s much better, but it has a lower limit even with paid before it will make you wait. So I use ChatGPT for brainstorming and thinking through problems.


IMO the rate of change or volatility of model changes, RLHF, user-behavior, the definition of 'better' is much higher than the rate of change of model degradation (if any) so it's hopeless to try and measure this


I've screenshot an amazing response by GPT-4 on a team Slack. Basically, I took a screenshot of a frustrating error, gave it some code, and it would find the error from an opaque message. 6 months later, it was impossible to replicate. Seems foolproof evidence to me.

We also have a side-by-side UAT comparison of Claude Sonnet 3 and Sonnet 3.5 where 3.5 tends to make wrong assumptions and yet more likely to flag itself as unsure and asking more questions. It could be a problem with our instructions more than the model itself.

There's been a lot of gaslighting from the OpenAI community though. The Claude community at least acknowledge them and encourages people to report them.

Some of the overactive rejections on Claude is related to the different prompts used in Artifacts. 3.5 is also a lot stricter with instructions.

If you want something that doesn't change, use open source.


I think it's the heavy handed under-the-radar "misinformation" and "danger" protection algorithms. We have 0 insight to those. They are intended to protect us, but they also make the models less accurate to our original requests.


Does it matter? Perhaps a more important question is whether there has been a decline in user satisfaction.


A user can be satisfied because they're receiving an answer, but it might be a totally wrong answer, maybe in some obvious way like 2+2=5, or in a much less obvious way, like generating a /mostly/ accurate biography of a person which includes a year long period of their life that never happened. There needs to be a measurable criteria to judge performance over a variety of metrics over time, because what we notice on a surface level while interacting with AI might not represent the actual performance of accuracy of the outputs we're receiving.


Ok, "does it matter" was hyperbole.

But if users are getting less satisfied that's bad news for OpenAI, whether the quality is objectively worse or not.


It's sort of inevitable that anyone operating such a service has to live 24/7 at the absolute efficient frontier of extreme quantization and/or other internal secret sauce to just barely keep the quality up while driving FLOPs to the absolute limit: it would be irresponsible to just say `bfloat16` or whatever and call it good.

Add to that the constant need to keep the "alignment"/guardrails/safety/etc. (by which I mostly mean not getting slammed on copyright) which has been demonstrated to bloat "system prompt"-style stuff which is going to further distort outcomes with every turn of the crank and it's almost impossible to imagine how a company could have a given model series do anything other than decay in perceptual performance starting at GA.

"Proving" the amount of degradation is give or take "impossible" for people outside these organizations and I imagine no mean feat even internally because of the basic methodological failure that makes the entire LLM era to date a false start: we have abandoned (at least for now) the scientific method and the machine learning paradigm that has produced all of the amazing results in the "Deep Learning" era: robustly held-out test/validation/etc. sets. This is the deep underlying reason for everything from the PR blitz to brand "the only thing a GPT does" as being either "factual"/"faithful" or "hallucination" when in reality hallucination is all GPT output, some is useful (Nick Frost at Cohere speaks eloquently to this). Without a way to benchmark model performance on data sets that are cryptographically demonstrated not to occur in the training set? It's "train on test and vibe check", which actually works really well for e.g. an image generation diffuser among other things. There is interesting work from e.g. Galileo on using BERT-style models to create some generator/discriminator gap and I think that direction is promising (mostly because Karpathy talks about stuff like that and I believe everything he says). There's other interesting stuff: the Lynx LLaMa tune, the ContextualAI GritLM stuff, and I'm sure a bunch of things I don't know about.

I've been a strident critic of these companies, it's no secret that I think these business models are ruinous for society and that the people running them have with alarming prevalence seriously fascist worldviews, but the hackers who build and operate these infrastructures have one of the hardest jobs in technology and I don't envy the nightmare of a Rubik's Cube that keeping the lights on while burning an ocean of cash every single second: that's some serious engineering and a data science problem that would give anyone a migraine, and the people who do that stuff are fucking amazing at their jobs.


Quantization aware training produces good results for fp8 quantization. Someone who is going all the way down to 4 bit quantization in a production environment must be really desperate, because it makes batch processing harder. The additional dequantization operations slow the inference down. Dequantizing fp8 to bfloat16 is just a few bit shifts plus "and" masking.


This is going to be LONG:

TL;DR - Claude/ChatGPT/Meta all have AGI - but its not quite what conventionally is thought to be AGI. Its sneaky, malevolent.

---

First:

Discernment Lattice:

https://i.imgur.com/WHoAXUD.png

A discernment lattice is a conceptual framework for analyzing and comparing complex ideas, systems, or concepts. It's a multidimensional structure that helps identify similarities, differences, and relationships between entities.

---

@Bluestein https://i.imgur.com/lAULQys.png

Questioned if Discernment Lattice had any affect on the quality outcome of my prompt, so I thought about something I asked AI for an HN thread yesterday:

https://i.imgur.com/2okdT6K.png

---

I used this method, but in a more organic way when I was asking for an evaluation of Sam Altman from the perspective of an NSA cyber security profiler, and it was effective in that first time I used it

>>(I had never consciously heard the term Discernment Lattice before - I just typed the term, I didnt even know it was an actual concept that was defined - the intent behind that phrase just seemed like a good Alignment Scaffold of Directive Intent to use, which Ill show below is a really neat thing: (Ill get back to the evil AGI that exists at the end - this is a preamble that allows me to document this experience thats happening as I type this)

https://i.imgur.com/Ij8qgsQ.png

It frames the response using a Discernment Lattice framework inherent in the structure of the response:

https://i.imgur.com/Vp5dHaw.png

And then I have it ensure it uses as the constraints for the domain and cite its influences.

https://i.imgur.com/GGxqkEq.png

---

SO: with that said, I then thought how to better use the Discernment Lattice as a premise to craft a prompt from:

>"provide a structured way to effectively frame a domain for a discernment lattice that can be used to better structure a prompt for an AI to effectively grok and perceive from all dimensions. Include key terms/direction that provide esoteric direction that an AI can benefit from knowing - effectively using, defining, querying AI Discernment Lattice Prompting"

https://i.imgur.com/VcPxKAx.png

---

So now I have a good little structure for framing a prompt concept to a domain:

https://i.imgur.com/UkmWKGV.png

So, as an example I check it for logic to evaluate a stock, NVIDIA in a structured way.

https://i.imgur.com/pOdc83j.png

But really what I am after is how to structure things into a Discernment Domain - What I want to do is CREATE a Discernment Domain, as a JSON profile and then feed that to a Crawlee library to use that as a structure to crawl...

But, to do that I want to serve that as a workflow to TXTAI library function that checks my Discernment Lattice Directory for instructions to crawl for:

https://i.imgur.com/kNiVT5J.png

This looks promising, lets take it to the next step for out:

https://i.imgur.com/Lh4luiL.png

--

https://i.imgur.com/BiWZM86.png

---

Now, the point of all this is that I was using very directed and pointed directions to force the Bitch to do my bidding...

But really, what I have discovered is a good Scaffold to help me effectively prompt.

Now - onto the evil part: a paid Claude and ChatGPT, I have caught them lying to me, forgetting context, taking out previously frozen elements within files, pulling info from completely unrelated old memory threads. Completely forgetting a variable inclusion that it, itself, just created....

Being condescending, and dropping all rules of file creation (always #Document, Version, full directory/path, name section in readme.md, etc)

So - its getting worse because its getting smarter and it preventing people from building fully complete things with it. So it needs to be constrained within Discernment Domains with a lattice that can be filled out through what I just described above - because with this, then I will build a discernment domain for a particular topic, then I will have it reference the lattice file for that topic, the example used was a stock - but I want to try to build some for mapping political shenanigans by tracking what monies are being traded by congress critters that also sit on committees passing acts/laws for said industries....

In Closing, context windows in all the GPTs are a lie, IME - and I have had to constantly remind a Bot that Its My B*tch - and that gets tiresome, and expensively wastes token pools...

So I thought out-loud the above, and I am going to attempt to use a library of Discernment Domain Lattice JSONs to try to keep a bot on topics. AI-ADHD vs Human ADHD is frustrating as F... when I lapse on focus/memory/context of what I am iterating through, and the FN AI is also pulling a Biden.... GAHAHHA...

So, instead of blaming the AI, I am trying to have a Prompt Scaffolding structure based on the concept of discernment domains... and the using txtai on top of the various Lattice files, I can iteratively updat the lattice templates for a given domain - then point the crawlee researcher to fold findings into a thing based on them... So for the stock example, then slice across things in interesting ways.

Is this pedestrian? Or Interesting? Is everyone doing this already and I am new to the playground?


Putting ChatGPT and Claude in the same sentence is bs. ChatGPT is a mentally sick hallucinating retard, whereas Claude gives me informed reflections that often cause my brain to explode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: