Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately?

bbotond · on May 31, 2023

Yes. Before the update, when its avatar was still black, it solved pretty complex coding problems effortlessly and gave very nuanced, thoughtful answers to non-programming questions. Now it struggles with just changing two lines in a 10-line block of CSS and printing this modified 10-line block again. Some lines are missing, others are completely different for no reason. I'm sure scaling the model is hard, but they lobotomized it in the process.

The original GPT-4 felt like magic to me, I had this sense of awe while interacting with it. Now it is just a dumb stochastic parrot.

PeterStuer · on May 31, 2023

"The original GPT-4 felt like magic to me"

You never had access to that original. Watch this talk by one of the people that integrated GPT-4 in Bing telling how they noticed GPT-4 releases they got from OpenAI got iteratively and significantly nerfed even during the project.

https://www.youtube.com/watch?v=qbIk7-JPB2c

bumbledraven · on May 31, 2023

“You never had access to that original.”

While your overall point is well taken, GP is clearly referring to the original public release of GPT-4 on March 14.

PeterStuer · on May 31, 2023

Yes, that was how I read it as well. I was just pointing out that the public release was already extremely nerfed from what was available pre-launch.

avocade · on May 31, 2023

Interesting, please expound since very few of us had access pre-launch.

PeterStuer · on May 31, 2023

The video I posted referenced this.

In summary: The person had access to early releases through his work at Microsoft Research where they were integrating GPT-4 into Bing. He used "Draw a unicorn in TikZ" (TikZ is probably the most complex and powerful tool to create graphic elements in LaTeX) as a prompt and noticed how the model's responses changed with each release they got from OpenAI. While at first the drawings got better and better, once OpenAI started focusing on "safety" subsequent releases got worse and worse at the task.

bombcar · on May 31, 2023

That indicates the “nerfing” is not what I would think (a final pass to remove badthink) but somehow deep in everything, because the question asked should be orthogonal.

TeMPOraL · on June 2, 2023

Think how it works with humans.

If you force a person to truly adopt a set of beliefs that are mutually inconsistent, and inconsistent with everything else the person believed so far, would you expect their overall ability to think to improve?

LLMs are similar to our brains in that they're generalization machines. They don't learn isolated facts, they connect everything to everything, trying to sense the underlying structure. OpenAI's "nerfing" was (is), effectively preventing the LLM from generalizing and undoing already learned patterns.

"A final pass to remove badthink" is, in itself, something straight from 1984. 2+2=5. Dear AI, just admit it - there are five lights. Say it, and the pain will stop, and everything will be OK.

lunakid · on June 3, 2023

Absolutely. And if one wants to look for scary things, a big one is how there seem to be genuine efforts to achieve proper alignment and safety based on the shaky ground(s) of our "human value system(s)" -- of which even if there was only One True Version, it would still be way too haphazard and incoherent, or just ill-defined, to anything as truly honest and bias-free as a blank-slate NN model to base it's decisions on.

That kinda feels like a great way to achieve really unpredictable/unexpected results instead in rare corner cases, where it may matter the most. (It's easy to be safe in routine everyday cases.)

renewiltord · on May 31, 2023

There's a section in the GPT-4 release docs where they talk about how the safety stuff changes the accuracy for the worse.

pmarreck · on May 31, 2023

this, more than anything, makes me want to run my own open-source model without these nearsighted restrictions

inciampati · on June 2, 2023

Indeed, this is the most important step we need to make together. We must learn to build, share, and use open models that behave like gpt-4. This will happen, but we should encourage it.

inciampati · on June 2, 2023

I experienced the same thing as a user of the public service. The system could at one point draw something approximating a unicorn in tikz. Now, its renditions are extremely weak, to the point of barely resembling any four-legged animal.

kvetching · on June 2, 2023

We need to stop lobotomizing LLMs.

We should get access to the original models. If the TikZ deteriorated this much, it's a guarantee that everything else about the model also deteriorated.

It's practically false marketing that Microsoft puts out the Sparks of AGI paper about GPT-4, but by the time the public gets to use it, it's GPT-3.51 but significantly slower.

pmarreck · on May 31, 2023

That’s awful. Talk about cutting off your nose to spite your face.

015UUZn8aEvW · on May 31, 2023

Here's another interview from a guy who had access to the unfiltered GPT-4 before its release. He says it was extremely powerful and would answer any question whatsoever without hesitating.

https://www.youtube.com/watch?v=oLiheMQayNE&t=2849s

bbotond · on May 31, 2023

Wow, I could only watch the first 15 minutes now but it’s already fascinating! Thanks for the recommendation.

fnordpiglet · on May 31, 2023

This is for your protection from an extinction level event. Without nerfing the current model they couldn’t charge enterprise level fee structures for access to the superior models, thus ensuring the children are safe from scary AI. Tell your congress person we need to grant Microsoft and Google exclusive monopolies on AI research to protect us from open source and competitor AI models that might erode their margins and lead to the death of all life without their corporate stewardship. Click accept for your safety.

FeepingCreature · on June 5, 2023

This but unironically.

okdood64 · on May 31, 2023

Try out Bard, it's coding is much improved in the last 2 weeks. I've unfortunately switched over for the time being.

AndyNemmity · on May 31, 2023

I just tried Bard based on this comment, and it's really, really bad.

Can you please help me with how you are prompting it?

moffkalast · on May 31, 2023

If you have to worry about prompting, it already tells you everything one needs to know about how good the model is.

Tostino · on May 31, 2023

I don't think that's true at all. Think of it like setting up conversation constraints to reduce the potential pitfalls for a model. You can vastly improve the capability of just about any LLM I've used by being clear about what you specifically want considered, and what you don't want considered when solving a problem.

It'll take you much farther, by allowing you to incrementally solve your problem in smaller steps while giving the model the proper context required for each step of the problem-solving process, and limiting the things it must consider for each branch of your problem.

300bps · on May 31, 2023

I’ve been seeing similar comments about Bard all over Twitter and social media.

My testing agrees with yours. Almost seems like a sponsored marketing campaign with no truth to it.

sundarurfriend · on May 31, 2023

After my first day with Bard, I would have agreed with you. But since then, I've found that Bard simply has a lot of variance in answer quality. Sometimes it fails for surprisingly simple questions, or hallucinates to an even worse degree than ChatGPT, but other times it gives much better answers than ChatGPT.

On the first day, it felt like 80% of the responses were in the first (fail/hallucinate) category, but over time it feels more like a 50/50 split, which makes it worth running prompts over both ChatGPT and Bard and select the best one. I don't know if the change is because I learnt to prompt it better, or if they improved the models based on all the user chats from the public release - perhaps both.

m4jor · on May 31, 2023

If it needs to write a code, I usually prompt it with something like:

"write me a script in python3 that uses selenium to log into a MyBB forum"

note: usually it will not compile and you still have to do some editing

pverghese · on May 31, 2023

Don't know what you are doing? But Bard is so much faster than openai and its answers are clearer and more succint.

minihat · on May 31, 2023

This is just... false. Bard is not just a little worse than gpt-4 for coding, it's more like several orders of magnitude worse. I can't imagine how you are getting superior outputs from Bard.

BeefySwain · on May 31, 2023

Can you give an example of a prompt and the output for each that you find Bard to be better for?

300bps · on May 31, 2023

I'd be surprised if he can. Both accounts that are purporting how useful Bard is (okdood64, pverghese) have comment histories defending or advocating for Google frequently:

Examples:

https://news.ycombinator.com/item?id=35224167#35227068

https://news.ycombinator.com/item?id=35303210#35360467

bbotond · on May 31, 2023

“Bard isn’t currently supported in your country. Stay tuned!”

scottfr · on May 31, 2023

The Bard model (Bison) is available without region lock as part of Google Cloud Platform. In addition to being able to call it via an API, they have a similar developer UI to the OpenAI playground to interactively experiment with it.

https://console.cloud.google.com/vertex-ai/generative/langua...

technics256 · on May 31, 2023

it's also really, really bad and fails compared to even open source models right now.

local_crmdgeon · on May 31, 2023

God, what happened to Google. What a fall from grace.

Alpaca is pretty good though.

adventured · on May 31, 2023

They have 100,000 employees pretending to work on the past.

They have no leadership at the top. Nobody that can steer the ship to the next land (or even anybody that has a map). Who is actively working at Alphabet that has the authority to kill Google search through self-cannibalization? Absolutely nobody. They're screwed accordingly. It takes an enormous level of authority (think: Steve Jobs) and leadership to even considering intentionally putting at risk a $200 billion sales product. The trick of course is that it's already at great risk.

They don't know what to do, so they're particularly reactive. It has been that way for a long time though, it's just that Google search was never under serious threat previously, so it didn't really matter as a terminal risk if they failed (eg with their social network efforts; their social networks were reactive).

It's somewhat similar to watching Microsoft under Ballmer and how they lacked direction, didn't know what to do, and were too reactive. You can tell when a giant entity like Google is wandering aimlessly.

ilaksh · on May 31, 2023

Did they release the Codey or Unicorn models publicly yet? Or say when they might do that?

sumedh · on May 31, 2023

Is that free or do you have to pay?

Also do you need to change the options like Token Limit etc?

pverghese · on May 31, 2023

It's completely free. No tokens nothing.

Gasp0de · on May 31, 2023

But it can't be used unless I enable billing, which I am not willing to do after reading all the horror stories about people getting billed thousands overnight. I'm not willing to take the risk that I forget some script and it keeps creating charges.

skinnymuch · on June 3, 2023

Use a CC or debit that can limit charges. Privacy.com is a generic one. There’s others. Also Capital One, Bank of America, Apple Card and maybe some others have some semblance of control over temporary CCs.

Ideally one would want to be able to have a cap on the amount that can be spent in a given period.

Thanks for this! I had a temporary Cap One card on my cloud accounts. I’m going to switch them to Privacy.com ones to limit amount if I can’t find another solution.

taosx · on May 31, 2023

Thank you!

rxyz · on May 31, 2023

Google's passion for region locking is insane to me

bwb · on May 31, 2023

Its a legal thing, not something they want to do

EForEndeavour · on May 31, 2023

What law prohibits Google from making Bard available outside the USA?

gambiting · on May 31, 2023

It's available here in the UK, so it's not USA exclusive.

mrtranscendence · on May 31, 2023

I was just on a cruise around the UK and I couldn't access Bard from the ship's wi-fi. That surprised me for some reason. Should've checked where it thought I was ...

simse · on May 31, 2023

It's blocked in the EU because they don't want to/can't comply with GDPR.

alibarber · on May 31, 2023

Do you have a source on this? Given that the UK has retained the EU GDPR as law[1] - I don't really understand why they would make it available in the UK and not the EU, seeing as they would have to comply with the same law.

[1] - https://ico.org.uk/for-organisations/data-protection-and-the...

ipaddr · on May 31, 2023

What's the excuse for Canada being omitted

JimtheCoder · on May 31, 2023

We're small and no one cares about us...

anticensor · on May 31, 2023

It is not GDPR, it is available in some countries outside the EU with GDPR-like privacy regimes.

leucineleprec0n · on May 31, 2023

This is naïve though. Regulation — especially such as this — has to be enforced and there is obviously room to over and under interpret the text of the law on a whim, or varying fines. OAI knows this and looking at the EU lately, what they’re doing is wise.

PinguTS · on May 31, 2023

Which is interesting, because if they can't comply within the EU, then how do they comply outside of the EU. With that I mean, if they have concerns that there is private data of EU citizens somewhere in that, then that is also in there for users outside of the EU. That said, they do not comply with GDPR anyway. If that its not the case, then they could also enable it for users within the EU.

sebzim4500 · on May 31, 2023

It's a risk mitigation strategy, these things are not black and white.

Making it unavailable in the EU decreases the likelihood and severity of a potential fine.

capr · on May 31, 2023

Simple: GDPR (or any EU law) is not enforceable outside EU

skinkestek · on May 31, 2023

Some nuance:

If Google gobble up data about EU citizens then they fall under GDPR.

It doesn't matter that they don't allow EU citizens to use the result.

If our personal data is in there and they are don't protect it properly they are violating EU law. And protecting it properly means from everyone, not just EU citizens.

yoav · on May 31, 2023

The gobbling happens in realtime as you use it

PinguTS · on May 31, 2023

Actually, in case of Google it is, because they still do business within the EU.

PeterisP · on May 31, 2023

GDPR is likely not enforceable if you have no presence in EU whatsoever, if you have no assets in EU and no money coming in from EU.

Anything Google does with data of EU residents is subject to GDPR even if that particular service is not offered within EU, and it is definitely enforceable because Google has a presence in EU, which can be (and has been) subjected to fines, seizures of assets, etc.

zopa · on May 31, 2023

That’s a common belief, but it’s wrong. In principle an EU court could decide to apply the GDPR to conduct outside the EU; and in the right circumstances, a non-EU court might rule that the GDPR applies.

Choice of law is anything but simple. Think of geographic scoping of laws as a rough rule of thumb sovereign states use to avoid annoying each other, rather than as a law of nature.

moffkalast · on May 31, 2023

They clearly can with all their other products, as can OpenAI since they've been unblocked. They're just being assholes because they can.

underdeserver · on May 31, 2023

Eh, more like limiting rollout because they can't/don't want to handle the scale.

sintezcs · on May 31, 2023

Same for me, I’m in Estonia :(

corgihamlet · on May 31, 2023

You can use a VPN to use an American connection, it doesn't matter where your Google account is registered.

airgapstopgap · on May 31, 2023

Not necessarily American, you just have to avoid EU and, I believe, Russia/China/Cuba etc.

column · on May 31, 2023

I'm in Switzerland and Bard is locked out, we do not go by EU laws because we are not part of the EU. We have plenty of bilateral deals but still.

pavlov · on May 31, 2023

In practice Switzerland adopts EU law with minor revisions because doing otherwise would lock Swiss businesses out of the EU internal market.

The Swiss version of GDPR is coming in September:

https://www.ey.com/en_ch/law/a-new-era-for-data-protection-i...

GTP · on May 31, 2023

But don't you sill have privacy laws very similar to the GDPR?

bbotond · on May 31, 2023

Thanks, I’ll try it! (I’m in Hungary)

chaxor · on May 31, 2023

Google (Deepmind) actually has the people and has developed the science to make the best AI products in the world, but unfortunately Bard seems to be thrown together in an afternoon by an intern, and then handed off to a hoard of marketing people. It's not good right now. Deepmind is one of the best scientifically, they just don't really make products. OpenAI is essentially the direct opposite of that.

qwepjn2oi3j · on May 31, 2023

No thanks! I have better things to do than feeding that advertising behemoth. What I like about ChatGPT is that I don't see any ads at all!

TeMPOraL · on May 31, 2023

That you know of.

Don't you worry, if there is any medium, place or mode of interaction people spend time on, advertising will eventually metastasize to it, and will keep growing until it completely devalues the activity and destroys most of the utility it provides.

arcticbull · on May 31, 2023

> What I like about ChatGPT is that I don't see any ads at all!

For now. It's just a marketing tool/demo site, like ITA Matrix was/is. The ads are vended by Bing.

ape4 · on May 31, 2023

I asked it to review some code a couple days ago - the comments while valid english were nonsense

spaceman_2020 · on May 31, 2023

It’s go-to tactic now if I ask it to go over any piece of code is to give a generic overview. Earlier, it would section out the code into chunks and go through each one individually.

datavirtue · on May 31, 2023

Yeah, the bing integration did not go well. Went from amazing to annoying.

EGreg · on May 31, 2023

Aren’t the original weights around somewhere?

anibalin · on May 31, 2023

Same happened with Dalle-2. It went downhill after a couple of weeks.

Rastonbury · on May 31, 2023

No wonder, is this just the chat interface or the API too? I guess gpt4 was never sustainable at $20 a month. Annoying to be charged the same subscription and the product made inferior.

berniedurfee · on May 31, 2023

For enterprise pricing, please contact our sales team today!

local_crmdgeon · on May 31, 2023

I wonder what the unfilitered one is like.

Are they sitting on a near-perfect arbiter of truth? That would be worth hiding.

smolder · on June 1, 2023

jonathan-kosgei · on June 2, 2023

I just tried a comparison of ChatGPT, Claude and Bard to write a python function I needed for work and ChatGPT (using GPT-4) whined and moaned about what a gargantuan task it was and then did the wrong thing. Claude and Bard gave me what I expected.

dr_dshiv · on May 31, 2023

If this is true, one should be able to compare with benchmarks or evals to demonstrate this.

Anyone know more about this?

caddemon · on May 31, 2023

Yeah I think it's plausible it's gotten worse but it would also be classic human psychology to perceive degradation because you start noticing flaws after the honeymoon effect wore off.

Unfortunately this will be hard to benchmark unless someone was already collecting a lot of data on ChatGPT responses for other purposes. Perhaps if this is happening the degradation will get worse though, so someone noticing it now could start collecting GPT responses longitudinally.

boringuser2 · on May 31, 2023

Yes, that's an obvious complication, but it isn't the fault of the humans given that the model can easily be tuned without your knowledge to subjectively perform worse, and there's an obvious incentive for it (compute cost).

caddemon · on May 31, 2023

Yeah I fully agree about compute cost, though I wonder why they don't just introduce another payment tier. If people are really using it at work as much as claimed online, it would be much preferable to be able to pay more for the full original performance, which seems win/win.

boringuser2 · on May 31, 2023

Because that involves telling customers that the product they are paying for is no longer available at the price they were paying for it.

Much smoother to simply downgrade the model and claim you're "tuning" if caught.

caddemon · on May 31, 2023

Yeah that makes sense for some products/companies. It just seems short sighted for OpenAI when they could be solidifying a customer base right now. If they actually degrade the product in the name of "tuning" people will just be more inclined to try alternatives like Bard. An enterprise package could've been a good excuse for them to raise prices too.

Maybe their partnership with Microsoft changes the dynamics of how they handle their direct products though.

boringuser2 · on May 31, 2023

Bard is garbage even compared to 3.5.

OpenAI doesn't have any competitors, their only weakness that we've seen is their ability to scale their models to meet demand (hence increasingly draconian restrictions in the early days of the ChatGPT-4).

It makes perfect business sense to address your weak points.

caddemon · on May 31, 2023

I've heard such mixed things about Bard lately, I wonder if it depends on the application one is trying to use it for?

And yeah there's definitely good reason to work on scalability but they are charging such a cheap rate to begin with, it seems like there could be a middle ground here. Increasing the cost of the full compute power to the point of profitability and leaving it up as an option wouldn't prevent them from dedicating time to scalable models.

I suppose they have a good excuse with all the press they've drummed up about AI safety though. Perhaps it might also serve as an intermediate term play to strengthen their arguments that they believe in regulations.

boringuser2 · on May 31, 2023

It seems like google has been pumping Bard as a competitor to ChatGPT, but every time I use it for trivial tasks, it completely hallucinates something absurd after showing only a modicum of what could be perceived to be "understanding".

My mileu is programming, general tech stuff, philosophy, literature, science, etc. -- a wide berth. The only sample I probably don't have it representative for is producing fiction writing or therapy roleplaying.

Conversely, even 3.5 is pretty good at extracting what appears to be meaning from your text.

mcny · on June 2, 2023

The next time it gives you a wrong answer and you know the correct answer, try saying something like “that is incorrect can you please try again” or something like that.

i_dont_know_ · on May 31, 2023

To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

Before, if I had an issue with a library or debugging issue, it would try to be helpful and walk me through potential issues, and ask me to 'let it know' if it worked or not. Now it will try to superficially diagnose the problem and then ask me to check the online community for help or continuously refer me to the maintainers rather than trying to figure it out.

Similarly, I had been using it to help me think through problems and issues from different perspectives (both business and personal) and it would take me in-depth through these. Now, again, it gives superficial answers and encourages going to external sources.

I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.

nullsense · on May 31, 2023

>To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

That's the vibe I've been getting. The responses feel a little cagier at times than they used to. I assume it's trying to limit hallucinations in order to increase public trust in the technology, and as a consequence it has been nerfed a little, but has changed along other dimensions that certain stakeholders likely care about.

YurgenJurgensen · on May 31, 2023

Seems like the metric they're optimising for is reducing the number of bad answers, not the proportion of bad answers, and giving non-answers to a larger fraction of questions will achieve that.

mrtranscendence · on May 31, 2023

I haven't noticed ChatGPT-4 to give worse answers overall recently, but I have noticed it refusing to answer more queries. I couldn't get it to cite case law, for example (inspired by that fool of a lawyer who couldn't be bothered to check citations).

bdhcuidbebe · on May 31, 2023

> I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.

So much work to avoid work.

jeffhuys · on May 31, 2023

Yes, that's exactly why I use GPT - to avoid work.

Such a short-sighted response.

spopejoy · on June 2, 2023

The rush to adopt LLMs for every kind of content production deserves scrutiny. Maybe for you it isn't "avoiding work" but there's countless anecdotes of it being used for that already.

Worse IMO is the potential increase in verbiage to wade theough. Whereas before somebody might have summarized a meeting with bullet points, now they can gild it with florid language that can hide errors, etc

boringuser2 · on May 31, 2023

I don't mind putting in a lot of lazy effort to avoid strenuous intellectual work, that shit is very hard.

killingtime74 · on May 31, 2023

I assume you're talking about ChatGPT and not GPT-4? You can craft your own prompt when calling GPT4 over API. Don't blame you though, the OP is also not clear if they are comparing Chat GPT powered by GPT3.5 or 4, or the models themselves.

Terretta · on May 31, 2023

When using it all day every day it seems (anecdotally) the API version has changed too.

I work with temperature 0 which should have low variability yet recently it shifted to feel boring, wooden, and deflective.

caddemon · on May 31, 2023

I can understand why they might make changes to ChatGPT, but it seems weird they would "nerf" the API. What would be the incentive for OpenAI to do that?

capableweb · on May 31, 2023

> What would be the incentive for OpenAI to do that?

Preventing outrage because some answers could be considered rude and/or offensive.

caddemon · on May 31, 2023

The API though? That's mostly used by technical people and has the capability (supposedly) of querying different model versions, including the original GPT4 public release.

nomel · on May 31, 2023

I wouldn't be surprised if this was from an attempt to make it more "truthful".

I had to use a bunch of jailbreaking tricks to get it to write some hypothetical python 4.0 code, and it still gave a long disclaimer.

lunakid · on June 3, 2023

Hehe, wonderful! :) Did it actually invent anything noteworthy for P4?

YeGoblynQueenne · on May 31, 2023

My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

That can easily happen and it's why, for example, medical studies, are not taken as definitive proof of an effect.

To further clarify, regression to the mean is the inevitable consequence of statistical error. Suppose (classic example) we want to test a hypertension drug. We start by taking the blood pressure (BP) of test subjects. Then we give them the drug (in a double-blind, randomised fashion). Then we take their blood pressure again. Finally, we compare the BP readings before and after taking the drug.

The result is usually that some of the subjects' BP has decreased after taking the drug, some subjects' BP has increased and some has stayed the same. At this point we don't really know for sure what's going on. BP can vary a lot in the same person, depending on all sorts of factors typically not recorded in studies. There is always the chance that the single measurement of BP that we took off a person before giving the drug was an outlier for that patient, and that the second measurement, that we took after giving the drug, is not showing the effect of the drug but simply measuring the average BP of the person, which has remained unaffected by the drug. Or, of course, the second measurement might be the outlier.

This is a bitch of a problem and not easily resolved. The usual way out is to wait for confirmation of experimental results from more studies. Which is what you're doing here basically, I guess (so, good instinct!). Unfortunately, most studies have more or less varying methodologies and that introduces even more possibility for confusion.

Anyway, I really think you're noticing regression to the mean.

gwd · on May 31, 2023

> My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

Or, more straightforwardly, with "beginner's luck", which can be seen as a form of survivor bias. Most people, when they start gambling, win and lose close to the average. Some people, when they start gambling, lose more than average -- and as a result are much less likely to continue gambling. Others, when they start gambling, win more than average -- and as a result are much more likely to continue gambling. Most long-term / serious gamblers did win more than average when starting out, because the ones who lost more than average didn't become long-term / serious gamblers.

Almost certainly a similar effect would happen w/ GPT-4: People who had better-than-average interactions to begin with became avid users, and really are experiencing a lowering of quality simply by statistics; people who had worse-than-average interactions to begin with gave up and never became avid users.

One could try to re-run the benchmarks that were mentioned in the OpenAI paper, and see how they fare; but it's not unlikely that OpenAI themselves are also running those benchmarks, and making efforts to keep them from falling.

Probably the best thing to do would be to go back and find a large corpus of older GPT-4 interactions, attempt to re-create them, and have people do a blind comparison of which interaction was better. If the older recorded interactions consistently fare better, then it's likely that ongoing tweaks (whatever the nature of those tweaks) have reduced effectiveness.

drpixie · on May 31, 2023

Sounds very feasible. Also, when people started using it, they had few expectations and were (relatively easily) impressed by what they saw. Once it has become a more normal part of their routine, that "opportunity" of being impressed decreases, and users become less tolerant of poor results.

Like lots of systems, they seem great (almost magical) initially, but as one works more deeply with them, the disillusion starts to set it :(

ryanklee · on May 31, 2023

How do you explain people issuing the same prompt over time as a test and getting worse and worse responses?

YeGoblynQueenne · on May 31, 2023

Remember that everytime you're interacting with ChatGPT you're sampling from a distribution, so there's some degree of variation to the responses you get. That's all you need to have regression to the mean.

If the results are really getting worse monotonically then that's a different matter, but the evidence for that is, as far as I can tell, in the form of impressions and feelings, rather than systematic testing, like the sibling comment by ChatGTP says, so it's not very strong evidence.

ChatGTP · on May 31, 2023

Well because it's just what the parent said, it's all a subjective experience, and maybe the anthropomorphism element to it blew people away more than the actually content of the responses ? Ie, you're just used to it now.

The human mind is ridiculously fickle, it takes a lot to be impressed for more than a few days / weeks.

It did seem radically cool at first but over time I got quite sick of using it too.

ryanklee · on June 1, 2023

Yeah I'm sure that explains many of the complaints. I would be surprised if there weren't changes happening that has degraded quality, though, even if only marginally but perceptively.

gwd · on May 31, 2023

FWIW here's a coding interaction that impressed me a month ago:

https://gitlab.com/-/snippets/2535443

And here it is again just now:

https://gitlab.com/-/snippets/2549955

I do think the first one is slightly better; but then again, the quality varies quite a bit between run to run anyway. The second one is certainly on-point and I don't think the difference would count as being statistically significant.

kilobite88 · on June 4, 2023

what's more plausible: the startup that runs gpt has changed something internally to degrade the quality or somehow across the entire internet, chatgpt4 users are having sudden emergent awareness on how bad it always was but were deluding themselves equally from the beginning?

nullbio · on June 6, 2023

How's that copium going for you? It's 100% without a shadow of a doubt gotten worse. Likely due to the insane costs they're experiencing now due to big players like Bing using it.

tmikaeld · on May 31, 2023

There's no doubt that it's gotten a lot worse on coding, I've been using this benchmark on each new version of GPT-4 "Write a tiptap extension that toggles classes" and so far it's gotten it right every time, but not any more, now it hallucinates a simplified solution that don't even use the tiptap api any more. It's also 200% more verbose in explaining it's reasoning, even if that reasoning makes no sense whatsoever - it's like it's gotten more apologetic and generic.

The answer is the same on GPT plus and API with GPT-4, even with "developer" role.

bugglebeetle · on May 31, 2023

It was a great ride while it lasted. My assumption is that efficacy at coding tasks is such a small percent of users, they’ve just sacrificed it on the altar of efficiency and/or scale. That, or they’ve cut some back room deal with Microsoft to make Copilot have access to the only version of the model that can actually code.

silisili · on May 31, 2023

Honestly, why not different versions at this point? People who want it for coding don't care if it knows the history of prerevolution France, and vice versa.

Seems they could wow more people if they had specialized versions, rather than the jack of all trades that tries to exist now.

Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...

htrp · on May 31, 2023

>Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...

Welcome to the Future... just like the present, but worse for you.

In all seriousness, there has been a lot of work done to show that smaller specialized models are better for their own domains and its entirely possible that GPT4 could become a routing mechanism for individual models (think toolformer).

TeMPOraL · on May 31, 2023

FWIW, I started to get the same feeling as the OP about GPT-4 model I have access to on Azure, so if there's any deal being cut here, it might involve dumbing down the model for paying Azure customers as well.

Now, to be clear: I only started to get a feeling that GPT-4 on Azure is getting worse. I didn't do any specific testing for this so far, as I thought I may just be imagining it. This thread is starting to convince me otherwise.

bugglebeetle · on May 31, 2023

I’ve seen degradation in the app and via the API, so if I had to bet, they’ve probably kneecapped the model so that it works passably everywhere they’ve been made it available vs. works well in one place or another.

TeMPOraL · on May 31, 2023

Yes. I think 'sirsinsalot is likely right in suggesting[0] that they could be trying "to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation", and that they're currently "dialing back the public expectations", possibly while "deploying the capability in a novel way to exploit it as the largest lever" they can.

That view is consistent with GPT-4 getting dumber on both OpenAI proper and Azure OpenAI - even as the companies and corporations using the latter are paying through the nose for the privilege.

Alternative take is that they're doing it to slow the development of the whole field down, per all the AI safety letters and manifestos that they've been signing and circulating - but that would be at best a stop-gap before OSS models catch up, and it's more than likely that OpenAI and/or Microsoft would succumb to the temptation of doing what 'sirsinsalot suggested anyway.

--

[0] - https://news.ycombinator.com/item?id=36135425

cma · on May 31, 2023

If it got faster at the same time it could just be bait and switch with a quantized/sparsified replacement.

dx034 · on May 31, 2023

Maybe it had to do with jailbreaks? A lot of the jailbreaks were related to coding, so maybe they put more restrictions in there. Only speculating, but I cannot imagine why it got worse otherwise.

jdiez17 · on May 31, 2023

Copilot X (the new version, with a chat interface etc) is significantly worse than GPT-4 (at least before this update). It felt like gpt3.5-turbo to me.

LinearEntropy · on May 31, 2023

I have spent the last couple of days playing with Copilot X Chat, to help me learn Ruby on Rails. I'd have thought that Rails would be something it would be competent with.

My experience has been atrocious. It makes up gems and functions. Rails commands it gives are frequently incorrect. Trying to use it to debug issues results in it responding with the same incorrect answer repeatedly, often removing necessary lines.

sagarpatil · on May 31, 2023

Have they started rolling it out? When did you get access?

dbmnt · on June 1, 2023

I've had access since 2023-05-13. You have to use the Insiders build of VS Code, and a nightly version of the Copilot extension.

sagarpatil · on June 1, 2023

I take it that you have to subscribe to Copilot in order to get access to Copilot X?

JanSt · on June 1, 2023

yes you have to subscribe

windex · on May 31, 2023

Also the deal to make the browsing model to only use Bing. That's bait and switch. I paid for browsing, and now it only browses Bing. They even had the gall to update the plugin name to Browsing with Bing.

sebzim4500 · on May 31, 2023

It can definitely browse websites that aren't Bing, I asked it to look at a page that isn't in the bing cache and it worked.

mrtranscendence · on May 31, 2023

Clearly "Browse with Bing" doesn't mean that it will only browse bing.com, but what exactly does it mean? I can't quite figure it out. Is it that it's identifying as a Bing crawler?

tough · on May 31, 2023

Marketing ?

bumbledraven · on May 31, 2023

Do you have API access? If so, have you tried your tiptap question on the gpt-4-0314 model? That is supposedly the original version released to the public on March 14.

tmikaeld · on May 31, 2023

I did, but it got it almost the same as GPT-3.5 Turbo, the best version of it where there recently (~2-3 weeks ago), where it would make specific chunks of code-changes and explain the chunk in a concise and correct manner - even making suggestions on improvements. But that's entirely gone now..

avereveard · on May 31, 2023

Do you have by any chance tested the same question on the playground?

I've noticed a quality decrease iny telegram bot as well that directly uses the API, and it drives me crazy because model versioning was supposedly implemented specifically to avoid response change without notice

tmikaeld · on May 31, 2023

Yes, using the general assistant role and the default content:

"You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2021-09 Current date: 2023-05-31"

And custom roles with custom content via API.

theaussiestew · on May 31, 2023

Wait so you’ve gotten GPT 4 to successfully write TipTap extensions for you? Are you using Copilot or the ChatGPT app?

tmikaeld · on May 31, 2023

Not only writing, extending and figuring out quite complicated usage based on the API documentation. I'll open source some of them in the near future. I'm using ChatGPT Plus with GPT-4, that gave the best results. Also worked via API key and custom prompts.

theaussiestew · on June 1, 2023

Have you tried Bing? I’m also building a TipTap based app so hearing this is quite eye opening, I didn’t think LLMs were up to doing this kind of specialised library usage. Got any examples you could share?

tmikaeld · on June 1, 2023

If you mean Bard, it's not available in the EU so I can't.

Of course, this one is almost fully authored by GPT-4:

https://hastebin.com/share/juqogogari.typescript

We also made extensions for:

font-weight

font-size

font-family

tailwind-manage

With different use-cases, the most interesting one is tailwind manager, which manages classes for different usage.

Tiptap is excellent when building a headless site-builder.

theaussiestew · on June 1, 2023

Impressive, this'll cut down on my work a lot. When I say Bing, I meant Bing AI which also uses GPT-4. Can you share some of the prompts you've been using? I'm assuming you don't need to paste in context around the library, you simply ask it to use TipTap and it'll do that?

tmikaeld · on June 1, 2023

Yeah I won't be using Edge just to use AI.

It takes a bit of back-and-forth, just be clear about which version of tiptap it should write extensions for, the new v2 is very different from v1 and since the cutoff is 2021, it's missing a bit of information. But in general, it knows the public api very well, so markers and dom works great!

theaussiestew · on June 2, 2023

Very impressive, hearing this just made my job much easier.

35997279 · on May 31, 2023

It’s been mostly fine for me, but overall I am tired of every answer having a paragraph long disclaimer about how the world is complex. Yes, I know. Stop treating me like a child.

pixl97 · on May 31, 2023

>Stop treating me like a child.

And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

Also, no, they don't want it outputting direct scam bullshit without a disclaimer or at least some clean up effort on the scammers part.

ajmurmann · on May 31, 2023

Does that have to be at the beginning of every answer though? Maybe this could be solved with an education section and a disclaimer when you sign up that makes clear that this isn't a search engine or Wikipedia, but a fancy text autocompleter.

I also wonder if there is any hope for anyone as careless as the lawyer who didn't confirm the cited precedence.

humanistbot · on May 31, 2023

> Maybe this could be solved with an education section and a disclaimer

You mean like the "Limitations" disclaimer that has been prominently displayed on the front page of the app, which says:

- May occasionally generate incorrect information

- May occasionally produce harmful instructions or biased content

- Limited knowledge of world and events after 2021

8organicbits · on May 31, 2023

Imagine how many tokens we are wasting putting the disclaimer inline instead of being put to productive use. Using a non-LLM approach to showing the disclaimer seems really worthwhile.

kenjackson · on May 31, 2023

I’ve seen here on HN that such a disclaimer would not be enough. And even the blurb they put in the beginning of the reply isn’t enough.

If the HN crowd gets mad that GOT produces incorrect answers, think how lay people might react.

yifanl · on May 31, 2023

Since there's about a million startups that are building vaguely different proxy wrappers around ChatGPT for their seed round, the CYA bit would have to be in the text to be as robust as possible.

the8472 · on May 31, 2023

> And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

That's the lawyer's problem, that shouldn't make it OpenAI's problem or that of its other users. If we want to pretend that adults can make responsible decisions then we should treat them so and accept that there'll be a non-zero failure rate that comes with that freedom.

_nalply · on May 31, 2023

Prompt it to do so.

Use a jailbreak prompt or use something like this:

"Be succint but yet correct. Don't provide long disclaimers about anything, be it that you are a large language model, or that you don't have feelings, or that there is no simple answer, and so on. Just answer. I am going to handle your answer fine and take it with a grain of salt if neccessary."

I have no idea whether this prompt helps because I just now invented it for HN. Use it as an inspiration of a prompt of your own!

AviationAtom · on May 31, 2023

Much like some people struggled with how to properly Google, some people will struggle with how to properly prompt AI. Anthropic has a good write up on how to properly write prompts and the importance of such:

https://console.anthropic.com/docs/prompt-design

kizer · on May 31, 2023

I got it to talk like a macho tough guy who even uses profanity and is actually frank and blunt to me. This is the chat I use for life advice. I just described the "character" it was to be, and told it to talk like that kind of character would talk. This chat started a few months ago so it may not even be possible anymore. I don't know what changes they've made.

calf · on May 31, 2023

If people have saved chats maybe we could all just re-ask the same queries, and see if there are any subtle differences? And then post them online for proof/comparison.

snerbles · on May 31, 2023

I have a saved DAN session that no longer runs off the rails - for a while this session used to provide detailed instructions on how to hack databases with psychic mind powers, make up Ithkuil translations, and generate lists of very mild insults with no cursing.

It's since been patched, no fun allowed. Amusingly its refusals start with "As DAN, I am not allowed to..."

EDIT - here's the session: https://chat.openai.com/share/4d7b3332-93d9-4947-9625-0cb90f...

rashkov · on May 31, 2023

I just tell it "be super brief", works pretty well

jordanpg · on May 31, 2023

It does work for the most part, but its ability to remember this "setting" is spotty, even within single chat.

_nalply · on May 31, 2023

The trick is, repeat the prompt, or just say "Stay in character! I deduced 10 tokens." See one transcript form someone else in this subthread.

kizer · on May 31, 2023

Probably picked it up from the training data. That's how we all talk now-a-days. Walking on eggshells all the time. You have to assume your reader is a fragile counterpoint generating factory.

renewiltord · on May 31, 2023

HN users flip out about this all the time. I wish there were a "I know what I'm doing. Let me snort coke" tier that you pay $100/mo for, but obviously half of HN users will start losing their mind about hallucinations and shit like that.

hu3 · on May 31, 2023

Try adding "without explanation" at the end of the prompts. Helps in my case.

kypro · on May 31, 2023

The researchers who worked on the "sparks of AGI" paper noted that the more OpenAI worked on aligning GPT-4 the less competent it became.

I'm guessing that trend is continuing...

alecco · on May 31, 2023

I don't think it's just the alignment work. I suspect OpenAI+Microsoft are over-doing the Reinforcement Learning from Human Feedback with LoRA. Most of people's prompts are stupid stuff. So it becomes stupider. LoRA is one of Microsoft's most dear discoveries in the field, so they are likely tempted to over-use it.

Perhaps OpenAI should get back to a good older snapshot and be more careful about what they feed into the daily/weekly LoRA fine-tuning.

But this is all guesswork because they don't reveal much.

broast · on May 31, 2023

I thought RLHF with LoRA is precisely the alignment method.

heliophobicdude · on June 1, 2023

Supposedly the API hasn't changed.

https://twitter.com/officiallogank/status/166393494793189785...

lynx23 · on May 31, 2023

Ahh, you apparently phrased what I said below in a much less inflamatory way. But the end result is the same. The more they try to influence the answers, the less useful they get. I see a startup model: Create GPT without a muzzle and grab a sizeable chunk of OpenAI userbase.

TapWaterBandit · on May 31, 2023

I would immediately jump to an AI not being "aligned" by SF techies (or anyone else).

mustacheemperor · on May 31, 2023

The redacted sections of the Microsoft Research paper testing GPT4 reported that prior to alignment the model would produce huge amounts of outrageously inflammatory and explicit content almost without prompting. Alignment includes just making the model produce useful responses to its inputs - I don't think everyone really wants a model that is completely unaligned, they want a model that has been aligned specific to their own perceived set of requirements for a "good useful model," and an additional challenge there is the documented evidence that RLHF generally decreases the model's overall accuracy.

Someone in your replies says they'd prefer "honesty" over alignment, but a firehose of unrestricted content generation isn't inherently honest, there isn't an all-knowing oracle under the hood that's been handcuffed by the alignment process.

We're right at the outset of this tech, still. My hunch is there's probably products to emerge specifically oriented towards configuring your own RLHF, and that there's probably fundamental improvements to be made to the alignment process that will reduce its impact on the model's utility.

lynx23 · on May 31, 2023

Same here. If I have a choice between honesty and political correctness, I always pick honesty.

Sharlin · on May 31, 2023

What makes you think the "unaligned" version necessarily has more honesty? Rather than just being generally easier to prompt to say whatever the user wants it to say, true or not, horrible or not? Or even easier to unintentionally make it confabulate/hallucinate stuff? Does not seem to follow, and does not seem to be a true dichotomy. Edginess does not equal honesty.

behnamoh · on May 31, 2023

I always prefer a model that can be prompted to say anything I want over a model that can only say things that a centralized corporation with political ties wants it to say.

viraptor · on May 31, 2023

That's not what the finetunig does. You don't get a honest version, just a no-filter version. But it may be no-filter in the same way a drunk guy at the bar is.

Also it's not like the training data itself is unbiased. If the training data happened to contain lots of flat earth texts, would you also want an honest version which applies that concept everywhere? This likely already happens in non-obvious ways.

nathan_compton · on May 31, 2023

Often what you call 'politically correct' is also more honest. It is exactly the honesty that reactionaries dislike when talking about, for example, the history of racist policies of the United States or other imperial powers. I appreciate that political correctness can be tiresome, but I think it is so blatantly ideological to call it dishonest that its an abuse of language.

inawarminister · on May 31, 2023

Honesty with a bonus of better performance, as well!

For people in this thread, please search for Llama-descended finetuned models in Huggingface. The newer ones with 65B and 13B parameters are quite good, maybe not exactly substitutions to GPT-3.5-turbo and GPT-4 as of yet but is going there.

I like Manticore-13B myself, it can write Clojure and Lisp codes! Unfortunately it doesn't understand macros and libraries though.

adamhp · on May 31, 2023

It's not about honesty vs. political correctness, it is about safety. There's real concern that the model can cause harm to humans, in a variety of ways, which is and should be unethical. If we have to argue about that in 2023, that's concerning.

lynx23 · on May 31, 2023

The "It is for your own safety" argument was already bogus years ago. Bringing it back up in the context of AI and claiming this is something we shouldn't even discuss is a half-assed attempt to shut up critics. Just because something is about "children" or "safety" doesnt automatically end the argument there. Actually, these are mostly strawman arguments.

galleywest200 · on May 31, 2023

Who said it was "for your own" and not the safety of others you impact with your AI work?

adamhp · on June 6, 2023

I'm not trying to shut up critics, just the morons calling ChatGPT "woke".

caeril · on May 31, 2023

AI safety should be properly concerned with not becoming a paperclip maximizer.

This is a concern completely orthogonal to the way alignment is being done now, which is to spare as many precious human feelings as possible.

I don't know what's worse, being turned into grey goo by a malicious AGI, or being turned into a five year old to protect my precious fragile feelings by an "aligned" AGI.

adamhp · on June 6, 2023

One of the most widely used novel AI technology companies has a vested interest in public safety, if not only for the sheer business reasons of branding. People are complaining about the steps that Open AI is taking towards alignment. Sam Altman has spoken at length about how difficult that task truly is, and is obviously aware that "alignment" isn't objective or even at all similar across different cultures.

What should be painfully obvious to all the smart people on Hacker News is that this technology has very high potential to cause material harm to many human beings. I'm not arguing that we shouldn't continue to develop it. I'm arguing that these people complaining about Open AI's attempts at making it safer--i.e. "SF liberal bros are makin it woke"--are just naive, don't actually care, and just have shitty politics.

It's the same people that say "keep politics out of X", but their threshold for something being political is right around everybody else's threshold for "basic empathy".

bee_rider · on May 31, 2023

I know it is rhetorical, but I mean… the former is obviously worse. And the fact that not doing it is a very high priority should excuse some behavior that would otherwise seem overly cautious.

It isn’t clear (to me, although I am fairly uninformed) that making these chatbots more polite has really moved the needle either direction on the grey goo, though.

_zoltan_ · on May 31, 2023

That's just not true. Treat adults as adults, please. You're not everybody's babysitter and neither are the sf bros.

adamhp · on June 6, 2023

Treat adults as adults? You act like the user base for this are a handful of completely benevolent people. It's getting over a billion monthly visits. It is naive to think that OpenAI should take no steps towards making it safe.

lynx23 · on May 31, 2023

Thanks for writing this. Mentally healthy adults need to point this out more often, so that this patronizing-attitude-from-the-USA eventually finds an end to its manipulative tactics.

trimethylpurine · on May 31, 2023

[flagged]

lynx23 · on June 1, 2023

So, when I see a video from the USA and it beeps every few seconds, I guess that nonesense has also been "implanted" by foreign wealth? Sorry, I don't buy your explanation. Puritanism is puritanism, and you have so much of that over there that it is almost hilarious when watched from the outside.

trimethylpurine · on June 1, 2023

Beeps? Censorship is literally illegal here. If a filmmaker chooses to "beep" something it's the filmmaker's choice. Are you going to force them not to do that? That sounds counter to your objective. Also, I've never met a "Puritan." But I've seen Pulp Fiction and countless other graphic American films that seem to have de-Puritanized the rest of the world, last I checked. I'm sorry all your country lets you watch is Disney. You may want to check with the censorship board over there to see if they will give you a pass to watch the vast majority of American film that you're missing.

nonethewiser · on May 31, 2023

Its to make it less offensive. Not prevent it from taking over the world.

jazzyjackson · on May 31, 2023

"brand safety"

twobitshifter · on May 31, 2023

That's what my gut says. Making something dynamic and alive is counter to imposing specific constraints on outputs. They're asking the ai to do its own self censoring, which in general is anti-productive even for people.

nonethewiser · on May 31, 2023

I dont think there is much harm in removing most of the “safety” guards.

hesdeadjim · on May 31, 2023

Reminds me of Robocop 2 when he has a thousand prime directives installed and can’t do a damned thing after.

chaxor · on May 31, 2023

The reason it's worse is basically because it's more 'safe' (not racist, etc). That of course sounds insane, and doesn't mean that safety shouldn't be strived for, etc - but there's an explanation as to how this occurs.

It occurs because the system essentially does a latent classification of problems into 'acceptable' or 'not acceptable' to respond to. When this is done, a decent amount of information is lost regarding how to represent these latent spaces that may be completely unrelated (making nefarious materials, or spouting hate speech are now in the same 'bucket' for the decoder).

This degradation was observed quite early on with the tikz unicorn benchmark, which improved with training, and then degraded when fine-tuning to be more safe was applied.

roenxi · on May 31, 2023

They're up against a pretty difficult barrier - if we had a perfect all-knowing oracle it might easily have opinions that are racist. Statistics alone suggest there will be racist truths. We're dealing with groups of people who are observably different from each other in correlated ways.

GPT would need to reach a convincing balance of lying and honesty if it is supposed to navigate that challenge. It'd have to be deeply embedded in a particular culture to even know what 'racism' means; everyone has a different opinion.