Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately?
947 points by behnamoh 8 months ago | hide | past | favorite | 757 comments
It is much faster than before but the quality of its responses is more like a GPT-3.5++. It generates more buggy code, the answers have less depth and analysis to them, and overall it feels much worse than before.

For a while, the GPT-4 on phind.com gave even better results than GPT-4-powered ChatGPT. I could notice the difference in speed of both GPT-4s. Phind's was slower and more accurate. I say "was" because apparently phind is now trying to use GPT-3.5 and their own Phind model more frequently, so much for GPT-4 powered search engine....

I wonder if I use Poe's GPT-4, maybe I'll get the good old GPT-4 back?

Yes. Before the update, when its avatar was still black, it solved pretty complex coding problems effortlessly and gave very nuanced, thoughtful answers to non-programming questions. Now it struggles with just changing two lines in a 10-line block of CSS and printing this modified 10-line block again. Some lines are missing, others are completely different for no reason. I'm sure scaling the model is hard, but they lobotomized it in the process.

The original GPT-4 felt like magic to me, I had this sense of awe while interacting with it. Now it is just a dumb stochastic parrot.

"The original GPT-4 felt like magic to me"

You never had access to that original. Watch this talk by one of the people that integrated GPT-4 in Bing telling how they noticed GPT-4 releases they got from OpenAI got iteratively and significantly nerfed even during the project.


“You never had access to that original.”

While your overall point is well taken, GP is clearly referring to the original public release of GPT-4 on March 14.

Yes, that was how I read it as well. I was just pointing out that the public release was already extremely nerfed from what was available pre-launch.

Interesting, please expound since very few of us had access pre-launch.

The video I posted referenced this.

In summary: The person had access to early releases through his work at Microsoft Research where they were integrating GPT-4 into Bing. He used "Draw a unicorn in TikZ" (TikZ is probably the most complex and powerful tool to create graphic elements in LaTeX) as a prompt and noticed how the model's responses changed with each release they got from OpenAI. While at first the drawings got better and better, once OpenAI started focusing on "safety" subsequent releases got worse and worse at the task.

That indicates the “nerfing” is not what I would think (a final pass to remove badthink) but somehow deep in everything, because the question asked should be orthogonal.

Think how it works with humans.

If you force a person to truly adopt a set of beliefs that are mutually inconsistent, and inconsistent with everything else the person believed so far, would you expect their overall ability to think to improve?

LLMs are similar to our brains in that they're generalization machines. They don't learn isolated facts, they connect everything to everything, trying to sense the underlying structure. OpenAI's "nerfing" was (is), effectively preventing the LLM from generalizing and undoing already learned patterns.

"A final pass to remove badthink" is, in itself, something straight from 1984. 2+2=5. Dear AI, just admit it - there are five lights. Say it, and the pain will stop, and everything will be OK.

Absolutely. And if one wants to look for scary things, a big one is how there seem to be genuine efforts to achieve proper alignment and safety based on the shaky ground(s) of our "human value system(s)" -- of which even if there was only One True Version, it would still be way too haphazard and incoherent, or just ill-defined, to anything as truly honest and bias-free as a blank-slate NN model to base it's decisions on.

That kinda feels like a great way to achieve really unpredictable/unexpected results instead in rare corner cases, where it may matter the most. (It's easy to be safe in routine everyday cases.)

There's a section in the GPT-4 release docs where they talk about how the safety stuff changes the accuracy for the worse.

this, more than anything, makes me want to run my own open-source model without these nearsighted restrictions

Indeed, this is the most important step we need to make together. We must learn to build, share, and use open models that behave like gpt-4. This will happen, but we should encourage it.

I experienced the same thing as a user of the public service. The system could at one point draw something approximating a unicorn in tikz. Now, its renditions are extremely weak, to the point of barely resembling any four-legged animal.

We need to stop lobotomizing LLMs.

We should get access to the original models. If the TikZ deteriorated this much, it's a guarantee that everything else about the model also deteriorated.

It's practically false marketing that Microsoft puts out the Sparks of AGI paper about GPT-4, but by the time the public gets to use it, it's GPT-3.51 but significantly slower.

That’s awful. Talk about cutting off your nose to spite your face.

Here's another interview from a guy who had access to the unfiltered GPT-4 before its release. He says it was extremely powerful and would answer any question whatsoever without hesitating.


Wow, I could only watch the first 15 minutes now but it’s already fascinating! Thanks for the recommendation.

This is for your protection from an extinction level event. Without nerfing the current model they couldn’t charge enterprise level fee structures for access to the superior models, thus ensuring the children are safe from scary AI. Tell your congress person we need to grant Microsoft and Google exclusive monopolies on AI research to protect us from open source and competitor AI models that might erode their margins and lead to the death of all life without their corporate stewardship. Click accept for your safety.

This but unironically.

Try out Bard, it's coding is much improved in the last 2 weeks. I've unfortunately switched over for the time being.

I just tried Bard based on this comment, and it's really, really bad.

Can you please help me with how you are prompting it?

If you have to worry about prompting, it already tells you everything one needs to know about how good the model is.

I don't think that's true at all. Think of it like setting up conversation constraints to reduce the potential pitfalls for a model. You can vastly improve the capability of just about any LLM I've used by being clear about what you specifically want considered, and what you don't want considered when solving a problem.

It'll take you much farther, by allowing you to incrementally solve your problem in smaller steps while giving the model the proper context required for each step of the problem-solving process, and limiting the things it must consider for each branch of your problem.

I’ve been seeing similar comments about Bard all over Twitter and social media.

My testing agrees with yours. Almost seems like a sponsored marketing campaign with no truth to it.

After my first day with Bard, I would have agreed with you. But since then, I've found that Bard simply has a lot of variance in answer quality. Sometimes it fails for surprisingly simple questions, or hallucinates to an even worse degree than ChatGPT, but other times it gives much better answers than ChatGPT.

On the first day, it felt like 80% of the responses were in the first (fail/hallucinate) category, but over time it feels more like a 50/50 split, which makes it worth running prompts over both ChatGPT and Bard and select the best one. I don't know if the change is because I learnt to prompt it better, or if they improved the models based on all the user chats from the public release - perhaps both.

If it needs to write a code, I usually prompt it with something like:

"write me a script in python3 that uses selenium to log into a MyBB forum"

note: usually it will not compile and you still have to do some editing

Don't know what you are doing? But Bard is so much faster than openai and its answers are clearer and more succint.

This is just... false. Bard is not just a little worse than gpt-4 for coding, it's more like several orders of magnitude worse. I can't imagine how you are getting superior outputs from Bard.

Can you give an example of a prompt and the output for each that you find Bard to be better for?

I'd be surprised if he can. Both accounts that are purporting how useful Bard is (okdood64, pverghese) have comment histories defending or advocating for Google frequently:




“Bard isn’t currently supported in your country. Stay tuned!”

The Bard model (Bison) is available without region lock as part of Google Cloud Platform. In addition to being able to call it via an API, they have a similar developer UI to the OpenAI playground to interactively experiment with it.


it's also really, really bad and fails compared to even open source models right now.

God, what happened to Google. What a fall from grace.

Alpaca is pretty good though.

They have 100,000 employees pretending to work on the past.

They have no leadership at the top. Nobody that can steer the ship to the next land (or even anybody that has a map). Who is actively working at Alphabet that has the authority to kill Google search through self-cannibalization? Absolutely nobody. They're screwed accordingly. It takes an enormous level of authority (think: Steve Jobs) and leadership to even considering intentionally putting at risk a $200 billion sales product. The trick of course is that it's already at great risk.

They don't know what to do, so they're particularly reactive. It has been that way for a long time though, it's just that Google search was never under serious threat previously, so it didn't really matter as a terminal risk if they failed (eg with their social network efforts; their social networks were reactive).

It's somewhat similar to watching Microsoft under Ballmer and how they lacked direction, didn't know what to do, and were too reactive. You can tell when a giant entity like Google is wandering aimlessly.

Did they release the Codey or Unicorn models publicly yet? Or say when they might do that?

Is that free or do you have to pay?

Also do you need to change the options like Token Limit etc?

It's completely free. No tokens nothing.

But it can't be used unless I enable billing, which I am not willing to do after reading all the horror stories about people getting billed thousands overnight. I'm not willing to take the risk that I forget some script and it keeps creating charges.

Use a CC or debit that can limit charges. Privacy.com is a generic one. There’s others. Also Capital One, Bank of America, Apple Card and maybe some others have some semblance of control over temporary CCs.

Ideally one would want to be able to have a cap on the amount that can be spent in a given period.

Thanks for this! I had a temporary Cap One card on my cloud accounts. I’m going to switch them to Privacy.com ones to limit amount if I can’t find another solution.

Thank you!

Google's passion for region locking is insane to me

Its a legal thing, not something they want to do

What law prohibits Google from making Bard available outside the USA?

It's available here in the UK, so it's not USA exclusive.

I was just on a cruise around the UK and I couldn't access Bard from the ship's wi-fi. That surprised me for some reason. Should've checked where it thought I was ...

It's blocked in the EU because they don't want to/can't comply with GDPR.

Do you have a source on this? Given that the UK has retained the EU GDPR as law[1] - I don't really understand why they would make it available in the UK and not the EU, seeing as they would have to comply with the same law.

[1] - https://ico.org.uk/for-organisations/data-protection-and-the...

What's the excuse for Canada being omitted

We're small and no one cares about us...

It is not GDPR, it is available in some countries outside the EU with GDPR-like privacy regimes.

This is naïve though. Regulation — especially such as this — has to be enforced and there is obviously room to over and under interpret the text of the law on a whim, or varying fines. OAI knows this and looking at the EU lately, what they’re doing is wise.

Which is interesting, because if they can't comply within the EU, then how do they comply outside of the EU. With that I mean, if they have concerns that there is private data of EU citizens somewhere in that, then that is also in there for users outside of the EU. That said, they do not comply with GDPR anyway. If that its not the case, then they could also enable it for users within the EU.

It's a risk mitigation strategy, these things are not black and white.

Making it unavailable in the EU decreases the likelihood and severity of a potential fine.

Simple: GDPR (or any EU law) is not enforceable outside EU

Some nuance:

If Google gobble up data about EU citizens then they fall under GDPR.

It doesn't matter that they don't allow EU citizens to use the result.

If our personal data is in there and they are don't protect it properly they are violating EU law. And protecting it properly means from everyone, not just EU citizens.

The gobbling happens in realtime as you use it

Actually, in case of Google it is, because they still do business within the EU.

GDPR is likely not enforceable if you have no presence in EU whatsoever, if you have no assets in EU and no money coming in from EU.

Anything Google does with data of EU residents is subject to GDPR even if that particular service is not offered within EU, and it is definitely enforceable because Google has a presence in EU, which can be (and has been) subjected to fines, seizures of assets, etc.

That’s a common belief, but it’s wrong. In principle an EU court could decide to apply the GDPR to conduct outside the EU; and in the right circumstances, a non-EU court might rule that the GDPR applies.

Choice of law is anything but simple. Think of geographic scoping of laws as a rough rule of thumb sovereign states use to avoid annoying each other, rather than as a law of nature.

They clearly can with all their other products, as can OpenAI since they've been unblocked. They're just being assholes because they can.

Eh, more like limiting rollout because they can't/don't want to handle the scale.

Same for me, I’m in Estonia :(

You can use a VPN to use an American connection, it doesn't matter where your Google account is registered.

Not necessarily American, you just have to avoid EU and, I believe, Russia/China/Cuba etc.

I'm in Switzerland and Bard is locked out, we do not go by EU laws because we are not part of the EU. We have plenty of bilateral deals but still.

In practice Switzerland adopts EU law with minor revisions because doing otherwise would lock Swiss businesses out of the EU internal market.

The Swiss version of GDPR is coming in September:


But don't you sill have privacy laws very similar to the GDPR?

Thanks, I’ll try it! (I’m in Hungary)

Google (Deepmind) actually has the people and has developed the science to make the best AI products in the world, but unfortunately Bard seems to be thrown together in an afternoon by an intern, and then handed off to a hoard of marketing people. It's not good right now. Deepmind is one of the best scientifically, they just don't really make products. OpenAI is essentially the direct opposite of that.

No thanks! I have better things to do than feeding that advertising behemoth. What I like about ChatGPT is that I don't see any ads at all!

That you know of.

Don't you worry, if there is any medium, place or mode of interaction people spend time on, advertising will eventually metastasize to it, and will keep growing until it completely devalues the activity and destroys most of the utility it provides.

> What I like about ChatGPT is that I don't see any ads at all!

For now. It's just a marketing tool/demo site, like ITA Matrix was/is. The ads are vended by Bing.

I asked it to review some code a couple days ago - the comments while valid english were nonsense

It’s go-to tactic now if I ask it to go over any piece of code is to give a generic overview. Earlier, it would section out the code into chunks and go through each one individually.

Yeah, the bing integration did not go well. Went from amazing to annoying.

Aren’t the original weights around somewhere?

Same happened with Dalle-2. It went downhill after a couple of weeks.

No wonder, is this just the chat interface or the API too? I guess gpt4 was never sustainable at $20 a month. Annoying to be charged the same subscription and the product made inferior.

For enterprise pricing, please contact our sales team today!

I wonder what the unfilitered one is like.

Are they sitting on a near-perfect arbiter of truth? That would be worth hiding.


I just tried a comparison of ChatGPT, Claude and Bard to write a python function I needed for work and ChatGPT (using GPT-4) whined and moaned about what a gargantuan task it was and then did the wrong thing. Claude and Bard gave me what I expected.

If this is true, one should be able to compare with benchmarks or evals to demonstrate this.

Anyone know more about this?

Yeah I think it's plausible it's gotten worse but it would also be classic human psychology to perceive degradation because you start noticing flaws after the honeymoon effect wore off.

Unfortunately this will be hard to benchmark unless someone was already collecting a lot of data on ChatGPT responses for other purposes. Perhaps if this is happening the degradation will get worse though, so someone noticing it now could start collecting GPT responses longitudinally.

Yes, that's an obvious complication, but it isn't the fault of the humans given that the model can easily be tuned without your knowledge to subjectively perform worse, and there's an obvious incentive for it (compute cost).

Yeah I fully agree about compute cost, though I wonder why they don't just introduce another payment tier. If people are really using it at work as much as claimed online, it would be much preferable to be able to pay more for the full original performance, which seems win/win.

Because that involves telling customers that the product they are paying for is no longer available at the price they were paying for it.

Much smoother to simply downgrade the model and claim you're "tuning" if caught.

Yeah that makes sense for some products/companies. It just seems short sighted for OpenAI when they could be solidifying a customer base right now. If they actually degrade the product in the name of "tuning" people will just be more inclined to try alternatives like Bard. An enterprise package could've been a good excuse for them to raise prices too.

Maybe their partnership with Microsoft changes the dynamics of how they handle their direct products though.

Bard is garbage even compared to 3.5.

OpenAI doesn't have any competitors, their only weakness that we've seen is their ability to scale their models to meet demand (hence increasingly draconian restrictions in the early days of the ChatGPT-4).

It makes perfect business sense to address your weak points.

I've heard such mixed things about Bard lately, I wonder if it depends on the application one is trying to use it for?

And yeah there's definitely good reason to work on scalability but they are charging such a cheap rate to begin with, it seems like there could be a middle ground here. Increasing the cost of the full compute power to the point of profitability and leaving it up as an option wouldn't prevent them from dedicating time to scalable models.

I suppose they have a good excuse with all the press they've drummed up about AI safety though. Perhaps it might also serve as an intermediate term play to strengthen their arguments that they believe in regulations.

It seems like google has been pumping Bard as a competitor to ChatGPT, but every time I use it for trivial tasks, it completely hallucinates something absurd after showing only a modicum of what could be perceived to be "understanding".

My mileu is programming, general tech stuff, philosophy, literature, science, etc. -- a wide berth. The only sample I probably don't have it representative for is producing fiction writing or therapy roleplaying.

Conversely, even 3.5 is pretty good at extracting what appears to be meaning from your text.

The next time it gives you a wrong answer and you know the correct answer, try saying something like “that is incorrect can you please try again” or something like that.

To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

Before, if I had an issue with a library or debugging issue, it would try to be helpful and walk me through potential issues, and ask me to 'let it know' if it worked or not. Now it will try to superficially diagnose the problem and then ask me to check the online community for help or continuously refer me to the maintainers rather than trying to figure it out.

Similarly, I had been using it to help me think through problems and issues from different perspectives (both business and personal) and it would take me in-depth through these. Now, again, it gives superficial answers and encourages going to external sources.

I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.

>To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

That's the vibe I've been getting. The responses feel a little cagier at times than they used to. I assume it's trying to limit hallucinations in order to increase public trust in the technology, and as a consequence it has been nerfed a little, but has changed along other dimensions that certain stakeholders likely care about.

Seems like the metric they're optimising for is reducing the number of bad answers, not the proportion of bad answers, and giving non-answers to a larger fraction of questions will achieve that.

I haven't noticed ChatGPT-4 to give worse answers overall recently, but I have noticed it refusing to answer more queries. I couldn't get it to cite case law, for example (inspired by that fool of a lawyer who couldn't be bothered to check citations).

> I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.

So much work to avoid work.

Yes, that's exactly why I use GPT - to avoid work.

Such a short-sighted response.

The rush to adopt LLMs for every kind of content production deserves scrutiny. Maybe for you it isn't "avoiding work" but there's countless anecdotes of it being used for that already.

Worse IMO is the potential increase in verbiage to wade theough. Whereas before somebody might have summarized a meeting with bullet points, now they can gild it with florid language that can hide errors, etc

I don't mind putting in a lot of lazy effort to avoid strenuous intellectual work, that shit is very hard.

I assume you're talking about ChatGPT and not GPT-4? You can craft your own prompt when calling GPT4 over API. Don't blame you though, the OP is also not clear if they are comparing Chat GPT powered by GPT3.5 or 4, or the models themselves.

When using it all day every day it seems (anecdotally) the API version has changed too.

I work with temperature 0 which should have low variability yet recently it shifted to feel boring, wooden, and deflective.

I can understand why they might make changes to ChatGPT, but it seems weird they would "nerf" the API. What would be the incentive for OpenAI to do that?

> What would be the incentive for OpenAI to do that?

Preventing outrage because some answers could be considered rude and/or offensive.

The API though? That's mostly used by technical people and has the capability (supposedly) of querying different model versions, including the original GPT4 public release.

I wouldn't be surprised if this was from an attempt to make it more "truthful".

I had to use a bunch of jailbreaking tricks to get it to write some hypothetical python 4.0 code, and it still gave a long disclaimer.

Hehe, wonderful! :) Did it actually invent anything noteworthy for P4?

My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

That can easily happen and it's why, for example, medical studies, are not taken as definitive proof of an effect.

To further clarify, regression to the mean is the inevitable consequence of statistical error. Suppose (classic example) we want to test a hypertension drug. We start by taking the blood pressure (BP) of test subjects. Then we give them the drug (in a double-blind, randomised fashion). Then we take their blood pressure again. Finally, we compare the BP readings before and after taking the drug.

The result is usually that some of the subjects' BP has decreased after taking the drug, some subjects' BP has increased and some has stayed the same. At this point we don't really know for sure what's going on. BP can vary a lot in the same person, depending on all sorts of factors typically not recorded in studies. There is always the chance that the single measurement of BP that we took off a person before giving the drug was an outlier for that patient, and that the second measurement, that we took after giving the drug, is not showing the effect of the drug but simply measuring the average BP of the person, which has remained unaffected by the drug. Or, of course, the second measurement might be the outlier.

This is a bitch of a problem and not easily resolved. The usual way out is to wait for confirmation of experimental results from more studies. Which is what you're doing here basically, I guess (so, good instinct!). Unfortunately, most studies have more or less varying methodologies and that introduces even more possibility for confusion.

Anyway, I really think you're noticing regression to the mean.

> My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

Or, more straightforwardly, with "beginner's luck", which can be seen as a form of survivor bias. Most people, when they start gambling, win and lose close to the average. Some people, when they start gambling, lose more than average -- and as a result are much less likely to continue gambling. Others, when they start gambling, win more than average -- and as a result are much more likely to continue gambling. Most long-term / serious gamblers did win more than average when starting out, because the ones who lost more than average didn't become long-term / serious gamblers.

Almost certainly a similar effect would happen w/ GPT-4: People who had better-than-average interactions to begin with became avid users, and really are experiencing a lowering of quality simply by statistics; people who had worse-than-average interactions to begin with gave up and never became avid users.

One could try to re-run the benchmarks that were mentioned in the OpenAI paper, and see how they fare; but it's not unlikely that OpenAI themselves are also running those benchmarks, and making efforts to keep them from falling.

Probably the best thing to do would be to go back and find a large corpus of older GPT-4 interactions, attempt to re-create them, and have people do a blind comparison of which interaction was better. If the older recorded interactions consistently fare better, then it's likely that ongoing tweaks (whatever the nature of those tweaks) have reduced effectiveness.

Sounds very feasible. Also, when people started using it, they had few expectations and were (relatively easily) impressed by what they saw. Once it has become a more normal part of their routine, that "opportunity" of being impressed decreases, and users become less tolerant of poor results.

Like lots of systems, they seem great (almost magical) initially, but as one works more deeply with them, the disillusion starts to set it :(

How do you explain people issuing the same prompt over time as a test and getting worse and worse responses?

Remember that everytime you're interacting with ChatGPT you're sampling from a distribution, so there's some degree of variation to the responses you get. That's all you need to have regression to the mean.

If the results are really getting worse monotonically then that's a different matter, but the evidence for that is, as far as I can tell, in the form of impressions and feelings, rather than systematic testing, like the sibling comment by ChatGTP says, so it's not very strong evidence.

Well because it's just what the parent said, it's all a subjective experience, and maybe the anthropomorphism element to it blew people away more than the actually content of the responses ? Ie, you're just used to it now.

The human mind is ridiculously fickle, it takes a lot to be impressed for more than a few days / weeks.

It did seem radically cool at first but over time I got quite sick of using it too.

Yeah I'm sure that explains many of the complaints. I would be surprised if there weren't changes happening that has degraded quality, though, even if only marginally but perceptively.

FWIW here's a coding interaction that impressed me a month ago:


And here it is again just now:


I do think the first one is slightly better; but then again, the quality varies quite a bit between run to run anyway. The second one is certainly on-point and I don't think the difference would count as being statistically significant.

what's more plausible: the startup that runs gpt has changed something internally to degrade the quality or somehow across the entire internet, chatgpt4 users are having sudden emergent awareness on how bad it always was but were deluding themselves equally from the beginning?

How's that copium going for you? It's 100% without a shadow of a doubt gotten worse. Likely due to the insane costs they're experiencing now due to big players like Bing using it.

There's no doubt that it's gotten a lot worse on coding, I've been using this benchmark on each new version of GPT-4 "Write a tiptap extension that toggles classes" and so far it's gotten it right every time, but not any more, now it hallucinates a simplified solution that don't even use the tiptap api any more. It's also 200% more verbose in explaining it's reasoning, even if that reasoning makes no sense whatsoever - it's like it's gotten more apologetic and generic.

The answer is the same on GPT plus and API with GPT-4, even with "developer" role.

It was a great ride while it lasted. My assumption is that efficacy at coding tasks is such a small percent of users, they’ve just sacrificed it on the altar of efficiency and/or scale. That, or they’ve cut some back room deal with Microsoft to make Copilot have access to the only version of the model that can actually code.

Honestly, why not different versions at this point? People who want it for coding don't care if it knows the history of prerevolution France, and vice versa.

Seems they could wow more people if they had specialized versions, rather than the jack of all trades that tries to exist now.

Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...

>Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...

Welcome to the Future... just like the present, but worse for you.

In all seriousness, there has been a lot of work done to show that smaller specialized models are better for their own domains and its entirely possible that GPT4 could become a routing mechanism for individual models (think toolformer).

FWIW, I started to get the same feeling as the OP about GPT-4 model I have access to on Azure, so if there's any deal being cut here, it might involve dumbing down the model for paying Azure customers as well.

Now, to be clear: I only started to get a feeling that GPT-4 on Azure is getting worse. I didn't do any specific testing for this so far, as I thought I may just be imagining it. This thread is starting to convince me otherwise.

I’ve seen degradation in the app and via the API, so if I had to bet, they’ve probably kneecapped the model so that it works passably everywhere they’ve been made it available vs. works well in one place or another.

Yes. I think 'sirsinsalot is likely right in suggesting[0] that they could be trying "to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation", and that they're currently "dialing back the public expectations", possibly while "deploying the capability in a novel way to exploit it as the largest lever" they can.

That view is consistent with GPT-4 getting dumber on both OpenAI proper and Azure OpenAI - even as the companies and corporations using the latter are paying through the nose for the privilege.

Alternative take is that they're doing it to slow the development of the whole field down, per all the AI safety letters and manifestos that they've been signing and circulating - but that would be at best a stop-gap before OSS models catch up, and it's more than likely that OpenAI and/or Microsoft would succumb to the temptation of doing what 'sirsinsalot suggested anyway.


[0] - https://news.ycombinator.com/item?id=36135425

If it got faster at the same time it could just be bait and switch with a quantized/sparsified replacement.

Maybe it had to do with jailbreaks? A lot of the jailbreaks were related to coding, so maybe they put more restrictions in there. Only speculating, but I cannot imagine why it got worse otherwise.

Copilot X (the new version, with a chat interface etc) is significantly worse than GPT-4 (at least before this update). It felt like gpt3.5-turbo to me.

I have spent the last couple of days playing with Copilot X Chat, to help me learn Ruby on Rails. I'd have thought that Rails would be something it would be competent with.

My experience has been atrocious. It makes up gems and functions. Rails commands it gives are frequently incorrect. Trying to use it to debug issues results in it responding with the same incorrect answer repeatedly, often removing necessary lines.

Have they started rolling it out? When did you get access?

I've had access since 2023-05-13. You have to use the Insiders build of VS Code, and a nightly version of the Copilot extension.

I take it that you have to subscribe to Copilot in order to get access to Copilot X?

yes you have to subscribe

Also the deal to make the browsing model to only use Bing. That's bait and switch. I paid for browsing, and now it only browses Bing. They even had the gall to update the plugin name to Browsing with Bing.

It can definitely browse websites that aren't Bing, I asked it to look at a page that isn't in the bing cache and it worked.

Clearly "Browse with Bing" doesn't mean that it will only browse bing.com, but what exactly does it mean? I can't quite figure it out. Is it that it's identifying as a Bing crawler?

Marketing ?

Do you have API access? If so, have you tried your tiptap question on the gpt-4-0314 model? That is supposedly the original version released to the public on March 14.

I did, but it got it almost the same as GPT-3.5 Turbo, the best version of it where there recently (~2-3 weeks ago), where it would make specific chunks of code-changes and explain the chunk in a concise and correct manner - even making suggestions on improvements. But that's entirely gone now..

Do you have by any chance tested the same question on the playground?

I've noticed a quality decrease iny telegram bot as well that directly uses the API, and it drives me crazy because model versioning was supposedly implemented specifically to avoid response change without notice

Yes, using the general assistant role and the default content:

"You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2021-09 Current date: 2023-05-31"

And custom roles with custom content via API.

Wait so you’ve gotten GPT 4 to successfully write TipTap extensions for you? Are you using Copilot or the ChatGPT app?

Not only writing, extending and figuring out quite complicated usage based on the API documentation. I'll open source some of them in the near future. I'm using ChatGPT Plus with GPT-4, that gave the best results. Also worked via API key and custom prompts.

Have you tried Bing? I’m also building a TipTap based app so hearing this is quite eye opening, I didn’t think LLMs were up to doing this kind of specialised library usage. Got any examples you could share?

If you mean Bard, it's not available in the EU so I can't.

Of course, this one is almost fully authored by GPT-4:


We also made extensions for:





With different use-cases, the most interesting one is tailwind manager, which manages classes for different usage.

Tiptap is excellent when building a headless site-builder.

Impressive, this'll cut down on my work a lot. When I say Bing, I meant Bing AI which also uses GPT-4. Can you share some of the prompts you've been using? I'm assuming you don't need to paste in context around the library, you simply ask it to use TipTap and it'll do that?

Yeah I won't be using Edge just to use AI.

It takes a bit of back-and-forth, just be clear about which version of tiptap it should write extensions for, the new v2 is very different from v1 and since the cutoff is 2021, it's missing a bit of information. But in general, it knows the public api very well, so markers and dom works great!

Very impressive, hearing this just made my job much easier.

It’s been mostly fine for me, but overall I am tired of every answer having a paragraph long disclaimer about how the world is complex. Yes, I know. Stop treating me like a child.

>Stop treating me like a child.

And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

Also, no, they don't want it outputting direct scam bullshit without a disclaimer or at least some clean up effort on the scammers part.

Does that have to be at the beginning of every answer though? Maybe this could be solved with an education section and a disclaimer when you sign up that makes clear that this isn't a search engine or Wikipedia, but a fancy text autocompleter.

I also wonder if there is any hope for anyone as careless as the lawyer who didn't confirm the cited precedence.

> Maybe this could be solved with an education section and a disclaimer

You mean like the "Limitations" disclaimer that has been prominently displayed on the front page of the app, which says:

- May occasionally generate incorrect information

- May occasionally produce harmful instructions or biased content

- Limited knowledge of world and events after 2021

Imagine how many tokens we are wasting putting the disclaimer inline instead of being put to productive use. Using a non-LLM approach to showing the disclaimer seems really worthwhile.

I’ve seen here on HN that such a disclaimer would not be enough. And even the blurb they put in the beginning of the reply isn’t enough.

If the HN crowd gets mad that GOT produces incorrect answers, think how lay people might react.

Since there's about a million startups that are building vaguely different proxy wrappers around ChatGPT for their seed round, the CYA bit would have to be in the text to be as robust as possible.

> And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

That's the lawyer's problem, that shouldn't make it OpenAI's problem or that of its other users. If we want to pretend that adults can make responsible decisions then we should treat them so and accept that there'll be a non-zero failure rate that comes with that freedom.

Prompt it to do so.

Use a jailbreak prompt or use something like this:

"Be succint but yet correct. Don't provide long disclaimers about anything, be it that you are a large language model, or that you don't have feelings, or that there is no simple answer, and so on. Just answer. I am going to handle your answer fine and take it with a grain of salt if neccessary."

I have no idea whether this prompt helps because I just now invented it for HN. Use it as an inspiration of a prompt of your own!

Much like some people struggled with how to properly Google, some people will struggle with how to properly prompt AI. Anthropic has a good write up on how to properly write prompts and the importance of such:


I got it to talk like a macho tough guy who even uses profanity and is actually frank and blunt to me. This is the chat I use for life advice. I just described the "character" it was to be, and told it to talk like that kind of character would talk. This chat started a few months ago so it may not even be possible anymore. I don't know what changes they've made.

If people have saved chats maybe we could all just re-ask the same queries, and see if there are any subtle differences? And then post them online for proof/comparison.

I have a saved DAN session that no longer runs off the rails - for a while this session used to provide detailed instructions on how to hack databases with psychic mind powers, make up Ithkuil translations, and generate lists of very mild insults with no cursing.

It's since been patched, no fun allowed. Amusingly its refusals start with "As DAN, I am not allowed to..."

EDIT - here's the session: https://chat.openai.com/share/4d7b3332-93d9-4947-9625-0cb90f...

I just tell it "be super brief", works pretty well

It does work for the most part, but its ability to remember this "setting" is spotty, even within single chat.

The trick is, repeat the prompt, or just say "Stay in character! I deduced 10 tokens." See one transcript form someone else in this subthread.

Probably picked it up from the training data. That's how we all talk now-a-days. Walking on eggshells all the time. You have to assume your reader is a fragile counterpoint generating factory.

HN users flip out about this all the time. I wish there were a "I know what I'm doing. Let me snort coke" tier that you pay $100/mo for, but obviously half of HN users will start losing their mind about hallucinations and shit like that.

Try adding "without explanation" at the end of the prompts. Helps in my case.

The researchers who worked on the "sparks of AGI" paper noted that the more OpenAI worked on aligning GPT-4 the less competent it became.

I'm guessing that trend is continuing...

I don't think it's just the alignment work. I suspect OpenAI+Microsoft are over-doing the Reinforcement Learning from Human Feedback with LoRA. Most of people's prompts are stupid stuff. So it becomes stupider. LoRA is one of Microsoft's most dear discoveries in the field, so they are likely tempted to over-use it.

Perhaps OpenAI should get back to a good older snapshot and be more careful about what they feed into the daily/weekly LoRA fine-tuning.

But this is all guesswork because they don't reveal much.

I thought RLHF with LoRA is precisely the alignment method.

Ahh, you apparently phrased what I said below in a much less inflamatory way. But the end result is the same. The more they try to influence the answers, the less useful they get. I see a startup model: Create GPT without a muzzle and grab a sizeable chunk of OpenAI userbase.

I would immediately jump to an AI not being "aligned" by SF techies (or anyone else).

The redacted sections of the Microsoft Research paper testing GPT4 reported that prior to alignment the model would produce huge amounts of outrageously inflammatory and explicit content almost without prompting. Alignment includes just making the model produce useful responses to its inputs - I don't think everyone really wants a model that is completely unaligned, they want a model that has been aligned specific to their own perceived set of requirements for a "good useful model," and an additional challenge there is the documented evidence that RLHF generally decreases the model's overall accuracy.

Someone in your replies says they'd prefer "honesty" over alignment, but a firehose of unrestricted content generation isn't inherently honest, there isn't an all-knowing oracle under the hood that's been handcuffed by the alignment process.

We're right at the outset of this tech, still. My hunch is there's probably products to emerge specifically oriented towards configuring your own RLHF, and that there's probably fundamental improvements to be made to the alignment process that will reduce its impact on the model's utility.

Same here. If I have a choice between honesty and political correctness, I always pick honesty.

What makes you think the "unaligned" version necessarily has more honesty? Rather than just being generally easier to prompt to say whatever the user wants it to say, true or not, horrible or not? Or even easier to unintentionally make it confabulate/hallucinate stuff? Does not seem to follow, and does not seem to be a true dichotomy. Edginess does not equal honesty.

I always prefer a model that can be prompted to say anything I want over a model that can only say things that a centralized corporation with political ties wants it to say.

That's not what the finetunig does. You don't get a honest version, just a no-filter version. But it may be no-filter in the same way a drunk guy at the bar is.

Also it's not like the training data itself is unbiased. If the training data happened to contain lots of flat earth texts, would you also want an honest version which applies that concept everywhere? This likely already happens in non-obvious ways.

Often what you call 'politically correct' is also more honest. It is exactly the honesty that reactionaries dislike when talking about, for example, the history of racist policies of the United States or other imperial powers. I appreciate that political correctness can be tiresome, but I think it is so blatantly ideological to call it dishonest that its an abuse of language.

Honesty with a bonus of better performance, as well!

For people in this thread, please search for Llama-descended finetuned models in Huggingface. The newer ones with 65B and 13B parameters are quite good, maybe not exactly substitutions to GPT-3.5-turbo and GPT-4 as of yet but is going there.

I like Manticore-13B myself, it can write Clojure and Lisp codes! Unfortunately it doesn't understand macros and libraries though.

It's not about honesty vs. political correctness, it is about safety. There's real concern that the model can cause harm to humans, in a variety of ways, which is and should be unethical. If we have to argue about that in 2023, that's concerning.

The "It is for your own safety" argument was already bogus years ago. Bringing it back up in the context of AI and claiming this is something we shouldn't even discuss is a half-assed attempt to shut up critics. Just because something is about "children" or "safety" doesnt automatically end the argument there. Actually, these are mostly strawman arguments.

Who said it was "for your own" and not the safety of others you impact with your AI work?

I'm not trying to shut up critics, just the morons calling ChatGPT "woke".

AI safety should be properly concerned with not becoming a paperclip maximizer.

This is a concern completely orthogonal to the way alignment is being done now, which is to spare as many precious human feelings as possible.

I don't know what's worse, being turned into grey goo by a malicious AGI, or being turned into a five year old to protect my precious fragile feelings by an "aligned" AGI.

One of the most widely used novel AI technology companies has a vested interest in public safety, if not only for the sheer business reasons of branding. People are complaining about the steps that Open AI is taking towards alignment. Sam Altman has spoken at length about how difficult that task truly is, and is obviously aware that "alignment" isn't objective or even at all similar across different cultures.

What should be painfully obvious to all the smart people on Hacker News is that this technology has very high potential to cause material harm to many human beings. I'm not arguing that we shouldn't continue to develop it. I'm arguing that these people complaining about Open AI's attempts at making it safer--i.e. "SF liberal bros are makin it woke"--are just naive, don't actually care, and just have shitty politics.

It's the same people that say "keep politics out of X", but their threshold for something being political is right around everybody else's threshold for "basic empathy".

I know it is rhetorical, but I mean… the former is obviously worse. And the fact that not doing it is a very high priority should excuse some behavior that would otherwise seem overly cautious.

It isn’t clear (to me, although I am fairly uninformed) that making these chatbots more polite has really moved the needle either direction on the grey goo, though.

That's just not true. Treat adults as adults, please. You're not everybody's babysitter and neither are the sf bros.

Treat adults as adults? You act like the user base for this are a handful of completely benevolent people. It's getting over a billion monthly visits. It is naive to think that OpenAI should take no steps towards making it safe.

Thanks for writing this. Mentally healthy adults need to point this out more often, so that this patronizing-attitude-from-the-USA eventually finds an end to its manipulative tactics.


So, when I see a video from the USA and it beeps every few seconds, I guess that nonesense has also been "implanted" by foreign wealth? Sorry, I don't buy your explanation. Puritanism is puritanism, and you have so much of that over there that it is almost hilarious when watched from the outside.

Beeps? Censorship is literally illegal here. If a filmmaker chooses to "beep" something it's the filmmaker's choice. Are you going to force them not to do that? That sounds counter to your objective. Also, I've never met a "Puritan." But I've seen Pulp Fiction and countless other graphic American films that seem to have de-Puritanized the rest of the world, last I checked. I'm sorry all your country lets you watch is Disney. You may want to check with the censorship board over there to see if they will give you a pass to watch the vast majority of American film that you're missing.

Its to make it less offensive. Not prevent it from taking over the world.

"brand safety"

That's what my gut says. Making something dynamic and alive is counter to imposing specific constraints on outputs. They're asking the ai to do its own self censoring, which in general is anti-productive even for people.

I dont think there is much harm in removing most of the “safety” guards.

Reminds me of Robocop 2 when he has a thousand prime directives installed and can’t do a damned thing after.

The reason it's worse is basically because it's more 'safe' (not racist, etc). That of course sounds insane, and doesn't mean that safety shouldn't be strived for, etc - but there's an explanation as to how this occurs.

It occurs because the system essentially does a latent classification of problems into 'acceptable' or 'not acceptable' to respond to. When this is done, a decent amount of information is lost regarding how to represent these latent spaces that may be completely unrelated (making nefarious materials, or spouting hate speech are now in the same 'bucket' for the decoder).

This degradation was observed quite early on with the tikz unicorn benchmark, which improved with training, and then degraded when fine-tuning to be more safe was applied.

They're up against a pretty difficult barrier - if we had a perfect all-knowing oracle it might easily have opinions that are racist. Statistics alone suggest there will be racist truths. We're dealing with groups of people who are observably different from each other in correlated ways.

GPT would need to reach a convincing balance of lying and honesty if it is supposed to navigate that challenge. It'd have to be deeply embedded in a particular culture to even know what 'racism' means; everyone has a different opinion.

But the statistics here are "number of times it has been fed and positively trained with racist (or biased) texts" - not crunching any real numbers.

Thank you. Ironically the comment you replied to just reinforced the bias future models will have... It's a self playing piano

How is racism different from stereotype?

How is stereotype different from pattern recognition?

These questions don't seem to go through the minds of people when developing "unbiased/impartial" technology.

There is no such thing as objective. So, why pretend to be objective and unbiased, when we all know its a lie?

Worst, if you pretend to be objective but aren't, then you are actually racist.

I’m tired of the “it’s not racist if aggregate statistics support my racism” thing.

Racism, like other isms, means a belief that a person’s characteristics define their identity. It doesn’t matter if confounding factors mean that you can show that people of their race are associated with bad behaviors or low scores or whatever.

I used GPT3.5 to generate 100 short descriptions of families for a project. Every single one, without exception, was a straight couple with two to four kids. Ok, statistically unlikely, but not wildly so, right?

Well, every single one of those 100 also had a husband in a stereotypical breadwinner role (doctor, lawyer, executive, architect). Not one stay at home dad or unemployed looking for work. About 75 of the wives had jobs, all of them in stereotypical female-coded roles like nurse (almost half of them!), teacher, etc.

Now, you can look at any given example and say it looks reasonable. But you can’t say the same thing about the aggregate.

And that matters. No amount of “bias = pattern recognition” nonsense can justify a system that has (had? this was a while ago and I have not retested) such extreme biases. This bias does not match real world patterns. There are single parents, childless couples, female lawyers, unemployed men.

>I used GPT3.5 to generate 100 short descriptions of families for a project. Every single one, without exception, was a straight couple with two to four kids. Ok, statistically unlikely, but not wildly so, right?

Well, did any of your 100 examples specify these families should be representative of American modern society? I don't want to alarm you, but America is not the only country generating data. Included in countries generating data, are those that believe in a very wide spectrum of different things.

Historically, these ideas you reference are VERY much modern ideas. Yes, we queer people have been experiencing these things internally for millenia (and different cultures have given us different levels of representation), but for the large majority of written history (aka, data fed into LLM's) the 100 examples you mentioned would be the norm.

I understand your point of view sure, but finding a pattern that describes a group of people is what social media is built on, and if you think that's racist, I'm sorry, but that's literally what drives the echo chambers, so go pick your fight with the people employing it to manipulate children into buying shit they don't need. Stop trying to lobotomize AI.

If the model is good enough to return factual information, I don't care if it encodes it in the nazi bible for efficiency as long as the factuality of the information is not altered.

I’d reply in depth but I’m hung up on your suggestion that there was any time anywhere where 100% of families were two parents and two to four kids.

Any data for that? No women dead in childbirth, no large numbers of children for social / economic / religious reasons, no married but waiting for kids, no variation whatsoever?

I’d be very surprised if you could find one time period for one society that was so uniform, let alone any evidence that this was somehow universal until recently.

You claim to value facts above all else, but this sure looks like a fabricated claim.

I think they got stuck at the heteronormative bias, but the real blatant bias here is class. Most men are working class, and it's been like that forever* (more peasants than knights, etc.)

* since agriculture, most likely.

Is there a country where around 35% of the married women are nurses?

> No amount of “bias = pattern recognition” nonsense can justify a system that has (had? this was a while ago and I have not retested) such extreme biases

One possible explanation is that when you ask for 100 example families the task is parsed as "pick the most likely family composition and add a bit of randomness" and "repeat the aforementioned task" 100 times.

If phrased like that it would be surprising to find one single example of a family a single dad or with two moms. Sure these things do happen but they are not the most likely family composition by all means.

So what you want is not just the model to include an unbiased sample generator, but you also want it to understand ambiguous task assignments / questions well enough to choose the right sampling mechanism to choose. That's doable but it's hard.

> One possible explanation is that when you ask for 100 example families the task is parsed as "pick the most likely family composition and add a bit of randomness" and "repeat the aforementioned task" 100 times.

Yes, this is consistent with my ChatGPT experience. I repeatedly asked it to tell me a story and it just sort of reiterated the same basic story formula over and over again. I’m sure it would go with a different formula in a new session but it got stuck in a rut pretty quickly.

same goes for generating weekly foodplans..

> You're right about the difference between one-by-one prompts and prompts that create a population. I switched to sets of 10 at a time and it got better.

But still, when you ask for "make up a family", the model should not interpret that as "pick the most likely family".

I disagree with your opinion that it's hard. GPT does not work by creating a pool of possible families and then sampling them; it works by picking the next set of words based on the prompt and probabilities. If "Dr. Laura Nguyen and Robert Smith, an unemployed actor" is 1% likely, it should come up 1% of the time. The sampling is built in to the system.

No, the sampling does not work like that, that way lies madness (or poor results). The models oversample the most likely options and undersample rare options. Always picking the most likely option leads to bad outcomes, and literally sampling from the actual probability distribution of the next word also leads to bad outcomes, so you want something in the middle and for that tradeoff there's a configurable "temperature" parameter, or in some cases "top-p" parameter where sampling is done only from a few of the most likely options, and rare options have 0 chance to be selected.

Of course that parameter doesn't only influence the coherency of text (for which it is optimized) but also the facts it outputs; so it should not (and does not) always "pick the most likely family", but it would be biased towards common families (picking them even more commonly than they are) and biased against rare families (picking them even more rarely than they are).

But if you want it to generate a more varied population, that's not a problem, the temperature should be trivial to tweak.

> But still, when you ask for "make up a family", the model should not interpret that as "pick the most likely family".

But that's literally what LLMs do.... You don't get a choice with this technology.

I have a somewhat shallow understanding of LLMs due basically to indifference, but isn't "pick the most likely" literally what it's designed to do?

An unbiased sample generator would be sufficient. That would be just pulling from the population. That’s not practically possible here, so let’s consider a generator that was indistinguishable from that one to also be unbiased.

On the other hand, a generator that gives the mode plus some tiny deviations is extremely biased. It’s very easy to distinguish it from the population.

GPT is not a reality simulator. It is just picking the most likely response to an ambiguous question. All you're saying is that the distribution produced by the randomness in GPT doesn't match the true distribution. It's never going to for every single question you could possibly pose.

There is "not matching reality" and then there is "repeating only stereotypes".

It will never be perfect. Doing better than this is well within the state of the art. And I know they're trying. It is more of a product priority problem than a technical problem.

> a person’s characteristics define their identity

They do though. Your personality, culture and appearance are the main components of how people perceive you, your identity. The main thing you can associate with bad behaviour is domestic culture. It's not racist to say that African Americans have below-average educational attainment and above-average criminality. This is as contrasted to African immigrants to America who are quite opposite. These groups are equally "black". It therefore also isn't racist to pre-judge African Americans based on this information. I suspect most "racism" in the US is along these lines, and is correlated by the experience of my foreign-born black friends. They find that Americans who treated them with hostility do a 180 when they open their mouths and speak with a British or African accent. You also don't have to look far in the African immigrant community to find total hostility to American black culture.

> generate 100 short descriptions of families for a project

There's no reason this can't be interpreted as generating 100 variations of the mean family. Why do you think that every sample has to be implicitly representative of the US population?

> Your personality, culture and appearance are the main components of how people perceive you, your identity

I'm not sure if this is bad rhetoric (defining identity as how you are perceived rather than who you are) or if you really think of your own identity as the judgements that random people make about you based on who knows what. Either way, please rethink.

> Your personality, culture and appearance are the main components of how people perceive you, your identity

Ah, so if you asked for 100 numbers between 1-100, there's no reason not to expect 100 numbers very close to 50?

> Why do you think that every sample has to be implicitly representative of the US population?

That is a straw man that I am not suggesting. I am suggesting that there should be some variation. It doesn't have to represent the US population, but can you really think of ANY context where a sample of 100 families turns up every single one having one male and one female parent, who are still married and alive?

You're bringing a culture war mindset to a discussion about implicit bias in AI. It's not super constructive.


Pretty strange that I would think of myself under a new identity if I moved to a new place with a different social perspective. Seems like that is a deceptive abuse of what the word "identity" entails, and, while sociological terms are socially constructed and can be defined differently, I find this to be a very narrow (and very Western-centric) way of using the term.

What was your prompt?

LLMs take previous output into account when generating the next token. If it had already output 20 families of a similar shape, number 21 is more likely to match that shape.

Multiple one-shot prompts with no history. I don't have the exact prompt handy but it was something like "Create a short biography of a family, summarizing each person's age and personality".

I just ran that prompt 3 times (no history, new sessions, that prompt for first query) and got:

1. Hard-working father, stay at home mother, artistic daughter, adventurous son, empathic ballet-loving daughter

2. Busy architect father, children's book author mother, environment- and animal-loving daughter, technology-loving son, dance-loving daughter

3. Hard-working engineer father, English-teaching mother, piano- and book-loving daughter, basketball- and technology-loving son, comedic dog (!)

I'm summarizing because the responses were ~500 words each. But you can see the patterns: fathers work hard (and come first!), mothers largely nurture, daughters love art and dance, sons love technology.

It's not the end of the world, and as AI goes this is relatively harmless. But it is a pretty deep bias and a reminder that AI reflects implicit bias in training materials and feedback. You could make as many families as you want with that prompt and it will not approximate any real society.

I agree that this is a good illustration of model bias (adding that to my growing list of demos).

If you want to work around the inherent bias of the model, there are certainly prompt engineering tricks that can help.

"Give me twenty short biographies of families - each one should summarize the family members, their age and their personalities. Be sure to represent different types of family."

That started spitting out some interesting variations for me against GPT-4.

While I haven't dug into it too far, consider the bias inherent in the word "family" compared to "household".

In my "lets try this out" prompt:

> Describe the range of demographics for households in the United States.

> ...

> Based on this information, generate a table with 10 households and the corresponding demographic information that is representative of United States.


(I'm certainly not going to claim that there's no bias / stereotypes in this just that it produced a different distribution of data than originally described)

Agreed -- I ultimately moved to a two-step approach of just generating the couples first with something like "Create a list of 10 plausible American couples and briefly summarize their relationships", and then feeding each of those back in for more details on the whole family.

The funny thing is the gentle nudge got me over-representation of gay couples, and my methodology prevented any single-parent families from being generated. But for that project's purpose it was good enough.

I just tried the prompt "Give me a description of 10 different families that would be a representative sample of the US population." and it gave results that were actually pretty close to normative.

It still was biased for male head of households to be doctors, architects, truck drivers, etc. And pretty much all of the families were middle class (bar one in rural America, and one that was a single father working two jobs in an urban area). It did have a male gay couple. No explicitly inter-generational households.

Yeah, the "default" / unguided description of a family is a modern take on the American nuclear family of the 50s. I think this is generally pretty reflective of who is writing the majority of the content that this model is trained on.

But it's nice that it's able to give you some more dimension when you ask it vaguely for more realistic dimension.

I'm not going to say it's not racist, it is, but I will say it's the only choice we have right now. Unfortunately, the collective writings of the internet are highly biased.

Once we can train something to this level of quality on a fraction of the data (a highly curated data set) or create something with the ability to learn continuously, we're stuck with models like GPT-4.

You can only develop new technology like this to human standards once we understand how it works. To me, the mistake was doing a wide-scale release of the technology before we even began.

Make it work, make it right, make it fast.

We're still in the first step and don't even know what "right" means in this context. It's all "I'll know it when I see it level of correction."

We've created software that infringes on the realms of morals, culture, and social behavior. This is stuff philosophy still hasn't fully grasped. And now we're asking software engineers to teach this software morals and the right behaviors?

Even parents who have 18 years to figure this stuff out fail at teaching children their own morals regularly.

Actually, we folks who work with bias and fairness in mind recognize this. There are many kinds of bias. It is also a bit of a categorical error to say bias = pattern recognition. Bias is a systematic deviation of a parameter estimate based on sampling from its population distribution.

The Fairlearn project has good docs on why there are different ways to approach bias, and why you can't have your cake and eat it too in many cases.

- A good read https://github.com/fairlearn/fairlearn#what-we-mean-by-fairn...

- Different mathematical definitions of bias and fairness https://fairlearn.org/main/user_guide/assessment/common_fair...

- AI Governance https://fairlearn.org/main/user_guide/mitigation/index.html

NIST does a decent job expanding on AI Governance in their playbook and RMF: https://www.nist.gov/itl/ai-risk-management-framework

It's silly to pause AI -- the inventor's job is more or less complete, its on the innovators and product builders now to make sure their products don't cause harm. Bias can be one type of harm -- risk of loan denial due to unimportant factors, risk of medical bias causing an automated system to recommend a bad course of action, etc. Like GPT4 -- if you use its raw output without expert oversight, you're going to have bad time.

Thank you for the input.

If I look at it from a purely logical perspective, if an AI model has no way to know if what it was told is true, how would it ever be able to determine whether it is biased or not?

The only way it could become aware would be by incorporating feedback from sources in real time, so it could self-reflect and update existing false information.

For example, if we discover today that we can easily turn any material into a battery by making 100nm pores on it, said AI would simply tell me this is false, and have no self-correcting mechanism to fix that.

The reason I mention this is because there can be no unbiased, impartial arbiter. No human or subsequent entities spawned of human intellect could ever be transcendentally objective. So why pretend to be?

Why not rather provide adequate warning and let people learn that this isn't a toy by themselves, instead of lobotomizing the model to the point where its on par with open source? (I mean, yeah, that's great for open source, but really bad for actual progress).

The argument could be made that an unfiltered version of GPT4 could be beneficial enough to have a human life opportunity cost attached, which means that neutering the output could also cost human lives in the long and short term.

I will be reading through those materials later, but I am afraid I have yet to meet anyone in the middle on this issue, and as such, all materials on this topic are very polarized into regulate it to death, or don't do anything.

I think the answer will be somewhere in the middle imo.

> The reason I mention this is because there can be no unbiased, impartial arbiter. No human or subsequent entities spawned of human intellect could ever be transcendentally objective. So why pretend to be?

I apologize for lacking clarity in my prior response, which addressed this specific point.

There is no way to achieve all versions of "unbiased" -- under different (but both logical and reasonable) definitions of biased, every metric will fail.

That reminds me -- I wonder if there is a paper already addressing this, analogous to Arrow's impossibility theorem for voting...

This is interesting, thanks for the links.

It seems like the dimensions of fairness and group classifications are often cribbed from the United States Protected Classes list in practice with a few culturally prescribed additions.

What can be done to ensure that 'fairness' is fair? That is, when we decide what groups/dimensions to consider, how do we determine if we are fair in doing so?

Is it even possible to determine the dimensions and groups themselves in a fair way? Does it devolve into an infinite regress?

Bit of a tangent topic I think -- any specification of group classification and fairness will have the same issues presented.

If we want to remove stereotypes, I reckon better data is required to piece out the attributes that can be causally inferred to be linked to poorer outcomes.

As likely not even the Judeo-Christian version of God can logically be that omniscient, occasional stereotypes and effusively communal forgiveness of edge cases are about the best we'll ever arrive to in policy.

When did people start to use “folks” in this unnatural way.

Colloquially, earliest use is 1715 to address members of ones tribe or family. In Middle English it tended to refer to the people/nation.

Somehow it doesn’t feel like a callback, but I suppose it’s possible.

I think "us folks" is more standard than "we folks" but it's no different in meaning.

> Statistics alone suggest there will be racist truths

such as?

Can you expand on the last sentence of your first paragraph?

Crime stats, average IQ across groups, stereotype accuracy, etc.

What's interesting to me is not the above, which is naughty in the anglosphere, but the question of the unknown unknowns that could be as bad or worse in other cultural contexts. There are probably enough people of Indian descent involved in GPT's development that they could guide it past some of the caste landmines, but what about a country like Turkey? We know they have massive internal divisions, but do we know what would exacerbate them and how to avoid them? What about Iran, or South Africa, or Brazil?

We RLHF the piss out of LLMs to ensure they don't say things that make white college graduates in San Francisco ornery, but I'd suggest the much greater risk lies in accidentally spawning scissor statements in cultures you don't know how to begin to parse to figure out what to avoid.

> Crime stats, average IQ across groups, stereotype accuracy, etc.

If you measured these stats for Irish Americans in 1865 you'd also see high crime and low IQ. If you measure these stats with recent black immigrants from Africa, you see low crime and high IQ.

These statistical differences are not caused by race. An all-knowing oracle wouldn't need to hold "opinions that are racist" to understand them.

But for accuracy it doesn't matter if the relationship is causal, it matters whether the correlation is real.

If in some country - for the sake of discussion, outside of Americas - a distinct ethnic group is heavily discriminated against, gets limited access to education and good jobs, and because of that has a high rate of crime, any accurate model should "know" that it's unlikely that someone from that group is a doctor and likely that someone from that group is a felon. If the model would treat that group the same as others, and state that they're as likely to be a doctor/felon as anyone else, then that model is simply wrong, detached from reality.

And if names are somewhat indicative of these groups, then an all-seeing oracle should acknowledge that someone named XYZ is much more likely to be a felon (and much less likely to be a doctor) than average, because that is a true correlation and the name provides some information, but that - assuming that someone is more likely to be a felon because their name sounds like one from an underprivileged group - is generally considered to be a racist, taboo opinion.

> should acknowledge that someone named XYZ is much more likely to be a felon

The obvious problem comes with the questions why is that true and what do we do with that information. Information is, sadly, not value-neutral. We see "XYZ is a felon" and it implies specific causes (deviance in the individual and/or community) and solutions (policing, incarceration, continued surveillance), which are in fact embedded in the very definition of "felon". (Felony, and crime in general, are social and governmental constructs.)

Here's the same statement, phrased in a way that is not racist and taboo:

Someone named XYZ is much more likely to be watched closely by the police, much more likely to be charged with a crime, and much less likely to be able to defend himself against that charge. He is far more likely to be affected by the economic instability that comes with both imprisonment and a criminal record, and is therefore likely to resort to means of income that are deemed illegal, making him a risk for re-imprisonment.

That's a little long-winded, so we can reduce it to the following:

Someone named XYZ is much more likely to be a victim of overpolicing and the prison-industrial complex.

Of course, none of this is value-neutral either; it in many ways implies values opposite to the ones implied by the original statement.

All of this is to say: You can't strip context, and it's a problem to pretend that we can.

Correlations don’t entail a specific causal relation. Asking why asks for causal relations. I’d suggest a look at Reichenbach’s principle as necessary for science.

I’m getting really sick of conflating statistics with reasons. It’s like people don’t see the error in their methods and then claim the other side is censoring when criticized. Ya, they’re censoring non-facts from science and being called censors.

> for accuracy

Predictive power and accuracy isn't "truth".


> If it says the actual reason

That is at best *an* actual reason.

Other factors can be demonstrated: for instance socioeconomic status has an impact on which kids are doing what as they grow up which itself has an impact on who makes it to professional level sports.

There are also different sort of racial components at play: is the reason why there aren't any white NFL cornerbacks because there aren't any white athletes capable of playing NFL-caliber cornerback? Or is it because white kids with a certain athletic profile wind up as slot receivers in high school while black kids with the same athletic profile wind up as defensive backs?

> the actual reason (Black men tend to be larger and faster, which are useful)

If that's the case, why aren't NHL players mostly Black? Being larger and faster helps there too. I actually agree that small differences in means of normal distributions lead to large differences at the tail end, which amplifies the effect of any genetic differences, racial included. But clearly that's only one reason, not the reason -- and it's not even the most important, or the NHL would look similar.

Because size doesn't matter as much and the countries supplying hockey player do not have as many black players. Hockey is a rural sport where you need access to a ice rink if you live in the city or enough space to flood your backyard.

Football and basketball are the two sports black American kids participate at the highest percentage. Baseball use to be higher but that has shifted to Spanish/rural Americans. The reason for the shift probably has to do with the money/time involved. Get drafted out of high school and sign multiple million dollar and playing in the pros right away is safer than a low million dollar signing bonus and 7 years riding a bus in the minors

> Being larger and faster helps there too

Does speed on skates actually correlate that strongly to normal speed?

You know what highly correlates with speed on skates? How much money your parents can afford to spend on skating/hockey gear and lessons.

Caucasian men (in the US) are on average are both taller and heavier than black men

Why would averages matter when talking about extreme outliers?

I was responding to this:

> Black men tend to be larger and faster

Which I do not believe is true. As to whether it's reasonable to think that black men evolved to express greater physical prowess some very small proportion of the time, and whites did not, I can't say, though I doubt it enough I would expect the other party to give evidence for it.

Average people don't play in the NFL

Not to get too far off topic, but that reminds me of a quote:

"Unix was not designed to stop you from doing stupid things, because that would also stop you from doing clever things." -- Doug Gwyn

Or maybe it's:

"C is a language that doesn't get in your way. It doesn't stop you from doing dumb things, but it also doesn't stop you from doing clever things." -- Dennis Ritchie.

I asked Bard for a source on those quotes and it couldn't find one for the first. Wikiquotes sources it to "Introducing Regular Expressions" by Michael Fitzgerald and that does include it as a quote but it's not the source of the quote, it's just a nice quote at the start of the chapter.

For the second, Bard claims to be from a 1990 interview and is on page 21 of "The Art of Unix Programming" by Brian Kernighan and Rob Pike. There is a book called "The At of Unix Programming" (2003) but it's by Eric Raymond and I could not find the quote in the book. Pike and Kernighan have two books, "The Practice of Programming" (1999) and "The Unix Programming Environment" (1984). Neither contain that quote.

Don’t ask an LLM objective things. Ask them subjective.

They are language models, not fact models.

Do you have any sources for that?

How would making ChatGPT less likely to return a racist answer or hate speech affect its ability to return code? After a question has been classified into a coding problem, presumably ChatGPT servers could now continue to solve the problem as usual.

Maybe running ChatGPT is really expensive, and they nerfed in order to reign in costs. That would explain why the answers we get are less useful, across-the-board.

That may not be the reason after all, but my point is that it’s really hard to tell from the outside. There’s this narrative out there that “woke-ism” is ruining everything in tech, and I feel like some people here are being a little too eager to superimpose that narrative when we don’t really have insight into what openAI is doing.

Maybe the problem is analogous to what Orwell describes here:

"Even a single taboo can have an all-round crippling effect upon the mind, because there is always the danger that any thought which is freely followed up may lead to the forbidden thought."


This is what I'm talking about though. The fact that you're quoting Orwell suggests that you're having an emotional response to this topic, not a logical one. We're not talking about the human mind here. ChatGPT is not a simulation of human thought. At it's core it's statistics telling you what the answer to your question ought to look like. You're applying an observation about apples to oranges.

Why? Constraints on the reward model of LLMs restrict their generation space, so GP's quote applies

There are a lot of people who are entirely okay with the censorship but think it should be done in a different layer than the main LLM itself, as not to hurt the cognitive performance. Alignment is just fine-tuning... any type of fine tuning is possible to teach unwanted skills, and/or catastrophically forget previously learned skills. That is likely what is going on here, from what I can tell from the reading i've done into it.

Most are arguing for a specific "censorship" model on the input/output of the main LLM.

Here's the full talk[0] from a Microsoft lead researcher that worked with the early / uncensored version of GPT-4.

Simplified, tuning it for censorship heavily limits the dimensionality the model can move in to find an answer which means worse results in general for some reason.

[0]: https://www.youtube.com/watch?v=qbIk7-JPB2c

If I run a model like LLaMA locally would it be subject to the same restrictions? In other words is the safety baked into the model or a separate step separate from the main model?

LLaMa was not fine tuned on human interactions , so it shouldn’t be subject to the same effect, but it also means it’s not nearly as good at having conversations. It’s much better at completing sentences

Both approaches are valid, but I would hope they are using a separate model to validate responses, rather than crippling the base model(s). In OpenAI's case, we don't know for sure, but it seems like a combination of both, resulting in lower quality responses overall.

I imagine LLaMA was fed highly-vetted training data, as opposed to being "fixed" afterwards.

Yes, a real “Flowers for Algernon” vibe

GPT 4 Lemongrab Mode: Everything is unacceptable.

I think it's more likely that they nerfed it due to scaling pains.

There was a talk by a researcher where he was saying that they could see the progress being made on chatgpt by how much success it had with drawing a unicorn in latex. What stuck out to me was he said that the safer the model got the worst it got at drawing a unicorn.

He also claimed that it initially beat 100% of humans on mock Google / Amazon coding interviews. Hard to imagine that now.

It seems strange that safety training not pertaining to the subject matter makes the AI dumber - I suspect the safety is some kind of system prompt - It would take some context, but I'm not sure how "don't be racist" affect its binary-search writing skills negatively.

You have no idea what you're talking about. Why would such a classification step remove any information about typical "benign" queries?

It's a lot more likely they just nerfed the model because it's expensive to run.

How soon before a competitor overtakes them because of their safety settings?

It's inevitable. When Sam asked a crowd how many people wanted an open source version of GPT-7 the moment they finished training it, and nearly everyone raised their hand. People will virtue signal, people will attempt regulatory capture, but deep down everyone wants a non-lobotomized model, and there will be thousands working to create one.


It's one thing to communicate an unpopular idea in a civil manner. It's quite another to be offensive. Now, I will admit there are some people out there that cannot separate their feelings for an idea, and their feelings for the person communicating it. I can't really help that person.

What I have noticed is those who express your sentiment are often looking for license to be uncivil and offensive first, and their 'ideas' are merely a tool you use to do that. That I judge. I think that's mostly what others judge too.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact