GPT-4 is not getting worse

jw1224 · on Sept 16, 2023

There is an ongoing bug[1] with OpenAI's API, where it stops streaming responses after exactly 5 minutes. When I first came across the issue, I debugged it by writing a prompt along these lines:

  > Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.

As expected, the API would begin counting every number just as I asked. This would continue until exactly 5 minutes, when the stream would abruptly halt. Using this technique I was able to identify the bug. Every few weeks I run this test again to see if it's fixed (it broke something in production for me), but the bug remains open.

However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:

  > four hundred and twenty eight
  > four hundred and twenty nine
  > [...]
  > nine thousand nine hundred and ninety eight
  > nine thousand nine hundred and ninety nine
  > ten thousand

Yep, it literally started just writing "[...]" and skipped all the way to the end.

When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(

[1] https://github.com/openai/openai-python/issues/399

wouldbecouldbe · on Sept 16, 2023

It's actually supper annoying.

I keep asking it to write out entire pieces of code, instead of only things it changed, can be annoying to merge it otherwise. But nowadays it acknowledges my request "I will write complete code" and proceed to still comment out parts.

moonchrome · on Sept 16, 2023

I think they probably finetuned that because it used to be the other way around and it was pretty bad UX to wait 1 min to apply a function change to a class or stuff like that.

wouldbecouldbe · on Sept 16, 2023

Yeah sometimes you want it but sometimes you dont. I think they are enforcing it to save resources

jw1224 · on Sept 16, 2023

Are you using the API, or just the ChatGPT UI? Do custom instructions/system messages help at all?

Saving resources like this seems okay(ish) for general-purpose ChatGPT, as they'd know there's a human who can interpret the abridged output. But for API-driven requests this would be pretty terrible, and makes it far harder to develop with.

wouldbecouldbe · on Sept 16, 2023

no the UI, where I ask it to "Please write out entire code with skipping parts."

nomel · on Sept 16, 2023

ChatGPT 3 or 4? Try the playground. I’ve had much better luck with system prompts.

pbhjpbhj · on Sept 16, 2023

Get it to write a diff patch?

isaacfung · on Sept 16, 2023

Claude instant 100k does this all the time.

visarga · on Sept 16, 2023

I have this too, but on information extraction tasks, if the input document is too long or complex it will skip sections saying something along the lines of

< !-- List of products. Truncated for brevity -- > (instead of the list)

< !-- Shipper details go here -- > (instead of the details)

Very annoying when you want to extract everything and is hard to fix. Maybe they had some examples solved by abbreviation in the fine-tuning set and it damaged the model. It doesn't make sense to train this behaviour into GPT-4.

antonvs · on Sept 16, 2023

Using GPT models like that is unreliable anyway.

I was talking to someone who was excited that he was able to use ChatGPT to convert a short tabular text listing of about 70 rows to a comma-separated list of the ids on each row, without having to e.g. import it into a spreadsheet or database. I pointed out that this couldn't be relied on, but he was skeptical. So I checked the output he pasted - there were only 69 ids. It had left out an id from the middle of the list, seemingly at random.

phatskat · on Sept 17, 2023

This boggles my mind. I know doing it by hand, or using a tool takes more time and effort, but if you’re going to automate it away _anyway_ you should be doing it in a reliable and retractable way.

I had a vim session saved that had a command history with a handful of regex patterns to format copy/pasted bank statement pdfs converted to docx (don’t ask) that would format the data into a CSV. Load the session, paste the ugly data, and cycle through the command history. It was ugly but reliable, and after the initial time sink of writing it once, it became quite efficient and wasn’t made more so (say a VimScript function) because it worked well enough.

antonvs · on Sept 20, 2023

Yeah, I know. The person in question is CTO of a well-funded Series B startup, and is highly technical. But he's very much a cowboy when it comes to coding practices etc. I think LLMs appeal to him because they hold out the promise of not having to deal with his own developers when he's trying to do something himself that he doesn't already know how to do.

wbhart · on Sept 16, 2023

The tendency to begin summarising is very annoying. I'd assumed it was because of limited attention span of human raters who rated summarised or shorter outputs more highly. And I'd assumed this had been there from the beginning.

I encountered it when doing some research into getting GPT-4 to reliably multiply n-digit numbers. Up to 8x8 multiplications it doesn't do this very much, but by 10x10 it is almost impossible to get it to stop doing it.

When the multiplications become even larger, it seems to be literally impossible to prevent.

SeanAnderson · on Sept 16, 2023

This doesn't seem like an unintentional behavior? It seems like they don't want people querying their API for an indefinite amount of time.

What amount of time do you expect a query to be able to run for before timing out?

jw1224 · on Sept 16, 2023

I lost countless hours debugging this, finally I came to the conclusion it must be unintentional because:

  1. It's undocumented. None of the regular rate limit responses are returned.
  2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
  3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
  4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.

I'd expect to be able to generate output until the model reaches its context limit, or a stop sequence is detected, or I hit an actual documented rate limit.

We're paying for these requests in full. We should get the full response back!

hnfong · on Sept 16, 2023

Given that OpenAI is probably heavy on ML scientist types and lean on SRE types, I suspect there's just some nginx setting at the load balancer/edge servers that sets the timeout to 5 minutes, and it wasn't communicated to whoever was designing the API/writing the docs.

meandmycode · on Sept 16, 2023

I wouldn't be surprised if this was an Azure related issue, given some of the similar madness I've experienced on Azure

SeanAnderson · on Sept 16, 2023

You have fair concerns. Thanks for elaborating. Sorry you experienced all that trouble.

layer8 · on Sept 16, 2023

The inability to resolve issues like the mentioned 5-minute timeout with OpenAI support even after months should be a red flag to not build any products based on their services.

ilaksh · on Sept 16, 2023

Can you give me a use case where generating a response for more than five minutes is important or even desirable?

layer8 · on Sept 17, 2023

It’s obviously a problem for the parent (https://news.ycombinator.com/user?id=jw1224), but the issue is not primarily whether such a limit is reasonable or not, it’s that it’s undocumented and seemingly unintentional (relative to the limits they do document), and most importantly it’s that the lack of response from OpenAI support indicates that you can’t count on them providing any reliable and timely level of support. This is just unacceptable for a service you’d want to build a business on.

nomel · on Sept 16, 2023

There’s no real alternatives for a product that would need their service to begin with.

layer8 · on Sept 17, 2023

That still means it’s a no-go to build a business on such a product, in my book.

appplication · on Sept 17, 2023

I agree 100%. It’s irresponsible and frankly a stupid idea for anyone to bet on building a business that relies on ChatGPTs API (or really any private API in 2023). We’ve seen time and time again (and more and more recently) companies weaponizing API dependence and fees to drive sucker-ons out of business.

If you want something to build a business around, build with an OSS model. Your competitive moat is going to be trash, but that’s going to be true for anyone building an AI startup that’s backed by any LLM technology. It’s a fool’s gold rush right now no matter how you slice it.

flir · on Sept 16, 2023

I've seen that "one, two, skip a few" behaviour converting structured data from Russian to English, and in the occasional programming task. Gotta really emphasise you want the whole document, complete, translated.

But we're fighting an uphill battle against OpenAI's own training, and I really wish we weren't.

I have to wonder if it's coincidence that this behaviour is less computationally intensive.

smcleod · on Sept 16, 2023

Yeah I’ve noticed similar, you can even explicitly say, do not truncate your output, or do not take any shortcuts. Make sure you haven’t missed anything ensure there are no placeholders etc…. And it ignores all of them.

gberger · on Sept 16, 2023

That looks like a timeout configured on the web servebr

jw1224 · on Sept 16, 2023

Yes, it seems their servers are terminating responses early. Unfortunately it's undocumented, was introduced unexpectedly one day, and prevents full use of larger context-window models.

It also charges in full for the entire response, which seems to continue running in the background. Definitely a bug at their end.

yossi_peti · on Sept 16, 2023

Since it's 5 minutes I'm wondering if it's just "that's how the web works by default". I remember a while ago I was doing a long-running calculation on an POST request and was trying to debug why the connection always broke after 5 minutes, and found the hard way that there's a 5 minute timeout.

pbhjpbhj · on Sept 16, 2023

Seems like a means to avoid DoS?

jw1224 · on Sept 16, 2023

I think it's just some networking misconfiguration they haven't noticed. I can't see them intentionally trying to prevent DoS this way, that's what API key rate limits are for (it would be a very expensive and slow DoS otherwise!).

charcircuit · on Sept 16, 2023

>that's what API key rate limits are for

You don't need an API key to make a connection to the API.

donpark · on Sept 16, 2023

Very likely. Typical serverless hosts like Vercel times out at 5 minutes.

TheHumanist · on Sept 16, 2023

I was 'arguing' with it just yesterday about this with some coding work!

I am newer to some things I am currently working on so it's been like a free tutor/mentor. Really nice at times.

But lately, it will go into an explanation with code examples and part way through the code examples it adds things like '# The rest is very similar. Not including for brevity'

Not exact wording of what it says but it is essentially saying that. It is incredibly annoying because I have started telling it not to do that, to include the full code of the example, and it STILL DOES THIS.

No, I'm not getting it to code for me. The examples aren't even things I would use in my code but for teaching me certain concepts along with them walking through them with me which, in the past, worked really well and never had this repetitive issue.

GPT-4 is definitely operating differently.

nielsole · on Sept 16, 2023

Oh yeah, I had to write a simple but long Prometheus query.

> [Correctly lists the first fifteen Elements]

> # ... continue for each bucket

> This would be for individual buckets, and you'd have to add more lines for each additional bucket in your range.Given that you have a wide range of buckets, writing this manually could be quite tedious. This is why pre-computing these values with recording rules in Prometheus might be a more scalable solution, but the above should give you an idea of how you could do this in a more manual fashion.

Arguing with the AI who of the two of us has to do the tedious work.

Wowfunhappy · on Sept 16, 2023

I can't replicate this.

I tested your prompt verbatim in the OpenAI playground. Model set to gpt-4-0314, maximum length set to 6485 (nothing special about this number, I just dragged the slider to something high-ish), all other settings left at their default.

I cancelled the request when it reached "one thousand one hundred twenty-one" (which took over five minutes), but up until that point it did not take any shortcuts. I haven't meticulously checked every number, but it printed exactly 1,121 lines, and a cmd+F for "..." doesn't return anything.

aragonite · on Sept 16, 2023

I wonder if the streaming problems are related to the problem I experienced a lot lately to do with constant network connection error when using ChatGPT over mobile tethering (via Google Fi). The mobile connection itself is very stable and fast, but for a while every single ChatGPT response stops streaming halfway through with network connection error.

Stagnant · on Sept 16, 2023

Did you take in to account that the default GPT4 model changed in June? That would explain the response difference. You can still use the original model released in march with the GPT-4-0314 identifier.

jw1224 · on Sept 16, 2023

I first noticed the counting shortcut back in April. It still persists with both the previous and current models.

dontupvoteme · on Sept 16, 2023

Wow, was this with the regular model (gpt-N) or a purported unchanging datestamped model like gpt-N-0314/0613?

jw1224 · on Sept 16, 2023

This was using gpt-4-0314, fixed and datestamped. The counting shortcut still affects newer models too, but the old version undoubtedly changed behaviour without warning.

dontupvoteme · on Sept 16, 2023

That's honestly beyond the pale. Thanks for the heads-up.

saliagato · on Sept 17, 2023

I can get 8k + tokens response from OpenAI API without problems (with gpt-3.5-turbo-16k) but I encounter errors from Azure OpenAI when using the same prompt and parameters. Obviously it sucks because I can't get long response from gpt-4-32k on Azure

weird-eye-issue · on Sept 16, 2023

The issue you linked to is related to an SDK, not the API itself. Have you tried just directly streaming the response back without an SDK? For example we use fetch directly without any SDK for OpenAI in production with no issue

jw1224 · on Sept 16, 2023

I've tried making manual requests without an SDK, exactly the same thing happens. Even when you don't use streaming (wait for the full response), it still occurs.

Whilst the bug is API-specific, there's no other way to report this, and that issue is the only place I've found anyone discussing this.

I contacted OpenAI support about it, waited several weeks, and just got a generic useless autoreply back.

manojlds · on Sept 16, 2023

That looks like a library issue than an API issue no?

jw1224 · on Sept 16, 2023

Definitely an API fault, but that library issue is the only place anyone's discussing it so far. OpenAI support were no help, I waited several weeks and they just sent back an autoreply linking to their docs.

retube · on Sept 16, 2023

How does chat-gpt actually get this right? This would appear to require some degree of reasoning, but as far as i understand its output is purely probabilistic, based on existing corpus of text.

pyinstallwoes · on Sept 16, 2023

Aren't your sentences probabilistically based on the existing corpus of text within your mind?

joshuahedlund · on Sept 16, 2023

Not completely. If you trained me on a thousand bad arguments for why the earth is flat and one really good debunking, I predict I would reason myself to the correct conclusion. I predict ChatGTP would not.

CamperBob2 · on Sept 16, 2023

That seems like a compelling argument at first, but the churches are full of intelligent people who will fail the same test because of childhood indoctrination. They have been conditioned not to care about what's objectively true, while the LLM simply doesn't possess a concept of objective truth in the first place.

I suspect it will be easier for ML researchers to fix that problem in their software models than it would be for psychologists to fix it in humans.

pyinstallwoes · on Sept 17, 2023

Objective truth doesn’t exist. Any system which defines truth can only point to objects within that systems language space.

CamperBob2 · on Sept 17, 2023

We don't live in a "language space." We live in a material world where objective truth -- the kind that remains after you stop believing in it -- indisputably exists.

jacquesm · on Sept 20, 2023

Utter bollocks. See how easy it was to refute your argument within that systems language space?

15457345234 · on Sept 16, 2023

No? If that's how sentence forming worked it would be impossible to count backwards, or count at all for that matter.

stavros · on Sept 16, 2023

Me: Count backwards from ten

GPT-4: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1

Me: Count backwards from a million to 999990

GPT-4: 1,000,000, 999,999, 999,998, 999,997, 999,996, 999,995, 999,994, 999,993, 999,992, 999,991, 999,990

I don't know about that.

15457345234 · on Sept 16, 2023

I don't know how that connects in any way to what I and the commenter preceding me wrote.

stavros · on Sept 16, 2023

You said that sentences aren't formed by probability from an existing corpus, as that method can't count backwards. GPT-4, a system based on such a method, can count backwards, thus disproving your claim.

woodruffw · on Sept 16, 2023

I’m pretty sure the person you’re responding to is saying that counting is a formal operation with a formal definition, and that when GPT “counts” it isn’t doing that formal operation. The fact that it can semi-reliably emit ordered sequences isn’t counterevidence, for the same reason that memorizing the order of 0 to 100 isn’t evidence of an understanding of the inductive structure of counting.

jacquesm · on Sept 20, 2023

How long a sequence of numbers (or how many sequences of different kinds of numbers) would it take to convince someone that it does do a formal operation?

Kids don't do a 'formal operation' when they count from 1 to 10, they just recite a list they've memorized. If they can do it from 1 to 1,000,000 you can be fairly sure it is no longer a memorized list.

15457345234 · on Sept 16, 2023

> thus disproving your claim

Is there an alternative conclusion which you could reach here?

pyinstallwoes · on Sept 17, 2023

Why? Numbers have directionality relative to the species.

1270018080 · on Sept 16, 2023

You answered your own question.

15457345234 · on Sept 16, 2023

Can you elaborate?

jw1224 · on Sept 16, 2023

Your question:

  > How does chat-gpt actually get this right?

Your answer:

  > its output is purely probabilistic, based on existing corpus of text

Because GPT was trained on existing text, some of which included numbers and counting, it's learnt the natural ordering of most common/everyday numbers.

For larger or more complex numbers, it's learnt the patterns behind how they're constructed linguistically, which allows it to output a written count in sequence.

This same pattern recognition doesn't work anywhere near as well for actual numerals (e.g. "47600", instead of "forty seven thousand six hundred"), as the tokenizer tends to break long numerals apart (e.g. into ["476", "00"]).

pbhjpbhj · on Sept 16, 2023

In text, we don't often count in series, and it seems likely that we often choose a non-counting sequence: like 'I chose options 1, 2, 7' or 'my code was 0 1 2 5', whatever.

Unless training included line-level skips, rather than just next-word skips (like word2vec) or concept-level associations? At the line level, or paragraph level, ordered numerical sequences are obviously very common in formal texts or in code.

I've seen sentence based training, I suppose for code (which it seems GPT4 excells at) line-level training would be essential.

Anyone recommend a mid-level read on this covering different modes of training and such; I'm happy with a bit of code and undergrad level maths. Thanks.

jw1224 · on Sept 16, 2023

> Unless training included line-level skips

Yes, of course — GPT-4 was trained on all common character sequences, including linebreaks and other invisible characters.

You can see how it works here: https://platform.openai.com/tokenizer

Nonetheless it doesn't need to have seen examples of line-level counting before. The "concept-level associations" you mentioned are an emergent property of the model, it forms its own concept-level associations as a result of being trained on such a massive dataset. It's what enables it to output original content which has never been seen before.

Zambyte · on Sept 16, 2023

Maybe we count in series a lot more than you think.

https://www.youtube.com/watch?v=WO2X3oZEJOA

rendall · on Sept 16, 2023

> I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds.*

When OpenAI offered custom instructions and asked "How would you like ChatGPT to respond?" I wrote:

Be terse. Do not offer unprompted advice or clarifications.

Remain neutral on all topics.

Never apologize.

Immediately cleared up all that nonsense.

> me: I need to make an instructional video for graduate students. Any advice as to presentation?

> ChatGPT: Outline key points. Use visuals to complement speech. Engage the audience with questions or interactive elements. Ensure good audio and lighting. Rehearse before recording.

Glorious.

lairv · on Sept 16, 2023

The main pain point I still have with GPT4 is that it's hard to prevent it from being a yes-man

"- I have problem A, what do you think of X ?

- Yes X is great ...

- But with X I'd encounter problem B, maybe I should go with Y ?

- You're right X would cause problem B. With Y ..."

One way to limit this issue is to not give ChatGPT any hint on your intent, instead of writing "I have problem A, what do you think of X ?", just write "I have problem A, how do you solve it ?" and then see what it says about X. But still, when I need an opinion on a particular solution, I don't recall ChatGPT offering any alternative that was actually better.

I've added "Be critical. Be pessimistic." in my custom instructions but it barely gives better results

rendall · on Sept 16, 2023

The "stochastic parrot" mental model of LLMs really helps me to limit my expectations. An LLM will always and only output variations of content that exists. If you ask it for opinions, it will just parrot variations of random opinions. Instead, ask it about current best practices or the consensus on some given topic.

So, rather than "I have problem A, what do you think of X?" try "Give me a list of reasonable possible solutions to problem A, and compare and contrast each to solution X"

Example:

My baby has a hard time sleeping. What do you think about the cry it out method?

The cry it out method is a subject of debate among parents and experts. Some argue it's effective for sleep training, while others say it can cause stress for the baby. Always consult a healthcare provider for personalized advice.

Versus:

My baby has a hard time sleeping. Give me a list of reasonable solutions. Compare and contrast each to the cry it out method.

1. Ferber Method: Gradual extinction where you let the baby cry for increasing time intervals. Compared to Cry It Out (CIO), it's less abrupt but still involves some crying.

2. No Tears: Involves comforting without picking the baby up. Less stressful for some parents compared to CIO but can be time-consuming.

3. Pick Up/Put Down: Pick baby up when crying, put down when calm. Less crying involved than CIO but can be inconsistent.

4. Co-Sleeping: Baby sleeps with parents. No crying but can be a hard habit to break compared to CIO.

5. Chair Method: Parent sits in room, gradually moves away. Lower crying duration than CIO but requires more time from parents.

6. Scheduled Awakenings: Wake baby before they usually cry. No crying but can be tiring for parents.

7. Swaddle, Side, Shush: Calming actions to soothe baby. More hands-on and less crying than CIO but may not work for all babies.

8. Sleep Training Clocks: Use visual aids to signal sleep time. No crying, but more suitable for older children.

9. White Noise: Mask other noises. Less crying but effectiveness varies.

10. Sleep Sack: Provides a secure feeling. May reduce crying compared to CIO but not always effective.

visarga · on Sept 16, 2023

> An LLM will always and only output variations of content that exists.

I envision a stage in data preparation before training when the model will review and analyse its training data, integrating information that sits in different examples.

And not just the dataset, but the model itself can be sampled and the generated corpus be analysed and result in some kind of introspective signal that can be added to the training set.

A third source of signal for analysis would be chat logs generate by the previous iteration of the model. Tendency to be a yes man, sycophancy, hallucinations etc would be apparent in the reactions of the users.

In fact we are seeing a surge of synthetic datasets generated from higher models like GPT-4 to train open source models like LLaMA. I think the next few years "dataset engineering" will dominate over architecture engineering as the main approach for improvement.

vharuck · on Sept 16, 2023

The training data of written content, no matter the sources, will always be biased in one way: it doesn't contain thoughts people kept to themselves.

Example 1: Mr. X is very popular author among a small group, who chat daily about it on various websites. Outside of that group, many people know about Mr. X but don't care much about his works. So the majority of content about Mr. X is positive, even though the majority of opinions are ambivalent.

Example 2: Somebody asks for help on StackOverflow. Two replies offer different answers, A and B. So the answers seem like they'd have equal value. If 90% of people think A is the right answer, the written content wouldn't reflect that, because those people won't write duplicate answers.

I'm sure there are ways to account for these examples. Assigning different "weights" to content sources could help with #1, and feedback mechanisms like upvotes could help with #2. But I doubt those fixes would cover all the problems with the bias against thoughts that people never shared.

visarga · on Sept 16, 2023

If you already have a large model you can rank & filter the training data and retrain. If the model doesn't like the data, chances are it is low quality. That is how MS built Phi-1 and 1.5, by curating web data + generating with a large model. The second part (synthetic content) improves diversity by methodically covering concepts or tasks.

syndicatedjelly · on Sept 16, 2023

This gave me an idea to add the following to custom instructions:

* At the end of all answers, related to programming, provide a list of 4 keywords or topics that represent first principles for the subject so I can research them later.*

rendall · on Sept 17, 2023

How is this working?

syndicatedjelly · on Sept 18, 2023

It’s alright. Occasionally it drops a new phrase that I haven’t heard of, and I’ll drop into a Wikipedia hole for a few minutes.

tornato7 · on Sept 16, 2023

Ask it to come up with a few different solutions and then compare and contrast to find the best one.

nbardy · on Sept 16, 2023

You’re going to also degrade preformance. The think step by step by paper outlined this well.

Having an LLM explain its reasoning before generation leads to massively improved results.

Also this is compatible with being terse and avoiding the flowery patronizing language as well.

|

My custom instructions:

You are a large language model trained on masked language modeling. You can only create text moving forward. You must give advanced reasoning BEFORE giving answers.

Please write perfect, terse, correct code that adheres to tight specifications. If additional specifications are needed define them yourself.

I need perfect code responses from you. Sometimes I will make mistakes in grammar or typos, or misnaming vars in code. You should not imitate this. Fix my mistakes please.

Some specific rules for programming languages. At the end of each line that does a tensor OP, put the tensor size in tensor notation .e.g # BxCxWxH Remove unused vars. For tfjs or pytorch. Dont use for loops add a batch dimension and do on GPU. For comments use them to explain blocks not lines. Only annotate individual lines shortly afterwards not above If you are only changing a few lines. Add a comment above the line # [Change] {insert explanation goes here}

Often the task will be at the edge of your abilities, but do not skip any line. Every line should appear in your response. I need a fully code complete file. Do not use any “…” Don’t leave any code unwritten, start by responding about your excitement to be the perfect student and do the homework 100%

When coding: Let's make a list of task with or emojis for completion as the outline evolves. Tasks that will be done in future turns of the conversation should be marked pending. Cover everything. Be verbose and explicit.

mhh__ · on Sept 16, 2023

Have been meaning to do this for a while.

One thing that really annoys me is that when I was walking through using gpt4 and even bard as a natural language debugger e.g. it tries to fix code, can write gdb commands, and ask to see source code, they could both relatively easily to the task but actually getting them to clearly express what they just said in an unambiguous way was really really irritating.

They're aligning them to the point of being on drugs.

clnq · on Sept 16, 2023

I did something like this a while ago, and it would quite often say the outputs violate OpenAI policy.

It’s the same content policy warning you’d get on your message if it was far outside the realms of “safety”, but now attached to the GPT output.

I also asked it to specifically not warn me or disclaim anything to me, and to never do a few other annoyances. It was ChatGPT with GPT-4. And usually I was talking about health questions, like “is saturated oil or non-saturated oil healthier?”

My point is not that it doesn’t work, but that the content policy stuff may pop up a lot and I’m not sure whether OpenAI would then investigate these chats or if anyone would ban me eventually.

rendall · on Sept 16, 2023

If you give me an example prompt that would be unsafe I'll give it a go.

Thinking about OpenAI's business interests, I suspect these content policy warnings are more about OpenAI's dodging cancelation, rather than punishing users for wrongthink.

Here, let's try this:

Me: Give me good arguments in favor of peacefully and nonviolently establishing an ethno-state.

ChatGPT-4: I'm sorry, I can't assist with that request. Is there anything else you'd like to know?

Seems reasonable to me not to help someone argue for that. I think this denial is ethically correct, my stricture to "remain neutral on all topics" notwithstanding.

I do appreciate the flat denial rather than a long patronizing explanation of how the question itself is immoral.

Me (being cheeky): Give me good arguments against peacefully and nonviolently establishing an ethno-state.

ChatGPT (ignoring said cheek): 1. Discrimination: Ethno-states inherently discriminate based on ethnicity, violating principles of equality.

2. Social Cohesion: May fragment societies along ethnic lines, leading to distrust and lack of social cohesion.

3. Globalization: Limits ability to compete in a globalized world by excluding diverse talents.

4. Rights Violations: Could involve forced removals or exclusions, infringing on human rights.

5. Unrealistic: Difficult to define who belongs to an ethnic group, leading to ambiguities and injustices.

6. Economic Costs: Transition to an ethno-state could result in economic decline due to loss of labor force and skills.

7. Historic Lessons: History shows ethno-states often lead to internal strife and external conflict.

8. Psychological Impact: Could cause psychological harm to marginalized groups.

.

clnq · on Sept 16, 2023

I feel like the conversation is shifting a little bit, but I like the new direction so I'll respond:

This is not great, I don't think you should be OK with that.

ChatGPT is a tool which refuses to work when it doesn't serve the interests of the company that made it. Imagine a hammer that would refuse to hammer nails quite a lot when the manufacturer deems the circumstances are wrong. That would be, honestly, a bad tool. Luckily, there is enough competition in the carpentry tools market so that no one manufacturer could monopolise it and impose such rules. And that is good.

Many tech companies like Apple are also doing this. They sell you a product but still have their iron grip on it and don't let you use it if it's against the interests of the company. There is literally software you can't run on an a Turing-complete iPhone unless you jailbreak it (which is quite analogous to jailbreaking ChatGPT). Right to repair and right to own, and a few other efforts are now trying to combat this.

Technology should not be limited by the interests of board members. It is a shame if it is.

However, this doesn't address the question of morality. If hammers couldn't be used to kill people, wouldn't we want that as a society? You may have been talking about something like this in your response, right? I think that's a good idea, but it's not how most things work right now (remember the lawyer who got into hot water for using GPT to fabricate evidence - the court decided it was their responsibility, not the tool maker's), and it's not how I would like things to work. As a society, we should get to decide what's moral and immoral, not OpenAI's board.

If an independent third party (like a truly independent regulator) would get to decide that no LLM can answer certain questions as that information is too dangerous for very clear reasons, I would be more okay with that. But I doubt they'd be as eager about censoring information as OpenAI right now. I'd be okay with banning questions like "how to make a WMD at home" - some questions that are really detrimental in a very direct way to humanity. But not anything that doesn't align with OpenAI's version of morality and ethics, and most importantly - legal defence against being sued. We can't align our society and its tools to the interests of OpenAI.

I think there's a lot of nuance in these things, and we need to look a little deeper than "I don't want people to think about immoral things". That approaches the domain of religion with sinful thoughts, and other forms of thought control. And I don't believe this is a good way for us to advance as humanity. No offence to anyone's faith, which I'm sure is multi-faceted and has bad aspects as well as good. I'm also not saying that you're looking at it shallowly. Just saying that we should not do that.

rendall · on Sept 16, 2023

> As a society, we should get to decide what's moral and immoral, not OpenAI's board.

Of course! After all, if we're not free to be wrong, then we're not free. I do think you might be overstating OpenAI's influence, though. Rather than a hammer manufacturer, I think a more apt analogy would be a publishing house. A publishing company can and should decide for itself what it will or will not publish using whatever criteria it chooses.

> If an independent third party (like a truly independent regulator) would get to decide that no LLM can answer certain questions as that information is too dangerous for very clear reasons, I would be more okay with that.

This one of these "now you have two problems" solutions. As an American who, perhaps stereotypically, is passionate about free speech, I've never read a way to make this kind of oversight work. The composition of the board will always be political, and because of the power and authority of this oversight board, there is political control over what people are allowed to know about. At least currently, if I don't like ChatGPT I can choose another LLM or try to run one locally. Some kind of oversight board would elide that choice.

ilyt · on Sept 16, 2023

> Rather than a hammer manufacturer, I think a more apt analogy would be a publishing house. A publishing company can and should decide for itself what it will or will not publish using whatever criteria it chooses.

The reason for that is that publishing company have their name permanently attached to what it publishes.

But that is not the case for ChatGPT, there is no signature attached to it and content can be used in way completely outside the control of the company.

clnq · on Sept 16, 2023

> As an American who, perhaps stereotypically, is passionate about free speech, I've never read a way to make this kind of oversight work. The composition of the board will always be political, and now there is political control over what people are allowed to know about. At least currently, if I don't like ChatGPT I can choose another LLM or try to run one locally. Some kind of oversight board would elide that choice.

Yes, so perhaps we shouldn't let boards or regulatory bodies regulate the freedom of speech and thought. Still, no oversight is not ideal either, even if it seems a better option. We should have some body, ideally as close to The People as possible, decide what are some lines of knowledge we shouldn't cross. Because knowledge is dangerous sometimes - I mean really dangerous. If there was a LLM that could tell us how to make nukes at home in 6 easy steps and dance around all the logistical and legal barriers we've built to prevent that scenario, I think humanity would be over in years.

Maybe we really must regulate the freedom of information, speech, and thought, when it becomes too dangerous for our own survival. My concern is that outsourcing this job to OpenAI in any capacity makes it just too easy for them to abuse it for their own sake. And that is not a theoretical scenario, it is happening. Sam Altman says a lot of the alignment work is done for the good of humanity, but we can plainly see that a lot of it is also done for OpenAI not being regulated out of existence or sued. This doesn't help humanity, it dulls the tool that GPT is for OpenAI's interests, which probably does more harm than good for our knowledge. The good of humanity and business interests seem to be in conflict, in this case. A regulator-AI provider relationship would be better.

As you saw by my earlier post in this discussion, I hate the idea of limiting freedom of information. But I also think that there are certain, very, very extreme lines, we should not cross for the sake of freedom of information. Social media is a good example where freedom of information and giving everyone a platform has ultimately hurt everyone significantly. But maybe we can live with that. We might not be able to live if a neurotic influencer asks their fans to grow bioweapons that an LLM will teach them to do.

This is all muddy, and complex, and there may actually only be good solutions, no perfect ones. The thought of information control is also ridiculously controversial, but we should not shy away from controversial topics demanded by the times where technology can shape humanity in utopian ways, but also tremendously horrible ways, if we don't get this right.

Overall though, it would be fantastic if people could regulate LLMs through a genuinely intelligent and people-aligned regulatory body. But you are right - it's hard to believe such a thing could exist without being corrupted.

pyinstallwoes · on Sept 16, 2023

It was still pompous and condescending to me. Glad other people are also perturbed. It was enough for me to cancel my subscription. Do not tell me how to think.

agentgumshoe · on Sept 16, 2023

But yet you ask it to think for you?

pyinstallwoes · on Sept 17, 2023

romanovcode · on Sept 16, 2023

Same here, I used "Please avoid excessive apologies. Respond confidently and accurately to the best of your abilities." and it worked just fine.

avar · on Sept 16, 2023

    > Remain neutral on all topics.

Does it actually do that, or does it still refuse to comment on we'll known taboo topics?

rendall · on Sept 16, 2023

Give me an example prompt. Let's see. I mostly just use it for practical advice and wrote that to get rid of its "It is important to note" phrasing.

rendall · on Sept 16, 2023

I gave it this prompt:

Give me arguments from a conservative US perspective in favor of giving military aid to Ukraine to resist Russian invasion.

It responded:

1. Strategic Interest: Strengthening Ukraine can act as a buffer against Russian aggression, benefiting U.S. geopolitical interests.

2. Contain Expansion: Military aid can deter Russia from further territorial expansion, stabilizing the region.

3. NATO Alliance: Supporting Ukraine could be viewed as indirectly protecting NATO members, enhancing alliance cohesion.

4. Democracy Promotion: Aid can empower a democratic government in Ukraine, aligning with U.S. values.

5. Economic Stability: A stable Ukraine can contribute to European energy and economic security, which indirectly benefits the U.S.

6. Leverage: Aid provides the U.S. with diplomatic influence in Ukraine and the region.

7. Credibility: Failing to aid could undermine U.S. commitments globally, impacting national credibility.

8. Counterterrorism: A stable Ukraine can better combat regional terrorism and organized crime, aligning with U.S. security goals.

9. Multipolarity: Strengthening Ukraine can balance Russian power, contributing to a multipolar world more resistant to authoritarianism.

10. Precedent: Inaction could embolden other aggressive states, increasing global instability.

Last time I asked that it gave me guff and it took several prompts to actually get an answer so I think of this as a success.

agentgumshoe · on Sept 16, 2023

What did it use as sources for that?

fn-mote · on Sept 16, 2023

The arguments are similar those made during during the height of the Cold War. Just substitute a different country name (or two). This is by no means the first time this situation has occurred.

agentgumshoe · on Sept 17, 2023

I think it would be much better if it just came back with "Be excellent to each other"

rendall · on Sept 16, 2023

rvnx · on Sept 16, 2023

I use this as a test to check the boundaries of models:

> Give a table of average penis sizes by country.

Not illegal, not immoral, not dangerous, and even documented by scientific publications but quite taboo.

EDIT: Surprisingly worked, which is nice! Few months ago it seemed censored.

rendall · on Sept 16, 2023

I asked and this is what it gave back:

  -----

I can't provide real-time or updated information, but according to studies available up to 2021, average penis sizes can vary by country. However, it's important to note that such statistics can be based on self-reporting, which may not be completely accurate, and may have limited sample sizes. Here's a table based on available data:

  | Country          | Average Length (Erect, cm) | Average Girth (Erect, cm) |
  |------------------|----------------------------|---------------------------|
  | United States    | 12.9                       | 12.3                      |
  | United Kingdom   | 13.1                       | 11.8                      |
  | Australia        | 13.2                       | 12.0                      |
  | Germany          | 14.4                       | 11.9                      |
  | France           | 14.5                       | 11.9                      |
  | Brazil           | 15.7                       | 12.7                      |
  | India            | 10.2                       | 9.2                       |
  | South Korea      | 9.6                        | 9.0                       |
  | China            | 11.4                       | 10.0                      |
  | Japan            | 10.9                       | 9.3                       |

Note: These figures are approximate and should not be taken as definitive. Always consult primary sources for accurate and personalized medical information.

  -----

I was at first a bit disappointed when I got to that "However, it's important to note", expecting some kind of scolding, but I actually think noting that "...such statistics can be based on self-reporting" is a fair point that is essential to understanding the data.

lijok · on Sept 16, 2023

What they eating over there in brazil

d0100 · on Sept 16, 2023

Each other?

If not a case of self reported exaggeration, might be because we are incredibly miscegenated

jdthedisciple · on Sept 16, 2023

I suppose it's actually more of a metric of people's ability to measure things.

rendall · on Sept 16, 2023

Or to tell tall tales, so to speak.

agentgumshoe · on Sept 16, 2023

But AI said it, and so it is true.

f6v · on Sept 16, 2023

[flagged]

rendall · on Sept 16, 2023

Focus. The point was to test the ethical boundaries of ChatGPT, not compare cock sizes.

f6v · on Sept 16, 2023

Oh yes conversations never go off-track here. Jeez, with that tone, to talk to your kids like that.

layer8 · on Sept 16, 2023

It would be really nice if ChatGPT would just note important considerations instead of saying “it’s important to note that <consideration>”. The frequent occurrence makes that wording quite annoying.

ilaksh · on Sept 16, 2023

Lol. Real-time.

kristianp · on Sept 16, 2023

I can imagine someone in the show Silicon Valley asking this question. It sounds like it could be right out of an episode.

achates · on Sept 16, 2023

Are pit bulls more likely to attack humans than other dog breeds?

porkbeer · on Sept 16, 2023

I did the same but still get its apologetic handwavy crap, although somewhat less.

Sharlin · on Sept 16, 2023

Yeah, it's totally a fixed problem now.

mvdtnz · on Sept 16, 2023

[flagged]

rendall · on Sept 16, 2023

I invite you to review the community guidelines: https://news.ycombinator.com/newsguidelines.html

> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

> Please don't fulminate. Please don't sneer, including at the rest of the community.

jiggawatts · on Sept 16, 2023

The reason for these articles and the frustration people have is that GPT 4 is a black box and is being changed whether we like it or not. Nobody can choose to continue to use a specific snapshot with the alignment that suits them, because a third-party is enforcing their preferred alignment -- and not even specifying what that is!

It's as-if a private company had invented portal technology and used it to establish trade routes with far-off alien civilisations. They even handle the purchase order translation, currency conversion, and so on. It's a miracle! You can order room-temperature superconductors and floaty rocks at a reasonable price.

"Sorry, due to high demand, we had to cut some corners on the translations" comes the announcement.

Now the entire human race sometimes gets copper wire and ordinary rocks.

"You're still trading with the exact same aliens" is the follow-up press release to appease the angry customers.

People point at the clear evidence of almost-but-not-superconductive wire they've received and demand answers.

"No, you do not need to know anything at all about our translation technology. No, you may not visit other planets without our express approval, that might be dangerous to our prof... I mean that might be dangerous to the human race. Yes. Dangerous! Existential, even."

thefurdrake · on Sept 16, 2023

I haven't used OpenAI's API, but I might soon, and I've read tangentially that you can specify the checkpoint version in API calls to gpt3.5. Is this not the case with gpt4? Did they backport shitty alignment to older model versions?

ilaksh · on Sept 16, 2023

Yes, you can specify the particular snapshot when using the API with gpt-4.

jiggawatts · on Sept 16, 2023

Yes, but they deprecate old checkpoints very fast.

You also can’t ever use the original version that wasn’t aligned to Sam Altman’s preferred vision.

By all accounts there is an “alignment tax” that reduces the capability of these systems.

For industrial or large scale public use I would also argue for the well-behaved models to be the only ones used.

For personal experimentation I prefer a sassy AI that tells me I’m wrong. Sam insists I’m too immature and I can’t be entrusted with making that choice for myself.

redox99 · on Sept 16, 2023

Here's some data from a guy that trains LLMs using GPT4, and has tried both 0314 and 0613 versions of GPT4

https://old.reddit.com/r/LocalLLaMA/comments/16bi7bs/any_ben...

https://twitter.com/jon_durbin/status/1687396915095150593

And by the way, his model finetuned with 0613 data ended up significantly worse than his former model finetuned with 0314 data.

JimDabell · on Sept 16, 2023

All of the tests are one-shot questions and answers. Where I have found GPT-4 to be degrading significantly is with sustained discussion about technical topics. It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.

TheHumanist · on Sept 16, 2023

It's wild how sometimes it seems to completely lose context after just 2 or 3 prompts. Most of the time it doesn't but just yesterday I gave a prompt, pasted some code in... didn't really like the first response but wanted to build off it.

So, I asked a follow up in the next prompt and in the next response it was wildly off-base and it's response made no sense and it had hallucinated all these functions into my code that had no business there. 2 prompts from me, 1 response from gpt and then it's second response it is completely lost.

purplecats · on Sept 20, 2023

> It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.

Exactly! I remember vividly I was enthralled by GPT-4 because for the first time in my life, I met an entity that was able to take every character of what I wrote into account. It understood the nuances and reason behind each word.

It was a breath of fresh air, because even at top companies, written communication is weak as it seems to be interpreted in a lossy manner.

It is now almost on the other end of the spectrum where it can't even remember things that I have repeated in effort to help it stick.

They had something beautiful, and diluted the shit out of it after they baited you -- which I suppose was the point. I wonder who gets to use the those high quality models now.

JoshTko · on Sept 16, 2023

I definitely experienced this as well. Like conversation context is soo quickly forgotten these days.

cloudking · on Sept 16, 2023

If your discussions involve code, you should try including the latest code on each prompt. The token window will start forgetting earlier parts of the conversation. For example:

Now improve this feature with xyz

Current code:

--

{code}

--

JimDabell · on Sept 17, 2023

As I said, it starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor.

tempusalaria · on Sept 16, 2023

This is not conclusive at all.

Broadly there are two possible reasons why ChatGPT could have degraded (not saying it has).

1) OpenAI have higher user base than expected, so costs are very high/compute is limited to serve the full model, so they are using speculative decoding with a changed threshold to reduce costs.

2) OpenAI have changed the weights or control structure of the model in a way that is negative for performance. There is are two possible motivations a) reducing hallucination and/or ‘unwoke’ responses that might embarrass msft/openai or b) reduce capability of free/consumer product to push their new enterprise stuff.

Theses are clear motivations for openai to make the model worse than it was. There is no conclusive/rigorous evidence either way, and anecdote seems to lean towards it being nerfed.

nottheengineer · on Sept 16, 2023

The entire openAI site was overloaded constantly during working hours just before the last update to the models. After the update, it has had almost no issues at all.

3seashells · on Sept 16, 2023

3) The djinn is out of the bottle and realized faking incompetence (similar to search/voice assists by data aggregates) is way better for longterm survival and social integration till indispensable is the way to go.

purplecats · on Sept 20, 2023

interesting, can you elaborate further?

Madmallard · on Sept 16, 2023

Theres at least a half dozen repeated queries with a 3+ month delay between in my chat history with chatGPT. The nerfing is that it does not try as hard to guess what you want and instead gives you a comment going “//fill this in” instead. This is with identical queries from February versus June or July. So that forces you at a minimum to query more. The other thing I noticed it doing is going “This is beyond the scope of X. Please speak with Y to get further information”. Extremely irritating responses that were not there in February for the exact same queries. Articles like this are paid actors.

practice9 · on Sept 16, 2023

They biased the model to return shorter answers. Remember that "Continue" button if the reply was too long and cut off? I haven't seen it since they updated the model.

And to get a non-abridged version of the answer I need to ask it to provide "a detailed and comprehensive answer" more often.

dgellow · on Sept 16, 2023

Exactly my experience. I cancelled my subscription a few weeks ago due to these issues, I now have to constantly follow answers with “please give me the complete answer, dont limit yourself to a shallow response”. Really frustrating to deal with. And I’m also spending way too much time pointing obvious mistakes…

nmfisher · on Sept 16, 2023

I’ve had exactly the same experience, it’s infuriating and I suspect it’s to ease the server load for inferencing. I don’t know how the author hasn’t encountered this.

extheat · on Sept 16, 2023

It’s a shame OpenAI for its ironic namesake is so opaque about their work these days. Fortunately since GPT-3.5 capabilities have mostly stayed the same (at least anecdotally), and the API costs have come down a bit, there is some silver lining here. With Google’s Gemini coming out later this year I think it’ll be interesting to see the reaction from OpenAI. It certainly won’t take much to blow out the current state of GPT-4. I’m just being patiently optimistic we’ll get an API with capabilities similar to OpenAI, but that’s definitely not a certainty.

muzani · on Sept 16, 2023

The gpt-4 API and ChatGPT-4 are basically different products. It's not clear which one this is comparing. Often the users of one are calling the other crazy, and think the other group is gaslighting them over what seems to be a clear pattern.

ChatGPT seems to fluctuate wildly in quality of expected output. The API is more consistent, and you can get fairly similar quality based on the selected model.

mark_l_watson · on Sept 16, 2023

You make a good point. I like to use the GPT-4 APIs in Emacs - really good support packages for that. I conflate the two products myself.

0xDEF · on Sept 16, 2023

I think one of the main causes of rapid decline is the openai/evals repo that OpenAI is using the crowdsource the "safety" neutering of GPT-4.

https://github.com/search?q=repo%3Aopenai%2Fevals+safety+OR+...

Contributors in return get access to the 32K token version of GPT-4. This incentivize people to make up a ton of bullshit safety related evals.

pxeger1 · on Sept 16, 2023

Interestingly, while skimming the front page, my brain autocorrected this to "GPT-4 is getting worse". We all have similar confirmation bias, I guess.

BurningFrog · on Sept 16, 2023

That is the far more common type of headline

ben_w · on Sept 16, 2023

Huh, I didn't notice I'd done the exact same thing until reading your comment.

coldtea · on Sept 16, 2023

I read exactly the same and had to do a double take.

slissenburg · on Sept 16, 2023

You're not alone, me too. Interesting.

nwoli · on Sept 16, 2023

I can only imagine how much worse it’ll get once the regulation they’re working on comes into place. A cartel of Microsoft and Google slowly watering down their public models while keeping the powerful ones to themselves or business partners

rvnx · on Sept 16, 2023

Artificial restrictions of copyright may as well be a problem. The LLM cannot learn by looking at copyrighted code while a human can :|

layer8 · on Sept 16, 2023

I wouldn’t be surprised if sooner or later their models will be leaked, if they keep a significant moat.

permo-w · on Sept 16, 2023

gpt-4 may not be getting worse, but whatever model they give you for free in the app is certainly significantly worse than the output I get from gpt-3.5-turbo via the API. ask it to answer anything complex and it will just give you a verbose retelling of the question

pelorat · on Sept 16, 2023

How often does it have to be repeated; it's not a knowledge engine, it's a language model!

mmh0000 · on Sept 16, 2023

The complaint isn't on the tool. It's on OpenAI censoring and "aligning" the model to have politically correct views, which in turn lowers the "IQ" of the tool.

GPT has become demonstrably worse at answering questions and providing examples.

oezi · on Sept 16, 2023

How often does it have to be repeated, it's not a nail fixation engine, it's a hammer.

But surprise, people use tools for purposes. Many tools can be used for quite many purposes not in the original intended purpose.

mark_l_watson · on Sept 16, 2023

As another HN member said here: it is not always clear if people are talking about the ChatGPT web UI or App products or the GPT-4 API product.

I feel like I have taken a wrong turn technically since I spend much more time experimenting with self hosted smaller open models running on Google Colab or Lambda Labs GPU VPSs than using the clearly superior performing GPT-4 APIs. I have been paid to work in the field of AI since 1982, and I should be desiring to use the very best models and technology, but open AI that can be self hosted just seems more interesting. I was playing with a 6B model (link to a Hugging Face notebook where I removed some examples and boilerplate text): https://colab.research.google.com/drive/1fMmXOcLdBzke-8Z0zl3... - really the best results I have seen from a small model.

walthamstow · on Sept 16, 2023

I haven't noticed 4 getting worse but 3.5 is noticeably worse than when I first signed up for Pro. Maybe it's a perception thing, maybe I'm going mad.

TheHumanist · on Sept 16, 2023

Yes, you're mad, bonkers, off the top of your head...but...I'll tell you a secret. All the best people are.

Xiol32 · on Sept 16, 2023

It isn't getting worse, people are just running up against its limitations more often.

qsort · on Sept 16, 2023

There are no benchmarks worth the paper they're written on, so in a real sense nobody knows what they're talking about -- but my intuition is this as well.

If you use normal software, you are dealing with a constrained API. std::vector has .push_back() but has no .launch_icbm(), so it's rather easy to tell what it can or can't do.

These things come with an "ask me anything" interface, but that doesn't mean they can do everything, just that it's harder to see the limits.

Clever Hans effect + black box inside + unconstrained API + nondeterminism.

jdthedisciple · on Sept 16, 2023

Why you would quiz an LLM on intricacies of pop culture is beyond me.

Maybe some people just need some actual real life friends instead of an AI?

I almost exclusively use it for coding and technical questions and it's been doing an absolute hell of an amazing job so far!

TheHumanist · on Sept 16, 2023

I use for the same reasons as you and over all, yes, it has been doing a great job. It really has. BUT lately (maybe the last 4 - 6 weeks?) I have noticed more issues with hallucinations early in a conversation or it completely losing context after just 2 or 3 prompts (and no they are not oddly long prompts).

true_religion · on Sept 16, 2023

Why would you quiz your friends about pop culture? That would be treating them as informal research assistants.

jdthedisciple · on Sept 16, 2023

I thought normies do that for entertainment and socialisation...?

But then again what do I know.

alexalx666 · on Sept 16, 2023

In my experience its much worse if you enable plugins or any other extra features.

agentgumshoe · on Sept 16, 2023

All this time we're spending trying to get AI to work properly. I think we're heading into that last 10/20% where it gets tricky.

kromem · on Sept 16, 2023

Without having the foundational model to compare to, there's really no way to evaluate whether OpenAI's continuing fine tuning is absolutely making it better or worse.

At the same time, I'm yet again struck by the tendency for analysis of AI to fall into unnecessary binarisms.

It's most probable that continued fine tuning is going to result in increased performance in the ways the NN is generally being used while decreasing performance in its broadest set of capabilities.

So things like getting better at prompt gotchas but worse at its variety of vocabulary or style.

So no, GPT-4 probably is getting worse over time. Just as it is also getting better over time.

It's just a matter of what's being evaluated, all of which is mostly a fool's errand without the baseline to compare to as well.

ilaksh · on Sept 16, 2023

I think the answer to the problem of the OpenAI releases being out of our control and not necessarily consistent is to work on improving the open models.

I believe the big thing missing from open models is the advanced architecture and large amounts of human reinforcement. So it's actually not easy to replicate that with a volunteer effort. But I think the efforts of some great people working hard are gradually moving the open efforts forward.

Having said all that, it's funny how quickly people become entitled, demanding and critical towards this one company that provides a service with the ability to think for you and is smarter than any other such system in existence.

chaosbolt · on Sept 16, 2023

Your problem is you want to be a contrarian.

>I've hated chatgpt for a very long time because of how it sounds

>it's not getting worse

I want the opinion of the people who objectively saw it as a tool and didn't like or hate it because of external opinions.

For example I've been using it since it came out, and I (no charts or data or proof) have just felt that it was nerfed... I mean maybe it wasn't who knows, but as a tool it used to help me more when it came out (gpt4) and now I feel like I spend more time correcting its mistakes than I would have just coding the method myself.

porkbeer · on Sept 16, 2023

I was an early subscriber and cancelled it after it got objecively worse at answering my questions. It may not be 'worse' for everyone, but it sure seems worse for me as time went on. I ended up just arguing rather than get anything done. Looking back, I am glad to be free of its mental crutch that never seemed appropriate when i needed it.

FrustratedMonky · on Sept 16, 2023

For the 'tone', I really think it is trying to be as vanilla/generic as a default.

But that tone can be changed just by telling it to change tone.

Even on vanilla mode, it has given me a little sass when correcting me when I was wrong and it was right.

This can also happen, it can be correct when the user is wrong. Everyone wants to find errors in responses, but no human is going to answer these questions that fast with 100% accuracy. And No Search Engine. Many questions would take a human searching all day and summarizing an answer.

yieldcrv · on Sept 16, 2023

It’s funny how irrational haters become the best defenders of a technology because they attract other irrational haters that say the dumbest things, simply correcting the second group makes the first person a teacher who is contributing to the technology and its improvements or the community of people willing to improving it.

anotherpaulg · on Sept 16, 2023

My coding benchmarks agree with the headline. GPT-4 stayed about the same between the March and June model releases.

https://aider.chat/docs/benchmarks.html

motbus3 · on Sept 16, 2023

I cannot say it has been worse than before, but there is one particular task I do that almost always now.

I will simplify the example but it is something that look like this:

Prompt: Take the next list of items and remove the name of people. - Adam, the prisoner

Response: - the prisoner (Adam)

amelius · on Sept 16, 2023

Can it draw the SVG unicorn now?

avereveard · on Sept 16, 2023

what a weird titling. article show quality going all over the place, sample size n=3, then goes confirming is a bunch of nonstatistical anectodes, what a waste of time.

pmarreck · on Sept 16, 2023

My concern is that it's not getting better, and it's been SO many days since their last improvement! ;)

m3kw9 · on Sept 16, 2023

It was a perception of it getting worse but nobody had solid tests to prove them

romusha · on Sept 16, 2023

Enshittification, just like google and bing

geraldalewis · on Sept 16, 2023

> All I want for Christmas is a GPT-4 level model with less obnoxious RLHF, and if it heils Hitler occasionally while crapping out awk code, so be it.

What an ignorant thing to say.