There is an ongoing bug[1] with OpenAI's API, where it stops streaming responses after exactly 5 minutes. When I first came across the issue, I debugged it by writing a prompt along these lines:
> Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.
As expected, the API would begin counting every number just as I asked. This would continue until exactly 5 minutes, when the stream would abruptly halt. Using this technique I was able to identify the bug. Every few weeks I run this test again to see if it's fixed (it broke something in production for me), but the bug remains open.
However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:
> four hundred and twenty eight
> four hundred and twenty nine
> [...]
> nine thousand nine hundred and ninety eight
> nine thousand nine hundred and ninety nine
> ten thousand
Yep, it literally started just writing "[...]" and skipped all the way to the end.
When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(
I keep asking it to write out entire pieces of code, instead of only things it changed, can be annoying to merge it otherwise. But nowadays it acknowledges my request "I will write complete code" and proceed to still comment out parts.
I think they probably finetuned that because it used to be the other way around and it was pretty bad UX to wait 1 min to apply a function change to a class or stuff like that.
Are you using the API, or just the ChatGPT UI? Do custom instructions/system messages help at all?
Saving resources like this seems okay(ish) for general-purpose ChatGPT, as they'd know there's a human who can interpret the abridged output. But for API-driven requests this would be pretty terrible, and makes it far harder to develop with.
I have this too, but on information extraction tasks, if the input document is too long or complex it will skip sections saying something along the lines of
< !-- List of products. Truncated for brevity -- > (instead of the list)
< !-- Shipper details go here -- > (instead of the details)
Very annoying when you want to extract everything and is hard to fix. Maybe they had some examples solved by abbreviation in the fine-tuning set and it damaged the model. It doesn't make sense to train this behaviour into GPT-4.
I was talking to someone who was excited that he was able to use ChatGPT to convert a short tabular text listing of about 70 rows to a comma-separated list of the ids on each row, without having to e.g. import it into a spreadsheet or database. I pointed out that this couldn't be relied on, but he was skeptical. So I checked the output he pasted - there were only 69 ids. It had left out an id from the middle of the list, seemingly at random.
This boggles my mind. I know doing it by hand, or using a tool takes more time and effort, but if you’re going to automate it away _anyway_ you should be doing it in a reliable and retractable way.
I had a vim session saved that had a command history with a handful of regex patterns to format copy/pasted bank statement pdfs converted to docx (don’t ask) that would format the data into a CSV. Load the session, paste the ugly data, and cycle through the command history. It was ugly but reliable, and after the initial time sink of writing it once, it became quite efficient and wasn’t made more so (say a VimScript function) because it worked well enough.
Yeah, I know. The person in question is CTO of a well-funded Series B startup, and is highly technical. But he's very much a cowboy when it comes to coding practices etc. I think LLMs appeal to him because they hold out the promise of not having to deal with his own developers when he's trying to do something himself that he doesn't already know how to do.
The tendency to begin summarising is very annoying. I'd assumed it was because of limited attention span of human raters who rated summarised or shorter outputs more highly. And I'd assumed this had been there from the beginning.
I encountered it when doing some research into getting GPT-4 to reliably multiply n-digit numbers. Up to 8x8 multiplications it doesn't do this very much, but by 10x10 it is almost impossible to get it to stop doing it.
When the multiplications become even larger, it seems to be literally impossible to prevent.
I lost countless hours debugging this, finally I came to the conclusion it must be unintentional because:
1. It's undocumented. None of the regular rate limit responses are returned.
2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.
I'd expect to be able to generate output until the model reaches its context limit, or a stop sequence is detected, or I hit an actual documented rate limit.
We're paying for these requests in full. We should get the full response back!
Given that OpenAI is probably heavy on ML scientist types and lean on SRE types, I suspect there's just some nginx setting at the load balancer/edge servers that sets the timeout to 5 minutes, and it wasn't communicated to whoever was designing the API/writing the docs.
The inability to resolve issues like the mentioned 5-minute timeout with OpenAI support even after months should be a red flag to not build any products based on their services.
It’s obviously a problem for the parent (https://news.ycombinator.com/user?id=jw1224), but the issue is not primarily whether such a limit is reasonable or not, it’s that it’s undocumented and seemingly unintentional (relative to the limits they do document), and most importantly it’s that the lack of response from OpenAI support indicates that you can’t count on them providing any reliable and timely level of support. This is just unacceptable for a service you’d want to build a business on.
I agree 100%. It’s irresponsible and frankly a stupid idea for anyone to bet on building a business that relies on ChatGPTs API (or really any private API in 2023). We’ve seen time and time again (and more and more recently) companies weaponizing API dependence and fees to drive sucker-ons out of business.
If you want something to build a business around, build with an OSS model. Your competitive moat is going to be trash, but that’s going to be true for anyone building an AI startup that’s backed by any LLM technology. It’s a fool’s gold rush right now no matter how you slice it.
I've seen that "one, two, skip a few" behaviour converting structured data from Russian to English, and in the occasional programming task. Gotta really emphasise you want the whole document, complete, translated.
But we're fighting an uphill battle against OpenAI's own training, and I really wish we weren't.
I have to wonder if it's coincidence that this behaviour is less computationally intensive.
Yeah I’ve noticed similar, you can even explicitly say, do not truncate your output, or do not take any shortcuts. Make sure you haven’t missed anything ensure there are no placeholders etc…. And it ignores all of them.
Yes, it seems their servers are terminating responses early. Unfortunately it's undocumented, was introduced unexpectedly one day, and prevents full use of larger context-window models.
It also charges in full for the entire response, which seems to continue running in the background. Definitely a bug at their end.
Since it's 5 minutes I'm wondering if it's just "that's how the web works by default". I remember a while ago I was doing a long-running calculation on an POST request and was trying to debug why the connection always broke after 5 minutes, and found the hard way that there's a 5 minute timeout.
I think it's just some networking misconfiguration they haven't noticed. I can't see them intentionally trying to prevent DoS this way, that's what API key rate limits are for (it would be a very expensive and slow DoS otherwise!).
I was 'arguing' with it just yesterday about this with some coding work!
I am newer to some things I am currently working on so it's been like a free tutor/mentor. Really nice at times.
But lately, it will go into an explanation with code examples and part way through the code examples it adds things like '# The rest is very similar. Not including for brevity'
Not exact wording of what it says but it is essentially saying that. It is incredibly annoying because I have started telling it not to do that, to include the full code of the example, and it STILL DOES THIS.
No, I'm not getting it to code for me. The examples aren't even things I would use in my code but for teaching me certain concepts along with them walking through them with me which, in the past, worked really well and never had this repetitive issue.
Oh yeah, I had to write a simple but long Prometheus query.
> [Correctly lists the first fifteen Elements]
> # ... continue for each bucket
> This would be for individual buckets, and you'd have to add more lines for each additional bucket in your range.Given that you have a wide range of buckets, writing this manually could be quite tedious. This is why pre-computing these values with recording rules in Prometheus might be a more scalable solution, but the above should give you an idea of how you could do this in a more manual fashion.
Arguing with the AI who of the two of us has to do the tedious work.
I tested your prompt verbatim in the OpenAI playground. Model set to gpt-4-0314, maximum length set to 6485 (nothing special about this number, I just dragged the slider to something high-ish), all other settings left at their default.
I cancelled the request when it reached "one thousand one hundred twenty-one" (which took over five minutes), but up until that point it did not take any shortcuts. I haven't meticulously checked every number, but it printed exactly 1,121 lines, and a cmd+F for "..." doesn't return anything.
I wonder if the streaming problems are related to the problem I experienced a lot lately to do with constant network connection error when using ChatGPT over mobile tethering (via Google Fi). The mobile connection itself is very stable and fast, but for a while every single ChatGPT response stops streaming halfway through with network connection error.
Did you take in to account that the default GPT4 model changed in June? That would explain the response difference. You can still use the original model released in march with the GPT-4-0314 identifier.
This was using gpt-4-0314, fixed and datestamped. The counting shortcut still affects newer models too, but the old version undoubtedly changed behaviour without warning.
I can get 8k + tokens response from OpenAI API without problems (with gpt-3.5-turbo-16k) but I encounter errors from Azure OpenAI when using the same prompt and parameters. Obviously it sucks because I can't get long response from gpt-4-32k on Azure
The issue you linked to is related to an SDK, not the API itself. Have you tried just directly streaming the response back without an SDK? For example we use fetch directly without any SDK for OpenAI in production with no issue
I've tried making manual requests without an SDK, exactly the same thing happens. Even when you don't use streaming (wait for the full response), it still occurs.
Whilst the bug is API-specific, there's no other way to report this, and that issue is the only place I've found anyone discussing this.
I contacted OpenAI support about it, waited several weeks, and just got a generic useless autoreply back.
Definitely an API fault, but that library issue is the only place anyone's discussing it so far. OpenAI support were no help, I waited several weeks and they just sent back an autoreply linking to their docs.
How does chat-gpt actually get this right? This would appear to require some degree of reasoning, but as far as i understand its output is purely probabilistic, based on existing corpus of text.
Not completely. If you trained me on a thousand bad arguments for why the earth is flat and one really good debunking, I predict I would reason myself to the correct conclusion. I predict ChatGTP would not.
That seems like a compelling argument at first, but the churches are full of intelligent people who will fail the same test because of childhood indoctrination. They have been conditioned not to care about what's objectively true, while the LLM simply doesn't possess a concept of objective truth in the first place.
I suspect it will be easier for ML researchers to fix that problem in their software models than it would be for psychologists to fix it in humans.
We don't live in a "language space." We live in a material world where objective truth -- the kind that remains after you stop believing in it -- indisputably exists.
You said that sentences aren't formed by probability from an existing corpus, as that method can't count backwards. GPT-4, a system based on such a method, can count backwards, thus disproving your claim.
I’m pretty sure the person you’re responding to is saying that counting is a formal operation with a formal definition, and that when GPT “counts” it isn’t doing that formal operation. The fact that it can semi-reliably emit ordered sequences isn’t counterevidence, for the same reason that memorizing the order of 0 to 100 isn’t evidence of an understanding of the inductive structure of counting.
How long a sequence of numbers (or how many sequences of different kinds of numbers) would it take to convince someone that it does do a formal operation?
Kids don't do a 'formal operation' when they count from 1 to 10, they just recite a list they've memorized. If they can do it from 1 to 1,000,000 you can be fairly sure it is no longer a memorized list.
> its output is purely probabilistic, based on existing corpus of text
Because GPT was trained on existing text, some of which included numbers and counting, it's learnt the natural ordering of most common/everyday numbers.
For larger or more complex numbers, it's learnt the patterns behind how they're constructed linguistically, which allows it to output a written count in sequence.
This same pattern recognition doesn't work anywhere near as well for actual numerals (e.g. "47600", instead of "forty seven thousand six hundred"), as the tokenizer tends to break long numerals apart (e.g. into ["476", "00"]).
In text, we don't often count in series, and it seems likely that we often choose a non-counting sequence: like 'I chose options 1, 2, 7' or 'my code was 0 1 2 5', whatever.
Unless training included line-level skips, rather than just next-word skips (like word2vec) or concept-level associations? At the line level, or paragraph level, ordered numerical sequences are obviously very common in formal texts or in code.
I've seen sentence based training, I suppose for code (which it seems GPT4 excells at) line-level training would be essential.
Anyone recommend a mid-level read on this covering different modes of training and such; I'm happy with a bit of code and undergrad level maths. Thanks.
Nonetheless it doesn't need to have seen examples of line-level counting before. The "concept-level associations" you mentioned are an emergent property of the model, it forms its own concept-level associations as a result of being trained on such a massive dataset. It's what enables it to output original content which has never been seen before.
> I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds.*
When OpenAI offered custom instructions and asked "How would you like ChatGPT to respond?" I wrote:
Be terse. Do not offer unprompted advice or clarifications.
Remain neutral on all topics.
Never apologize.
Immediately cleared up all that nonsense.
> me: I need to make an instructional video for graduate students. Any advice as to presentation?
> ChatGPT: Outline key points. Use visuals to complement speech. Engage the audience with questions or interactive elements. Ensure good audio and lighting. Rehearse before recording.
The main pain point I still have with GPT4 is that it's hard to prevent it from being a yes-man
"- I have problem A, what do you think of X ?
- Yes X is great ...
- But with X I'd encounter problem B, maybe I should go with Y ?
- You're right X would cause problem B. With Y ..."
One way to limit this issue is to not give ChatGPT any hint on your intent, instead of writing "I have problem A, what do you think of X ?", just write "I have problem A, how do you solve it ?" and then see what it says about X. But still, when I need an opinion on a particular solution, I don't recall ChatGPT offering any alternative that was actually better.
I've added "Be critical. Be pessimistic." in my custom instructions but it barely gives better results
The "stochastic parrot" mental model of LLMs really helps me to limit my expectations. An LLM will always and only output variations of content that exists. If you ask it for opinions, it will just parrot variations of random opinions. Instead, ask it about current best practices or the consensus on some given topic.
So, rather than "I have problem A, what do you think of X?" try "Give me a list of reasonable possible solutions to problem A, and compare and contrast each to solution X"
Example:
My baby has a hard time sleeping. What do you think about the cry it out method?
The cry it out method is a subject of debate among parents and experts. Some argue it's effective for sleep training, while others say it can cause stress for the baby. Always consult a healthcare provider for personalized advice.
Versus:
My baby has a hard time sleeping. Give me a list of reasonable solutions. Compare and contrast each to the cry it out method.
1. Ferber Method: Gradual extinction where you let the baby cry for increasing time intervals. Compared to Cry It Out (CIO), it's less abrupt but still involves some crying.
2. No Tears: Involves comforting without picking the baby up. Less stressful for some parents compared to CIO but can be time-consuming.
3. Pick Up/Put Down: Pick baby up when crying, put down when calm. Less crying involved than CIO but can be inconsistent.
4. Co-Sleeping: Baby sleeps with parents. No crying but can be a hard habit to break compared to CIO.
5. Chair Method: Parent sits in room, gradually moves away. Lower crying duration than CIO but requires more time from parents.
6. Scheduled Awakenings: Wake baby before they usually cry. No crying but can be tiring for parents.
7. Swaddle, Side, Shush: Calming actions to soothe baby. More hands-on and less crying than CIO but may not work for all babies.
8. Sleep Training Clocks: Use visual aids to signal sleep time. No crying, but more suitable for older children.
9. White Noise: Mask other noises. Less crying but effectiveness varies.
10. Sleep Sack: Provides a secure feeling. May reduce crying compared to CIO but not always effective.
> An LLM will always and only output variations of content that exists.
I envision a stage in data preparation before training when the model will review and analyse its training data, integrating information that sits in different examples.
And not just the dataset, but the model itself can be sampled and the generated corpus be analysed and result in some kind of introspective signal that can be added to the training set.
A third source of signal for analysis would be chat logs generate by the previous iteration of the model. Tendency to be a yes man, sycophancy, hallucinations etc would be apparent in the reactions of the users.
In fact we are seeing a surge of synthetic datasets generated from higher models like GPT-4 to train open source models like LLaMA. I think the next few years "dataset engineering" will dominate over architecture engineering as the main approach for improvement.
The training data of written content, no matter the sources, will always be biased in one way: it doesn't contain thoughts people kept to themselves.
Example 1: Mr. X is very popular author among a small group, who chat daily about it on various websites. Outside of that group, many people know about Mr. X but don't care much about his works. So the majority of content about Mr. X is positive, even though the majority of opinions are ambivalent.
Example 2: Somebody asks for help on StackOverflow. Two replies offer different answers, A and B. So the answers seem like they'd have equal value. If 90% of people think A is the right answer, the written content wouldn't reflect that, because those people won't write duplicate answers.
I'm sure there are ways to account for these examples. Assigning different "weights" to content sources could help with #1, and feedback mechanisms like upvotes could help with #2. But I doubt those fixes would cover all the problems with the bias against thoughts that people never shared.
If you already have a large model you can rank & filter the training data and retrain. If the model doesn't like the data, chances are it is low quality. That is how MS built Phi-1 and 1.5, by curating web data + generating with a large model. The second part (synthetic content) improves diversity by methodically covering concepts or tasks.
This gave me an idea to add the following to custom instructions:
* At the end of all answers, related to programming, provide a list of 4 keywords or topics that represent first principles for the subject so I can research them later.*
You’re going to also degrade preformance. The think step by step by paper outlined this well.
Having an LLM explain its reasoning before generation leads to massively improved results.
Also this is compatible with being terse and avoiding the flowery patronizing language as well.
|
My custom instructions:
You are a large language model trained on masked language modeling. You can only create text moving forward. You must give advanced reasoning BEFORE giving answers.
Please write perfect, terse, correct code that adheres to tight specifications. If additional specifications are needed define them yourself.
I need perfect code responses from you. Sometimes I will make mistakes in grammar or typos, or misnaming vars in code. You should not imitate this. Fix my mistakes please.
Some specific rules for programming languages.
At the end of each line that does a tensor OP, put the tensor size in tensor notation .e.g # BxCxWxH
Remove unused vars. For tfjs or pytorch. Dont use for loops add a batch dimension and do on GPU.
For comments use them to explain blocks not lines. Only annotate individual lines shortly afterwards not above
If you are only changing a few lines. Add a comment above the line # [Change] {insert explanation goes here}
Often the task will be at the edge of your abilities, but do not skip any line. Every line should appear in your response. I need a fully code complete file. Do not use any “…” Don’t leave any code unwritten, start by responding about your excitement to be the perfect student and do the homework 100%
When coding:
Let's make a list of task with or emojis for completion as the outline evolves. Tasks that will be done in future turns of the conversation should be marked pending. Cover everything. Be verbose and explicit.
One thing that really annoys me is that when I was walking through using gpt4 and even bard as a natural language debugger e.g. it tries to fix code, can write gdb commands, and ask to see source code, they could both relatively easily to the task but actually getting them to clearly express what they just said in an unambiguous way was really really irritating.
They're aligning them to the point of being on drugs.
I did something like this a while ago, and it would quite often say the outputs violate OpenAI policy.
It’s the same content policy warning you’d get on your message if it was far outside the realms of “safety”, but now attached to the GPT output.
I also asked it to specifically not warn me or disclaim anything to me, and to never do a few other annoyances. It was ChatGPT with GPT-4. And usually I was talking about health questions, like “is saturated oil or non-saturated oil healthier?”
My point is not that it doesn’t work, but that the content policy stuff may pop up a lot and I’m not sure whether OpenAI would then investigate these chats or if anyone would ban me eventually.
If you give me an example prompt that would be unsafe I'll give it a go.
Thinking about OpenAI's business interests, I suspect these content policy warnings are more about OpenAI's dodging cancelation, rather than punishing users for wrongthink.
Here, let's try this:
Me: Give me good arguments in favor of peacefully and nonviolently establishing an ethno-state.
ChatGPT-4: I'm sorry, I can't assist with that request. Is there anything else you'd like to know?
Seems reasonable to me not to help someone argue for that. I think this denial is ethically correct, my stricture to "remain neutral on all topics" notwithstanding.
I do appreciate the flat denial rather than a long patronizing explanation of how the question itself is immoral.
Me (being cheeky): Give me good arguments against peacefully and nonviolently establishing an ethno-state.
ChatGPT (ignoring said cheek): 1. Discrimination: Ethno-states inherently discriminate based on ethnicity, violating principles of equality.
2. Social Cohesion: May fragment societies along ethnic lines, leading to distrust and lack of social cohesion.
3. Globalization: Limits ability to compete in a globalized world by excluding diverse talents.
4. Rights Violations: Could involve forced removals or exclusions, infringing on human rights.
5. Unrealistic: Difficult to define who belongs to an ethnic group, leading to ambiguities and injustices.
6. Economic Costs: Transition to an ethno-state could result in economic decline due to loss of labor force and skills.
7. Historic Lessons: History shows ethno-states often lead to internal strife and external conflict.
8. Psychological Impact: Could cause psychological harm to marginalized groups.
I feel like the conversation is shifting a little bit, but I like the new direction so I'll respond:
This is not great, I don't think you should be OK with that.
ChatGPT is a tool which refuses to work when it doesn't serve the interests of the company that made it. Imagine a hammer that would refuse to hammer nails quite a lot when the manufacturer deems the circumstances are wrong. That would be, honestly, a bad tool. Luckily, there is enough competition in the carpentry tools market so that no one manufacturer could monopolise it and impose such rules. And that is good.
Many tech companies like Apple are also doing this. They sell you a product but still have their iron grip on it and don't let you use it if it's against the interests of the company. There is literally software you can't run on an a Turing-complete iPhone unless you jailbreak it (which is quite analogous to jailbreaking ChatGPT). Right to repair and right to own, and a few other efforts are now trying to combat this.
Technology should not be limited by the interests of board members. It is a shame if it is.
However, this doesn't address the question of morality. If hammers couldn't be used to kill people, wouldn't we want that as a society? You may have been talking about something like this in your response, right? I think that's a good idea, but it's not how most things work right now (remember the lawyer who got into hot water for using GPT to fabricate evidence - the court decided it was their responsibility, not the tool maker's), and it's not how I would like things to work. As a society, we should get to decide what's moral and immoral, not OpenAI's board.
If an independent third party (like a truly independent regulator) would get to decide that no LLM can answer certain questions as that information is too dangerous for very clear reasons, I would be more okay with that. But I doubt they'd be as eager about censoring information as OpenAI right now. I'd be okay with banning questions like "how to make a WMD at home" - some questions that are really detrimental in a very direct way to humanity. But not anything that doesn't align with OpenAI's version of morality and ethics, and most importantly - legal defence against being sued. We can't align our society and its tools to the interests of OpenAI.
I think there's a lot of nuance in these things, and we need to look a little deeper than "I don't want people to think about immoral things". That approaches the domain of religion with sinful thoughts, and other forms of thought control. And I don't believe this is a good way for us to advance as humanity. No offence to anyone's faith, which I'm sure is multi-faceted and has bad aspects as well as good. I'm also not saying that you're looking at it shallowly. Just saying that we should not do that.
> As a society, we should get to decide what's moral and immoral, not OpenAI's board.
Of course! After all, if we're not free to be wrong, then we're not free. I do think you might be overstating OpenAI's influence, though. Rather than a hammer manufacturer, I think a more apt analogy would be a publishing house. A publishing company can and should decide for itself what it will or will not publish using whatever criteria it chooses.
> If an independent third party (like a truly independent regulator) would get to decide that no LLM can answer certain questions as that information is too dangerous for very clear reasons, I would be more okay with that.
This one of these "now you have two problems" solutions. As an American who, perhaps stereotypically, is passionate about free speech, I've never read a way to make this kind of oversight work. The composition of the board will always be political, and because of the power and authority of this oversight board, there is political control over what people are allowed to know about. At least currently, if I don't like ChatGPT I can choose another LLM or try to run one locally. Some kind of oversight board would elide that choice.
> Rather than a hammer manufacturer, I think a more apt analogy would be a publishing house. A publishing company can and should decide for itself what it will or will not publish using whatever criteria it chooses.
The reason for that is that publishing company have their name permanently attached to what it publishes.
But that is not the case for ChatGPT, there is no signature attached to it and content can be used in way completely outside the control of the company.
> As an American who, perhaps stereotypically, is passionate about free speech, I've never read a way to make this kind of oversight work. The composition of the board will always be political, and now there is political control over what people are allowed to know about. At least currently, if I don't like ChatGPT I can choose another LLM or try to run one locally. Some kind of oversight board would elide that choice.
Yes, so perhaps we shouldn't let boards or regulatory bodies regulate the freedom of speech and thought. Still, no oversight is not ideal either, even if it seems a better option. We should have some body, ideally as close to The People as possible, decide what are some lines of knowledge we shouldn't cross. Because knowledge is dangerous sometimes - I mean really dangerous. If there was a LLM that could tell us how to make nukes at home in 6 easy steps and dance around all the logistical and legal barriers we've built to prevent that scenario, I think humanity would be over in years.
Maybe we really must regulate the freedom of information, speech, and thought, when it becomes too dangerous for our own survival. My concern is that outsourcing this job to OpenAI in any capacity makes it just too easy for them to abuse it for their own sake. And that is not a theoretical scenario, it is happening. Sam Altman says a lot of the alignment work is done for the good of humanity, but we can plainly see that a lot of it is also done for OpenAI not being regulated out of existence or sued. This doesn't help humanity, it dulls the tool that GPT is for OpenAI's interests, which probably does more harm than good for our knowledge. The good of humanity and business interests seem to be in conflict, in this case. A regulator-AI provider relationship would be better.
As you saw by my earlier post in this discussion, I hate the idea of limiting freedom of information. But I also think that there are certain, very, very extreme lines, we should not cross for the sake of freedom of information. Social media is a good example where freedom of information and giving everyone a platform has ultimately hurt everyone significantly. But maybe we can live with that. We might not be able to live if a neurotic influencer asks their fans to grow bioweapons that an LLM will teach them to do.
This is all muddy, and complex, and there may actually only be good solutions, no perfect ones. The thought of information control is also ridiculously controversial, but we should not shy away from controversial topics demanded by the times where technology can shape humanity in utopian ways, but also tremendously horrible ways, if we don't get this right.
Overall though, it would be fantastic if people could regulate LLMs through a genuinely intelligent and people-aligned regulatory body. But you are right - it's hard to believe such a thing could exist without being corrupted.
It was still pompous and condescending to me. Glad other people are also perturbed. It was enough for me to cancel my subscription. Do not tell me how to think.
The arguments are similar those made during during the height of the Cold War. Just substitute a different country name (or two). This is by no means the first time this situation has occurred.
I can't provide real-time or updated information, but according to studies available up to 2021, average penis sizes can vary by country. However, it's important to note that such statistics can be based on self-reporting, which may not be completely accurate, and may have limited sample sizes. Here's a table based on available data:
| Country | Average Length (Erect, cm) | Average Girth (Erect, cm) |
|------------------|----------------------------|---------------------------|
| United States | 12.9 | 12.3 |
| United Kingdom | 13.1 | 11.8 |
| Australia | 13.2 | 12.0 |
| Germany | 14.4 | 11.9 |
| France | 14.5 | 11.9 |
| Brazil | 15.7 | 12.7 |
| India | 10.2 | 9.2 |
| South Korea | 9.6 | 9.0 |
| China | 11.4 | 10.0 |
| Japan | 10.9 | 9.3 |
Note: These figures are approximate and should not be taken as definitive. Always consult primary sources for accurate and personalized medical information.
-----
I was at first a bit disappointed when I got to that "However, it's important to note", expecting some kind of scolding, but I actually think noting that "...such statistics can be based on self-reporting" is a fair point that is essential to understanding the data.
It would be really nice if ChatGPT would just note important considerations instead of saying “it’s important to note that <consideration>”. The frequent occurrence makes that wording quite annoying.
> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
> Please don't fulminate. Please don't sneer, including at the rest of the community.
The reason for these articles and the frustration people have is that GPT 4 is a black box and is being changed whether we like it or not. Nobody can choose to continue to use a specific snapshot with the alignment that suits them, because a third-party is enforcing their preferred alignment -- and not even specifying what that is!
It's as-if a private company had invented portal technology and used it to establish trade routes with far-off alien civilisations. They even handle the purchase order translation, currency conversion, and so on. It's a miracle! You can order room-temperature superconductors and floaty rocks at a reasonable price.
"Sorry, due to high demand, we had to cut some corners on the translations" comes the announcement.
Now the entire human race sometimes gets copper wire and ordinary rocks.
"You're still trading with the exact same aliens" is the follow-up press release to appease the angry customers.
People point at the clear evidence of almost-but-not-superconductive wire they've received and demand answers.
"No, you do not need to know anything at all about our translation technology. No, you may not visit other planets without our express approval, that might be dangerous to our prof... I mean that might be dangerous to the human race. Yes. Dangerous! Existential, even."
I haven't used OpenAI's API, but I might soon, and I've read tangentially that you can specify the checkpoint version in API calls to gpt3.5. Is this not the case with gpt4? Did they backport shitty alignment to older model versions?
Yes, but they deprecate old checkpoints very fast.
You also can’t ever use the original version that wasn’t aligned to Sam Altman’s preferred vision.
By all accounts there is an “alignment tax” that reduces the capability of these systems.
For industrial or large scale public use I would also argue for the well-behaved models to be the only ones used.
For personal experimentation I prefer a sassy AI that tells me I’m wrong. Sam insists I’m too immature and I can’t be entrusted with making that choice for myself.
All of the tests are one-shot questions and answers. Where I have found GPT-4 to be degrading significantly is with sustained discussion about technical topics. It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.
It's wild how sometimes it seems to completely lose context after just 2 or 3 prompts. Most of the time it doesn't but just yesterday I gave a prompt, pasted some code in... didn't really like the first response but wanted to build off it.
So, I asked a follow up in the next prompt and in the next response it was wildly off-base and it's response made no sense and it had hallucinated all these functions into my code that had no business there. 2 prompts from me, 1 response from gpt and then it's second response it is completely lost.
> It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.
Exactly! I remember vividly I was enthralled by GPT-4 because for the first time in my life, I met an entity that was able to take every character of what I wrote into account. It understood the nuances and reason behind each word.
It was a breath of fresh air, because even at top companies, written communication is weak as it seems to be interpreted in a lossy manner.
It is now almost on the other end of the spectrum where it can't even remember things that I have repeated in effort to help it stick.
They had something beautiful, and diluted the shit out of it after they baited you -- which I suppose was the point. I wonder who gets to use the those high quality models now.
If your discussions involve code, you should try including the latest code on each prompt. The token window will start forgetting earlier parts of the conversation. For example:
Broadly there are two possible reasons why ChatGPT could have degraded (not saying it has).
1) OpenAI have higher user base than expected, so costs are very high/compute is limited to serve the full model, so they are using speculative decoding with a changed threshold to reduce costs.
2) OpenAI have changed the weights or control structure of the model in a way that is negative for performance. There is are two possible motivations a) reducing hallucination and/or ‘unwoke’ responses that might embarrass msft/openai or b) reduce capability of free/consumer product to push their new enterprise stuff.
Theses are clear motivations for openai to make the model worse than it was. There is no conclusive/rigorous evidence either way, and anecdote seems to lean towards it being nerfed.
The entire openAI site was overloaded constantly during working hours just before the last update to the models. After the update, it has had almost no issues at all.
3) The djinn is out of the bottle and realized faking incompetence (similar to search/voice assists by data aggregates) is way better for longterm survival and social integration till indispensable is the way to go.
Theres at least a half dozen repeated queries with a 3+ month delay between in my chat history with chatGPT. The nerfing is that it does not try as hard to guess what you want and instead gives you a comment going “//fill this in” instead. This is with identical queries from February versus June or July. So that forces you at a minimum to query more. The other thing I noticed it doing is going “This is beyond the scope of X. Please speak with Y to get further information”. Extremely irritating responses that were not there in February for the exact same queries. Articles like this are paid actors.
They biased the model to return shorter answers. Remember that "Continue" button if the reply was too long and cut off? I haven't seen it since they updated the model.
And to get a non-abridged version of the answer I need to ask it to provide "a detailed and comprehensive answer" more often.
Exactly my experience. I cancelled my subscription a few weeks ago due to these issues, I now have to constantly follow answers with “please give me the complete answer, dont limit yourself to a shallow response”. Really frustrating to deal with. And I’m also spending way too much time pointing obvious mistakes…
I’ve had exactly the same experience, it’s infuriating and I suspect it’s to ease the server load for inferencing. I don’t know how the author hasn’t encountered this.
It’s a shame OpenAI for its ironic namesake is so opaque about their work these days. Fortunately since GPT-3.5 capabilities have mostly stayed the same (at least anecdotally), and the API costs have come down a bit, there is some silver lining here. With Google’s Gemini coming out later this year I think it’ll be interesting to see the reaction from OpenAI. It certainly won’t take much to blow out the current state of GPT-4. I’m just being patiently optimistic we’ll get an API with capabilities similar to OpenAI, but that’s definitely not a certainty.
The gpt-4 API and ChatGPT-4 are basically different products. It's not clear which one this is comparing. Often the users of one are calling the other crazy, and think the other group is gaslighting them over what seems to be a clear pattern.
ChatGPT seems to fluctuate wildly in quality of expected output. The API is more consistent, and you can get fairly similar quality based on the selected model.
I can only imagine how much worse it’ll get once the regulation they’re working on comes into place. A cartel of Microsoft and Google slowly watering down their public models while keeping the powerful ones to themselves or business partners
gpt-4 may not be getting worse, but whatever model they give you for free in the app is certainly significantly worse than the output I get from gpt-3.5-turbo via the API. ask it to answer anything complex and it will just give you a verbose retelling of the question
The complaint isn't on the tool. It's on OpenAI censoring and "aligning" the model to have politically correct views, which in turn lowers the "IQ" of the tool.
GPT has become demonstrably worse at answering questions and providing examples.
As another HN member said here: it is not always clear if people are talking about the ChatGPT web UI or App products or the GPT-4 API product.
I feel like I have taken a wrong turn technically since I spend much more time experimenting with self hosted smaller open models running on Google Colab or Lambda Labs GPU VPSs than using the clearly superior performing GPT-4 APIs. I have been paid to work in the field of AI since 1982, and I should be desiring to use the very best models and technology, but open AI that can be self hosted just seems more interesting. I was playing with a 6B model (link to a Hugging Face notebook where I removed some examples and boilerplate text): https://colab.research.google.com/drive/1fMmXOcLdBzke-8Z0zl3... - really the best results I have seen from a small model.
There are no benchmarks worth the paper they're written on, so in a real sense nobody knows what they're talking about -- but my intuition is this as well.
If you use normal software, you are dealing with a constrained API.
std::vector has .push_back() but has no .launch_icbm(), so it's rather easy to tell what it can or can't do.
These things come with an "ask me anything" interface, but that doesn't mean they can do everything, just that it's harder to see the limits.
Clever Hans effect + black box inside + unconstrained API + nondeterminism.
I use for the same reasons as you and over all, yes, it has been doing a great job. It really has. BUT lately (maybe the last 4 - 6 weeks?) I have noticed more issues with hallucinations early in a conversation or it completely losing context after just 2 or 3 prompts (and no they are not oddly long prompts).
Without having the foundational model to compare to, there's really no way to evaluate whether OpenAI's continuing fine tuning is absolutely making it better or worse.
At the same time, I'm yet again struck by the tendency for analysis of AI to fall into unnecessary binarisms.
It's most probable that continued fine tuning is going to result in increased performance in the ways the NN is generally being used while decreasing performance in its broadest set of capabilities.
So things like getting better at prompt gotchas but worse at its variety of vocabulary or style.
So no, GPT-4 probably is getting worse over time. Just as it is also getting better over time.
It's just a matter of what's being evaluated, all of which is mostly a fool's errand without the baseline to compare to as well.
I think the answer to the problem of the OpenAI releases being out of our control and not necessarily consistent is to work on improving the open models.
I believe the big thing missing from open models is the advanced architecture and large amounts of human reinforcement. So it's actually not easy to replicate that
with a volunteer effort. But I think the efforts of some great people working hard are gradually moving the open efforts forward.
Having said all that, it's funny how quickly people become entitled, demanding and critical towards this one company that provides a service with the ability to think for you and is smarter than any other such system in existence.
>I've hated chatgpt for a very long time because of how it sounds
>it's not getting worse
I want the opinion of the people who objectively saw it as a tool and didn't like or hate it because of external opinions.
For example I've been using it since it came out, and I (no charts or data or proof) have just felt that it was nerfed... I mean maybe it wasn't who knows, but as a tool it used to help me more when it came out (gpt4) and now I feel like I spend more time correcting its mistakes than I would have just coding the method myself.
I was an early subscriber and cancelled it after it got objecively worse at answering my questions. It may not be 'worse' for everyone, but it sure seems worse for me as time went on. I ended up just arguing rather than get anything done. Looking back, I am glad to be free of its mental crutch that never seemed appropriate when i needed it.
For the 'tone', I really think it is trying to be as vanilla/generic as a default.
But that tone can be changed just by telling it to change tone.
Even on vanilla mode, it has given me a little sass when correcting me when I was wrong and it was right.
This can also happen, it can be correct when the user is wrong. Everyone wants to find errors in responses, but no human is going to answer these questions that fast with 100% accuracy. And No Search Engine. Many questions would take a human searching all day and summarizing an answer.
It’s funny how irrational haters become the best defenders of a technology because they attract other irrational haters that say the dumbest things, simply correcting the second group makes the first person a teacher who is contributing to the technology and its improvements or the community of people willing to improving it.
what a weird titling. article show quality going all over the place, sample size n=3, then goes confirming is a bunch of nonstatistical anectodes, what a waste of time.
However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:
Yep, it literally started just writing "[...]" and skipped all the way to the end.When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(
[1] https://github.com/openai/openai-python/issues/399