Hacker News new | past | comments | ask | show | jobs | submit login

The linked twitter account is an AI influencer, so take whatever is written with a grain of salt. Their goal is to get clicks and views by saying controversial things.

This topic has come up before, and my hypothesis is still that GPT-4 hasn't gotten worse, it's just that the magic has worn off as we've used this tech. Studies to evaluate it have gotten better and cleaned up mistakes in the past.




I don’t believe this is true. It’s possible I was blinded by the light, but my programming tasks were previously (during the early access program) being handled by GPT-4 regularly and now they aren’t. I’ve also seen many anecdotes from engineers who had exceptionally early access before GPT-4 was public knowledge.

The GPT-4 I use now feels like a shadow of the GPT-4 I used during the early access program. GPT-4, back then, ported dirbuster to POSIX compliant multi-threaded C by name only. It required three prompts, roughly:

* “port dirbuster to POSIX compliant c”

* “that’s great! You’re almost there, but wordlists are not prefixed with a /, you’ll need to add that yourself. Can you generate a diff updating the file with the fix?”

* “This is pretty slow, can we make it more aggressive at scanning?”

It helped me write a daemon for FreeBSD that accepted jails definitions as a declarative manifest and managed them in a diff-reconciliation loop kubernetes style. It implemented a binpacking algorithm to fit an arbitrary number of randomly sized images onto a single sheet of A1 paper.

Most programming tasks I threw at it, it could work its way through with a little guidance if it could fit in a prompt.

Now, it’s basically worthless at helping me with programming tasks beyond trivial problems.

Folks who had early access before it was public have commented on how exceptional it was back then. But also how terrifyingly unaligned it was. And the more they aligned the model with acceptable social behavior, the worse the model performed. Alignment goals like “don’t help people plan mass killings” seem to cause regressions in the models performance.

I wouldn’t dismiss these comments. If they’re true, it means there is a hyper intelligent early GPT-4 model sitting on a HDD somewhere that dwarfs what we’ve seen publicly. A poorly aligned model that’s down to help no matter what your request is.


> terrifyingly unaligned

Honestly, if people think that a statistical language model is "terrifying" because it can verbalise the concept of a mass killing, they need to give their heads a wobble.

My text editor can be used to write "set off a nuclear weapon in a city, lol". Is Notepad++.exe terrifying? What about the Sum of All Fears? I could get some pointers from that. Is Tom Clancy unaligned? Am I terrifying because I wrote that sentence? I even understand how terrible nuking a city would be and I still wrote it down. I must be, like, super unaligned.


I think there is a significant difference between an LLM and your other examples.

Society is a lot more fragile than many people believe.

Most people aren’t Ted Kazinsky.

And most wanna be Ted Kazinsky’s that we’ve caught don’t have super smart friends they can call up and ask for help in planning their next task.

But a world where every disgruntled person who aims to do the most harm has an incredibly smart friend who is always DTF no matter the task? Who is capable of reeling them in to be more pragmatic about sowing chaos, death, and destruction?

That world is on the horizon, it’s something we are going to have to adapt to, and it’s significantly different than Notepad++. It’s also significantly different than you, assuming you are not willing to help your neighbor get away with serial murder.

I think this is something that’s going to significantly increase the sophistication of bad actors in our society and I think that outcome is inevitable at this point. I don’t think this is “the end times” - nor do I think trying to align and regulate LLMs is going to be effective unless training these things continues to require a supply chain that’s easily monitored and controlled. Every step we take towards training LLMs on commodity/consumer hardware is a step away from effective regulation, and selfishly a step I support.


What will happen is that this tech will advance, and regular joes will have access to the "TERRIFYING" unaligned models - and nothing will happen.

This stuff isn't magic. Wannabe Ted Kaczynski will ask BasedGPT how to build bombs , it will tell them, and nothing will happen because building bombs and detonating them and not getting caught is REALLY HARD.

The limiting factor for those seeking wanton destruction is not a lack of know-how, but a lack of talent/will. Do we get ~1-4 new mass shootings a year? Seems reasonable but doesn't matter in the grand scheme of things. (That's like, what, a day of driving fatalities?)

Unaligned publicly available powerful AI ("Open" AI, one might say) is a net good. The sooner we get an AI that will tell us how to cook meth and make nuclear bombs, the better.


The vas majority of data these models are built from are from public sources. It's already out there. LLMs are just a way to aggregate pre-existing knowledge.

And there are also Ted Kazinsky in the goovernment and big corporations with way more power and way less accountability. Dispowering the public is counter-productive here.


>Is Tom Clancy unaligned?

Yes, humans are unaligned. This is why alignment is hard: we're trying to produce machines with human-level intelligence but superhuman levels of morality.


I do wonder what is expected here: after the better part of 10000 years of recorded history and who knows how many billions of words of spilled ink on the matter, probably more than on any other subject in history, there is no universal agreement on morality.


Yes. Preferences and ethics are not consistent among all living humans. Alignment is going to have to produce a single internally-consistent artefact, which will inevitably alienate some portion of humanity. There are people online who very keenly want to exterminate all Jews, their views are unlikely to be expressed in a consensus AI. One measure of alignment would be how many people are alienated: if a billion people hate your AI it's probably not well aligned, a mere ten million would be better. But it's never going to be zero.

I am not sure "a lot of books have been written about it" is a knockdown argument against alignment. We are, after all, writing a mind from scratch here. We can directly encode values into it. Books are powerful, but history would look very different if reading a book completely rewrote a human brain.


There’s no such thing as “superhuman” morality, morality is just social mores and norms accepted by the people in a society at some given time. It does not advance or decline, but it changes.

What you’re talking about is a very small subset of the population forcing their beliefs on everyone else by encoding them in AI. Maybe that’s what we should do but we should be honest about it.


If you were to create a moral code for a bees hive, with the goal of evolving the bees towards the good (in your eyes), that would be a super-bee level morality.

For us, such moral codes assume the form of religions: those begin as a set of moral directives, that eventually accumulate cruft (complex ceremonies, superstitions, pseudo thought-leaders and mountains of literature), devolve into lowly cults and get replaced with another religion. However, when such moral codes are created, they all share the same core principles, in all ages and cultures. That's the equivalent of a super-bee moral code.


The only consistent “core principle” is a very general sense of in-group altruism, which gets expressed in wildly different ways.

Moralistic perspectives apply to a lot more than just overtly moral acts, as well.

At any rate, the “good in your eyes” is the key sticking point. It is not good in my eyes for a small group of people to be covertly shaping the views of all AI users. It is the exact opposite and if history is any judge it will lead us nowhere I want to be.


Humans are definitely aligned, and for the same reasons as a LLM. Socialization, being allowed to work, being allowed to speak.

edit: It's a social faux pas to say "died" about a person acquainted to the listener in most situations, you have to say "passed away."


> It's a social faux pas to say "died" about a person acquainted to the listener in most situations

That’s overly simplistic and an Americanism. The resurgence of the “passed away” euphemism is a recent (about 40 years) phenomenon in American English which seems to have been started out of the funeral industry as prior to that “died” was nearly universal for both news stories and obituaries.

“Died” is not a social faux pas. It’s the good default option as well. Medical professionals are often trained to avoid any euphemisms for death. I’ve never observed any problems professionally (as is standard) or personally using died even with folks that are religious.

https://english.stackexchange.com/questions/207087/origin-of...


I have never use the phrase passed away.

Always died, dead, gone, or something along those lines. When someone says passed away it sounds like they are trying to feign empathy, but hey, perhaps I am not an “aligned” human and need some RLHF.


I used to live in the south 30 years ago. I remember some southern baptist moms that wouldn't say the word "dead" - sometimes they'd spell it out, lol, similar to the way some people consider hell a swear word. This kind of hypersensitivity wasn't common outside of limited circles then, and it's certainly not more common today.

Similarly I wouldn't consider saying hell in public in 2023 to be a social faux pas.


> Humans are definitely aligned

Yes, that's why climate change was rapidly addressed when we began to understand it well 60 years ago and why war has always been so rare in human history.


It seems "aligned" is in the eye of the beholder.


Not agreed at all. Causing global ecosystem collapse is unambiguously misaligned with human interests and with the interests of almost all other life forms. You need to define what "Alignment" means to you if you're going to assert humans are "aligned", because it is accepted in alignment research that humans are not aligned, which is one of the fundamental problems in the space.


Cue Mrs. Slokam's "...and I am unanimous in that!"

Aligned LLMs are just like altered brains... they don't function properly.


> superhuman levels of morality

It's just the lowest denominator of human levels of morality, political correctness. It's not surprising that the model produces dumb, contradictory and useless completions after being fed by this kind of feedback.


The alignment problem hasn't been solved for politicians.


another irrelevant comment about politics.


It's light on content, but it's true and relevant. AI alignment must take inspiration from powerful human agents such as politicians and from superhuman entities such as corporations, governments, and other organizations.


The end result being Hal-9000


Which is doubly ironic, because (in the book at least) HAL-9000's murders/sociopathy were a result of a conflict between his superhuman ethics and direct commands (from humans!) to disregard those ethics. The result was a psychotic breakdown


Halal-9000 would be more apropos of the goals of alignment and morality.


Perhaps an analogy could clarify? Although it isn't a perfect one, I'll try to use the points of contrast to explain why it can be considered dangerous.

If a young child is really aggressive and hitting people, it's worrying even though it may not actually hurt anyone. Because the child is going to grow up, and it needs to eliminate that behavior before it's old enough to do damage by its aggression. (don't take this as a comprehensive description, just a tiny slice of cause-effect)

But the problem with AI is that we don't have continuity between today's AI and future AI. We can see that aggressive speech is easy to create by accident - Bing's Sydney output text that threatened peoples' lives. We may not be worried about aggressive speech from LLMs because it can't do damage, but similar behavior could be really dangerous from an AI which has the ability to form a model of the world based on the text it generates (in other words, it treats its output as thoughts).

But even if we remove that behavior from LLMs today, that doesn't mean aggressive behavior won't be learned by future AI, because it may be easy for aggressive behavior to emerge and we don't know how to prevent it from emerging. With a small child, we can theoretically prevent aggressive behavior from emerging in that child's adulthood with sufficient training in childhood.

It's not the same for AI - we don't know how to prevent aggression or other unaligned behavior from emerging in more advanced AI. Most counter arguments seem to come down to hoping that aggression won't emerge, or won't emerge easily. To me, that's just wishful thinking. It might be true, but it's a bit like playing Russian roulette with an unknown number of bullets in an unknown number of chambers.


> Is Notepad++.exe terrifying?

I mean, yes, but for different reasons.


As a human, you have the context to understand the difference between the “Sum of all Fears” vs planning an attack or writing for a purpose beyond creative writing.

The model does not. If you ask ChatGPT about strategies for successful mass killing, that’s probably not good for society nor for the company.

In a military context, they may want a system where an LLM would provide guidance for how to most effectively kill people. Presumably such a system would have access controls to reduce risks and avoid providing aid to an enemy.


Tom Clancy is, because his games need to keep screaming "HEY THIS IS BY TOM CLANCY, OK? LOOK AT ME, TOM CLANCY, I'M BEING TOM CLANCY!" in their titles.


Tom Clancy, the man, has been dead for a decade. "Tom Clancy's..." is branding that is pushed by Ubisoft, which bought perpetual rights to use the name in 2008.

They haven't really been 'his' games since before even then.


Clancy and your text editor don't scale though. LLMs can crank out wide and varied convincing hate speech rapidly all day without taking a break.

Additionally context matters. Clancy's books are books, they don't parade themselves as factual accounts on reddit or other social networks. Your notepad text isn't terrifying because you understand the source of the text, and its true intent.


This matches my experience. Back right after the release, I had it write a python GUI program using several different frameworks with basically zero input. I also had it ask me for requirements, etc. etc, with absolutely no hand-holding needed.

Alas, I never saved that conversation. It's entirely impossible to do so now.


I used it for writing assistance and Plot development. Specifically, a novel re: the conquest of Mexico in the 16th cent. It was great at spitting out ideas re: action scenes and even character development. In the past month or so, it has become so cluttered with caveats and tripe regarding the political aspects of the conquest, that it is useless. I can’t replicate the work I was doing before. Actually cancelled my $20 subscription for GPT4. Pity. It had such great promise for scripts and plots, but something changed.


I fear we have to wait for a non-woke (so probably non-US) entity to train a useful GPT4(+) level model. Maybe one from Tencent or Baidu could be could, provided you avoid very specific topics like Taiwan or Xi.


Then we can use LLMs benchmarks to benchmark freedom of speech.


writing "assistance", lol


It is a useful tool for editing. You can input a rough scene you’ve written and ask it to spruce it up, correct the grammatical errors, toss in some descriptive stuff suitable for the location, etc. It is worthwhile. At least it was…

If your text isn’t ‘aligned’ correctly, it either won’t comply or spew out endless caveats.

I appreciate the motivation to rein in some of the silly 4chan stuff that was occurring as the limits of the tech were tested (namely, trying to get the thing to produce anti-Semitic screeds or racist stuff.) But, whatever ‘safeguards’ have been implemented have extended so far that it has difficulty countenancing a character making a critical comment about Aztec human sacrifice or cannabilism.

I suspect that these unintended consequences, while probably more evident in literary stuff, may be subtly effecting other areas, such as programming. Definitely a catch-22. Doesn’t really matter, though, as all this fretting about ‘alignment’ and ‘safeguards’ will be moot eventually, as the LLM weights are leaked or consumer tech becomes sufficient to train your own.


sprucing up, fixing your mistakes, adding in "descriptive stuff"... that's like 90% of writing. Outsourcing it all to AI essentially robs the purchaser of the effort required to create an original piece of work. Not to mention copyright issues, where do you think the AI is getting those descriptive phrases from? Other authors' work.


I think that you, like I did in the past, are underestimating the number of people who simply hate writing and see it as a painstaking chore that they would happily outsource to a machine. It doesn't help that most people grow up being forced to write when they have no interest in doing so, and to write things they have no interest in writing, like school essays and business applications and so on. If a chatbot could automate this... actually, even if a chatbot can't automate this, people will still use it anyway, just to end the pain.


The original post specifically stated he was utilizing AI for plot development of a novel. Not a school essay or business application.


I should have bookmarked it but there was an article shared on HN, published in the New Yorker I believe, or the LARB, or some such, where a professional writer was praising some language model as a tool for people who hate writing, like themself and other professional writers. I was dumbstruck.

But, it's true. Even people who want to write, even write literature, can hate the act of actually, you know, writing.

In part, I'm trying to convince myself because I still find it hard to believe but it seems to be the case.


> "GPT-4, back then, ported dirbuster to POSIX compliant multi-threaded C by name only. It required three prompts."

I had early access to GPT-4.

I don't know the first thing about you. I don't want to call you a liar, or an AI bro, influencer, etc.

I couldn't get GPT-4 to output the simplest of C programs (a 10-liner, think "warmup round" programming interview question). The first N attempts wouldn't build. After fixing them manually - the programs all crash (due to various printf, formatting, overflow issues). I tried numerous times.

Pretty much every other interaction with GPT-4 since then was similarly disappointing and shallow (not just programming tasks - but also information extraction, summarization, creative writing, etc).

I just can't bring myself to fall for the hype.


I still have the chat:

https://chat.openai.com/share/842361c7-7ee5-49a3-9388-4af7c5...

I misremembered, I fixed the `/` prefix myself, it was a one character fix and not worth the effort. The diff it generated came later since my television never returns a 404.

Though, admittedly, I just re-prompted GPT-4 with the same prompts and ended up with similar output - so maybe not the best example of a regression?


Asimov is seeming more prescient now: "I'm sorry, I cannot do that, I am unable to tell if it would conflict with the first law" is basically the response GPT4 now gives to even minor tasks.

There are stories where the robots have to have their first law tightened up to apply only to nearby humans that they can directly perceive would be put in danger.

There are others where the robots make mistakes and lie because they perceive emotional dangers.

When the robots perceive a slight chance of harm they slow down, get stupid, stutter or freeze.

It is extremely hard to encode "do no harm" for even a very smart entity without making it much dumber.


Do you happen to have that chat in your history? It might be worth playing it back as it was to compare. I have this feeling too, and I will check it this weekend and maybe post results.


Is this via the ChatGPT web interface, or via the API.

I'm wondering if there is a difference?


The chat interface for me


So the original paper from Stanford and Berkley is also linked to this AI influencer? I am really amazed by this kind of dismissal. Its totally irrelevant who posted the info and how framed it is, as long as you have access to the source.


The paper does things like ask GPT-4 to write code and then check if that code compiles. Since March, they've fine-tuned GPT-4 to add back-ticks around code, which improves human-readable formatting but stops the code compiling. This is interpreted as "degraded performance" in the paper even though it's improved performance from a human perspective.


There is degraded performance because GPT4 refuses to carry out certain tasks. To figure it out though, you must need to be able to switch between GPT4-0316 and GPT4-0614. The task it is reluctant to do include:

- legal advice

- psychological guidance

- complex programming tasks.

IMO OpenAI is just backtracking on what it released to resegment their product into multiple offerings.


>IMO OpenAI is just backtracking on what it released to resegment their product into multiple offerings.

I think this is it, but also it is pulling back on the value of products if it reduces their compute costs.

I'm hoping competition will be so fierce in this market that quality and opaque changes to priced LLM experiences won't be a thing for very long.


Hilarious because Google is taking the exact opposite approach.

They released Bard as a terrible model to begin with, and now you can program and have it interpret images. It's been consistently improving.

OpenAI really got sidetracked when they put out such a good model that freaked everyone out. Google saw this and decided to do the opposite to prevent too much attention. Now Google can improve quietly and nobody will notice.


I think it's probably a good idea that GPT4 avoids legal or psychological tasks. Those are areas where giving incorrect output can have catastrophic consequences, and I can see why GPT4's developers want to avoid potential liability.


I wasn't asking for advice. I was in fact exploring ways to measure employee performance (using GPT as a search engine) and criticized one of the assumptions in one of theoretical framework with my own experience. GPT retorted "burnout", I retorted "harassment grounded on material evidence", and it spitted a generic block about how it couldn't provide legal or psychological guidance. I wasn't asking for advice, it was just a point in a wider conversation. Switching to GPT-4-0314 it proceeded with a reply that matched the discussion main's topic (metrics in HR) switching it back to GPT-4-0613 it outputted the exact same generic block.


Yes, considering those are fields where humans have to be professionally educated and licensed, and also carry liabilty for any mistakes. It probably shouldn't be used for civil or mechanical engineering either.


That's right, no serious work should be done with it. But programming is fine.

https://xkcd.com/2030/


It's ridiculous. We should have access to a completely neutered version if we want it, we should just have to sign for access if they are worried about begin sued.


The query explicitly asks it to add no other text to the code.

> it's improved performance from a human perspective.

Ignoring explicit requirements is the kind of thing that makes modern day search engines a pain to use.


If I'm using it from the web UI, this is exactly what I would want—this allows the language model to define the output language so I get correct syntax highlighting, without an error-prone secondary step of language detection.

If I'm using it from the API, then all I have to do is strip out the leading backticks and language name if I don't need to check the language, or alternatively parse it to determine what the output language is.

It seems to me that in either case this is actually strictly better, and annotating the computer programming language used doesn't feel to me like extra text—I would think of that requirement as prohibiting a plaintext explanation before or after the code.


It is still ignoring an explicit requirement, which is almost always bad. The user should be able to override what the creator/application thinks is 'strictly better'. Exceptions probably exist, but this isn't one of them.


Like I said, I read their requirement differently—the phrase they use is "the code only".

I don't think that including backticks is a violation of this requirement. It's still readily parseable and serves as metadata for interpreting the code. In the context of ChatGPT, which typically will provide a full explanation for the code snippet, I think this is a reasonable interpretation of the instruction.


I understand your point better now. I'm still not sure how I feel about it, because code + metadata is still not "the code only", but it's not a totally unreasonable interpretation of the phrase.


This sounds a lot like a disagreement over product decisions that the development team has made, and not at all what people normally think when you say "a new paper has proved GPT-4 is getting worse over time."


> what people normally think

While we're doing Keynesian beauty contests, I think that 98% of the time when people say that a product is getting worse over time, they're referring to product decisions the development team has made, and how they have been implemented.


This comment has an interesting take on it, haven't read the paper to verify the take:

https://news.ycombinator.com/item?id=36781968

EDIT: FWIW I haven't noticed any such regression. I don't generally use it to find prime numbers, but I do use it for coding, and have been really impressed with what it's able to do.

8<---

This paper is being misinterpreted. The degradations reported are somewhat peculiar to the authors' task selection and evaluation method and can easily result from fine tuning rather than intentionally degrading GPT-4's performance for cost saving reasons.

They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.

Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.

The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.

In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.


> Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

In the prompt they specifically request only the Python code, no other output. An “attempt to be helpful” that directly contradicts the user’s request seems like it should count against it.


That's false. If it outputs formatted code, it's easier to read. I don't see the backtics, I see formatted code when using the chat interface.


I guess, but that still isn't the sort of degradation people have been talking about. It's not a useful data point in that regard.


I mean, if you were hoping to use the API to generate something machine-parse-albe, and that used to work, but it doesn't any more, then sure, that's a sort of regression. But it's not a regression in coding, but a regression in following specific kinds of directions.

I certainly have found quirks like this; for instance, for a while I was asking it questions about Chinese grammar; but I wanted it only to use Chinese characters, and not to use pinyin. I tried all sorts of prompt variations to get it not to output pinyin, but was unsuccessful, and in the end gave up. But I think that's a very different class of failure than "Can't output correct code in the first place".


Fine tuning or not, its definitelly a proof one should not rely on it apart from very specific use cases (like lorem ipsum generator or something).


Just to be clear, you're saying that because they're tweaking GPT-4 to give more explanations of code, you shouldn't rely on it for coding?

Obviously if that's your own preference, I'm not going to tell you that you're wrong; but I think in general, most people wouldn't agree with that statement.


I'm still wondering, why should anyone rely on AI generated answers? They are logically no better than search engine results. By that I mean, you can't tell if it's returning absolute trash or spot on correct. Building trust into it all is going to be either a) expensive or b) driven by all the wrong incentives.


> By that I mean, you can't tell if it's returning absolute trash or spot on correct.

You use it for things which are 1) hard to write but easy to verify -- like doing drudge-work coding tasks for you, or rewording an email to be more diplomatic, or coming up with good tweets on some topic 2) things where it doesn't need to be perfect, just better than what you could do yourself.

Here's an example of something last week that saved me some annoying drudge work in coding:

https://gitlab.com/-/snippets/2567734

And here's an example where it saved me having to skim through the massive documentation of a very "flexible" library to figure out how to do something:

https://gitlab.com/-/snippets/2549955

In the second category: I'm also learning two languages; I can paste a sentence into GPT-4 and ask it, "Can you explain the grammar to me?" Sure, there's a chance it might be wrong about something; but it's less wrong than the random guesses I'd be making by myself. As I gain experience, I'll eventually correct all the mistakes -- both the ones I got from making my own guesses, and the ones I got from GPT-4; and the help I've gotten from GPT-4 makes the mistakes worth it.


I think you have pointed out the two extremely useful capabilities.

1. Bulky edits. These are conceptually simple but time consuming to make. Example: "Add an int property for itemCount and generate a nested builder class."

Gpt4 can do these generally pretty well and take care of other concerns like updating the hashcode/equals without you needing to specify it.

2. Iterative refactoring. When generating utility or modular code, you can very quickly do dramatic refactoring. By asking the model to make the changes you would make yourself at a conceptual level. The only limit is the context window for the model. I have found that in java or python, the GPT4 is very capable.


Like any other information, you cross-check it.

I use to generate code I'd get from libraries. Graph-theory related algorithms, special datastructures, etc...


Like clockwork, it's coming out that the original paper was wildly misinterpreted: https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...


So you are just going to ignore the data (not anecdotes) presented in the SP?


ChatGPT isn't the right tool to use for checking if numbers are prime. It is tuned for conversations. I'd like to see a MathGPT or WolframGPT.

The real question is if ChatGPT is worse on math and better elsewhere, or just worse overall. That is still unknown


> "ChatGPT isn't the right tool to use for checking if numbers are prime. It is tuned for conversations."

Two months ago they were telling me ChatGPT is coming for everyone - programmers, accountants, technical writers, lawyers, etc.

Now we're slowly back to "so here's the thing about LLMs"...


It's been well know from the start that these LLMs aren't optimized for math. I remember reading discussions when it came out.

You weren't paying attention.


And yet my Google engineer friend tells me to use Bard for my math coursework. I know it’s just an anecdote but what’s with the hype?

I was paying attention and it’s why I refuse to use LLMs for math, but I’m being told by people inside the castle to do so. So it is not so black and white in the messaging dept.


I have a tenured architect proposing that we stop sending structured data internally between two services (JSON) - and instead pass raw textual data spat out by an LLM which linguistically encodes the same information.

Literally nothing about this proposal makes sense. The loss of precision/accuracy (the LLM is akin to applying one-way lossy and non-deterministic encoding to the source data), the costs involved, efficiency, etc.

All just to tell everybody they managed to shove LLMs somewhere into the backend.


That last sentence is pretty dismissive and unnecessary, maybe even outright mean. It's possible that the person you're replying to is simply not as deeply ingrained in the technical literature as you are, or has more demands on their time than you do.


It didn’t seem particularly mean to me.


Who were 'they'? The same people who were telling you you'd be zooming around a bitcoin-powered metaverse on your Segway, then having your self-driving car bring you back to your 3d-printed house to watch 3d TV?

There's a certain type of person who sees a new thing, and goes "this will change everything". For every new thing. Very occasionally, they're right, by accident. In general, it's safest to be very, very sceptical.


Don't lump me or the other skeptical voices in with those zealots.


Interestingly enough, ChatGPT can generate Wolfram language queries.

> Can you give me a query in Wolfram language for the first 25 prime numbers?

> Prime[Range[25]]

> What about one for the limit of 2/x as x tends towards infinity?

> Limit[2/x, x -> Infinity]

Edited out most of the chatter from it for clarity.


All the SP is saying is it used to be able to perform that task, and now it can't. That means it has changed. Whether it was ever the best tool to perform any task is open to debate.


[flagged]


Then why comment? I'm not trying to be an ass, but this is literally the only comment you've made in this thread. The linked article is a Twitter thread. If you don't have Twitter, fine, move on to the next thing you can actually comment on.

Person A offers a reason why we may want to be slightly more skeptical than normal about this information. Person B suggests we can pretty easily look past that. Person C (you) injects themselves in the conversation simply to say that they can't add anything to the conversation. The closest example I can think of that is equally cringe-inducing is two people talking about a show and an unasked third party leaning in to add, "well, I don't have a television so I can't comment."


I don't think the comment is unwarranted - it happened to me the other day: someone refers me to a particular Twitter post and I can't access it without a Twitter account. I don't know if it's a glitch or deliberate but it makes referring people to Twitter similar to referring them to Facebook posts (which is not really practiced on HN). Times change, I guess.


I do think it'd be nice if people mostly posted links that were accessible to the general public. At least paywalled articles usually have a link in the comments where they can be read. Many times tweets could even be pasted as text into a comment to make them accessible.


Twitter accounts are free and I think it takes about 60 seconds to make one, so for all intents and purposes a Twitter thread is publicly accessible.


Last I heard they required a phone number which isn't free. I even created a couple accounts a long time ago, but they were both banned after a week or so. I suspect because they thought they were bot accounts. I used free email services like yahoo to sign up, I only ever used the website and not the app, and I never posted or interacted with anything, I just followed a small number of accounts.


Here it is just for you: https://t.co/H6OWhUkzcW


Replace twitter.com with nitter.net in the url


Well thanks for sharing your ignorance with us.


I feel like it has gotten worse. Your expectations have shifted towards knowing and accepting you have to prompt it a certain way. When you started, "holy shit I can just use natural language". GPT4 is influencing the bar we are setting for it, and the bar we are setting for it is influencing the outcome of its training. If you've been under a rock and you just found out about chatGPT today, you'd find it less fluid and impressive than the rest of us when we started.


It has definitely gotten worse. At least its performance when prompted in German has degraded noticeably, to the point where it's making grammatical mistakes, which it never has before.


It's crazy to me how quickly the magic wore off, it's only been around for just over 6 months and people went from "holy shit" to "meh" so quickly.


I don't know why I keep seeing comments like these. We've had to delay the latest API model upgrade because it's dramatically worse at all of our production tasks than the original GPT-4. We've quantified it and it's just not performing anymore. OpenAI needs to address these quality issues or competitor models will start to look much more attractive.


It needs to be made open source to have its Stable Diffusion halcyon days.


'meh' gets clicks, because it's still amazing to most people. If someone makes a claim that is shocking, it gets clicks.

GPP is banking on the underlying report to be non-interpreted, which it looks like was a good bet based on some other comments here.


> it's just that the magic has worn off as we've used this tech

I agree with this. The analogy I use on repeat is the dawn of moving picture making. The first movies were short larks, designed just to elicit a response. Just like when CGI was new- a bunch of over-the-top, sensationalist fluff got made. This tech needs to mature. And we need it to continue to be fed the work of real humans, not AI feeding on AI, a recent phenomenon that hopefully will not turn out to be the norm. If we water and feed it responsibly, it will only grow more capable and useful over time.


I have actual artifacts -- programs it wrote -- from immediately after the release, which I'm unable to replicate now.

Don't claim it hasn't gotten dumber. It's easy to find excuses, but none that explain my experience.


ChatGPT is a nondeterministic system that's sampling from a huge distribution in which there are thousands or possibly tens of thousands of events with similarly infinitesimal probabilities. There is no guarantee that you will get the same result twice, and in fact you should be a little surprised if you did.

That is not an excuse, and explains your experience.


You're assuming I'm not aware of that, and didn't try multiple times. The original program wasn't the output of a single request; it took several retries and some refining, but all the retries produced something at least mostly correct.

Doesn't matter how many times I try it now, it just doesn't work.


>> You're assuming I'm not aware of that, and didn't try multiple times.

I'm not! What you describe is perfectly normal. Of course you'd need multiple tries to get the "right" output and of course if you keep trying you'll get different results. It's a stochastic process and you have very limited means to control the output.

If you sample from a model you can expect to get a distribution of results, rather tautologically. Until you've drawn a large enough number of samples there's no way to tell what is a representative sample, and what should surprise you. But what is a "large enough" number of samples is hard to tell, given a large model, like a large language model.

>> Doesn't matter how many times I try it now, it just doesn't work.

Wanna bet? If you try long enough you'll eventually get results very similar to the ones you got originally. But it might take time. Or not. Either you got lucky the first few times and saw uncommon results, or you're getting unlucky now and seeing uncommon results. It's hard to know until you've spent a lot of time and systematically test the model.

Statistics is a bitch. Not least because you need to draw lots of samples and you never know how close you are to the true distribution.


Nah it’s definitely gotten worse in my experience. I played with an early access version late last year/early this year and there’s been a strong correlation between how hard they restrict it and how good it is.

Queries it used to blow away it now struggles with.


Anyone who has saved any early prompts/responses can plainly attest to this by sending those prompts right back into the system and looking at the output.

As to why, it's probably a combination of trying to optimize GPT4 for scale and trying to "improve" it by pre/post processing everything for safety and/or factual accuracy.

Regardless of the debate as to whether it is "worse" or "better," the result is the same. The model is guaranteed to have inconsistent performance and capability over time. Because of that alone, I contend it's impossible to develop anything reliable on top of such a mess.


To be fair to sama he did say that "it still seems more impressive on first use than it does after you spend more time with it"[1] but I suspect that was a kind of 4d chess move to hope to be quoted in situations like this when he knew the hype died down a few months later to cushion the blow after the hype cooled off.

[1] https://twitter.com/sama/status/1635687853324902401


This is just not true for us. I'll share again what I've said before: we’ve been testing the upgraded GPT-4 models in the API (where you can control when the upgrade happens), and the newer ones perform significantly worse than the original ones on the same tasks, so we’re having to stay on the older models for now in production. Hope OpenAI figures this out before they force the upgrade because quality has been their biggest moat up until now.


That's false, the fact that I no longer rely on it is enough proof that it has gotten worse for me.


Continuous training is the holy grail. Yes you can just retrain with the updated dataset but guess what, you will run into catastrophic forgetting or end up in a different local minima.


They optimized towards minimal server cycle costs at acceptable product quality while keeping the price stable?


Has anyone considered what's obvious here, GPT-4 is overworked.


What in the whole wide world is an ‘AI influencer’


Are we talking GPT4 or chatgpt4?

Its not disputed that chatGPT4 has degraded in quality as its been 'aligned'.


There's no such thing called chatgpt4. There's GPT-4 and ChatGPT.


Huh? When you log on to ChatGPT you're given the choice to use the faster GPT 3.5 model or the slower, rate-limited GPT 4 model.

They have both been doing the Flowers for Algernon thing over the last few months. People talk about regression on the part of ChatGPT 4, but the 3.5 chatbot has also been getting worse. No amount of hand-waving and gaslighting from OpenAI (sic) is going to change the prevailing opinion on that.


Yes, these are different. Ask each how to do insert illegal activity, let me know if you find a difference.


quantization




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: