Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Has ChatGPT gotten worse at coding for anyone else?
108 points by Michelangelo11 on Feb 18, 2023 | hide | past | favorite | 81 comments
I used it for coding in Python, often with the python-docx library, about six weeks ago, and it was superb. It gave me exactly what I wanted, which is no mean feat for a semi-obscure little library, and I was delighted. Then I tried it again a few weeks ago and it did worse than before, but I thought maybe it was just bad luck. Using it today, though, it seemed really really bad and it messed up some very basic Python features, like the walrus operator -- it got so bad that I gave up on it and went back to google and stack overflow.

The performance drop is so steep that I can only imagine they crippled the model, probably to cope with the explosion in demand. Has anyone else seen the same thing?




In my anecdotal experience it's been sort of terrible from the start. In my first interaction with it, it suggested that I use a library that had been deprecated for a few years, which is when I found out it's data cut-off point was in 2021.

We've been building an API with: Asp.Versioning, Microsoft.AspNetCore.OData.Query,, Microsoft.AspNetCore.OData.Deltas and Microsoft.EntityFrameworkCore, and it's been very bad at it. I think it's sort of understandable, because there isn't a lot of documentation or examples for these libraries, and, some of them have changed a lot since 2021, but it can't even write an ActionResult function correctly without a lot of help. At one point we asked it to do something resulting in some very terrible code. When I pointed out what was wrong, it apologized and then proceded to give me the exact same piece of code.

We use it quite a lot, along with co-pilot to test the waters, and so far it's rather unimpressive. From my completely anecdotal experience, it hasn't gotten worse, but neither are useful for things that haven't been solved a million times before. I think the major advantages we're going to see from it is in terms of automated documentation and possibly having it write tests.

That being said, I don't think it's that much worse than google programming. C# documentation is really hard to find. Some of the Odata documentation is a github repository with very few comments and only in-memory example code, but it was easier to find through the use of chatGPT than it was on Google. I do think it needs to automatically include the source and date for what it bases it answers on to help you navigate the answer. What I mean by this is that IActionResult wasn't replaced yet by ActionResult in 2021, so if it simply told you that it's answer is old, then you'd probably be more inclined to look things up in the official documentation. I know I would.


I asked it to create a struct representing a table in Go, without telling it the types and it inferred those just from the names, including having lat/long as floats. Then I asked it to write a fiber app using that struct with a single endpoint for adding new entries. Did so perfectly. Asked it to add single entry and multi entry retrieval and done so from context, no bugs


Not surprising, or? It can do all these things for which plentiful examples exist out there quite perfectly, but everything else that requires true more complex understanding of things, transfer of knowledge, or where only little documentation and less examples exist... not really.


Yeah I have been trying to use it to design neural nets, but it always gets the Tensor shapes wrong and it uses the incorrect layers and parameters.

I tried being brief thinking that it has a short “working memory” but that leads to definition ambiguity.

I tried being exhaustive with explanations about things like the data range, input shapes and reasons for parameters, and with enough explanation it will eventually come to the right conclusions - but it’s far more effort to exhaustively explain what I need then to just write the code - and even worse, during the explanation phases it gives me “working” code (I.E. code that executes) but is functionally wrong! So if I wasn’t so experienced I would likely have accepted one of these incorrect implementations and then been not sure why it doesn’t work…

When I say “wrong” I don’t really mean things like “this hyperparameter is non-optimal” - I mean things like “This network topology doesn’t make sense, and the whole thing is wired together wrong”

I think overall it’s a great tool though - one of my favourite hobbies now is to sit down before bed and spend an hour chatting with it on random subjects, asking it deep Domain questions on ancillary interests of mine, but I am cautious about what it produces knowing it gets code so wrong

I look forward to future versions that are improved!


> but it’s far more effort to exhaustively explain what I need then to just write the code

Exactly. It's almost as if explicitly laying out the requirements is itself a form of programming. We might see something more akin to a natural language in the future, but requirements are programming, in a very real sense.


I wouldn't have thought it could do even this, I am truly amazed that it was able to infer type from name and then keep in context everything it did. I decided to use postgres instead of mysql, asked it to rewrite it with that in mind and it did so perfectly. To me that is amazing.


It more likely that most examples of lat/long in structs were floats, so that's just what it used.


I mean we can go back and forth on this all day long. What’s the conclusion you’re driving towards?


Why do I have to drive to a conclusion? I was just sharing my experience


> We use it quite a lot, along with co-pilot to test the waters

Can you shred some lights on how co-pilot is doing comparable?


Interestingly, I've found copilot has been getting seriously dumber I feel.

Seemingly simple things that I could Google in 5 mins (which I feel is really really what copilot is good for) it seems to be struggling more and more with.

This weekend I asked it to write "an event listener that listens for when a user cmd + clicks and binds the event to the onCmdClick function" in a Vue app.

I tried around 3 different variation's of the prompt, it just COULD NOT figure out that it needed to check if the cmd key was down. All it would do is just bind "click" events to the function I mentioned.


I think it's hard to compare the two because of how you approach them. We dismiss the co-pilot suggestions a lot, but it's not like we get the same amount of suggestions from chatGPT because you more actively have to seek it out. If that makes sense.

I think co-pilot is eventually going to compete with a lot of the auto-generation tools that exist today. Like, you can auto-generate REST controllers in C# if you're not doing the whole Odata with generic <T> controllers, and Microsoft has made a tool for that, but in the future, it'll likely be co-pilot that handles that sort of thing.


Yeah they definitely changed the model. In ChatGPT Pro you can actually select to use the legacy model or the new one, the new one is substantially worse at everything from coding to analyzing and categorizing text. My guess is it takes ridiculous amounts of VRAM (100’s of GB per user) to do inference so they changed it.


What's funny it was first posted as "Fast vs Normal" rather than "Default vs Legacy", with Normal selected by default presumably because that made it better

Then overnight "Normal" became "Legacy" and "Fast" became the default.


I do wonder what is the real economic cost of running a ChatGPT query, once you peel away any introductory pricing etc. IE amortised cost of hardware and research, plus running costs (electricity).

Whenever I see the size of these models, it strikes me that it must be quite a bit, certainly not pennies.


In the all in podcast they said 30 cents / query (probably for the original version), and 1-3 cents for Google search, that’s why what Bing is doing can be so disrupting to Google margins as well when they introduce it.


I noticed this too. I'm considering stopping my subscription and claiming a refund. Chances they'll drop legacy soon and it's usefulness for me goes with it.


Someone said you can choose the legacy model of you are a subscriber, have you tried that?


Perhaps question if this will remain?


Yea I think they shrink the model size for sure


I'm guessing that new one is being used for the free tier, too, because it seems to be pretty bad for non-coding tasks too. I tried to use it for some very simple math (see comment above) and it was atrocious.


ChatGPT is a language model, and as far as I understand it isn't designed to do explicit computations or logic.


logic it is good at. math isn't logic as weird as that sounds

it's trained on code which follows logical rules

if I'm wrong here please correct me


They probably changed it from a davinci-based model to ada or curie -based model.


It was actually saying the model names in the ChatGPT Plus URLs today. text-davinci-002-something.


But we don’t know whether davinci still maps to 175B anymore.

Realistically speaking, in order to offer ChatGPT Plus, they will have to have more model instance running concurrently. Because the promise is no wait time, but they can’t magically increasing number of hosts over a short period of time. DC took months to ramp up.

So the realistic solution is to shrink model size so one instance can serve two models, 3 models or even 4 models

And label it as Turbo mode


can you expand on that last part?


I asked it for an HLSL shader to raymarch a cloud and it basically handed me a copy/paste of the top result off shadertoy changed just enough to be broken. Kept the indentation and the magic constants unchanged though!

The more niche the ask the less... transformative/uniquely generative its model is, and the less reliable.


Right, absolutely. But the level of nicheness was largely the same over time or, if anything, went down (the walrus operator in Python isn't very niche at all).


It is niche in the sense that it is new and there are less examples of it on the internet, maybe not even 10% of the public python code uses it.


Yes, I noticed this too. I had it build a MongoDB aggregation where documents would get aggregated in hourly timeslots (compute temperature averages+hi+low for every hour). Two ways to do this: 1) convert the datetime to a YYYY-mm-dd-HH string and use it to group the documents, or 2) use a Unix timestamp and do some math on it.

I was already using 2) in some projects, so I wanted to check if it was able to do this.

It first suggested 1), then I told it to make it more efficient by avoiding strings so it gave me 2). Wow.

That was around 3-4 weeks ago. When I tried it again this week, it would only output 1) and it wasn't able to make the move to 2) anymore by telling it to not use strings. It kept using them.


Can you paste code on how #2 was done?


Time stamp % 24


That does not get you hourly time slots.

You either want something like

    floor(timestamp / 3600)
or

    timestamp - timestamp % 3600


that's clever, thanks!


Is it possible that what you have been working on over the last six weeks became more specialized / less generalized and common? Did you start out with a prototype and then move later into pinned dependencies? Had you attempted using the Walrus earlier or just recently?

My contention, which I covered in this video below here, is that due to the underlying statistical sampling problems inherent in RLHF transformers, LLM's perform poorly in edge cases, which, depending on the application or language, the margin of that edge can be super wide.

Here's a video I created about it: https://www.youtube.com/watch?v=GMmIol4mnLo

I didn't cover this yet but there are these things called, "scaling laws," which basically state the amount of raw text needed for a LLM with of a particular size of parameters. So my current mental model is that these, "laws," are really economic rules of thumb, like Moore's law is actually Moore's Rule of Thumb, and there is a huge expense in sampling clean data, hence the need for RLHF.

More about RLHF if not familiar with that term yet: https://huggingface.co/blog/rlhf


What I've noticed is that it seems to almost have Alzheimer's now.

When it first came out, it seemed to be able to hold context all the way back to the beginning of the chat thread. Now it seems to be limited to roughly 2-3 messages.

You can actually test this I found by telling it a bunch of detailed information over the course of 5-6 messages, and then ask it a question about something you mentioned in message 1. For me, it will almost 100% fail at this now.

Makes it almost useless to me IMO. The main thing I was excited about was the ability to dump large corpus' of information into it as chat messages and then be able to have it distill down answers to specific questions I have about the content.

Effectively useless for that now that it's only able to "remember" the most recent 1-2000 words.


Absolutely. I just tested this by giving it 4 JSON files I wanted to ask it about. It immediately acted like it did not have the info I just gave it. It told me how I could retrieve the information from a file.

I did get that working eventually by explicitly telling it at the start to refer back to this data when I ask questions.

It's a lot of work, currently, figuring out the magic phrasing you need to use to get it to act like it did a week ago.


I tried it with my starter python interview question quite a while back and it did pretty good. Just tried it now and it totally missed half the problem but it figured that part out after I told it what it forgot to solve for.

Still had a bug though but it seemed way faster at writing the code this time than last.


Yeah, earlier tonight I used it to find a function for 4 points in 2d space and it only got it right after I corrected it 3-4 times. We're talking a simple linear function with 1 independent variable and integers as the only numbers.

I'm certain that the ChatGPT version at launch would have spat out the right solution on the first try.


I wanted to ask this very question. Since the last update I have struggled to get it to do what I ask. I am really fighting with it, currently, hoping to find a solution, because as it stands, it is not helpful to me at all. For example, a prompt which I have been using for months that starts along the lines "you will be my bash script advisor" now receives the response "sorry, as a language model I cannot act as a bash script advisor".

I managed to get it to cooperate again by rewording it, but it's answers seem to be ignoring all of my instructions. I explicitly tell it not to use imagined flags and options, and that used to work. Now it's back to just inventing a plausible sounding flag, even though I told it not to.

I just gave it 4 JSON files that I wanted to ask questions about, and it acts like I didn't just give it all the data it needs to answer my question. I solved that by rewording my prompt to include "Make a note of the following data and refer back to it when I ask you questions".

If they are going to change it every month so that I have to rethink all my prompts just to get the answers then it's not worth it. I'm spending more time prompt engineering than I am getting the answers I need.


If you subscribe to the paid version you can use the slower and perhaps better "legacy" model that was available a few times everyday until a few weeks ago.


Hmm, thanks, yeah, I forgot about the paid version. They might have just crippled the free version...


If you’re going to pay, you should also test out GitHub Copilot, which as far as i know is a GPT-3 derivative as well


sadly it has nowhere near the same quality though. I am considering between the two and so far ChatGPT is winning still. Its nowhere perfect, but a very good boilerplate.


They recently switched Copilot to a better model.


Yes, I use both.


In my work with it, chatgpt has never produced very good code. I have not noticed a drop in its already mediocre to poor performance. I have to wonder about your weird perception that it was "superb."


I’ve noticed ChatGPT becoming so bad that I only ever use it to see if they have improved it - but it’s always worse than the previous time. It can’t remember more than a thousand or two tokens now, but in saying that it can simply forget things I have just asked it 50 tokens ago.

I didn’t test this out out thoroughly, but when I VPN’d in to USA it did seem to work much better for me. But I also created an account at that time so it had no traces back to the UK. I don’t know, but I think some proper study into whether non-USA users are being given a more limited, degraded version of ChatGPT could be worth considering - I really really want to be wrong about this one.


It's laughably bad now. For two weeks I was impressed but kind of disappointed by their restrictions, but now....I forgot about it and kind of have anxiety yo use it....it will not do what I want....so why bother giving them any data points...they seem to have made it Alexa quality....lmao.


Which model is being called? You can see it in the HTTP calls


Not programming, but I asked it to generate a list of words that end in common tlds, to see if I could come up with a nice website domain / project name. I said give me words that end in .io, .ws, .is etc.

It failed to follow this simple instruction. I tried to be more and more specific, but no matter what I did it returned words that did not end with the tlds I asked for. it just seemed to give me any old words and throw a . two places before the end.

Compared to the initial experiences I had with ChatGPT it feels like it's suddenly gotten awfully dumb.


To be fair though, I would expect that. It's always been and always will be pretty terrible for anything at the character level, because it works on tokens (i.e. chunks of words) rather than characters, and it can't see individual characters at all.

For example, try asking it to count the number of characters in a word and, no matter if it's right or wrong (seems to be a coin toss, btw -- it can just do a rough guess), ask it "Are you sure?", and keep asking that over and over every time it replies. Chances are it'll keep changing the count back and forth, apologizing every time.


I have the same experience. I got nice results in some cases, less nice in other. And since a week or so I get bad code that even with a lot of hints are not improved by ChatGPT where before it was improving. It also stopped to follow instructions. It stop it's generation in the middle of a function and when asked to continue it restart from the beginning. When you ask it to improve the last version of a script, it generate something completely different instead of improving existing script. A lot of update goes A => B => A and fixes are lost from one version to the next. It was nice when it worked. No way I would pay for the kind of results I get the last few days. It even lied to me. It kept generating code that didn't worked and I said I wish I could send you a screenshot so you see by yourself what goes wrong. It said send a screenshot to imgur and paste the url here. I was telling myself maybe there was an update since bing AI can access the web maybe ChatGPT got an upgrade then the "inspection" of my screenshot went so fast I got doubts it processed anything. Then as always it apologized once I pointed that I had doubts.


I tested it out on some 6502 assembly recently and it made serious logical errors as well as basic mistakes like using 16-bit numbers in 8-bit regisers.

https://videos.rights.ninja/w/m9LX4Tw47vPvbNDSsjFTYh


"Write a position paper describing how AI models will never be able to achieve the promises of OpenAI. Explain how OpenAI will not achieve their goals and have likely lied to the public and themselves."

That is the only question you need to ask this 'model' lmao. what a joke.


I believe they further adapted the prompts to be even more concise and that in turn means worse code.

I have gotten better results by telling it to not be concise and therefore more detailed.

What i also have noticed though is that by now i "expect" more from it than it actually can handle too


My experience as a back-end developer was that although it makes quit a lot of mistakes but if I put a medium amount of time dwelling on it, it can give me a full webapp. I was always afraid of front-end development but now with the help of this guy I'm making mediumly complex websites in 3-7 days. and thanks to its wide range of knowledge on almost everything, I can run a few full online businesses solo and see which one turns profit. it needs a LOT of improvements and there will be alternatives in the coming months but one thing for sure is that the future is for those who capitalize on it.


I guess the text generation might be using some kind of beam search as a final layer (or some other enumerative search procedure). A trivial computational intensity reduction trick would be to reduce search size and thus reduce the quality of the output, without even changing the model. So I can imagine they reduced the compute per chat over time like this.

I semi agree with the observation, though I'm not sure if that's just a hype recovery bias from my side. For example, in the early days I would just marvel at the output looking right. Later I started running the output and noting it is often quite wrong.


Havent noticed anything different.

I've found that it's been also pretty good at writing sql queries, and pinpointing where input queries are incorrect. It's probably my highest leverage use of chatgpt at the moment.


A demonstration of your workflow using it for this case would be nice, if you ever feel so inclined to make.


Maybe they wanted to avoid lawsuits for license violations of code they trained on (and which the model will in some cases spit out near verbatim).


Also it competes with MS's Github Copilot


Ok I'm not going crazy. We... I mean maybe I still am... But that isn't part of it! So,yes... Was utilizing it for some React work and in the last couple weeks had even remarked to some other folks that it seems to be much less successful with answering questions I ask that are code related and the code it outputs seems to have degraded quite a bit.


I always noticed it made-up function calls or got them wrong for GMS2 all the time. (Gamemaker Studio2) but the structure was often correct.


Got decent results by telling it to either pretend it's the gamemaker studio 2.3.3 manual or in the style of the manual. The problem is how much outdated information is in the model.


I used it two days ago to translate some research code from Matlab to Python and it worked really well, saving me about 10 hours of work. It would be very unfortunate if they degraded it without adding a higher paid tier. This thing has literally provided me with thousands of dollars of tangible value. It turns me into a proper one man army.


ITT: many egotistical developers that claim or "know" their code is "better" than chatGPT.

meanwhile a post not too long ago was upvoted quite high which claimed that it's anyway impossible to prove / disprove code is "clean" or not - because it's all subjective anyway


For code quality, I’ve found asking it “is that idiomatic code?” will have it rewrite it more cleanly if it determines the code it wrote was indeed not idiomatic.

The bigger issue is code that doesn’t follow the spec. When I’m trying to generate code to do what I’m asking, I care a lot more about whether or not it actually works (which it usually doesn’t, save for the simplest use cases). But with enough prompting and reminders you can usually get it there. But at a certain point, you’re better off taking whatever busted code it gives you and just fix it yourself.


Yes, I also noticed it before they had the different models available. Somedays it would just be another AI somehow much dumber. I guess it would be nice to have some kind of signature so you know who you are dealing with


I noticed is sometimes uses outdated code. Chatgpt admitted it gave me old outdated and wrong code without me telling it. it knew.

Tried to have it write a hexchat script in perl and python, neither worked due old documentation being trained.


If you're saying it "knew" because it admitted to it after you pointed it out, the character it's playing will bow and scrape so long as you correct it at all, even if you just straight up lie. It knows that a correction in that context ought to be followed by an apology, and OpenAIs tweaking is almost certainly involved (since it uses the same words every time).


> without me telling it


ChatGPT doesn't "know" anything. It's not sentient, it's just running statistical models.

I'd be interested in seeing the prompts where it gave incorrect code and immediately apologized without user interaction. Maybe the training data includes text showing how to bad code would look like and the parts where good code appears didn't rank high enough.


They are attempting to monetize this technology, so I can imagine they are degrading the free model. Amusingly, that makes me less inclined to sign up.


They're pushing the "degraded" model for paid users too, and now labeling the previous model as Legacy, implying it will go away at some point


We've just incorporated our Stackoverflow app into you.com/chat Would love to get your feedback.


Yes at some point i feel that too as i did coding for my website https://www.astroassociates.com/


yes it got much worse !!!!


Yes! I noticed that, too.


Yep




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: