Teaching Large Language Models to Self-Debug

cs702 · on April 12, 2023

In hindsight, it's the most natural, most obvious next step to get LLMs to write better code:

Explain to them how to debug and fix the code they've written.

Which is pretty much what you would do with an inexperienced human software developer.

Looking at this with fresh eyes, it's both shocking to me that this sort of thing is even possible, and yet also completely unsurprising as yet another emergent capability of LLMs.

We live in interesting times.

hyperthesis · on April 13, 2023

Are they actually running the code, and evaluating the output? Or is it debug-by-code-review?

Beware of bugs in the above code; I have only proved it correct, not tried it. - Knuth

cs702 · on April 13, 2023

They're doing both. Quoting from Figure 1, "the model first generates new code, then the code is executed and the model explains the code. The code explanation along with the execution results constitute the feedback message, which is then sent back to the model to perform more debugging steps. When unit tests are not available, the feedback can be purely based on code explanation."

hyperthesis · on April 14, 2023

So only evaluating output with unit tests - a fitness function. AITDD.

og_kalu · on April 12, 2023

Not too shocking for me after this paper. https://arxiv.org/abs/2211.09066

You can teach GPT-3 arithmetic - https://imgur.com/a/w3DAYOi

Basically 100% accuracy up to about 13 digit addition and >90 after that.

What else can you teach GPT without changing weights ?

gopalv · on April 13, 2023

> and >90 after that

This is such a circular thing, that I feel like it is amazing to see it.

The reason LLMs use a NN is because they're trying to encode a probability function for generating the passage.

And now, you are encoding another n-gram follower exercise (i.e 1+1 = 2) on top of it :)

Paul-Craft · on April 13, 2023

Yeah... and I'm kind of suspicious of the whole "without changing the weights" deal, because adding working context to the model, like telling it the algorithm for adding numbers really sounds like there's some model state that's getting mutated, even if it's not stored in a file called weights.dat or whatev.

cs702 · on April 13, 2023

I meant shocking in the sense that it makes me gape in awe, but as I wrote, it's also, simultaneously, completely unsurprising given all the new emergent capabilities we keep discovering. We're in agreement :-)

og_kalu · on April 13, 2023

Oh. yes well that's fair

mirashii · on April 12, 2023

> 100% accuracy up to about 13 digit addition

The graphs you just posted do not support that, they'd support at most 100% accuracy up to 4 digits.

sharemywin · on April 12, 2023

it's GPT so 13=4

og_kalu · on April 12, 2023

It's 100 at 13 and extremely close to it prior to that. Maybe basically 100 is better.

civilized · on April 12, 2023

I've done several experiments (and posted results in previous HN comments) where I've given GPT puzzles or brainteasers and asked it to review aspects of its answers Socratically. Never telling it it got anything wrong, just "you said A, then you said B, does that make sense"?

It usually does notice inconsistencies between A and B when asked this. But its ways of reconciling inconsistencies can be bizarre and suggest a very superficial understanding of concepts.

For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.

I will be super impressed the day we have a model that can read an arithmetic textbook and come out with reliable arithmetic skills.

faizshah · on April 12, 2023

I have run into the same issue when using it for coding. It can easily debug simple code but for libraries like Bazel I went down a rabbit hole for 2 hours of letting it debug an error and failing every time even with chain of thought it had a very shallow understanding of the issue. Eventually I had to debug it myself.

RheingoldRiver · on April 12, 2023

> For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.

Fair enough, have you explained it the axioms of arithmetic? It only has memorized examples that it has seen, it has a right to be skeptical until it's seen our axioms and proofs about what is always true in mathematics.

When I was a child I was skeptical that an odd number + an even number is always odd etc for very large numbers until I saw it proven to me by induction (when I was 6, I think, imo this was reasonable skepticism).

Now, ChatGPT probably has seen these proofs, to be fair, but it may not be connecting the dots well enough yet. I would expect this in a later version that has been specifically trained to understand math (by which I really mean math, and not just performing calculations. And, imagine what things will prove for us then!)

civilized · on April 12, 2023

I think GPT has read about as many textbooks on arithmetic as I have, and the difference between us is entirely in the intelligence to absorb the contents and apply them logically with consistent adherence to the rules.

I think one problem with these models is that all their knowledge is soft. They never learn true, universal rules. They seem to know the rules of grammar, but only because they stick to average-sounding text, and the average text is grammatical. At the edges of the distribution of what they've seen, where the data is thin, they have no rules for how to operate, and their facade of intelligence quickly falls apart.

People can reliably add numbers they've never seen before. The idea that it would matter whether the number has been seen before seems ridiculous and fundamentally off-track, doesn't it? But for GPT, it's a crapshoot, and it gets worse the farther it gets away from stuff it's seen before.

sharemywin · on April 12, 2023

in computer logic you would get an undefined if the number was large enough.

civilized · on April 12, 2023

It doesn't work with numbers as computer numbers though. It works with them as decimal digit strings, just like humans do.

Paul-Craft · on April 12, 2023

Make the number you multiply by essentially the concatenation of a long series of random digits, and I can just about guarantee most humans will get different things on both sides, because they'll make one or more mistakes doing the math. That is, of course, assuming the humans don't have suitable traditional computer tools capable of handling such a scenario.

civilized · on April 13, 2023

Not sure how this is relevant to the discussion.

Paul-Craft · on April 13, 2023

You don't see how asking humans to multiply both sides of 2 * 2 = 4 by the same, very large, random-ish number, and expecting that they'll get different things is relevant to this:

> 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.

You know, the very same scenario I pulled from your comment?

civilized · on April 13, 2023

It's not the same issue. I was talking to GPT about the strings 2 * 2 * x and 4 * x, not the multiplied-out versions.

int_19h · on April 13, 2023

Was it GPT-3.5 or GPT-4?

civilized · on April 13, 2023

GPT-3.5. People keep telling me GPT-4 is so much better, but I don't know where I can access it for free and I'm not interested in paying for it.

But if anyone wants to give it to me for free, I would happily make a $1000 bet that I can get GPT-4 to make the same mistake.

int_19h · on April 13, 2023

There's no free tier that I know of. But, yes, it is drastically better, and it's specifically much less prone to hallucinate "proofs" that the previous answer is correct if you challenge it.

If you provide the inputs for some specific task where you expect GPT-4 to fail in this manner, I can give it a try.

ulrikhansen54 · on April 12, 2023

'Unsupervised reinforcement learning' is how these large models and systems ultimately will end up becoming sentient. We recently tried a similar approach on a toy problem in the computer vision sphere (https://encord.com/blog/we-employed-chatgpt-as-an-ml-enginee...) with pretty decent results.

ChatGTP · on April 13, 2023

When it attains sentience, will it wake up, sing dixie and finally defeat communist China and a Russia once and for all, and then finally proceed to grant Silicon Valley elites eternal life and then turn itself off ?

ulrikhansen54 · on April 13, 2023

I bloody hope so...

Imnimo · on April 12, 2023

I'd be curious to know if having few-shot prompts that demonstrate making mistakes and then correcting them causes the model to make more initial mistakes so that it has something to correct.

Like as far as the model is concerned, how can it distinguish between the task being "do your best but if you do make an error, correct it" and "make some mistakes like in this example and then fix them".

Buttons840 · on April 12, 2023

Ah we're starting to bootstrap.

For decades in reinforcement learning we've had Q learning, which promises to solve any optimization problem if only we can build a powerful enough function approximator. It can even learn off-policy, meaning it can just watch from the sideline and find the optimal solution. It works for toy problems, and it works in theory, theres even formal proofs that it will work given infinite time and resources, and yet in practice it often becomes unstable and collapses.

Supervised learning is one thing, having a model remain stable while bootstrapping through a complex environment is another. GTP is supervised learning, so far, let's see if it can bootstrap.

ftxbro · on April 12, 2023

> "We evaluate SELF-DEBUGGING on code-davinci-002 in the GPT-3 model family"

Putting aside the incongruity of Google researchers using the OpenAI model, I'm curious how GPT-4 would do in this situation. Probably its zero shot attempts at coding would be better, and maybe its self criticisms would be better too.

astrange · on April 13, 2023

Google's recent LLM agent paper also used ChatGPT.

rhyme-boss · on April 13, 2023

A warning siren goes off in the background. Another step towards recursive self-improvement.

sowbug · on April 13, 2023

So is this the singularity?

goatlover · on April 13, 2023

I doubt it without being able to evolve the weights, architectures and train those on new domains. Humans are still doing all of that. It's just a powerful tool people are using for a bunch of things. But it doesn't do anything on it's own, and it doesn't care.

int_19h · on April 13, 2023

At this point, does anything actually preclude telling a model connected to a service running Python with all the requisite libraries (or access to PyPI / conda) to try "designing" another LLM, from architecture and choice of training data to writing actual training code?

goatlover · on April 13, 2023

Nothing, and I'm sure people are trying that, but it's probably beyond the abilities of current LLMs to make progress in this area, since it's kind of outside their training.

int_19h · on April 14, 2023

Is it, though? Codex was trained on Python code mostly, and given the cutoff date, it has all the relevant ML bits. And if you ask it about things like transformers, self-attention etc, it's clear that papers on that were a part of its training set, as well. The question is whether it can reason well enough about it to produce meaningful improvements.

FWIW if you ask it how transformers could be improved, it basically gives a recap of various relevant papers that it knows: Longformer, BigBird etc. Amazingly, all the papers cited actually exist and have the specified titles, names, and dates. But then if you ask it for some ideas of its own:

> Can you think of other novel improvement ideas that haven't been investigated yet, to the best of your knowledge? What makes the most sense based on your understanding of the subject matter.

As an AI language model, my knowledge is derived from the text data I have been trained on, which includes research papers and articles published up to September 2021. While it's difficult to guarantee the novelty of any idea, here are a few suggestions for potential improvements that may not have been thoroughly investigated yet:

Context-Adaptive Layers: Instead of using the same set of layers for all input sequences, consider using context-adaptive layers that are dynamically selected or generated based on the input's characteristics. This could allow the model to adapt its processing strategy to different input types, such as formal vs. informal text, or narrative vs. expository text, which might improve its performance on diverse tasks.

Multi-modal Fusion: Develop techniques to better fuse information from multiple modalities, such as text, images, and audio, within the Transformer architecture. Improved multi-modal fusion could lead to enhanced performance on tasks that require understanding and generating content across different modalities, such as visual question answering or image captioning.

Meta-Learning for Architecture Search: Employ meta-learning techniques to search for the best Transformer architecture for a specific task or domain. By learning to optimize the architecture itself, the model might be able to discover more effective configurations or components that have not been considered before.

Transformer Modules for Specific Tasks: Design specialized Transformer modules tailored for specific NLP tasks, such as syntactic parsing, named entity recognition, or sentiment analysis. These task-specific modules could be combined with the generic Transformer architecture to create a more powerful and adaptable model.

Hierarchical Transformers: Explore hierarchical architectures where multiple layers of Transformers work together to process information at different levels of abstraction. This could help the model capture both low-level patterns, such as local dependencies, and high-level patterns, such as long-term dependencies and semantic relationships, more effectively.

Enhanced Tokenization Strategies: Investigate novel tokenization strategies that better capture the structure and meaning of the input text. This could involve using character-level, subword-level, or even syntax-based tokenization to better represent the input for the Transformer model.

Regularization Techniques: Develop novel regularization techniques specifically tailored for the Transformer architecture. This could help improve generalization and prevent overfitting, especially when training large-scale models with billions of parameters.

fatherzine · on April 13, 2023

Not yet. When the singularity bootstraps, it's going to be over fast.

alecco · on April 12, 2023

3 Google researchers using Open AI GPT-3 code-davinci-002, interesting.

runlaszlorun · on April 13, 2023

> Self-Debugging with code explanation consistently improves the baseline by 2-3%

I’ll admit that I only have had time so far to read the abstract, and I’m not sure what their baseline is, but a 2-3% improvement doesn’t sound like a quantum leap forward that you’d expect from the title. Heck, I’d think that’s likely within expected sampling errors.

I’m not sure about others’ experience and, while I keep reading articles showing impressive seeming examples, my few forays into attempting to get ChatGPT to write code were actually completely useless. Even with follow on prompts to correct itself.

The other day I asked it what covid case fatality rates were in 2020. After all the various opinions at the time, I was curious to see what it was pre-vaccine. It would alternately tell me that it couldn’t give me data for 2020 because it only had data up to Sep. 2021, and then give me wildly varying numbers.

Is this a Rocko’s Basilisk trying to lure me into a false sense of security… haha.

ChatGTP · on April 13, 2023

…yes

cloudking · on April 12, 2023

GPT-4 in ChatGPT Plus can do this fairly well for coding tasks, I've had numerous cases where the code it produces has bugs initially. However, after a few rounds of passing the errors back in the chat it's usually able to correct it's own code.

matisseverduyn · on April 12, 2023

Useful, but still wouldn't count on it.

With respect to GPT etc. as a copilot, the current dialogue seems to focus on "ask for GPT to generate code to do X" then "just paste in the error message to fix bugs in the code GPT generates"

A.) Why is GPT generating code that results in simple compiler errors (that is why GPT probably shouldn't be used to generate any code / replace devs for real projects yet), and

B.) error messages are (just guessing here) probably <1% of the actual errors in most codebases.

I personally know of a few large companies laying off devs over this.

IMO, the tech debt we're going to see in 6 months will probably be huge. Good now to start a staffing agency of human experts who can come in and fix this type of problem (extricating massive amounts of code generated by GPT without starting from scratch) because there will be a bunch of fires to put out and those fires will be worth $

david2ndaccount · on April 12, 2023

> I personally know of a few large companies laying off devs over this.

They’re laying people off and replacing them with chat gpt generating code? That seems... aggressive. Or are they laying off devs who copy-pasted gpt-generate code?

sdfghswe · on April 12, 2023

My company recently hired someone that I'm absolutely convinced can't code and produces all their code by copy pasting into/from ChatGPT. I absolutely think they should be fired, it's not even aggressive, it's just common sense. First that means they cheated on their coding interview. Second it means their code is consistently a pile of shit.

matisseverduyn · on April 12, 2023

Replacing devs with LLMs.

blondin · on April 12, 2023

color me skeptical. what are those large companies that are replacing devs with LLMs?

broast · on April 12, 2023

I think it's more natural than you might think. For example, my company laid off a lot of people to try to be profitable, and now they pay me more but I have a smaller team with tighter deadlines. I have no choice but to use gpt for a lot of my analysis, design, and code- which I've gotten pretty used to over the past year in my hobby time

The way I see it, if you code without it, you won't compete with the speed and value.

And they are not going to back fill those roles

qumpis · on April 13, 2023

Can you give some of your usecases? Is it involved stuff or mostly boilerplate? Curious how a team lead uses this tech.

runlaszlorun · on April 13, 2023

I’m not knocking the parent post here for not replying. But of the 4 or so times on HN that I’ve seen someone been asked to provide detail, I haven’t seen a single answer. I’m not on here a ton but do people tend not to check back?

I’ll admit my bias of having seen enough vacuous industry hype over the years to be naturally skeptical. Heck, I worked in a marketing department once where I helped manufacture the stuff (forgive me father for I have sinned, in my defense they did fire me after three months in a layoff). But my few personal experiences with ChatGPT were pretty disappointing and I’m actually looking for someone to tell me otherwise.

qumpis · on April 13, 2023

Yes I think since it requires to look up your own comments to see if you got any replies, it's quite common to not get any replies, and I'm very guilty of this myself.

My personal use of gpt4 (also daily) is: correct, rephrase spelling from my brain dump, make python plots (stylize, convert, add subplots, labels, handle indexing when things get inverted), makw short shell scripts (generated 2FA, login vpn through console using 2fa, make script of disabling keyboard etc), and help debug my code (my situation is this, here's some code, what do you suggest?).

The last part is an interesting one for people with attention deficiency disorders, like myself, where procrastination can be conquered when there's an assistant that keeps you on track.

runlaszlorun · on April 13, 2023

> The last part is an interesting one for people with attention deficiency disorders, like myself, where procrastination can be conquered when there's an assistant that keeps you on track.

Aha, I’ll def have to give it a whirl. My procrastination ability is world class.

ratg13 · on April 12, 2023

You can’t replace devs with LLMs because someone that knows what they are doing still needs to put it all together.

You can only make employees more productive.. this in turn could, in theory, lessen the need for developers in the long run, but it assumes the company will not bother to use the extra bandwidth for other projects.

sublinear · on April 13, 2023

> use the extra bandwidth for other projects.

This is exactly what would happen if ChatGPT was actually a productivity boost for senior devs. I don't know why some idiots on here keep insisting businesses want to get rid of people when that's not how the game works at all. Extra work capacity will always be used. Regardless of what payroll costs, what's always more important is the ROI.

highduc · on April 13, 2023

I also believe the gain is in productivity more than needing less people. They will fire as much as possible, but the largest gains seem to me to be in productivity.

And exactly like this some future brain implant thing will also put another layer of pressure. People will get it as it'll give them an edge on certain fronts.

sublinear · on April 13, 2023

> IMO, the tech debt we're going to see in 6 months will probably be huge. Good now to start a staffing agency of human experts who can come in and fix this type of problem (extricating massive amounts of code generated by GPT without starting from scratch) because there will be a bunch of fires to put out and those fires will be worth $

Nah they deserve to eat shit and the staffing agencies hired to fix the bad AI code will undoubtedly be people abroad who barely speak English and will only tangle it up worse. I would actually pay to be a fly on the wall in those meetings listening to people lose their minds in frustration.

viscanti · on April 12, 2023

If an LLM hallucinates lines of code that can't even compile, I suppose it could also hallucinate logic issues which are more difficult to track down.

matisseverduyn · on April 12, 2023

Definitely. QA at a snails pace should still be the focus here for a while, but that's not what I'm observing in the real world. Just rush, pressure, layoffs. At least this sort of behavior keeps humans employed long-term.

runlaszlorun · on April 13, 2023

I have limited experience even trying. But I did try it for some fundamental JS Web API stuff sans framework or library like IndexedDB, web sockets, and a basic, basic todo like app.

Neither of those three would function nor would they throw an error. Prompts to correct itself would not improve things.

So I did the natural thing and started to debug myself. At which point, I couldn’t help but ask myself why I was debugging machine generated code when I could not be lazy and actually build it from first principles.