> Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.
This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.
I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.
When they spit out these subtle bugs, are you promoting the LLM to watch our for that particular bug? I wonder if it just needs a vir more guidance in more explicit terms
At a certain point it becomes more work to prompt the LLM with each and every edge case than it is to just write the dang code.
I work out what the edge cases are by writing and rewriting the code. It's in the process of shaping it that I see where things might go wrong. If an LLM can't do that on its own it isn't of much value for anything complicated.
Have you found that to be a good trade-off for large-scale projects?
Where I'm at right now with LLMs is that I find them to be very helpful for greenfield personal projects. Eliminating the blank canvas problem is huge for my productivity on side projects, and they excel at getting projects scaffolded and off the ground.
But as one of the lead engineers working on a million+ line, 10+ year-old codebase, I've yet to see any substantial benefit come from myself or anyone else using LLMs to generate code. For every story where someone found time saved, we have a near miss where flawed code almost made it in or (more commonly) someone eventually deciding it was a waste of time to try because the model just wasn't getting it.
Getting better at manual QA would help, but given the number of times where we just give up in the end I'm not sure that would be worth the trade-off over just discouraging the use of LLMs altogether.
Have you found these things to actually work on large, old codebases given the right context? Or has your success likewise been mostly on small things?
I use them successfully on larger project all the time.
"Here's some example JavaScript code that sends an email through the SendGrid REST API. Write me a python function for sending an email that accepts an email address, subject, path to a Jinja template and a dictionary of template context. It should return true or false for if the email was sent without errors, and log any error messages to stderr"
That prompt is equally effective for a project that's 500 lines or 5,000,000 lines of code.
I also use them for code spelunking - you can pipe quite a lot of code into Gemini and ask questions like "which modules handle incoming API request validation?" - that's why I built https://github.com/simonw/files-to-prompt
I had some success converting a react app with classes to use hooks instead. Also asking it to handle edge cases, like spaces in a filename in a bash script--this fixes some easy problems that might have come up. The corollary here is that pointing out specific problems or mentioning the right jargon will produce better code than just asking for the basic task.
It's very bad at Factor but pretty good at naming things, sometimes requiring some extra prompting. [generate 25 possible names for this variable...]
That’s the problem I had on the early ones. I learned a few tricks that let me output whole apps from GPT3.5 and GPT4 before they seemed to nerf them.
1. Stick with popular languages, libraries, etc with lots of blog articles and example code. The pre-training data is more likely to have patterns similar to what you’re building. OpenAI’s were best with Python. C++ was clearly taxing on it.
2. Separate design from coding. Have an AI output a step by step, high-level design for what you’re doing. Look at a few. This used to teach me about interesting libraries if nothing else.
3. Once a design is had, feed it into the model you want to code. I would hand-make the data structures with stub functions. I’d tell it to generate a single function. I made sure it knew what to take in and return. Repeat for each function.
4. For each block of code, ask it to tell you any mistakes in it and generate a correction. It used to hallucinate on this enough that I only did one or two rounds, make sure I hand-changed the code, and sometimes asked for specific classes of error.
5. Incremental changes. You give it the high-level description, a block of code, and ask it to make one change. Generate new code. Rinse repeat. Keep old versions since it will take you down dead ends at times but incremental is best.
I used the above to generate a number of utilities. I also made a replacement for the ChatGPT application that used the Davinci API. I also made a web proxy with bloat stripping and compression for browsing from low-bandwidth, mobile devices. Best use of incremental modification was semi-automatically making Python web apps async.
Another quick use for CompSci folks. I’d pull algorithm pseudocode out of papers which claimed to improve on existing methods. I’d ask GPT4 to generate a Python version of it. Then, I’d use the incremental change method to adapt it for a use case. One example, which I didn’t run, was porting a pauseless, concurrent GC.
This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.
I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.