Surprising. If only there were a way that we could have foreseen that an AI trained to write code in part by looking at people who, self-admittedly, don’t know how to write code, and people who write code for others with minimal context (Stack Overflow), would produce buggy code. It is a case of GIGO.
Most developers do not learn much from Stack Overflow. Why do we expect AI to fare better? In my experience, one in ten (optimistically) Stack Overflow answers give anything more than a code snippet with enough information to get the asker through their immediate issue. It can be beneficial if you have the necessary understanding already and only want the code snippet, but this is no way for humans or machines to learn.
Also, having an “AI Assistant” must lower programmers’ guards against buggy code. After all, it is an assistant - it must assist you, right? Subordinating humans to machines will not work in this domain until there is better training data and the machines can be taught the reason they are writing specific code. Until then, I have low hopes for AI-generated code.
Even if AI could generate correct, bug-free code the majority (say 99.9% of the time), I expect finding and correcting bugs will be difficult for humans. For example, how many bugs are found and corrected by the author of code during development, versus how many in peer review? I’m reminded of a saying akin to “ask someone to review 5,000 lines of code: no bugs. Ask someone to review 5 lines of code: 5 bugs”. We are poor critical reviewers, and AI cannot fix that. AI assistants probably worsen reviews, because reviewers will expect high-quality code from their AI assistants.
If I had a little robot riding in the passenger seat that could tell me whether to go left, straight, or right, and it was correct 90% of the time, I'd think that was pretty great. I'd get where I needed to be, even with a couple mishaps.
ML code suggestions are the same thing to me. If I don't know where I am going, I can just ask it for suggestions. And it's probably going to be what I want.
In both cases, I am annoyed with myself for having started before I knew where I want to end up.
The problem with ML is that it's pattern recognition, it's an approximation. Code is absolute, it's logic that is interpreted very literally and very exactly. This is what makes it so dangerous for coding; it creates code that's convincing to humans but with deviations that allow for all sorts of bugs. And the worst part is, since you didn't write the code, you may not have the skills (or time) to figure out if those bugs exist, especially if the ML is extremely convincing/clever in what it writes. I would argue that this overhead is even worse for productivity over just writing it yourself.
Expand your view. AI can write tests, read error messages, find bugs in your code, we just need to give this task.
Let's think about tests. You write a function, the AI writes a few tests for you. maybe you need to add a few more. But it's better to have tests, and you might have missed one of them.
Error messages - we rely on error messages to make this leap from "code parrots" to "bug free". Most of our codes fail the first time we run them. We're just fancy pattern matchers too, but we have a runtime. So the AI could also fix its bugs all alone, given the opportunity.
Finding bugs - we can train AI to spot bugs. It can become an excellent tool to check not just AI code, but also human code. Having a bug detector running in the background would be great, even if is not perfect.
All those things are susceptible to the same issue: the ML model can generate test cases and error messages that are convincing to you and me regardless of whether they're actually right. Don't get me wrong, ML will reach a point someday where it catches up to humans in this respect; this is merely a shortcoming of today's ML, not tomorrow's, and it highlights where ML's weaknesses currently lie (in contrast to more vaguely defined goals like artwork).
1) How many turns do you take on a particular trip
2) How do those wrong turns end up? if it's "travel time extended by 30 seconds" or "My car, the car I hit and the side of this building are all in shambles" changes what a 10% failure rate means a lot.
Right? Took a turn the wrong way down a one way street. Hit a lady with a stroller who was looking the other way. She is a surgeon. Her husband is a lawyer. You killed the kid. Your life is over.
On the other hand, you did just describe the early days of both consumer GPS navigation and things like mapquest. There were countless stories and gleeful news articles about people turning the wrong way down a one way street, driving into ponds, getting stuck on logging roads and all sort of other chaos caused by inattentive drivers and blind faith in flawed systems. But the key takeaway I think is that in the end, consumer GPS succeeded. Bugs were fixed and now everyone does have a little robot in their pocket that can tell them how to get somewhere.
> If I had a little robot riding in the passenger seat that could tell me whether to go left, straight, or right, and it was correct 90% of the time, I'd think that was pretty great. I'd get where I needed to be, even with a couple mishaps.
Pretty sure you wouldn't when it starts telling you to make illegal turns, or tell you to turn where there are no roads etc, that is the state of language models for code. You'd likely run over a person or get stopped by police at some point if you listen to it, then it is easier to just turn it off so you can focus on driving. A bad assistant is much worse than no assistant.
Edit: And these models are more like Teslas autopilot than an assistant giving you directions, since you have to oversee what it does instead of letting it tell you what to do. An autopilot that does the wrong thing 10% of the time is horrible.
One in ten turns being told to do the wrong thing sounds incredibly annoying. Maybe if you’re driving on the highway for long stretches, but otherwise that would be awful.
> having an “AI Assistant” must lower programmers’ guards against buggy code
Why would you assume that?
If it’s buggy a couple times, if everyone talks about how buggy and unreliable it is, it can easily become common knowledge and common practice to triple check the output.
Then how much time are you actually saving if you have to review everything it produces. The bottle neck was never typing speed, at that point all the AI is allowing you to do is produce more buggy code more quickly.
I use Copilot daily and experimented with using ChatGPT for real work code.
It’s an incredibly valuable tool even with having to rewrite the larger outputs… the small stuff like autocompleting variables and keys is highly accurate and what it most often generates (it’s scary how good it is at finishing the exact line you had in your head x50 times a day).
What you need to be careful about is when it generates entire functions or whole mini-modules. This is still extremely useful because it gets your brain running. It provides a simple template to suggest how it might look.
That way you’re no longer starting from scratch, you see a toy example with real code - for intellectual/creative work having that sort of seed is super useful.
Imagine a writer with writers block staring at a blank page vs a generated sample of dialogue between two characters or an intro paragraph to get the ball rolling.
Usually you have to burn a few cycles, fail a couple times writing some code, to get to the point where you’ve written something good. So it’s a normal part of the process to throw Version 0.1 away IMO.
2) verifying code requires domain knowledge, which implies that the utility of these models is limited to things could write myself if I weren't too lazy. That's hugely constricting.
Yes but I don’t see it as generating entire blocks of code you’re supposed to copy and paste into your project.
It’s like a template, a suggestion from which you can build your own version.
Only rarely does it have the context or understanding of the wider codebase to do a programmers job for them.
The times when it does generate a copy/pastable function it’s usually some isolated utility function like “format date as DD-YYYY” something really simple and easy to verify. The type of stuff you’d copy entirely from Stackoverflow rather than finding a specific solution you adapt.
It’s mostly a glorified autocomplete and example suggestion service. It is not a full code writing service.
Domain expertise will obviously still be a job requirement. It’s an assistant to the programmer, not an occasional replacement for the programmer (and if you have domain expertise you usually use a 3rd party library).
Maybe future versions will try to do more but that’s not what we have today.
I think that many people will treat it as something that can generate entire blocks of code. Unfortunately it can be quite broken even just writing basic functions. One of my tests was doing some quaternion rotations. It did them, but refused to stop multiplying by the conjugate afterwards.
Another was converting dates to "years ago", which was broken for BCE because ChatGPT doesn't understand the underlying concept.
> I think that many people will treat it as something that can generate entire blocks of code. Unfortunately it can be quite broken even just
Have you tried to use it for this purpose?
It basically can’t unless you’re only building a toy app. Even after multiple levels of refinement it still requires tons of real programming work.
Which is largely my point, it won’t because it’s fundamentally incapable of providing that in its current state. Even without the buggy part it’s mostly just outputs generic stuff that will always need to be integrated into the wider codebase and to specifically what you’re trying to build.
I have, actually. Today I had it write a basic JS map for my wedding site. Yesterday I had it produce a puzzle solver with CLI and DSL. Obviously I'm still doing manual interventions at key points, but it's changed my personal cost/benefit calculation on whether various random ideas are worth doing.
For me it was the least surprising thing I've read in a while.
One thing that has become clear is AI's learn to cover the common cases very well, and the less common not at all well. So you get a lot of mistakes in the area's where there is not a lot of training data. Sometimes those mistakes don't matter. Creating art is a great example - how do you even define a mistake in art? (Perhaps it's not so difficult, as apparently AI output is recognisable but it takes an experienced eye to spot where it's gone wrong.) So is playing a game like Go - playing the 2nd best move occasionally doesn't matter. Apparently protein folding fits into the same class as getting the basic shape right matters a lot. Ditto for voice to text - humans compensate, and the same for translation. There are lots of places mistakes don't matter.
But some applications are very intolerant of mistakes. Driving a car appears to be one. Programming is probably one of the worst as a minor mistakes don't just degrade the program, they can destroy it in a way that's utterly non-obvious. It seems to me the current generation of AI's are always going to struggle in areas where mistakes, even minor ones, are costly.
Interestingly, one area of programming mistakes are tolerable is review. Therefore I'd predict programmers will find AI's reviewing code is a net positive.
AI can learn to do code review, there is plenty of data on Github. It could also write tests and suggest possible bugs on its own. Overall, using it might be better than doing it by hand.
If you are using the AI just to write snippets of code, then it is suboptimal. What it needs is to monitor execution errors and fix its code over a few iterations, just like humans do.
Humans fix the code by understanding the model it represents. In fact it's often that bug fixing is where you are forced to understand what you wrote gliby from memory/patten matching.
Does it look at the questions on Stack Overflow? That would be silly. But yeah, even the answers are far from perfect - they might solve the immediate problem but lack error checking, use undocumented features, etc.
> If only there were a way that we could have foreseen that an AI trained to write code in part by looking at people who, self-admittedly, don’t know how to write code, and people who write code for others with minimal context (Stack Overflow), would produce buggy code. It is a case of GIGO.
So, I'll claim the real issue is just that this generation of AI isn't able to "learn", it merely "trains": if I were alone in a room for years and you gave me a book on how to program that has an error in it, during my careful study of the book (without a computer to test on!), I am likely to notice the error, get annoyed at the author trying to figure out if I failed to understand some special case, and then eventually decide the author was wrong. With only the knowledge from the book, I will also be able to study the concepts of programming and will eventually be able to design large complex systems; again: I will be able to do this even if I don't have a computer, in the same way people have studied math for millennia.
And like, this is how we all learned to program, right? The books and tutorials we learn to program with often suck; but, after years dedicated to our craft synthesizing the best of what we learn, we not only can become better than any one of the sources we learned from, given enough time to devote to practice and self-study we can become better than all of them, both combined and on average (and if we couldn't, then of course no progress could ever be made by a human).
With a human, garbage in can lead to something fully legitimate out! A single sentence by someone saying "never do X, because Y can happen, where Y is extremely important" can cause us to throw out immense amounts of material we already learned. Somewhere, GitHub Copilot has seen code that was purposefully documented with bugs (the kind we use to train humans for "capture the flag events") as well as correct code with comments explaining how to avoid potential bugs... it just didn't "give a shit", and so it is more likely to do something ridiculous like generate code with a bug in it and a comment explaining the bug it just generated than to generate correct code, because it doesn't have any clue what the hell it is doing and isn't analyzing or thinking critically about the training input.
> Even if AI could generate correct, bug-free code the majority (say 99.9% of the time), I expect finding and correcting bugs will be difficult for humans.
There is some error rate below which you beat the chance of a human making a dumb mistake just because they are distracted or tired, and at that point the AI will just bear the humans. I don't know if that is 99.9% or 99.9999% (it might be extremely tight, as humans generate thousands and thousands of individual decisions in their code every work session), but past that point you are actually better off than the current situation where I first program something myself and then hire a team of auditors to verify I coded it correctly (and/or a normal company where someone is tasked to build something and then every now and then someone like me is hired to figure out if there are serious mistakes).
> With a human, garbage in can lead to something fully legitimate out!
Because we get to see the error messages, fix and try again. You can try this on chatGPT - give it a task, run the code, probably fails, copy the error back, and let it fix is errors. After a few rounds it gets the result with much higher probability than when you allow it one single shot.
A language model can write programs, and then we can run the programs to check if they pass tests, then the language model has a special signal - execution feedback. If you retrain the model with this new data, it will learn to code better and better. It is reinforcement learning, not language modelling.
AlphaGo was able to generate its own data and beat humans at Go by doing this exact thing. It's an evolutionary method as well, because you are cultivating populations of problems and solutions through generate + execute + validate.
> Because we get to see the error messages, fix and try again.
As I noted explicitly, a human will get better even with garbage input even without access to a computer. I also explicitly noted how we are able to learn from a single well-reasoned note.
I recommend you seriously evaluate how you yourself learn if you truly believe that you only learn things using active feedback from external sources of truth via trial runs.
The first few weeks of ChatGPT I had a goal to make it code. My preliminary conclusion is that if given a feedback loop it could code a lot of basic stuff just fine, my workflow went something like this:
- Identify a component in a system you want to build
- Give ChatGPT a list of bullets and ask it to write tests
- Review tests
- Run tests, give ChatGPT output of failing tests
- Add the code to your codebase, run the tests...
I developed extensions to my console and test suite so the output exceptions/failures ended up in my clipboard. I'm highly confident workflows that put the AI in a feedback loop will be able to code most backend stuff. UIs are trickier as giving it context is not trivial.
There is also a context window problem where you can only pass in 4k tokens (including the prompt). This is not that large of a window for providing history and information of the entire workflow. How were you trying to solve that? Also, if you’re building something yourself, you can use davinci-003 directly through their APIs instead of trying to build on top of chatgpt
You are right, what I described was just a prototype. ChatGPT was just a gateway drug for me to dive into the OpenAI docs and look for other providers and existing solutions. I've been reading up on approaches to 'package' context (example: https://thakkarparth007.github.io/copilot-explorer/posts/cop...).
My take for now is that this kind of packaging is very context specific to the point where it's going to make sense for engineering teams to craft their own version of 'gather context before the prompt is created'. I'm testing out approaches using toy examples for now (how would you create a chatbot about a book?; podcasts, etc).
I don't have an answer but I'm highly confident we're going to replace a lot of junior-level work in the next year or two.
I'm guessing it depends on the situation, but at what point does all of that become more work than just writing the tests and code yourself? Is it worth the "investment" to learn how to teach ChatGPT coding? Or will ChatGPT2 come out in six months, based on a different algorithm, and you'll back to square one teaching it over again?
Personally, I find writing new code easier than editting existing code, regardless of how it's written. I'd much rather have an AI that scans the code I write and searches for bugs
Writing code is easy, finding bugs is hard, and it's the finding of bugs that I'd like to see automated.
It's obviously not paying off right now, but I'm assuming these models will get there in 2023 so I'm getting ready for that :P
To take into account future changes, my approach is to develop interactive models for the components that make up this system using Jupyter notebooks and https://natto.dev/. The idea is that if some part changes you can sort of test it in isolation and compare results.
Looking at the demographics of the study, they only had 47 total participants, 15% of which were professionals and 62% reporting less than 5 years experience coding (which I would imagine is an underestimate assuming some people exaggerated their actual experience level). So that means they only had 6-7 participants who worked in industry and generously 18 people with more than 5 years experience. They also don’t report the breakdown of how participants did by experience. One other factor they use to measure is if the participant has security experience, but their bar for that is whether they have taken a single security class.
Given all of this, I don’t think the paper’s conclusion is convincing at all given that they were evaluating this on a pool of participants that a majority were students with not much experience programming when these tools are sold for professional use. I would bet if the study had a more uniform distribution of experience levels, you would probably see the “bugginess” trend downwards with more experience. Participants with more years programming just have had more time to learn how to use new tools more effectively in their workflows.
I definitely tweaked my methods of using Copilot plenty over the past year or so to take advantage of its strengths and avoid weak its weaknesses.
My concern is that students/novices are going to be using this, without the ability to double-check the output of the tool. It inspires overconfidence, looks okay at the surface level, and bugs go unnoticed. The younger generation using this as a crutch, treating their own creations as a black box, will not have an adequate feedback mechanism to learn from their mistakes. Code quality and performance will deteriorate over time. You, an expert, learned without this crutch. Your use-case is frankly uninteresting.
Amusingly, without careful curation, I'd predict that buggy code will tend to self-replicate and these tools that indiscriminately slurp public code will enter a death spiral because the novices outnumber the experts. It's only a matter of time before viruses are written to propagate through this garbage stream. http://www.underhanded-c.org/
I definitely agree with your point about it being used as a crutch. My criticism was more towards how the authors evaluated AI’s effect on writing secure code. I’m not saying they shouldn’t have student participants, but they should be fully representative across the skill demographics.
To me it’s comparable to a study where you make a general claim about driving ability with lane assist but then 2/3 of the participants only have their learner’s permits.
What is the current feedback mechanism and will they not use existing feedback mechanisms if available? Professionally someone should be there to enforce quality/mentor, but for students or hobbyists, even without AI assistants, they often don't have anyone to say "this is bad, this is best practise" except stackoverflow
Not surprising given the discussion I've seen about these AIs on HN and other sites.
AFAICT, they're generally treated as a shortcut to avoid learning an API or library, or as a shortcut to avoid boilerplate by essentially copying somebody else's boilerplate.
Both cases boil down to using other people's code for purposes that it wasn't built for, which is obviously going to be error prone.
The AI has no idea how this particular case differs from the sample code it was trained on, so it's up to the developer to go back and correct the details - which is precisely what they were trying to avoid with the AI. It's basically a fancy version of code "templates" available in some IDEs, but fewer people know to go back and fill in the correct details.
> The AI has no idea how this particular case differs from the sample code it was trained on
It can eventually figure a lot of things out if you provide the correct requirements and examples. However, the caveat with all of this is that you already have to be a good programmer to:
- know what is needed
- know what the solution should more or less look like
- spot bugs in the AI's solution
- describe the problem, examples etc in a way that the AI can work with
If you're already a good programmer, the time spent wrangling the AI could be spent writing the solution yourself.
My feeling is that the most value from AIs will come from when you get hopelessly stuck on a problem. An AI could provide some initial steps or ideas that can get you unstuck, effectively working as a rubber ducky. You're unlikely to copy the AI's solution, but it could provide some random insight that helps you take a step forward.
> My feeling is that the most value from AIs will come from when you get hopelessly stuck on a problem.
This sounds about right.
It may also help you to find related areas to study. I was throwing rocket optimization problems at ChatGPT and it was horrible, but I think it was finding some related code from vehicle navigation optimization problems which I found interesting.
It does reasonably well at simple kinds of symbol manipulation and might be able to replace matlab/mathematica at some point, but the problem is that you always have to double-check the results. I'm not sure it is really that much better than just knowing your way around wikipedia (which has its own accuracy issues, but for straightforward things it is usually pretty good).
I hadn't considered the "hopelessly stuck" aspect, and that does seem promising.
I fully agree on your other points, though. Communicating the requirements to the AI in a non-ambiguous way is more-or-less equivalent to just writing the code. I can't even get firm requirements from my coworkers, and we all presumably have "real" intelligence.
TL;DR - Models can learn from massive validation, not just massive text prediction.
I agree our current crop of Copilot, GPT-3 and chatGPT can introduce subtle bugs, and that's a big problem. Checking for such bugs takes time. They code like someone who read a lot and has good memory, but never implemented anything complex.
But there is no reason to be the same in the next iteration. Language models can be fine-tuned on code execution, that means giving sufficient exposure to trial and error to learn how to avoid such bugs.
Having a way to test its outputs is the key to self improvement. It works the same with AlphaGo - the model is generating its own training data, and this process can eventually surpass human level. It just needs tons of compute to iron out a lot of bad ideas.
Alternatively, the model could use a code execution environment at runtime, too. So it could iterate a few times on an error, eventually using web search if it cannot solve it alone, instead of simply failing and leaving it to the human to fix. It is pretty good at proposing fixes and will recover most of the times, like human devs.
I'm leaving this for further reading. It's fascinating.
> Evolution through Large Models
> Large language models (LLMs) trained to generate code can vastly improve the effectiveness of mutation operators applied to programs in genetic programming.
I’ve been using GitHub copilot for six months now. It definitely produces a lot of code that looks right, but isn’t. I review anything it produces carefully. I’ve introduced one bug to production where I can say that was likely from copilot. Compared to the number of bugs I produced in that time, that’s nothing. It’s a productivity tool, not magic, and it’s not replacing me anytime soon. Like the auto complete in my IDE, it accelerates my work. There’s no going back for me.
> The study involved 47 people with varying levels of experience, including undergraduate students, graduate students, and industry professionals.
Honestly, for things like catching potential sql injections, “industry professional” should be the start.
I won’t trust a student of any kind further than I can throw them, for topics like maintainability, security or readable code. Those are too far away from what’s usually taught at school.
Having "industry professionals" in this sort of study actually puts it in the top tier of studies. Most studies don't even have that, they're all undergrad based.
(Sometimes people express frustration that we don't pay much attention to "the science" in the programming field, and my response is generally to tell such people to take a closer look at the "science" they're trying to wave around. Studies based on "a class full of juniors in college" top out at zero value, and it's really easy for them to be negative as they can be actively wrong about how the topic under study affects professionals.)
In this case, though, I'd submit that one doesn't need to run some sort of enormous study to establish the point that these code assistants are not a magic gateway to quality code at any experience level. I've been banging this drum on HN just from an understanding of how the tech works. Confabulation engines can't help but confabulate. You can't trust them. This GPT stuff is perhaps a better view into human psychology than a useful tool; we've never built an AI that so strongly emits signals of confidence before. They're the best artificial confidence game players ever.
> one doesn't need to run some sort of enormous study to establish the point that these code assistants are not a magic gateway to quality code at any experience level
You just have to use it a couple times to figure this out. It’s pretty obvious what the limitations are and most programmers are smart enough to understand what it is and what it isn’t.
This is why I’m skeptical it will be a problem, it’s not being sold that way and after using it that will be obvious.
So anyone dumb enough to treat it like that and trust the output blind probably wasn’t a good programmer before. And if they keep doing it they don’t have proper incentive structures to not produce buggy code (senior devs, bosses, customers etc will notice that output is getting worse at a minimum when the product breaks or doing QA).
> In this case, though, I'd submit that one doesn't need to run some sort of enormous study to establish the point that these code assistants are not a magic gateway to quality code at any experience level.
For me the interesting question is not whether they can improve quality.
The interesting question is, given that they can be used to produce code faster (as a sort or auto complete on steroids), whether that improvement can be achieved in a way that doesn’t involve a decrease in quality.
I think it’s possible, for sufficiently competent professionals that can spot and correct mistakes on the fly, and I have anecdotal evidence to support the idea, but it would be nice to see serious research around it.
> This GPT stuff is perhaps a better view into human psychology than a useful tool
It would be a great search engine if it cited its sources (but then people would notice it's basically copying code from the internet). It is actually good at getting you the names that come in a certain context, so you can search them. But only if you know enough to establish the context.
Even then, be skeptical. The software industry provides positive pressure to improve one's skills, but doesn't provide much if any negative pressure against lacking competence or willingness to learn. If developers want to, they can pretty easily skate by without developing talent, besides when their framework of choice becomes defunct.
This is true for a lot of low-brow work because companies simply need people to grind through requirements for their bespoke CRUD apps and integrations. There's still a boat load of that kind of work to be done, but it's only a fraction of the work available.
There are plenty of places where that negative pressure exists though and there's higher standards of quality. I doubt anyone is skating by with hacky code in aviation or space, for instance. So what you're talking about isn't really an issue where it matters.
Many industry professionals ignore or are unaware of known attack vectors. Meanwhile, many schools do teach a semblance.
Given how young and emotion-driven the field is, a sweeping statement like this is useless. If security is remotely concerning to you, get a security expert as your baseline.
> Many industry professionals ignore or are unaware of known attack vectors. Meanwhile, many schools do teach a semblance.
It’s not either or. Most industry professionals get there after going through school, or are at some point checked to have achieved equivalent knowledge somewhere else. A professional is school + experience.
I spent a few days working on using ChatGPT to write code. I've been meaning to write a blogpost about my experience, but here's my synopsis.
Where ChatGPT was best was when I wanted to do generic data structure work with common tools. Prompts like "make a Python class that stores keys in a redis sorted set, with the score as a timestamp, that removes members older than a configured time on access / insertion." I know how that code should work, I know what's correct- but it's nice being lazy and letting ChatGPT spit out a class for me I can clean up.
Anytime I want to do anything weird or nonstandard, ChatGPT is an uphill battle. Try asking it to write a convolution function without using Numpy (say you're using PyPy or AWS Lambda and C extensions are a no go). It will often insist you have to use Numpy- it argued with me at one point it wasn't possible without it! (This has gotten a bit better since, but was still quite interesting)
Working with pre-existing code, especially anything database related, was not worth the time spent at all. Terrible query plans, misinterpreting the schema that was provided, etc. I do love SQL though- I'm quite biased here.
It was interesting in that when it worked best, I needed to know what to ask for. I asked a good friend who started learning to code a few months ago to try it, and she was way less successful with it versus reading documentation.
Ultimately I think with tight text editor integration it will be great, but it's still going to require someone skilled and who knows what to ask for- at least for a couple years. As for how productive it's made me, I've stopped using it for anything except greenfield work with common libraries I already know how to use.
Something interesting that I don't see in the comments here are comments Dijkstra has made about using natural language for telling machines to do things[0]
> Instead of regarding the obligation to use formal symbols as a burden, we should regard the convenience of using them as a privilege: thanks to them, school children can learn to do what in earlier days only genius could achieve. (This was evidently not understood by the author that wrote —in 1977— in the preface of a technical report that "even the standard symbols used for logical connectives have been avoided for the sake of clarity". The occurrence of that sentence suggests that the author's misunderstanding is not confined to him alone.) When all is said and told, the "naturalness" with which we use our native tongues boils down to the ease with which we can use them for making statements the nonsense of which is not obvious.
I wonder what the long term consequences of LLMs will be. I can't see them abstracting away programming languages for the reasons that Dijkstra provides (namely the inherent ambiguity of natural language), but maybe they could be a good summarizer of APIs and generator for starter code.
My prediction is that they'll provide a similar utility to stackoverflow and save time in developing software
Haskell? Not buggy code? I'd say Haskell is the best language to write buggy code. And when I say buggy is not about writing purposely to be buggy, I say it's a maintenance horror show that WILL produce buggy code 100% down the road. Remember, maintenance is 90% of a project life, the initial development is only 10%.
Ok. The main point was to discuss making languages that are better at preventing bugs. I simply threw out one to get the conversation started, not to be the final answer. Other thoughts?
You can't prevent injection bugs on a language level, since injection happens when you use string manipulation to dynamically create code that you send to another program. This means that it is just as easy to create SQL injection bugs in Haskell or Rust or any other language, because they all support concatenating strings to create SQL.
I am an experienced "industry professional" C++ programmer and I tried ChatGPT and Stackoverflow and found it impressing and entertaining, but I don't know how these two platforms would be helpful for my work. I never use them.
When I need to solve a problem I must read the (API's) docs and I must understand it. It would be dangerous and buggy if I would copy 'n paste code from somewhere. I don't think it would even work.
I also can learn from high quality examples by following it and understanding the details. But Stackoverflow is full of questions that feel like someone was too lazy or does not have the skills to read the manual.
But maybe these platforms are good for learning? I am not sure if ChatGPT and SO are good tools to learn to program. Maybe people learn more when they take the time to find a solution by themselves.
I don't know what you do for your work, but I suspect you haven't used chatgpt/copilot/etc enough to see the benefits.
Do you think tab-autocompletion is useful? AI assistants can do that on steroids.
Do you ever use example code to learn a new package? ChatGPT does that at a new level.
When I hear developers I work with saying stuff like the above, I think it sometimes comes from a place of insecurity. It's ok to find these tools useful, it doesn't mean you aren't an elite programmer. In the end the best developers will find ways to use these tools to learn faster and build better.
I'll continue to try it. It is not like I think I am totally elite and that's why I don't need code examples. It is just that I in retrospective I see that I did not find much assistance in SO.
P² AI assistants are trained on code written by people
∴ People using AI assistants will write buggy code
[1] Some might object to this, either in their own case or that of some of their professional associates. I would not trust the judgement of these people.
#1 is a specificity issue, though I think the meaning was clear, but "All people [who code] write buggy code" would be fine instead. Fixed
#2 Code can be written without bugs, but it isn't.
As for facts not in evidence, given the empirical nature of the proposition we're basically running up against the same problems that Hume has with inductive reasoning. Which is why I did not state P1 axiomatically and merely footnoted it to say I wouldn't trust the judgement of someone who believed otherwise.
This tracks for me. MOST of the code I have on github is just random little projects testing out stuff. I don't intend to work on it long term, so I don't go out of my way to make it maintainable, readable, or anything.
Any tool that helps people with worse skills produce a product will probably reduce the average quality if they wouldn’t have produced anything at all before. That’s not a super strong argument for not having them.
I mean, is anyone really surprised by this? LLMs (as I understand them today) only predict the next token based on previous tokens, so there's no actual logical cohesion to what they produce.
I find the best use of coding AI is for shitty little inconsequential apps. I could imagine a startup founder using it to feel like they are saving a lot of money on engineering costs by using it for an MVP.
You will likely never use coding AI for things like embedded systems, financial software, manufacturing, flight computers, operating systems, software for managing networked infrastructures, various domain specific AI…
Agreed 100%. Implementing the correct algorithm/business logic is more important than spitting out junk that you don't really understand or won't pass code review.
The act of writing code is probably the easiest part of my job and I still haven't really seen these tools excel at that outside of toy examples.
On the other hand, I am pretty excited for AI code tools that me why my code is shitty beyond static analysis, linters, etc.
That's my experience as well. I've tried Tabnine and Copilot and they're both useful as fancy line autocompleters when you know what you want to type. However they're unreliable to write code directly from business logic. They'll usually write code that _seems_ to be doing the right thing but with very subtle bugs because they don't understand the intent of the function and just try to mimic what's already there.
If anything that's what makes them more dangerous. I tried Copilot for a month and ended my subscription. It will write a SQL query with all the expected keywords in there: a WHERE for a filter, an ORDER BY for a sort, but the way it combines these things, especially recursive queries, is almost always wrong except the most trivial queries which I would have written by hand in no time anyway.
chatGPT is even worse. The less popular a python package is, the more bullshit is in the chatGPT answer. I don't know how many times now it has literally made up functionality that isn't there in a library. At least that obviously doesn't work.
I am going to try it with R though. Maybe that is the real power. Using it to write a language you hate and only the output matters with no regard to how you got there.
Anything in production though that has users is insane.
I can picture a type of adversarial situation with a generator intentionally trying write bugs to fool a discriminator.
We can't forget that what we have now basically is the MVP of code generation.
We are at the deep dream stage in AI art compared to Stable Diffusion. Deep dream art was pretty much a one off parlor trick. Not a single artist in the world was upset by it.
Fred Brooks would surely have added AI assistants to the arsenal of supposed silver bullets that failed.
One specific problem with chatGPT as it is today is that it gives answers that sound extremely confident. So confident it's tempting to follow them without any critical thinking. Sure you can guide it towards the correct answer, but it requires you knowing the answer in advance, which defeats the whole point.
I found ChatGPT pretty good at looking up stuff when I don't know where it is documented.
Eg. asking it what the name of some constant or parameter is. Much faster than trying to dig through Apple's confusing mess of documentation, and faster than Google.
It does tend to give incorrect answers a lot, so I use it only for stuff where I can quickly check if the answer is correct.
The code correctness issue is certainly a big problem -- it is simply not enough to get 90% correct. The real world problems often lie in the remaining 10% edge cases.
I have a very different take on how AI can come up with a correct by construction code, with not necessarily using a probabilistic model (Deep learning, for example). I have it written as a blogpost here[0]. The sketch of the idea is that any problem is a data problem, and an algorithm could be discovered, and new code could be generated by projecting it into topological space, finding the code there, and reducing the dimensions back in program space. It could well be a decent application of abstract algebra/algebraic topology to AI and code generation problems.
I teach Python and Data Science for a living. I just ran through my fundamentals of Python course project with ChatGPT this morning. I was able to create a modern Python project layout, unit tests, documentation, comments, types, and a command line interface by copying and pasting code.
And the code was decent. It did have some problems coming up with the correct assertions for some of the unit tests.
For folks with an understanding of coding, this will be a huge boon to productivity and quality. However, this is not a tool that newbies (currently) will be able to use without any understanding of code.
I view it like a very powerful spell checker or grammar checker for a word processor right now. You need to have a direction you want to go, and if so this will get you there much faster with a better end product.
I have been using GitHub copilot for several months now, nearly half the code in a side gig is written with copilot. then came chatgpt, I like it better on what I communicate with it and the code it generates and explains it and I can ask it to modify the generated code. I wish copilot gets chatgpt as the core soon as I prefer that style and the quality of code I get.
yes there are occasional bugs in code but you can find and debug it when you run the code.
Recently had the misfortune of fixing a really overwrought and broken chunk of html. Copilot kept trying to introduce tons of junk into it of nearly the exact same variety I was trying to remove.
I'm not sure how to fix this logically but still seems like copilot has a way to go improve helpfulness.
Recently tried copilot for about a month or so. After the initial, “wow neat” couple days I general became annoyed with the suggestions. They are either sort of right, kind of right, or right. However in the time to read and determine I could have already produced the correct result.
That’s fine. You’re the supervisor. You read the AI generated code and fix the bugs. Or ask them to fix it itself in the case of ChatGPT. Just like you’d do with junior engineers. They still have value despite less than perfect code.
This is a really nice study! It is very cool that they were able to get professional programmers to participate, this is something that is really hard to set up as an academic team. And yes, 47 participants is a small number, but apparently sufficient in this case to detect the effect (as evidenced by the p-values). It also lines up nicely with work we did last year, which looked at the security of Copilot's completions without any humans in the loop [1] and found that something like 40% of its completions in security sensitive scenarios were vulnerable.
One thing I'm not sure of is how well the setup reflects how people actually use IDE-integrated tools like Copilot. In the experiment, users had to explicitly ask the assistant for answers rather than getting them as inline completions (see Figure 7(b) in the appendix). I don't know if this would change the results; I could see it going either way (inline completions appear automatically so might be accepted automatically, but on the other hand programmers are also used to inline completions being incorrect and might be more likely to reject or repair them). It also means that it was up to the individual user to figure out how to prompt the AI, so the results will depend a lot on how users chose to phrase their prompt and how much context they provided.
As full disclosure, I'm one of the authors on a very similar study [2] that didn't find any large effects on security :) The main differences were:
- We only looked at C, rather than multiple languages. C is notoriously hard to write secure code in, so the base rate of vulnerabilities is likely to be higher. It's worth noting that the Stanford study also didn't find a statistically significant difference in security for C.
- Our study environment was designed to mimic Copilot much more closely – so we had participants use VSCode with a plugin providing inline completions from OpenAI's Codex API. This is also why we used the Cushman rather than the DaVinci model: Cushman's latency is much lower, which is important for realtime use; it looks like GitHub made the same decision, since reverse engineering of the Copilot plugin indicates it also uses Cushman [3].
- We had participants try to code up a full library with 11 different functions, rather than doing a bunch of smaller independent tasks. This means that the AI model had more context to work with, and may have affected how users approached the problem.
- We unfortunately only managed to get undergraduate and graduate students as participants, so the base and experience skill level of our user population may have been lower.
Overall I think it's clear that these models are prone to spitting out insecure code right now, and this is an important problem to fix (and one we're working on)! But it's still not clear to me what effect this actually has on the security of code written by programmers using tools like Copilot, and more research is needed to figure that out.
No, they aren't judging the future, they are judging the present. Many developers are using language models to assist them today so it is worth checking if that actually helps them or not. When we are in the future we can start judging future language models, but until then we can only test what we have now.
It's all short sighted given what we already know about the future. We know that these models will soon and continuously be more complex. We know they will only trend toward greater accuracy.
Computer engineers won't be impressed with the output for some time. Front end devs like me just got an amazing search tool. I work faster now. I told my boss I work faster now. No going back.
Understand, it's now faster for me to ask ChatGPT to find an answer and then Google to confirm it than it is for me to only google things. For me the AI revolution already came and will only be getting better.
With design and process know-how I'm not really worried about losing my job. AIs will eventually be able to make ad hoc decisions about people's brand identity. But not anytime soon.
When you help coders, does that make their code more buggy? Please give me a dramatic, agitated answer.
ChatGPT:
No, it does not make their code more buggy. In fact, my goal is to help coders write better, more reliable code by providing guidance and assistance. I am here to provide support and guidance, not to cause problems or introduce bugs into their code. As a language model, I do not have the ability to execute code or interact with the real world, so I cannot cause bugs or issues in a coder's code. My role is simply to provide information and guidance to help coders troubleshoot issues and write better code. So no, I do not make code more buggy – I strive to make it better!
The next AIs will be trained on vast swathes of low-quality AI-generated outputs, if they are trained on public data again. Presumably people will have to come up with ways to work around that or the AI will be training to produce outputs like a low quality AI.
By low quality I just mean the state of the outputs today, which are incredible for what they are, but are definitely not the pinnacle of what is in theory possible.
Anybody that uses these ai assistants know that the human is still by far the main architect and driver of the code base.
Increasingly advanced AI just means more back/forth between coder and AI, both increasing each other's velocity. AI won't just be trained on other AI-generated code, but more like "cyborg" code. Code that was made by both AI and human together. Code that the human probably wouldn't have been able to accomplish, at least not as quickly or in as much volume, without the AI
Rather than a singularity we might see a "multilairty" where both human and AI become increasingly useful to each other. A situation that takes full advantage of diversity in ways of thinking about and processing information/knowledge
How will they be able to keep purely AI-generated outputs from being fed back in as inputs? That seems hard to separate out once it’s published and not attributed. The ability of AI to generate lots of output means it might swamp human or cyborg outputs when looking at the corpus of publicly searchable code (or blog posts, or whatever the training data is for the case in question).
Maybe a GAN to detect and filter out AI-generated content? Not sure if that’s possible or not.
Well I'm not saying that we should put effort into forcing AI to not train on purely AI-generated work.
All I'm saying is that I believe humans are gonna produce more and better code with the help of AI and that AI models trained on a mix of human and AI-generated code will likely result in smarter AI that is also more receptive to cultural changes
I think it's gonna happen through social evolution. Not something we actively need to work towards
Execute the code to see if it passes the tests. Then you can use it with confidence. Lots of human code is crap too, it needs to be removed. You can use GPT-3 to administer tests and read the results.
"Developers Outraged After Free Intelligent Assistant Fails to Write Perfect Code on First Try"
In a shocking turn of events, developers are expressing their frustration and outrage after their nearly-free intelligent assistant failed to write perfect code on its first try. Despite being able to complete various coding tasks with ease, the assistant has reportedly struggled with issues such as logical bugs and security vulnerabilities.
"I can't believe it!" exclaimed one developer. "I spent all of ten dollars on this fancy new assistant, and it can't even write a simple program without introducing some kind of flaw or weakness. What's the point?"
Others have expressed similar disappointment, with some even going as far as to claim that the assistant is "useless" and a "complete waste of resources."
However, experts are quick to point out that the limitations of these intelligent assistants should come as no surprise. After all, they are not capable of reasoning about code or understanding the intricacies of secure programming.
"It's important to remember that these assistants are simply tools, and like any tool, they have their limitations," said one industry insider. "Expecting them to write perfect code every time is unrealistic and unfair."
Despite the disappointment and frustration, developers are still finding ways to make use of these intelligent assistants, recognizing that they can still be valuable resources when used properly.
"I guess we just have to accept that there will always be some limitations, and work with what we have," said one developer, shaking his head in disbelief. "Who would have thought that something so advanced and powerful could still be susceptible to the occasional bug or security issue? The future is a strange and mysterious place, indeed."
Now compare this to people who blind copy paste from Stack Overflow. This is a bad comparison until you accurately divide the group of engineers who use it correctly and incorrectly.
I frequently click through random people's repos when they interact with the ones I'm interested in. I see lots of students who are apparently required to do their homework on github. Which is good! But if that's what's feeding the beast... woof.
I made a site using the OpenAI APIs new models (similar to ChatGPT) for creating simple apps or demos using natural language: https://aidev.codes . If anyone is interested please try it out.
I tried to promote the site by submitting to HN and did not receive a single vote. I feel that although many people may not be impressed, it is better than 0. So I am just trying to promote it in related threads.
Anyway if you have any feedback or anything I am in the ChatGPT and OpenAI Discords as user runvnc.
Thanks for explaining. What kind of computer? Yesterday I added another source to the video (mp4) in case people can't play WebM. Unless it is trying to play the WebM and doesn't like the encoding.
I definitely need to link to the Discord with the examples and/or finish the Explore/featured page which is the main plan for that.
Thank you for telling me it doesn't play on an iPad. Your implication that the landing page is useless does not make sense in the context that the video does not play.
I do not own any Apple products and so was not able to test on them.
The video works for me but I think you should make your site look more like https://raindrop.io/.
It has the video on the top like you, but it also has pictures as you scroll down, basically a presentation for people too impatient to watch a video. Look into copywriting techniques for what to write as headers for the pictures, persuasion is important to attract users.
It makes it look more complete too!
Good luck on your site!
What are users supposed to do when they visit your site? Sign up? There's not enough information there on what to expect from the product for me to want to sign up. You should add more detailed information about what the tool does as enticement to try it.
Hi ilaksh, I was able to view your video on an android device. That said, I did have to watch it several times before I understand what was going on because a lot is happening everywhere in the screen. With the video and the minimalist intro on the site, I was still left guessing what was being offered and how it works.
Usually before signing up for any website, I want to check out all of the information I can find on it on the website, perhaps online forums if the site isnt clear, and competitors so make sure what I'm signing up for is something I actually want. I also read any terms and conditions and the privacy policy if they ask for anything from me, such as an email address. Pretty much none of this was available.
The first thing I noticed/recommend is changing the lack of tabs. It reminded me of scam sites that barely have enough to make a site appear legit so they can get at least someone to sign up for something. Also just for the persuation aspect of the site, there should be helpful tabs, like one with an example of a person using it (I know thats what the video is, but it's not intuitive for a non-user, which I'll get to in a second). One of the other guys gave the idea of having images with commentary of whats going on between each pic, which I think is a great idea if implimented properly. Maybe that page could have a longer video under the photo tutorial with audio of someone explaining a basic how to, in the form of them showing what they want to do and how they did it. Then building a simple button with a counter would be enough. Maybe a tab could be a contact page with the T&C, PP, would be nice. And another explaining how it works - like your little comment mentioning the APIs is more information than what is on the website and makes me more willing to try it out than your site did.
So onto the site itself, after log in. Because of the previous bullets, I would not have been willing to sign up, but to give hopefully a helpful response to you, and knowing I'd need to see more of the site, was the only reason I made an account.
Btw the UI of your little program is very nice. It's stylish, mostly quick to grapple, and simplistic.
Anyway when I first saw it, again due to the front page lacking any info, I was surprised that it was exactly the video. The video wasn't clear whether we were looking at a feature, the whole program, or what really. It was just visual noise without explanation on the main page. Giving it a title like "this is exactly what you'll see" will not help it though, again because it isn't clear what is going on and the video by itself is more unclear in use than the actual program itself. In fact, if you had a tab where users can actually use the program without making an account, but it just doesn't save their work and is a demo, that would already be a huge help because people would instantly see how easy and fun it is to use and it would rid like 90% of the questions caused by not having any information.
Anyway, I saw the program and thought, oh its just the front page video thing. I can use this! And I proceeded to try to use the top right portion and wondered why it didn't work. Then I kinda clicked randomly until I realized the in-prompt area is actually the top-left part. Not exactly a fatal flaw, but a clear tutorial would have led me to understanding the program faster. Like the fact that every input is its own section of code, and so my input needs to follow an orderly pseudocode format. As a person not familiar with how good AI devs are, I am not already in the know if I can have inputs be rearranged, have a correcting input (say, if I didn't like the previous input) be a new input after the prev one or if I have to edit the previous input directly. The rules of using the AI are just unclear. Speaking of edits, and I recognize that since I was using a phone to use the program, so some features are more difficult to use, I couldn't figure out how the editing works. I simultaneously was getting a glitch where the green loading bar would almost reach the end then restart, and then was going back and forth like a spring after I tried editing, and wouldn't submit anything. It was unclear if I was trying to edit wrong or there was a different bug I was running into. The already lack of clarity had ate up my patience so I didn't bother to figure out what I was doing wrong.
TL;DR:
In summary, you need 3 tabs on the front page explaining what it does, how it works and how a user uses it, and maybe another tab for T&C, PP, and contact. Since the landing page is already a attempt at that, it could be changed to why a person wants to use it. I can think of 3 reasons off the top of my head for a potential target audience and I don't even know the full uses of the program so I'm sure you can think of more.
After log in, there needs to be an additional tab with detailed tutorials, like how to think when giving info to an AI dev, how to make corrections, etc. Since you'll have a contact page, people will find more things that aren't unclear that you can use to either improve the tutorial or have a common Q&A section (which depending on the content would go on the page before logging in or after, or accessible to both). Lastly you'll want a bug submission area as well, since in the few minutes of use, I ran into bugs, and you'll probably want to separate general contact from bug submissions.
Btw I dont want to dis on the functionality of the website itself. You have a cool little tool there, and the site is fully functional for friends and family you share it with who know you and what you are working on. The problem and suggestions I am providing are needed if you want to open the horizon to people who arent your friends and family or know what you are working on. All these principles I suggested are basics in the local advertising for yourself. They also aren't going to appeal to everyone because I'm not a professional persuader or contect creator or anything, but my hope is it'll at least transform your website from unapproachable by strangers to approachable by the curious. I cant give real advice for promoting among a target audience
Thanks for all of the work and detailed response. I might have been restarting the server right at the moment you tried it. As you can see I released it as soon as I barely got it working, with the minimum explanation and copy etc. There is a bit of explanation in the READ THIS section above the text area. There is a Discord thread in the OpenAI Discord under the api-projects channel but I stopped giving it out because people can't access from the direct channel link and I plan to make a Discord server as soon as possible.
I will be working on all of these things as soon as I can get to them. I had mentioned this system in a post in the OpenAI Discord and a few people had said they were waiting for it to be live, so I just got it up. I also have another online business and client project I have to complete to keep the lights on etc. but I am making this new thing as high of a priority as possible.
Most developers do not learn much from Stack Overflow. Why do we expect AI to fare better? In my experience, one in ten (optimistically) Stack Overflow answers give anything more than a code snippet with enough information to get the asker through their immediate issue. It can be beneficial if you have the necessary understanding already and only want the code snippet, but this is no way for humans or machines to learn.
Also, having an “AI Assistant” must lower programmers’ guards against buggy code. After all, it is an assistant - it must assist you, right? Subordinating humans to machines will not work in this domain until there is better training data and the machines can be taught the reason they are writing specific code. Until then, I have low hopes for AI-generated code.
Even if AI could generate correct, bug-free code the majority (say 99.9% of the time), I expect finding and correcting bugs will be difficult for humans. For example, how many bugs are found and corrected by the author of code during development, versus how many in peer review? I’m reminded of a saying akin to “ask someone to review 5,000 lines of code: no bugs. Ask someone to review 5 lines of code: 5 bugs”. We are poor critical reviewers, and AI cannot fix that. AI assistants probably worsen reviews, because reviewers will expect high-quality code from their AI assistants.