What we learned in 6 months of working on an AI Developer

Lerc · 2024-03-03T22:18:48

Even though I don't think GPT-4 is up to the task, it does seem like now is the right time to be working on these things. Pretty soon GPT-4 will not be the best in the field. The next generation will perform much better.

Possibly the most frustrating thing I find about GPT-4 is how close it gets with it's wrong answers. It's easy to dismiss a lesser answer when it responds with a laughably out-of-band idea. GPT-4 often shows that it has a general idea of what you want but misses a small but critical aspect which results in a solution to something else that is similar but not what you wanted.

I have mixed results on iterating on it's own mistakes. It will too often try and change the world to match it's answer, rather than fixing the answer. The best approach I have found to stop this is by getting it to create unit tests. I imagine there is a lot of training data for it to understand the intention behind fixing a failing test. It's a very specific problem for it to look at and generally changing the test is not considered the correct solution.

withinboredom · 2024-03-03T22:45:52

Oh man. When it’s so close but wrong it’s amazing for creative endeavors! For technical ones, it is quite a bad thing. It’s like being a Star Wars fan but the AI just wants to talk about Star Trek.

I think this is why the non-tech people see AI as so amazing. For anything human and non-technical, the “almost but not quite” nature is a good thing.

I was using an AI to help me debug a weird thing (mainly summarizing log splats hundreds of lines long) and I eventually got pretty close to identifying the issue when I asked “wtaf is this message. Never seen anything like it.” It then went on about how it was offended that I used vulgar language. I had to apologize for saying “wtaf!” Anyway, I found a bug in a linker, so that was fun; thanks Al.

layer8 · 2024-03-04T01:27:44

What’s frustrating is that the one reason I ever wanted AI is to have a Lt. Cdr. Data or ship computer equivalent that is logical and correct to a fault and that helps me reason through things, but what we got now is almost exactly the opposite, we have to help it reason through things and have to double-check everything for correctness.

__loam · 2024-03-03T23:51:37

I think it's equally shit at creative work too, that's just harder to dismiss as "wrong". It's still wrong, it's just harder to see.

jaggederest · 2024-03-04T00:27:19

Rapidly expose the issue by asking it to write something funny. Can't tell a joke, even when prompted for really elementary stuff like knock knock jokes or chicken crossing the road, let alone anything sophisticated.

berkes · 2024-03-03T22:45:50

> Pretty soon GPT-4 will not be the best in the field. The next generation will perform much better.

What makes you believe that progress is linear, or at least a line forever going up?

I keep seeing people predicting rapidly improving AI, based on how rapid it improved over the last x months.

But why is that not an outlier? How do we know we haven't hit a ceiling and stagnating? Isn't progress typically very bumpy and sudden?

Lerc · 2024-03-03T22:52:04

>What makes you believe that progress is linear, or at least a line forever going up?

I assume neither of those things. I have however read a lot of the papers published since GPT-4 was trained. There have been a lot of advances since then, so much so that simply saying "a lot" seems to be a massive understatement.

I think it is a reasonable assumption that at least a portion of those advancements would be able to build upon the existing technology of GPT-4 to produce something greater.

I am not assuming discoveries yet to be made. I am considering existing discoveries that have not yet made it into the top level of production.

teaearlgraycold · 2024-03-03T23:57:06

Technological innovation is not truly an exponential. It is instead a series of logistics. If we do not have further breakthroughs then AI technology will plateau. I do think we'll have those breakthroughs but it's impossible to say when they will happen.

bugglebeetle · 2024-03-03T23:14:17

I would say that Microsoft/OpenAI’s attacks on open source, whether it be through “AGI safety” BS as a front for regulating their way to a monopoly, or attempting Embrace, Exentend, Extinguish on companies like Mistral, and Cold War-style fear mongering about China, are the greatest near-term risks to linear progress. And it’s worth noting on that latter point that China is not similarly constrained and so could end up outcompeting the U.S., regardless

anileated · 2024-03-03T23:36:44

It’d be a bit hypocritical on the part of Microsoft/ClosedAI if they try to restrict AI tech in the name of safety while stepping around the fact that they themselves built it while essentially ignoring copyright.

Their commercial LLMs would not be possible without original creators who are now being ripped off and squeezed out of their jobs; if they don’t keep the tech open it will be very difficult to justify.

threeseed · 2024-03-03T23:45:03

> China is not similarly constrained and so could end up outcompeting the U.S.

With what technology ?

US has long term export controls on China and as they have demonstrated with Russia recently once you have secondary sanctions in place everyone falls into line. So it's pretty likely they will be effective.

throwaway2990 · 2024-03-04T00:06:56

The export controls simply don’t work. Those chips still make their way into China. All those export controls do is slow it down a bit.

But China can outfit itself with more hardware even if it’s not as fast as the latest iteration and still speed past the U.S. while the U.S. and the EU argue about AI being racist or not.

threeseed · 2024-03-04T00:38:28

> Those chips still make their way into China

Not at the scale you need to build a world-class AI.

Let me know when Huawei is able to place an order for 300,000 GPUs.

bugglebeetle · 2024-03-04T00:41:36

Okay:

https://asia.nikkei.com/Business/Business-Spotlight/How-Chin...

viksit · 2024-03-03T23:49:00

China actually has very advanced technology in literally every area of the sciences now. They just don’t bother talking about it in english

I recommend you follow some of the specialist chinese AI substacks to see what’s happening over there.

Esp around chip building

sdsd · 2024-03-04T00:02:39

>I recommend you follow some of the specialist chinese AI substacks to see what’s happening over there.

Do you recommend any in particular? I'm not familiar enough with the Chinese AI scene to know who to check out

threeseed · 2024-03-04T00:01:52

Not really buying the idea that China secretly has equivalents for Nvidia, ASML, TSMC etc.

SIMC right now relies entirely on existing US/EU hardware for their chip building.

New versions of which are no longer available to them.

bugglebeetle · 2024-03-03T23:52:36

I’m guessing you don’t read many machine learning papers. Chinese research in AI coming out of their top institutions (e.g. Tsinghua University, Shanghai AI Lab), is pretty much on par with state of the art in U.S. labs.

__loam · 2024-03-03T23:55:52

The greatest near term risk to progress here is cost but nobody wants to talk about it.

bugglebeetle · 2024-03-03T23:59:11

Architecture improvements and stuff like the 1-bit networks that have been all over HN recently are going to drastically reduce training costs over time and could easily result in very powerful local models. That’s why there is a rush to regulate and create monopolies. Microsoft and OpenAI know they have no real moat relative to the current pace of R&D.

amelius · 2024-03-03T22:17:04

Until I see an AI sysadmin that can help with basic configure/make problems, I don't have high hopes for an AI developer.

samus · 2024-03-03T23:06:33

That should be quite easy compared to software development, which is much more open-ended since the requirement are usually more nebulous, potentially contradictory, and at times simply wrong.

c0balt · 2024-03-04T00:04:44

> That should be quite easy

SysAdmin stuff is quite easy in terms of complexity to some sw stuff. The problems, similar to traditional engineering, tend to come from the rather high cost of failure.

To expand further, it's easy to setup a system but hard to setup one that's reliable and/ or resilient. It's hard to maintain systems that are not documented and/ or wrongly documented (outdated, inaccurate). It's even harder to always make sure everything's consistent and you don't lose/ damage data.

jaggederest · 2024-03-04T00:28:58

Yeah imagine giving an AI root access to the server with your production database on it. Now I really want to try that in a VM to see what will happen. Even odds at some point it tries to rm-rf everything.

samus · 2024-03-04T06:09:17

Nobody is considering giving AI unrestricted access to anything yet. Code written by AI is reviewed by humans, and I would be shocked to hear that sysadmins are considering letting AI agents execute arbitrary commands.

singularity2001 · 2024-03-04T08:15:39

I gave GPT root access, so far without regrets

admittedly I recently added a soft barrier "for dangerous commands please ask for confirmation"

ofrzeta · 2024-03-04T10:20:13

Does it "know" what's "dangerous"?

samus · 2024-03-04T08:37:55

Username checks out!

amelius · 2024-03-03T23:33:49

Yet, I haven't seen an AI that solves the software distribution problem. Tools like ChatGPT are often plain wrong when answering questions about basic sysadmin problems, and even make up commands.

samus · 2024-03-03T23:40:51

The software distribution problem is not a close-ended, technical task that we could realistically expect an LLM to have an answer for. At least not now. The latter problem could eventually be lessened by fine-tuning the LLM more specifically and by improving the tooling around it. For example, by automatically briefing it about peculiarities of the operating systems.

fragmede · 2024-03-04T04:26:44

https://chat.openai.com/share/c424f444-c7ac-476f-bd10-02234e...

I picked a random GitHub issue that was some issue with ./configure. Seems like it helps to me.

kbar13 · 2024-03-03T22:37:44

need AI for ffmpeg flags

stavros · 2024-03-03T23:20:51

I made a program for all your Unix admin needs:

https://github.com/skorokithakis/sysaidmin

whiterknight · 2024-03-03T23:19:25

That sounds like a job ai would actually be good at

Cilvic · 2024-03-03T22:40:28

I have great success for my simple use cases with sgpt -s "cut the 40 seconds of the video starting at 1:30"

symbolicAGI · 2024-03-03T23:55:33

ChatGPT-4 is wonderful for composing regular expressions. Saves so much time when transforming various Java strings that arise in my work.

Today's big time savings came from this prompt: "Write a Java method that uses the Eclipse AST parser to create a simple markdown file showing the commented method signatures of a given Java class text file."

65 · 2024-03-03T22:32:37

Maybe AI developers can make landing pages and basic APIs. But, taking front end as an example, I just don't see how an AI can reproduce exact design specifications and interactivity to the point where it wouldn't just be faster to write the code yourself or search for some human verified snippet that does what you want.

And programmers who do know how to actually write efficient code without AI seem like they'd be even more in demand than those that rely on AI. Skill + knowledge + ability to use existing resources (e.g. StackOverflow, packages, templates), as we do now, are much more predictable and faster than trying to wrangle AI to do exactly what the designer or PM wants.

When the dishwasher was invented, everyone thought the human dish washer would be obsolete. And yet, restaurants still employ dish washers because they are much more efficient and thorough than a dishwashing machine.

nine_zeros · 2024-03-03T23:00:11

> When the dishwasher was invented, everyone thought the human dish washer would be obsolete. And yet, restaurants still employ dish washers because they are much more efficient and thorough than a dishwashing machine.

This is a good example of both job destruction and job retention by technology.

Job destruction - the total number of potential hand dishwasher jobs has reduced because the vast majority of commodity dishwashing is machine driven.

Job enhancement - machine dishwashers just can't produce the quality/dexterity of hand dishwashers.

I feel like generative AI will do the same. It will replace a large number of commodity jobs - editors, translators, copy producers, website designers, app prototypes, paper pushers but it will also reveal the value of skilled producers.

Too risky to let chatGPT write code for your backend that destroys your production database and crashes your company forever.

ctoth · 2024-03-03T22:47:58

One of the things they seem to have figured out is the requirement to at least model a sort of actor-critic architecture with their agents. It helps quite a bit.

They seem to badmouth Aider a tad (not cool) but I do wonder how a full-stack of this + Aider might work? There needs to also be some sort of good test generator involved.

All that said, any time someone actually demonstrates progress on the automated Software Engineer problem and it makes it to HN, I am deeply reminded of the old quote:

"It is difficult to get a man to understand something, when his salary depends on his not understanding it."

Just read through this comments section and check out the pure copium. Yes, ChatGPT can do basic sysadmin tasks with ./configure and make.

Yes it does make sense to work on this now, assuming LLMs will get better, because LLMs have continued to get better on any metric you can imagine.

Finally, yes, AI devs will make landing pages and basic APIs. I didn't realize we were all hardcore world-class 0.01% programmers? I have certainly written a landing page and basic API before, in fact I do that sort of thing a lot more than I write uber1337 hax0r code. You probably do too!

singularity2001 · 2024-03-04T09:55:10

Mentioning Aider, which other tools attempt to act as complete coding copilots? Github Copilot pro seems to have ended some of their experiments, or did I just not get access to the right beta?

stevage · 2024-03-03T22:11:00

The focus on upfront specs feels a bit off. Since it's apparently cheap to generate running code, as a user, I'd much rather be able to just iterate really fast and use output to refine my requirements rather than having to laboriously state them all up front. Agile rather than waterfall if you will.

samus · 2024-03-03T23:15:26

In that case, it might be easier to start over with fixed specs. That might not be as much work as it sounds like since most of the existing code would have been produced by the LLM, and only the human feedback and interventions would have to be redone. It would be almost like backtracking to an earlier point in a chat history and changing path there. TA talks about that another LLM could provide insight about where to change what.

It might also be possible to change an existing history without abandoning all which has happened afterwards. Of course, this could lead to conflicts, sort of like when rebasing a branch, and it would be useful to have another LLM look for it.

GPT Copilot might or might not be able to start from existing code as well and one would approach it as one would a legacy codebase that has to be adapted to new requirements.

bsenftner · 2024-03-04T00:31:55

It's the LLM that needs the upfront specs, regardless of what you'd like, at this point. If that is the case, implement a nondestructive composable system, like node based visual editors - ComfyUI for example. Change the upfront specification "node" and let the LLM cascade through that and any attached nodes creating the code (or whatever) fresh each time.

gumby · 2024-03-03T22:58:30

CMake was invented to guarantee that at least some humans would have software jobs.

somewhereoutth · 2024-03-03T22:00:20

> Our approach is to focus on building the application layer instead of working on getting LLMs to output better results. The reasoning is that LLMs will get better,...

So more jam tomorrow then. Building the framework around the magic is the easy bit.

romafirst3 · 2024-03-03T22:10:58

It is a very important bit and might be how we all code in the future.

romafirst3 · 2024-03-03T22:12:14

It’s definitely not even close to bring solved either. I haven’t seen a single code generator that works (100% of the time) for anything more than a very simple one or two liner.

samus · 2024-03-03T23:35:16

So far, this involved changing the input or patching the code generator. Now we can simply nudge the code generator toward the fix. And we circle back to the point TA is making: just like you can't expect a junior programmer to get it right the first time, also the code generator needs feedback. And since the code generator is not restrained by a lack of technical knowledge like a junior programmer might be, it is even more important to supervise it.

wokwokwok · 2024-03-03T23:59:22

Hm.

It’s easy to look at https://github.com/Pythagora-io/gpt-pilot-db-analysis-tool/b... and go… so, this new tool means you took two days to write this?

long stare

Why did you bother?

…but, this both hits the nail on the head and misses the point at the same time.

On the one hand, this is foundational tech, prototyping on a new way of doing things. It’s not going to be faster than doing it yourself at first. It won’t run locally at first.

On the other hand, we already know that GPT4 level models can do trivial tasks.

Over and over and over, people claim coding tools can massively improve productivity, and then try to demo that by building a trivial system.

…but building a trivial systems is not the problem that needs solving.

The problem that needs solving is building large complex systems with dynamically adjusting requirements.

The examples and blog post seem to miss this even as an idea.

While I applaud, in general, efforts to explore this space, tackling the easy problems seems like it doesn’t significantly advance the state of play.

Here are some concrete things that would be more valuable, but are significantly technically harder:

- Use tests. Make it write tests. Make humans write tests. Do not accept generated code that fails the tests.

- Focus on refactoring; it’s a known issue that models struggle to refactor code. Breaking your existing code base into tiny files isn’t the answer.

- Focus on documenting the behaviour of existing code and incrementally migrating to new behaviour.

- Bad developers write new code instead of reading the existing code and using existing functionality and utilities. AI generators are notoriously rubbish at this, and will almost always generate a function rather than use an existing one.

Refining and understanding existing code is significantly more valuable than generating code “from scratch”; so much so that I would argue that without the ability to refine existing code, such tools will forever remain in the “scaffold generator” category of “useful but ultimately no better than the current status quo”.

The tool as shown, is I believe broadly speaking interesting, but the approach described in the blog (upfront decisions about everything) is a dead end.