One problem that I run into with LLM code generation on large projects is that at some point the LLM runs into a problem it just cannot fix no matter how it is prompted. This manifest in a number of ways. Sometimes it is by bouncing back and forth between two invalid solutions while other times it is bouncing back and forth fixing one issue and while breaking something else in another part of the code.
Another issue with complex projects is that llms will not tell you what you don't know. They will happily go about designing crappy code if you ask them for a crappy solution and they don't have the ability to recommend a better path forward unless explicitly prompted.
That said, I had Claude generate most of a tile-based 2D pixel art rendering engine[1] for me, but again, once things got complicated I had to go and start hand fixing the code because Claude was no longer able to make improvements.
I've seen these failure modes across multiple problem domains, from CSS (alternating between two broken css styles, neither came close to fixing the issue) to backend, to rendering code (trying to get character sprites correctly on the tiles)
[1] https://www.generativestorytelling.ai/town/index.html notice the tons of rendering artifacts. I've realized I'm going to need to rewrite a lot of how rendering happens to resolve them. Claude wrote 80% of the original code but by the time I'm done fixing everything maybe only 30% or so of Claude's code will remain.
Same. I was writing my own language compiler with MLIR/C++ and GPT was ok-ish to dive into the space initially but ran out of steam pretty quickly and the recommendations were so off at one point (invented MLIR features, invented libraries, incorrect understanding of the framework, etc) that I had to go back to the drawing board, RTFM, and basically do everything I would have done without GPT to begin with. I've seen similar issues in other problem domains as well just like you. It doesn't surprise me though.
I’ve observed this too. I’m sceptical of the all-in-one builders, I think the most likely route to get there is for LLMs to eat the smaller tasks as part of a developer workflow, with humans wiring them together, then expand with specialised agents to move up the stack.
For instance, instead of a web designer AI, start with an agent to generate tests for a human building a web component. Then add an agent to generate components for a human building a design system. Then add an agent to generate a design system using those agents for a human building a web page. Then add an agent to build entire page layouts using a design system for a human building a website.
Even if there’s a 20% failure rate that needs human intervention, that’s still 5x developer productivity. When the failure rate gets low enough, move up the stack.
I’ve found that getting the ai to write unit tests is almost more useless than getting it to write the code. If I’m writing a test suite, the code is non trivial, and the edge cases are something I need to think about deeply to really make sure I’ve covered, which is absolutely not something an llm will do. And, most of the time, it’s only by actually writing the tests that I actually figure out all of the possible edge cases, if I just handed the job off to an llm I’m very confident that my defect rate would balloon significantly.
I've found that LLMs are only as good as the code it is trained on.
So for basic CRUD style web apps it is great. But then again so is a template.
But the minute you are dealing with newer libraries, less popular languages e.g Rust or Scala it just falls apart where for me it constantly hallucinates methods, imports etc.
I spent weeks trying to get GPT4 to get me through some gnarly Shapeless (Scala) issues. After that failure, I realized how real the limitations are. They really cannot produce original work and as far as niche languages, hallucinates all the time to the point of being completely unusable.
Hallucination from software generation AI is worse than software written by a beginning programmer, because at least they can analyse and determine the truth of their mistakes.
And so another level of abstraction in software development is created, but this time with an unknown level of accuracy. Call me old-school, but I like a debuggable, explainable and essentially provably reliable result. When a good developer can code while keeping the whole problem accurately in their heads, the code is worth its wait (deliberate sic, thank you) in gold.
Reviving my long-dead account to say that I built a perfectly functional small site to help schedule my dungeons and dragons group within about 5 minutes, on my phone, from my bed. If this isn't the future I don't want to go there. Fantastic work.
As someone who's run in software startup circles on the customer support side for years, I couldn't figure out how to make it do anything. Took several minutes even to notice the "save and preview" button to get my "edit this text" edit to work on the code editor. I first tried the "refresh" button on the preview, after realizing it didn't update automatically, but most of the text for the correct button wasn't visible.
Then, when it asked me to make a "val" that did something, I tried looking at the templates, hoping to figure out what they even mean by "a val." The page loaded halfway down the screen and wouldn't scroll, so I gave up and went back to the site to see if the About page would tell me anything useful.
I ended up back here hoping someone in the comments had figured it out, only to find that for some people, it clearly is entirely intuitive, which makes me think the whole platform is just for "other people," people not like me in some important way I haven't identified yet, even though it sounded like it was targeting people who have ideas for projects but don't know how to build them themselves.
Sorry to hear this! We definitely have a lot of work to do to make it more intuitive. If you have 30 minutes, I'd love to do a video call where I walk you through it and take notes on what you find confusing. Hopefully it'd be helpful to us both! I'm steve@val.town, shoot me a note if you're up for a chat
I made it build a function table for editing with authentication and roled based access to the different fields in the table by asking it:
"Build a site with a table that has editable stuff" Yes... I really said "stuff"...
"add authentication"
"add role based access to the different fields in the table"
It took like 5 seconds. The code is unusable for anything enterprise with how poorly it is. Especially the back-end and authentication would have to be completely rewritten and the front-end has no re-usable parts, but it works. It would work for a lot of smaller projects, even ones for tiny scale professional usage.
I'm not sure how a non-developer would deploy it anywhere, but it would be quite easy to get chatGPT (or maybe townie?) to generate a docker-compose file for it. I'm confident a non-developer would be capable of changing the "stuff" into something useful even if they didn't use townie for it. I don't think, everyone, could do it, but I do think a lot of of my very not tech but digitally inclined co-workers would be capable of working with it. I'm not completely sure how they would transform their various data sources into something that could be used by it. So I'm not sure exactly what they would use it for.
Whenever there is a 'big new AI model' launch, I try to build one of my side projects fully with the new AI. So I do not touch anything myself, I only talk english to it. I do read the generated code and instructions so I can correct them in English; no code at all. It worked twice; with chatgpt4 and the sonnet launch. All the others did not manage without significant code or ops help.
It is a very annoying experience even if you know what you are doing; it is still much faster than what you would get done writing code but it is very frustrating getting the last 10% right; you spend a day on 80%, two days on 10 and a week on the last 10%. If I just jump in and fix the code myself, it is about 1 day for the same project, which is still amazing (and not imaginable before).
People complaining that it sucks and it cannot figure things out often are right, however, it is a lot better than what we had before, which was doing all this by hand (causing many people to procrastinate over even starting a side project while have 1000s in mind every day).
These types of services are important and I like this val.town idea. Well done and keep going.
Yup, absolutely. I've got a few side-projects that I'm doing for fun, which wouldn't have even gotten started if not for LLMs being good enough — but, like many customer-facing roles*, LLMs get a lot of slack from people who (appear to) expect perfection for free.
When the AI doesn't make mistakes, humans can't continue to get paid to do those jobs. So I'm hoping AI continues to be mediocre at writing code for at least another 2 years.
The other big GenAI things, I'm not sure how they're perceived. Music? I don't know well enough to tell. Images? The way people talk in public is a three-way mix of complaining they're (1) making bad art while (2) also taking away jobs, plus (3) the "this rips off the human creators" that is present for all GenAI models. Video? I'm only really seeing people mock it right now; even the praise is caveated "I can see the flaws but $something_positive".
* Was thinking this yesterday WRT the police getting flack both for not doing enough and also for being too proactive; but I could say this about almost anything — a former partner could have said the same about working in a call center, a few days ago on this site there was a discussion about doctors making terrible mistakes, and there's a few current examples I don't dare write about because I know I'll get flack from "both sides" (even on this site) for even daring to use the phrase "both sides".
LLMs are massively useful, just not in the way that people think they should work. No, you are not the manager with AI doing the work. Instead, it is more like someone to bounce ideas off of, a teacher (who is often wrong) and a reviewer. The human, ironically, is the specialist that can get details right, not AI (at least not for now).
I just played with townie AI for an hour or so... Very cool! Very fun.
There's still some glitches, occasionally the entire app code would get replaced by the function the LLM was trying to update. I could fix it by telling it that's what had happened, and it would then fill everything in again... Waiting for the entire app to be rewritten each time was a bit annoying.
It got the initial concepts of the app very very quickly running, but then struggled with some CSS stuff, saying it would try a different approach, or apologising for missing things repeatedly...and eventually it told me it would try more radical approaches and wrote online styles... I wonder if the single file approach has limitations in that respect.
Very interesting, very fun to play with.
I'm kind of concerned for security things with LLM written apps - you can ask it to do things and it says yes, without really thinking if it's a good idea or not.
But cool!
And anything which helps with the internet to be full of small independent quirky creative ideas, the better.
> I'm kind of concerned for security things with LLM written apps - you can ask it to do things and it says yes, without really thinking if it's a good idea or not.
Well, right. If I'm using an LLM to create code, I'm going to use all my skill and experience to review and shape the code to standards I'm ok with.
But for people with extremely limited experience, LLMs offer a "create an app by talking!!" Zero understanding required. So they won't know to not leak user PII in JSON responses or have publicly writable endpoints or keeping private keys for external services server side and outside of the code base, etc... Let alone anything more complex.
prompt = """
You run in a loop of Thought, Action, PAUSE, Observation.
At the end of the loop you output an Answer
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then return PAUSE.
Observation will be the result of running those actions.
..."""
Seems like a really powerful technique to have LLMs act on their own feedback.
Couldn't this essentially be used as a training data generator?
E.g. have humans + LLMs generate a bunch of prompts that goes into this system, and it spits out a bunch of fully-fledged applications, which can be used to train an even bigger model.
I think that this is kind of an obvious "optimization" for making application generation much more reliable. Just because the models can generate code for one of 1000 different platforms, doesn't mean that you need all of them. Just by narrowing the scope to a particular platform makes it much more feasible to get working applications without needing manual debugging due to out of date library references etc.
I think something like the approach you have demonstrated here will relatively quickly become the standard for "no-code" application development.
Completely agree. It's useful not just for targeting one specific language, but all the other APIs that we have, and things like RAG to search for importable modules on the platform. Duplicating all that across many platforms is a lot of work!
There hasn’t been a ton of work in this area, but most things will work in Deno if you import them directly. You’d need to figure out blob storage and sqlite but I don’t think it would be the biggest lift
to do so.
Another issue with complex projects is that llms will not tell you what you don't know. They will happily go about designing crappy code if you ask them for a crappy solution and they don't have the ability to recommend a better path forward unless explicitly prompted.
That said, I had Claude generate most of a tile-based 2D pixel art rendering engine[1] for me, but again, once things got complicated I had to go and start hand fixing the code because Claude was no longer able to make improvements.
I've seen these failure modes across multiple problem domains, from CSS (alternating between two broken css styles, neither came close to fixing the issue) to backend, to rendering code (trying to get character sprites correctly on the tiles)
[1] https://www.generativestorytelling.ai/town/index.html notice the tons of rendering artifacts. I've realized I'm going to need to rewrite a lot of how rendering happens to resolve them. Claude wrote 80% of the original code but by the time I'm done fixing everything maybe only 30% or so of Claude's code will remain.