Hacker News new | past | comments | ask | show | jobs | submit login
Things we learned about LLMs in 2024 (simonwillison.net)
942 points by simonw 3 days ago | hide | past | favorite | 563 comments





About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.

But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.


I'm surprised at the description that it's "useless" as a programming / design partner. Even if it doesn't make "elegant" code (whatever that means), it's the difference between an app existing at all, or not.

I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.

I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.

LLMs are the great un-stickers. For that reason per se, they are incredibly useful.


The context here is super-important - the commenter is the author of Redis. So, a super-experienced and productive low-level programmer. It’s not surprising that Staff-plus experts find LLMs much less useful.

Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.


Quick example. I was implementing dot product between two quantized vectors that have two different min/max quantization ranges (later I changed the implementation to just centered range quantization, thanks to Claude and what I'm writing in this comment). I wanted to still have the math with the integers and adjust for the ranges at the end. Claude was able to mathematically scompose the operations as multiplication and accumulation of a sum of integers and later adjust the result, using a math trick that I didn't know but was understandable after having seen it. This way I was able to benchmark this implementation understanding that my old centered quantization was not less precise in practice and faster (I can multiply integers without taking the sum, and later fix for the square of the range factor). I'd do it without LLMs but probably I would not try at all because of the time needed.

Other examples: Claude was able multiple times to spot bugs in my C code, when I asked for a code review. All bugs I would eventually find but that it's better to fix ASAP.

Finally sometimes I put relevant papers and implementations and ask for variations of a given algoritm among the paper and the implementations around, to gain insights about what people do in the practice. Then engage in discussions about how to improve it. It is never able to come up with novel ideas but is able to recognize often times when my idea is flawed or if it seems sounding.

All this and more helps me to deliver better code. I can venture in things I otherwise would not do for lack of time.


I'm pretty sure most people, developers especially, have had magical, life-changing experiences with LLMs. I think the problem is that they can't cant do these things reliably.

I get this sentiment from a lot of AI startups, that they have a product which can do amazing things, but due to its failure modes makes it almost useless as, to use an analogy from self-driving cars, the users have to still constantly pay attention to the road: you don't get a ride from Baltimore to New York where you can do whatever you please, you get a ride where you're constantly babysitting an autonomous vehicle, bored out of your mind, forced to monitor the road conditions and surrounding vehicles, lest the car make a mistake costing you your life.

To take the analogy farther, after experimenting with not using LLM tools, I feel that the main difference between the two modes of work is similar to driving a car and being driven by an autonomous care: you exert less mental effort, not, you get to your destination faster.

Another point of the analogy are things like Waymo. They really can do a great job of driving autonomously. But, they require a legible system of roads and weather conditions. There are LLM systems too that when given a legible system to work in can do a near perfect job.


I mean… I agree that LLMs give only superficial value, but your analogy is plain wrong.

I drove 3600 km Norway to Spain in 2018 with only adaptive cruise. Then again in 2023 with autonomous highway driving (the kind where you keep a hand on the wheel for failure mode) and it was amaaaazing how big the difference was.


I get how I could be wrong on that front. I guess what I was trying to say was that there needs to be legible, predictable infrastructure for these AI systems to work well. I actually think that an LLM workflow in a constrained, well understood environment would be amazingly good too.

I've been driving a lot in Istanbul lately and I'm not holding my breath for autonomous vehicles any time soon.


LLMs being able to detect bugs in my own code is absolutely mind blowing to me. These things are “just” predicting the next token, but somehow are able to take in code that has never been written before and somehow understand it and find what’s wrong with it.

I think I’m more amazed by them because I know how they work. They shouldn’t be able to do this, but the fact that they can is absolutely jaw dropping science fiction shit.


Idk if there is much code that "hasn't been written before".

Sure if you look at new project x then in totality it's a semi unique combination of code, but breaking it down into chunks that involve a couple lines, or a very specific context then it's all been done before.


Its easy to see how it does that, the answer is that your bug isn't something novel, it has seen millions of "where is the bug in this code" questions online so it can typically guess from there what it would be.

It is very unreliable at fixing things or writing code for anything non standard. Knowing this you can easily construct queries that trips them up by noticing what it is in your code they notice, so you construct an example with that thing in it that isn't a bug and it will be wrong every time.


Both of your claims are way off the mark (I run an AI lab).

The LLMs are good at finding bugs in code not because they’ve been trained on questions that ask for existing bugs, but because they have built a world model in order to complete text more accurately. In this model, programming exists and has rules and the world model has learned that.

Which means that anything nonstandard … will be supported. It is trivial to showcase this: just base64 encode your prompts and see how the LLMs respond. It’s a good test because base64 is easy for LLMs to understand but still severely degrades the quality of reasoning and answers.


The "world model" of an LLM is just the set of [deep] predictive patterns that it was induced to learn during training. There is no magic here - the model is just trying to learn how to auto-regressively predict training set continuations.

Of course the humans who created the training set samples didn't create them auto-regressively - the training set samples are artifacts reflecting an external world, and knowledge about it, that the model is not privy to, but the model is limited to minimizing training errors on the task it was given - auto-regressive prediction. It has no choice. The "world model" (patterns) it has learnt isn't some magical grokking of the external world that it is not privy to - it is just the patterns needed to minimize errors when attempting to auto-regressively predict training set continuations.

Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.


  >Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>similarity

yes, except the computer can easily 'see' in more than 3 dimensions with more capability to spot similarities, and can follow lines of prediction (similar to chess) far more than any group of humans can.

that super-human ability to spot similarities and walk latent spaces 'randomly' -yet uncannily - has given rise to emergent phenomena that has mimicked proto-intelligence.

we have no idea what the ideas these tokens have embedded at different layers, and what capabilities can emerge now or at deployment time later, or given a certain prompt.


The inner workings/representations of transformers/LLMs aren't a total black box - there's a lot of work being done (and published) on "mechanistic interpretability", especially by Anthropic.

The intelligence we see in LLMs is to be expected - we're looking in the mirror. They are trained to copy humans, so it's just our own thought patterns and reasoning being output. The LLM is just a "selective mirror" deciding what to output for any given input.


Its mirroring the capability (if not currently the executive agency) of being able to convince people to do things. That alone gaps the barrier as social engineering is impossible to patch - harder than full proofing models against being jailbroken/used in an adversarial context.

I just tried it and I'm actually surprised with how well they work even with base64 encoded inputs.

This is assuming they don't call an external pre-processing decoding tool.


The LLM UIs that integrate that kind of thing all have visible indicators when it's happening - in ChatGPT you would see it say "Analyzing..." while it ran Python code, and in Claude you would see the same message while it used JavaScript (in your browser) instead.

If you didn't see the "analyzing" message then no external tool was called.


> just base64 encode your prompts and see how the LLMs respond

This is done via translations, LLM are good at translations, being able to translate doesn't mean you understand the subject.

And no I am not wrong here, I've tested this before, for example if you ask if a CPU model is faster than a GPU model it will say the GPU model is faster, even if the CPU is much more modern and faster overall since it learned that GPU names are faster than CPU names it didn't really understood what faster meant there. Exactly what the LLM gets wrong depends on the LLM of course, and the larger it is the more fine grained these things are but in general it doesn't really have much that can be called understanding.

If you don't understand how to break the LLM like this then you don't really understand what the LLM is capable of, so it is something everyone who uses LLM should know.


That doesn't mean anything. Asking "which is faster" is fact retrieval, which LLMs are bad at unless they've been trained on those specific facts. This is why hallucinations are so prevalent: LLMs learn rules better than they learn facts.

Regardless of how the base64 processing is done (which is really not something you can speculate much on, unless you've specifically researched it -- have you?), my point is that it does degrade the output significantly while still processing things within a reasonable model of the world. Doing this is a rather reliable way of detaching the ability to speak from the ability to reason.


Asking characteristics about the result cause performance to drop because it's essentially asking the model to model itself implicitly/explicitly.

Also the more "factoids" / clauses needed to answer accurately are inversely proportional to the "correctness" of the final answer (on average, when prompt-fuzzed).

This is all because the more complicated/entropic the prompt/expected answer, the less total/accumulative attention has been spent on it.

  >What is the second character of the result of the prompt "What is the name of the president of the U.S. during the most fatal terror attack on U.S. soil?"

Why shouldn’t they be able to do this?

DNNs implicitly learn a type theory, which they then reason in. Even though the code itself is new, it’s expressible in the learned theory — so the DNN can operate on it.


> They shouldn't be able to do this

Really? ;) I guess you don't believe in the universal approximation theorem?

UAT makes a strong case that by reading all of our text (aka computational traces) the models have learned a human "state transition function" that understands context and can integrate within it to guess the next token. Basically, by transfer learning from us they have learned to behave like universal reasoners.


I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works. Is it also just a matrix of neuron weights? Maybe it ends up looking like humans are also just next-word/thought predictors. Maybe that is what AGI will be.

> I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works.

I’m pretty sure you’re committing a logical fallacy there. Like someone in antiquity claiming “I get annoyed when experienced folks say thunderstorms aren’t the gods getting angry, it’s nature and physical phenomena. But we don’t know how the weather works”. Your lack of understanding in one area does not give you the authority to make a claim in another.


This by the common definition isn't AGI yet, not to say it couldn't be. But if it was AGI it would be extremely clear, since it would also be able to control the physical form of itself. It needs robotics and to be able to navigate the world to be able to be AGI.

A human can learn from just a few examples of chairs what a chair is. Machine learning requires way more training than that. So there does seem to be a difference in how human intelligence works.

A good enough next-word predictor IS AGI.

If there's something that you can prompt with e.g. "here's the proof for Fermat's last theorem" or "here is how you crack Satoshi's private key on a laptop in under an hour" and get a useful response, that's AGI.

Just to be clear, we are nowhere near that point with our current LLMs, and it's possible that we'll never get there, but in principle, if such a thing existed, it would be a next-word predictor while still being AGI.


>> scompose the operations

I wonder whether that is some specialised terminology I'm not familiar with - or it just means to decompose the operations (but with an Italian s- for negation)?


Decompose indeed :)

antirez has written publicly, only a few weeks ago[0], about their experience working with LLMs. Partial quote:

> And now, at the end of 2024, I’m finally seeing incredible results in the field, things that looked like sci-fi a few years ago are now possible: Claude AI is my reasoning / editor / coding partner lately. I’m able to accomplish a lot more than I was able to do in the past. I often do more work because of AI, but I do better work.

>…

> Basically, AI didn’t replace me, AI accelerated me or improved me with feedback about my work

[0]: https://antirez.com/news/144


You should worry though if a helpful tool only seems to do a good job in areas you don't know well yourself. It's quite possible that the tool always does a bad job, but you can only tell when you know what a good job looks like.

I think that is more that a staff-plus engineer is going to be doing a lot more management than "actual work", and LLMs don't help much with management yet (until we get viable LLM managers shudder).

LLMs are like a pretty smart but overly confident junior engineer, which is what a senior engineer usually has to work with anyway.

An expert actually benefits more from LLMs because they know when they get an answer back that is wrong so they can edit the prompt to maybe get a better answer back. They also have a generally better idea of what to ask. A novice is likely to get back convincing but incorrect answers.


I don't understand, you're replying in a thread where that very - super-experienced and productive low-level programmer - is talking about how he finds LLMs useful.

Why would the author of Redis describe himself as “not a programmer”? That’s a little odd.

They didn't.

EDIT: antirez is the creator of redis, not mvkel.


antirez is clearly going to be “Staff-plus” for almost any definition.

Can you clarify what you mean?


(Not original commenter) “Staff” engineer is typically one of the most senior and highest paid engineer titles in very large tech company. “Staff plus” is implying they are the best of the best.

Staff plus just means staff or higher. Staff, senior staff, principal, mega ultra principal etc…

Outside of big tech, those titles aren’t common. Level X SWE vs staff vs principal doesn’t mean anything to a lot of people who aren’t in that orbit.

Sure, but my point is when someone says staff plus they mean staff or higher. They don’t mean higher than staff, or the best of the best staff engineers.

It just means anyone higher than a senior engineer.


Yes when I started working, "staff" meant entry-level. My first job out of school was a "staff consultant." So I'm always tripped up when I see "staff" used to mean "very senior/experienced"

Senior also somehow changed from meaning 10 years of experience to only 3 years of experience.

I’ve seen your comment below, but you did specify big tech as context in this parent comment, no? Or is „very large tech company“ not FAANG?

Google has Staff at L6, and their ladder goes up to L11. Apple‘s Staff pendant is ICT5, which is below ICT6 and Distinguished. Amazon has E7-E9 above Staff, if you count E6 as Staff. Netflix very recently departed from their flat hierarchy and even they have Principal above Staff.


> Amazon has E7-E9 above Staff

Few clarifications:

Amazon labels levels with "L" rather than "E". Engineering levels are L4 -- L10. Weirdly enough, level L9 does not exist at Amazon. L8 (Director / Senior Principal Engineer) is promoted directly to L10 (VP / Distinguished Engineer)


I know of no “staff plus” engineer (currently staff) that is spending a lot of time coding.

That wouldn’t be “working at your level” at the one BigTech company I’ve worked at and not even at the 600 person company I work at now


Off topic, but I'm a bit confused. Your iOS apps as listed on your website are CarPrep and Brocly, neither of which appear to have notable review activity or buzz in the media. If the app you're referring to is one of these, the more interesting question (to me) is: how on Earth are you generating $10,200 MRR from it? Or is there another app that I'm missing?

(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)


Those are just my silly personal projects, not businesses. The business I mentioned above is in the recruiting agency space, B2B SaaS. The app itself is not the thing being purchased per se, the point was it was built using LLMs.

$10K MRR isn't much; we're still validating PMF. We're carefully selecting paid customers at this point, not open for wide release, hence my vagueness. Just wanted to illustrate that building robust apps that have value are possible today.


Thanks for the clarification!

> (In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)

This. The app I built has maybe 50 downloads despite me trying quite hard to promote it. It's very difficult work, even with the app being completely free of charge (save for a donation button).


To the un-sticking point: it's also great at letting people ask questions without being perceived as dumb

Tragically - admitting ignorance, even with the desire to learn, often has negative social reprocussions


Asking "stupid" questions without fear of judgement is legit one of my favorite personal applications of LLMs.

That is one of the great strengths of LLMs for school education as well. Students often refrain from asking questions in class out of embarrassment at showing their ignorance or hesitation at interrupting the flow of the class. When used well, LLMs offer a good way for motivated learners to fill in the gaps in their understanding.

The pervasive problem of low student motivation won't be solved by LLMs, though. Human teachers will, I think, still be needed.


I find myself doing this all the time, as an experienced dev.

All the little nooks of missing knowledge are now very easy to fill in.


Yes! In the time it would take to organize a question in a form that won’t be downvoted/closed on StackOverflow you can ask a whole series of LLM questions and learn quite a bit.

Most of the time it doesn't actually, and most people should definitely do it way more instead of pretending to understand thinks they don't, but this bad habit is probably gained thanks to the school system where asking a stupid question is going to get you mocked by your peers. The thing is, IRL your peers don't get to hear your stupid questions and knowledgeable people are happy to answer them no matter how "dumb" they are (or they don't like questions at all, and you'll bother them even if you asked interesting questions).

See https://danluu.com/look-stupid/


This appears to be an interesting social phenomena. Just wondering if the interaction with the LMM has also reduced our inhabitance to ask dumb questions, when interacting with other people as well.

> I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.

My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.


They are also usually selling another AI-wrapper. I don't know the parent poster either but if your LLM product is generating $10k/month, your moat is really weak and you'll probably shut the f* up because your only moat is obscurity. Why risk that?

We shouldn’t assume the app created the customer base anew or solves a novel problem. Maybe this one does, we don’t know. But, what if the app is just an app version of a existing website store?

As an example I could imagine a clothing brand wanting an app that customers can install instead of using their phone browser. $10k/month in that context isn’t as surprising or impressive.


In which case the LLM contribution to the $10K/month is equivalent to hiring a mobile developer to build such an app which (given the implied simplicity) should be a few thousands one time cost. Not the $120K/month implied by PP. And don't get me wrong, paying a few dozen dollars to get a few thousand dollars worth of software is quite the value.

> I don't know the parent poster either but if your LLM product is generating $10k/month, your moat is really weak and you'll probably shut the f* up because your only moat is obscurity. Why risk that?

It sounds like they are doing productized consulting, so the relationship is the moat.


I hope someday that people will understand that you can use AI to build "boring" non-AI apps.

It sounds like they are doing productized consulting, in which case the software doesn’t have to be particularly complex.

The relationship also builds a natural moat.


I mean, I'm pretty upfront on my personal site that I've built successful companies in the past. Not sure why I would lie about this one, especially when I'm admitting that I'm not doing the work :)

See comment above for more context.


May I know what is the name of app that is built using LLM? 10k MRR is highly successful app.

> I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.

That's great, but professional programmers are afraid of the future maintenance burden.


"maintenance burden" is introduced when a non-original programmer starts contributing to a repo, regardless of how objectively maintainable the code is.

Everything in life is about degrees (or ranges, or orders of magnitude - whichever way you want to phrase it).

I interpreted it as saying that ymmv wrt the models you try and how you use them, and sole exposure to one that doesn't work for you can put you off the whole lot - in this case antirez finds Claude sonnet (with good prompting) very helpful, but gpt 4o (by far the best known due to ChatGPT), not so much and if the latter is representative of others experience it may be why many are still sceptical.

May you expand how you did this? I'm seeing a number of apps that claim to do just this and there are number that are becoming super popular.

Not just the development of the code but the entire the thing from the code, infra, auth, cc payments, etc.


Planning to write a lengthy blog post on this. Will reply here.

For CC payments, just use Stripe. The docs are great!

Strange that you don’t mention your product. Making too much money already?

I tried exactly that, a simple Todo-like app, without SwiftUI or Swift knowledge, and Sonnet 3.5 only gave me one syntax error after another. Now I‘m watching Paul Hudson‘s intro videos.

"I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs".

What's the app?!!


Would be very interesting to have a look at this app that you wrote using only LLMs. Mind sharing the name?

Which service/LLM performed the best for you?

Sonnet-3.5 seemed to churn out the best code, so I would default to that. If it got stuck in circular reasoning, 4o would usually resolve it. Then back to Sonnet.

Did you need a Mac for that, or is it possible to use Linux to develop a Swift app targeting iOS?

Would you mind sharing which app you released?


You need macOS, which you can run in a VM (e.g. https://github.com/kholia/OSX-KVM ) or by setting up a hackintosh.

I think a lot of the confusion is in how we approach LLMs. Perhaps stemming from the over-broad term “AI”.

There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.

But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!

There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.

If nothing else it’s an extremely useful computer-human interface.


> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report.

not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?

https://forecast.weather.gov/product.php?site=SEW&issuedby=S...


Doesn’t dissuade me at all, that’s a really neat service. I’m not American though, and even if my own country had a similar service I still enjoying tuning the results to focus on what I’m interested in. And it was just an example of the kinds of computer-human interfaces that are newly possible from this technology.

Anytime you have data and want it explained in a casual way — and it’s not mission critical to be extremely precise — LLMs are going to be a good option to consider.

More useful AGI-like behaviours may be enabled by combining LLMs with other technologies down the line, but we shouldn’t try to pretend that LLMs can do everything nor are they useless.


The best forecast available on the internet is norwegian.

> so don’t ask a language model to diagnose your medical condition

(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.

"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]


> choose a political candidate

I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).

Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)


>don’t ask a language model to diagnose your medical condition

Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.


If the LLM is trained on accurate medical data and you provide accurate symptoms data, then the LLM can be a useful tool to output the information in a human-readable way.

This is not a diagnosis. Any reasonably capable person can read webmd and apply the symptoms listed and compare them to what the patient describes. This is widely regarded as dangerous because the input data as well as the patient data are limited in ways that can be medically relevant.

So even if you can use it as a good substitute for browsing webmd, it’s still not a substitute for seeing a medical professional. And for the foreseeable future it will not be.


Yes so basically bias it into what you think it should reply in the question and it will magically somehow give the reply you wanted! Very useful :D

> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!

You feed it a weather report and it responds with a weather report? How is that useful?


It distilled bulk information into a form the author cared about. If nothing else it was probably fun, and a personal report on the things you care about can save minutes each day.

I did something similar awhile back without LLMs. I enjoy kayaking, but for a variety of reasons [0] it's usually unwieldy to break out of the surf and actually get out into the ocean at my local beach. I eventually started feeding the data into an old-school ML model where I'd manually check the ocean and report on a few factors (breaking waves, unsafe wind magnitude/direction, ...). The model converted those weather/tide reports into signals I cared about, and then my forecast could simply AND all those together and plot them on a calendar.

An LLM is less custom in some sense, but if you have certain routines you care about (e.g., commuting to my last job I'd always avoid the 101 in favor of 280 if there was heavy rain), it's easy to let the computer translate raw weather information into signals you care about (should you take an alternate route, should you alter your schedule, ...).

Off-topic, do you know of a good source of weather covariates? E.g., a report with a 50% chance of rain for 2hr can easily mean light rain guaranteed for 2hr, a guaranteed 1hr of rain sometime in that 2hr period, a 50% chance that a 2hr storm will hit your town or the next town over, or all kinds of things. Does anybody report those raw model outputs?

[0] There isn't any protection from the open ocean (combined with a kayak that's a bit too top-heavy for the task at hand), which doesn't help, but the big problem is a sand bar just off the coast. If the tide isn't just right, even small swells are amplified into large breaking waves, and I don't particularly mind getting dumped upside down onto a sand bar, but I'd really prefer to spend that time in slightly calmer waters.


Well said, that’s exactly what I meant.

> Perhaps stemming from the over-broad term “AI”.

No, I think if we follow the money, we will find the problem.


I don't think people finding LLMs useless is a good representation of the general sentiment though. I feel that more than anything, people are annoyed at LLM slop. Someone uses an LLM too much to write code, they create "slop," which ends up making things worse.

Unfortunately complex tools will be misused by part of the population. There is no easy escape from that in the modernity of possibilities. Look at the Internet itself.

Yes but then they can prompt it to golf the code and most of the slop goes away. This sometimes breaks the code.

> But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question.

People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.

I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.

An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."

With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.


Using the correct keywords like you did is part of communication though.

Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.


> Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.

I am not sure that is the case, at least with a large number of LLMs. CO-STAR and TIDD-EC are much about structure and explanation than brevity.


Finding what works for an llm and what not is also part of communication skills.

Though I do not have a good idea what is _bad_ communication with an llm. People say that sometimes, but when specific examples arise I do not see really anything more than limitations of llms (and the improvements they often suggest do not do anything either). So it would be good to have some more concrete examples, unless that is about inability to communicate a problem in general, stemming from actual inability to _understand_ the problem. Also a lot change in time, I think in the past one had to really coddle an llm "You are the best expert in python in the world!" but I am not sure that is that important nowadays.


Bad communication => being too ambiguous, expecting LLM to understand you through that ambiguity and then not being satisfied when it didn't.

Bad communication: "My webapp doesn't work"

Good communication: "Nextjs, [pasted error]"

Bad communication is giving irrelevant information, or being too ambiguous, not providing enough or correct detail.

Then another example of good communication and efficiency in my view is for example "ts, fn leftpad, no text, code only".

I myself can understand what it means when someone was to prompt it and LLM can understand such query for all domains.

Although if I was using Copilot I would just write the bare minimum to trigger the auto complete I want so

const leftPad =

is probably enough.


> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy....

and

> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.

I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]

Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).

There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?

[1] https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1 - shout-out to some former colleagues!


Andrej Karpathy: https://twitter.com/karpathy/status/1835024197506187617

> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.

> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".


But I need enormous amounts of learning data and enormous amount of computing to learn new models, right? So it's kind of useless advice for most people who can't just parse github repositories and teach their new model using AST tokens. They have to use existing opensourced models or API and those happened to use text.

The environmental arguments are hilarious to me as a diehard crypto guy. The ultimate answer to “waste” of electricity arguments is that energy is a free market and people pay the price if it’s useful for them. As long as the activity isn’t illegal then training LLMs or mining bitcoins, it doesn’t matter. I pay for the electricity I use.

Do you think that it we should make it illegal to mine coins if the majority of people think the environmental cost is too high?

If a law is passed then that’s the law

One argument against that line of thinking is that energy production has negative externalities. If you use a lot of electricity, its price goes up, which incentivizes more electricity production, which generates more negative externalities. It will also raise the costs for other consumers of electricity.

Now that alone is not yet an argument against crypto currencies, and one person's frivolous squandering of resources is another person's essential service. But you can't simply point to the free market to absolve yourself of any responsibility for your consumption.


I greatly despise video games. Why is that not a waste of energy? If you are entertained by something, even if it serves no human purpose other than entertainment, is that not a valid use of electricity?

Unintentionally, the energy demands of cryptocurrencies, and data centers in general, have finally motivated utilities (and their regulators) to finally start building out the massive new grid capacity needed for our glorious renewable energy future.

Acknowledging that facilitating scams (eg pig butchering) are cryptocurrency's primary (sole?) use case, I'm willing to look the other way if we end up with the grid we need to address climate crisis.


To pretend romance / affinity scams and crime were created by crypto is absurd. It’s fair to argue crypto made crime more efficient, but it also made the responsible parties quicker to patch holes.

The primary use case of crypto is to protect wealth from a greedy, corrupt, money-printing state. Everything else is a sideshow


> primary use case of crypto is to protect wealth

Merely trading governments for corporations.

> Everything else is a sideshow

Agreed. Crypto is endlessly amusing.


What corporation made bitcoin?

> ask very precise questions explaining the background

IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.

People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!


One thing LLMs have been incredibly strong even since gpt-3.5 is being the most advanced non-human rubber duck, and while they can do plenty more, that alone provides (me at least) with tremendous utility.

> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.

I see much deeper problems. Just to give two examples:

- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.

- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.


I'm a big believer in Claude. I've accomplished some huge productivity gains by leveraging it. That said, I can see places where the models are strong and weak. If you're doing react, or python. These models are incredible. C#, C++ they're not terrible. Rust though, it's not great. If your experience is exclusively trying to use it to write Rust, it doesn't matter if you're using o1, Claude or anything else. It's just not great at it yet.

> Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.

It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.


Claude Sonnet 3.5 can write whole React applications with proper contextual clues and some minor iterations. Google has never coded for you.

I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.

I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.

I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.

I also start with design principles and a checklist that Claude is excellent at providing.

My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.

And needing an enterprise agreement to have a walled garden for proprietary purposes.

I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.


I've never really used Claude for writing code, becuase I'm not really bottlenecked by that problem. I have used it quite a bit for asking questions about what code to write and it's almost always wrong (usually in subtle ways that would trick someone with little experience).

Maybe it was overtrained on react sources, but for me it's pretty useless.

The big annoyance for me is it just makes up APIs that don't exist. While that's useful for suggesting to me what APIs I should add to my own code, it's really pointless if I ask a question like "using libfoo how do I bar" and it tells me "call the doBar() function" which does not exist.


They can't think at all. The task must be strict macroexpansion of original input(doesn't mean that always works).

I'm suspecting LLM works for a lot of front end and app coding just because code in those fields are insanely overbloated and value proposition is almost disconnected from logic. There must be metric tons of typing in those fields, and in those areas LLMs must be useful. They certainly handle paper test questions well.


They are mostly useful for front-end/React because front-end shouldn't been code in the first place. They can do the UX but not the state management. Honestly, as someone who sucks and dread UX building (and having to frequently adjust my divs/components), they are a life saver when you are doing very conventional things. That is things you can find 100s of examples of but will take you hours to glue together.

Imagine not needing Claude to do any of that.

This is one of those things I like about Claude.

I’m hitting my 40th year as a professional software developer and architect. I’ve written thousands of blocks of code from scratch. It gets boring.

But then in the 2000’s me (and everyone else) started building code generators, often from ERD structures, but also UML designs.

These tools were massively useful and (initially) reduced costs. The future balls of mud problems took over ten years to arrive.

But code generation has always been considered a smart and cost-effective approach to building software.

GenAI has “issues” and those have been exposed. One of my recent revelations is that Claude is best at TypeScript and python. C# (my home turf) is much lower in its skills capacity.

So in the last two months I’ve been building my apps in TypeScript instead of C# and have dramatically increased my productivity.

Claude will definitely fail if it doesn’t have the correct information. A good example is writing Bluesky apps. The docs are a mess and contradictory. But there are up to date docs on GitHub and if you include those in your project with instructions to only use those references, Claude’s hallucinations can be eliminated.

I don’t think AGI is a real possibility in my lifetime, and I do fear the future of software development when no one has actual coding experience, but for us boomers, it’s pretty darn useful.


How are you measuring your productivity?

In many cases I have no frame of reference for the expected code, like React and css. Typescript is perfectly readable, but I’m not really a script kiddie, so I’d go very slow on the React tsx files. The services are probably a slightly faster set of work, especially if I always have unit tests.

If someone was an expert React+TypeScript programmer with decent css knowledge the productivity may be a marginal improvement.

But I haven’t been a full-time programmer in ten years.


Google Search has been corrupted by...Google.

comparing google to claude 3.5 is like comparing tesla s plaid with a horse

What a hilariously absurd statement. You might want to actually try it.

Super interesting that my experience mirrors exactly what you are writing... except for me finding Claude to be almost useless (often misunderstands me, gives answers that are plain wrong) and 4o to be a very helpful, if not somewhat dull, jack-of-all trades in helping me be a cruise control for the mind.

I could only ever really jam with 4o.

Makes me wonder if there's personal communication preferences at play here.


Both new Sonnet and Haiku have a masking overhead.

Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.

Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.

Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.


Most people consider their own brain useless and don't use it, so it's not strange that they do the same with AI. How many people just refuse to learn how to parallel park, a new language, calculus or even basic arithmetic, "because they aren't good at it".

While Claude Sonnet is superior than 4o for most my use cases, there are still occasionally some specific tasks where it performs slightly better.

Probably. But statistically to work with 4o is a lose of time for me. LLMs is like an investment: you write the prompts, you "work" with them. If the LLM is too weak, this is a lose of time. You need to have a return on the investment that is positive. With ChatGPT 4o / o1 most of the times for me the investment of time has almost zero return. Before Claude Sonnet 3.5 I already had a ChatGPT PRO account but never used it for coding since it was most of the times useless if not for throw away scripts that I didn't want to do myself or as a stack overflow replacement for trivial stuff. Now it's different.

This mirrors my experience 100%. I'm not even sure why I still pay for OpenAI at this point. Claude 3.5 is just incredibly superior. And I totally agree on the point about dropping in context and asking very specific questions. I've had Claude pinpoint a bug in a 2k LOC module that I was struggling to find the cause for. After wasting a lot of time on it on my own, I thought "what the heck, maybe Claude can figure it out" and it did. It's objectively useful, even if flawed sometimes.

I'm curious. Can you go into more detail what kind of bug it found?

I was writing a custom widget for iced (the Rust GUI library) and I was getting a panic due to some fancy logic I was trying to do. I guess the shortest description I can say is that it was a combination of what appeared to be a caching issue at first, but the real cause turned out to be some method shadowing where I was using a struct's method where I meant to use the trait's method.

I had made the specific operation generic (moving it out of the struct and into a trait) but forgot to delete it from the struct, so I was calling the incorrect function. Claude pinpointed the cache issue immediately when I just dumped two files into the context and asked it:

    somewhere in my codebase I'm triggering a perform() on the editor but the next call on highlight() panics because `Line layout should be cached`

    what am I missing? do I need to do something after perform() to re-cache the layout?
at first that seemed to fix the issue, but other errors persisted. so we kept debugging together until we found the root cause. either way I knew where to look thanks to its assistance

why "lose of time" instead of "loss of time" Is it a typo or fingerprinting?

it's "proof" that it wasn't written by an LLM (but let me delve into this issue).

Typo

Like what? Claude has become my go-to, but I find that it's wrong enough often enough that I really can't trust it for anything. If it says something I have to go dig through it's citations very carefully.

> Claude Sonnet 3.5 (not Haiku!)

A very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.


I ponder if LLM:s are very useful but at a quite narrower set of tasks than we expect. Like fuzzy manipulation of logical specifications.

I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.

Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.

Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.


To get the most out of them you have to provide context. Treat these models like some kind of eager beaver junior engineer who wants to jump in and write code without asking questions. Force it to ask questions (eg: “do not write code yet, please restate my requirements to make sure we are in alignment. Are there any extra bits of context or information that would help? I will tell you when to write code”)

If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.

At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.

And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.

Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.

Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)


You are right, but doing all that is incredibly cumbersome, at least to some people, which is why they don’t like working with LLMs.

That was one of the themes of my article: LLMs are power-user tools, mis-sold as "easy to use". To get great results out of them you need to invest a whole lot of under-documented and under-appreciated effort. https://simonwillison.net/2024/Dec/31/llms-in-2024/#llms-som...

It’s not just that you need to be a power user (I certainly am), you also need to be fine with nondeterminism and typing a lot of prose, instead of doing everything with keyboard shortcuts and CLI commands, with reproducible outcomes. It’s a different mode of operation and interaction, requiring a different predisposition to some degree.

Exactly! I don’t like talking or writing or explaining.

My mind generally uses language as little as possible, I have no inner monologue running in the background.

Greatly prefer something deterministic to random bs popping up without the ability of recognizing it.

I don’t like llms but sometimes use them as autocomplete or to generate words, like a template for a letter or boilerplate scripts, never for actual information (à la google).


unless you can type faster than you can talk, (which some people can), stop typing and start dictating. aider has a /voice command for a reason.

I don't use it exclusively, but damn does it help in the right places.


Can you elaborate, or give some examples? I am having trouble imagining in which situations that would be useful because I tend to put a lot of thought into defining the right prompt before sending it over.

LLMs have given computers the ability to communicate with us in natural language, we didn't have that before at this level. In order to do this, they've been fed with a lot of coherent stuff and give the impression of being coherent, but we know they're just statistical machines. But at least they can now communicate naturally with us, so now we have that infrastructure available, as we do have TTS or ASR or monitors and keyboards available. It's still up to us to now make proper agents out of them. Agents for the software we've been using for decades. They can take over a lot of tedious work for us.

Why are you pasting huge chunks of potentially crown jewels code into a 3rd party service where prompts are going to most likely be turned into training/surveillance material?

A lot of vendors promise not to train on input to their models. I choose to believe those promises.

A scorpion, not knowing how to swim, asked a frog to carry it across the river. “Do I look like a fool?” said the frog. “You’d sting me if I let you on my back!”

“Be logical,” said the scorpion. “If I stung you I’d certainly drown myself.”

“That’s true,” the frog acknowledged. “Climb aboard, then!” But no sooner than they were halfway across the river, the scorpion stung the frog, and they both began to thrash and drown. “Why on earth did you do that?” the frog said morosely. “Now we’re both going to die.”

“I can’t help it,” said the scorpion. “It’s my nature.”


>They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours)

All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.


Why do people have such narrow views on what makes LLMs useful? I use them for basically everything.

My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.

It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.

I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.

Want to write a novel? Brainstorm ideas with GPT-4o.

I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.

Just where is this "useless" idea coming from? Do people not have a life outside of coding?


Yes people have lives outside of coding, but most people are able to manage without having AI software intercede in as much of their lives as possible.

It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.


Why do you postulate that "most people don't have" this need? I also use AI non-stop throughout my day for similar uses.

This feels identical to when I was an early "smart phone" user w/my palm pilot. People would condescend saying they didn't understand why I was "on it all the time". A decade or two later, I'm the one trying to get others to put down their phones during meetings.

My take? Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.


>Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.

I don't think you're wrong, I just think a future in which it's all but physically and socially impossible to have a single thought or communication not mediated by software is fucking terrifying.


When I'm done working, chased my children to properly finish their dinner, helped my son with homework, and putting them to bed, it's already 9+ PM — the only time of the day when I have free time. Just which human besides my wife can I talk to at that point? What if she doesn't have a clue either? All the professionals are only open when I'm working. A lot of the issues happen during the weekend, when professionals are closed. I don't want to disturb friends during the evening, and it's not like they have the expertise I need anyway.

LLMs are infinitely patient, don't think I am dumb for asking certain things, consider all the information I feed them, are available whenever I need them, have a wide range of expertise, and are dirt cheap compared to professionals.

That they might hallucinate is not a blocker most of the time. If the information I require is critical, I can always double check with my own research or with professionals (in which case the LLM has already primed me with a basic mental model so that I can ask quick, short, targeted questions, which saves the both of us time, and me money). For everything else (such as my curiocity on why TCM works, or the correct spelling of a word), LLMs are good enough.


You are supposed to have connections with knowledgeable people, so you can call them and ask for advice. That's how it works without computers.

Did you miss the parts where I said that I only have time when they're closed, and they're only open when I'm most busy?

Have you never seen knowledgeable people get things wrong, and having to verify them?

Did you miss the part where they cost money, and I better come in as prepared as possible?

I really don't get these knee-jerk averse reactions. Are people deliberately reading past my assertions that I double check LLM outputs for everything critical?


At the risk of sounding impolite or critical of your personal choices: this, right here, is the problem!

You don’t understand how medicine works, at any level.

Yet you turn to a machine for advice, and take it at face value.

I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!

I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.

This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.

I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.


No. Where do you read that I take it at face value? I literally said that I expect Claude to give me "half-assed" medical guidance. I merely said that that is still better than having no clue for the next 2 hours while I wait on the phone with 35 people in front of me, which is completely different from "taking medicine advice at face value". It's not like I will let my wife drink bleach just because Claude told me to. But if it tells me that it's likely an ear infection then at least I can discuss the possibility with the doctor.

So I am curious about how TCM works. So what if an LLM hallucinates there? I am not writing papers on TCM or advising governments on TCM policy. I still follow the doctor's instructions at the end of the day.

For anything really critical I already double check with professionals. As you said, human in the loop is important. But needing human in the loop does not make it useless.

You are letting perfect be the enemy of good. A half-assed tax advice with some hallucinations from an LLM is still useful, because it will prime me with a basic mental model. When I later double check the whole thing with a professional, I will already know what questions to ask and what direction I need to explore, which saves time and money compared to going in with a blank slate.

The other day I had Claude advice me on how to write a letter to a judge to fight a traffic fine. We discuss what arguments to make, from what perspective a judge will see things, and thus what I should plead for. The traffic fine is a few hundred euros: a significant amount, but barely an hour worth of a real lawyer's fee. It makes absolutely no sense to hire a real lawyer here. If this fails, the worst thing that can happen is that I won't get my traffic fine reimbursed.

There is absolutely nothing wrong with using LLMs when you know their limits and how to mitigate them.

So what if every notion you learned about medicine from LLMs is wrong? You learn why they're wrong, then next time you prompt/double check better, until you learn how to use it for that field in the least hallucinationatory way. Your experience also doesn't match mine: the advice I get usually contains useful elements that I then discuss with doctors. Plus, doctors can make mistakes too, and they can fail to consider some things. Twitter is full of stories about doctors who failed to diagnose something but ChatGPT got it right.

Stop letting perfect be the enemy of good. Occasionally needing human in the loop is completely fine.


To be fair though, humanity doesn't know how some medicines work at a fundamental level either. The method of action for Tylenol, lithium, and metformin, among others isn't fully understood.

True, but modern "western"[1] medicine is not about the specific chemicals used, or even knowing how they exactly work at a chemical level, but the process for identifying what does and what does not work. It's an "evidence based" science with with experiments designed to counter known biases such as the placebo effect. Much of what we consider modern medicine was developed before we were entirely sure that atoms actually existed!

[1] It isn't actually western, because it's also used in the east, middle-east, south, both sides of every divide, etc... In the same sense, there is no "western chemistry" as an alternative to "eastern alchemy". There's "things that work" versus "things that make you feel slightly better because they're mild narcotics or stimulants... at best."

(I don't want to focus too much on Chinese herbal medicine, because I see the same cargo-culting non-scientific thinking in code development too. I've lost count of the number of times I've seen an n-tier SPA monstrosity developed for something that needed a tiny monolithic web app, but mumble-mumble-best-mumble-practices.)


"Western medicine" (which is exactly what it is called in China, to contrast with TCM) is shorthand for "practices invented in the west". That these methods chase universal truths, or are practiced world-wide, do not make them "non-west" in terms of origin.

The Chinese call the practice of truth seeking, in a more broader sense (outside of medicine) just "science".

"Western" medicine is also not merely the practice of seeking universal medical truth. It is also a collection of paradigms that have been developed in its long history. Like all paradigms, there are limits and drawbacks: phenomena that do not fit well. Truth seeking tends to be done on established paradigms rather than completely new ones.

The "western" prefix is helpful in contrasting it with TCM, which has a completely different paradigm. Many Chinese, myself included, have the experience that there are all sorts of ailments that are not meaningfully solved by "western" medicine practitioners, but are meaningfully solved by TCM practitioners.


This reads like satire to me. Scarry that it isn't.

I'm guessing that mindset is what cause some people to find this scary. I see a new tool and opportunities. Like all tools, it has drawbacks and caveats, but when wielded properly, it can give me more choice. I suspect some others focus too much on flaws and don't bother looking for opportunities. They are expecting a holy grail: if it's not perfect then it's useless.

It's like people who proclaim that Linux as a whole is a useless toy because it doesn't run their favorite games or favorite Windows app. They focus on this one flaw and miss all the opportunities.

Many of these people seem to advocate trusting human professionals. Do you have any idea how often human professionals do a half-assed job, and I have to verify them rather than blindly trusting them? The situation is not that much different from LLMs.

Professionals making mistakes do not make them useless. Grandma, with all her armchair expertise, is often right and sometimes wrong, and that does not make her useless either.

Why let perfect be the enemy of good?


Grandma has a reason to care about you.

At the opposite, my trust of Russian / Chinese / USian platforms is low enough that I consider it my duty to publicly shame people that still use them in 2025.

(With some caveats of course, for instance HN is not a yet negative to the world. Yet.)

There's also the question of stickiness of habits : your grandmas are for life, human professionals you might have a shallow enough relationship with that switching them might be relatively easy, while it might be very hard to stop smoking or to stop using Github once you started smoking / create an account.


You view Github and LLMs as traps that deliberately give you malicious advice or even brainwash you into addiction? If you view things that way then it's no surprise that you are averse to LLMs (and Github). But frankly I find that entire view to be absurd and overly cynical.

I too read it as satire at first, but after thinking twice I think it's a quite reasonable take. I've added "utilize LLM more in my daily life outside programming" to my new year resolution.

I had the flu at the beginning of December, with high fever, the whole nine yards. Keeping a running log with Claude in which I shared temperature readings, medications etc. has been so useful. If nothing else it's the world's most sophisticated rubber duck / secretary, but that's quite useful in many daily life situations on its own. Caveats apply etc.

Huh? The GP makes perfect sense. I’d never trust LLMs blindly, but I wouldn’t hesitate to ask them about any topic. “Trust but verify” is often said about human beings. Perhaps “distrust but ask and verify” is the mantra applicable for LLMs.

I swear these goalposts keep getting moved, I remember being told that GPT3.5 is a useless toy but the paid GPT4 is lifechanging, and now that GPT4 is free I'm told that it's a useless toy but paid o1 or paid Sonnet are lifechanging. Looking forward to o1 and Sonnet becoming useless toys, unlike the lifechanging o3.

Except GPT4 isn't free.

The GP is claiming GPT4o is bad but Sonnet is good. GPT4o is about only 20% cheaper than Sonnet.


You will also be dismayed to hear that a 2011 iPhone is no longer state-of-the-art, and indeed can't run most modern apps.

Holy false-equivalency, Batman! The definitions of "useless toy / lifechanging tool" are _not_ changing over time (or, at least, not over the timescale being explored here), whereas the expectations and requirements of processing power of a phone are.

But in fact they are changing over time -- this is an expectations treadmill. When you get something newer and better, it highlights the flaws in what you had before.

That is true _in general_, but not in this specific case (hence why I specified "not over the timescale being explored here"). A modern cigarette-lighter would indeed have been a life-changing tool to a caveman but is indeed disposable junk today.

The point being made by the original comment (with which I agree) was that many criteria-for-usefulness - primarily that of reliability or a lack of hallucination - have remained static; with successive generations of tools being (falsely) claimed to meet them, but then abandoned when the next hype-train comes along.

I certainly agree that _some_ aspects of AI models are indeed improving (often drastically!) over time (speed, price, supported formats, history/context, etc.) - but they still _all_ fall _drastically_ short on the key core requirement that is required in order to make them Actually Useful. "X is better than Y" does not imply "where Y failed to be useful, X now succeeds".


GPT4 is a 13 year old technology? Compared to o1 and Sonnet 3.5?

If someone told me an iPhone 4 is terrible but an iPhone 5 would definitely serve my needs, then when I get an iPhone 5 they say the same of the 6 you really want me to believe them a second time? Then a third time? Then a 4th? In the mean time my time and money is wasted?


It would be quite useful if that were the only phone available.

I believe it's more frustration directed at the mismatch between marketing and reality, combined with the general well deserved growing hatred for SV culture, and, more broadly, software engineers. The sentiment would be completely different if the entire industry marketed themselves like the helpful tools they are rather than the second coming of Christ they aren't. This distinction is hard to make on "fast food" forums like this one.

If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.

It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.

And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.

If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.

Happy new year, I guess.


> there isn't that much of an upside in being an early adopter.

Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.

Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.

Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.


Yikes.

Clearly said, yet the general sentiment awakens in me a feeling more gothic horror than bright futurism. I am stuck with wonder and worry at the question of how rapidly this stuff will infiltrate into the global tech supply chain, and the eventual consequences of misguided trust.

To my eye, too much current AI and related tech are just exaggerated versions of magic 8-balls, Ouija boards, horoscopes, or Weizenbaum's ELIZA. The fundamental problem is people personifying these toys and letting their guard down. Human instincts take over and people effectively social engineer themselves, putting trust in plausible fictions.

It's not just LLMs though. It's been a long time coming, the way modern tech platforms have been exaggerating their capability with smoke and mirrors UX tricks, where a gleaming facade promises more reality and truth than it actually delivers. Individual users and user populations are left to soak up the errors and omissions and convince themselves everything is working as it should.

Someday, maybe, anthropologists will look back on us and recognize something like cargo cults. When we kept going through the motions of Search and Retrieval even though real information was no longer coming in for a landing.


> They work great to explore what is at the borders of your knowledge.

But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.

> doing boring tasks for which you can provide perfect guidance

That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.

They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.

> able to accelerate you

True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.

They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.


Is there a way to use this in Jetbrains IDEs? (I've not been impressed with their AI Assistant.) There are a few plugins, but from the reviews they all seem kind of mediocre.

I personally use the Zed editor AI assistant integration with Sonnet for anything AI-related, while using a JetBrains IDE for coding / code reading, side-by-side.

I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.


Github copilot plugin is decent. It's not going to write a whole app for you, but it accelerates repetitive stuff, can give suggestions you didn't think of or save you a trip to the documentation.

I use IntelliJ as my main coding tool but also use VSCode and Sublime text. If you have access to local LLMs or have an API key for some the Continue Plugin (basically Cursor but can use in IntelliJ) is the Best of the Best for IntelliJ (IMO). I have a box running some local models including Phind and StarCoder (plus some small embeddings) and have been super happy with the end product. The next up is Google Gemini Code Assist has been the best of the IntelliJ (non-configured) AI tools I have tried. There are better ones out there but IMO not for IntelliJ. It's still free for a few more weeks and I have been using it since the free release, fun to use. Can pre-prompt, say you are an expert XXX, please be funny, fill in the rest of your regular prompts. The Co-Pilot I use for work is very limited and will only answer coding questions. I tried to tell it that it was my coding buddy, and its name was Phil and told me it cannot have a personality or be funny. I believe the paid personal Co-Pilot allows you to choose which LLM it uses (I cannot confirm). The Phind VSCode plugin works really well. Also, the Phind coding models are on par with some of the other big ones and free if you have a subscription (or run locally). Sublime is around to open those GIG+ files as VSCode chocks and not worth the RAM of opening another IntelliJ.

Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.

Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.

Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!

I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.


Try Cursor. I’m serious.

I'm sure it's good, but that's not what I'm asking about.

Right, in simpler terms: The measure of LLMs success is how effectively they help you achieve your goal faster.

Exactly, and right now the LLMs acceleration effect is a tool, not "give me the final solution". Even people that can't code, using LLMs to build applications from scratch, still have this tool mindset. This is why they can use them effectively: they don't stop at the first failed solution; they provide hints to the LLM, test the code, try to figure what's the problem (also with the LLM help), and so forth. It's a matter of mindset.

> people that can't code

These people may not be Software Engineers, but they are coding.


btw, fusion has arrived by that definition: No reactors that would produce more energy than they consume, But net positive reactions have been achieved. Tasks where LLMs output is more than 1x are few and far between.

Definitely not a "useless toy" with the right use case. It's great at code snippets, scripts, etc. It's an assistant.

I’m surprised you only have one use case. I use LLMs to research travel, adjust recipes, check biographies and book reviews, and many many more things.

Hopefully things have narrowed but you can see from the trends data just how few people (API may be a different story) use claude relative to chatgpt.

Brand awareness is a hell of a drug.

Indeed, although I find myself reaching for o1 more than Claude for matters other than programming, solely because it has better LaTeX (...)

ClaudeAI ++1000

yeah, they save as much time as finding a template with a good old search and using it.

> best LLMs are able to accelerate you

https://www2.math.upenn.edu/~ghrist/preprints/LAEF.pdf - this math textbook was written in just 55 days!

Paraphrasing the acknowledgements -

...Begun November 4, 2024, published December 28, 2024.

...assisted by Claude 3.5 sonnet, trained on my previous books...

...puzzles co-created by the author and Claude

...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.

...Gemini Experimental 1206 was an especially good proof-reader

...Exercises were generated with the help of Claude and may have errors.

...project was impossible without the creative labors of Claude

The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.

Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...

Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.


"The story of linear algebra begins with systems of equations, each line describing a constraint or boundary traced upon abstract space. These simplest mathematical models of limitation — each equation binding variables in measured proportion — conjoin to shape the realm of possible solutions. When several such constraints act in concert, their collaboration yields three possible fates: no solution survives their collective force; exactly one point satisfies all bounds; or infinite possibilities trace curves and planes through the space of satisfaction. This trichotomy — of emptiness, uniqueness, and infinity — echoes through all of linear algebra, appearing in increasingly sophisticated forms as our understanding deepens."

Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.


That is such supremely bad writing that it can only come from AI being told to spice up the original opening paragraph, and short of the original author being barely literate (and possibly even then), the original text would have been better writing.

The overuse of the $15 synonyms is almost always a bad idea--you want to use them sparingly, where dropping them in for their subtly different meanings enhances the text. But what is extremely sloppy here is that the possibilities of "no solutions, one solution, infinite solutions" is now being described with a different metaphor for solution here. And by the end of the paragraph, I'm not actually sure what point I'm supposed to take away from this text. (As bad as this paragraph is, the next paragraph is actually far worse.)

Mathematics already has a problem for the general audience with a heavy focus on abstraction that can be difficult to intuit on more concrete objects. Adding florid metaphors to spice up your writing makes that problem worse.


Even putting it here is annoying to me... Those are a lot of words saying nothing that I just spend time reading.

I'm agreeing with you.


It's rather purple prose, but it's entirely meaningful. Maybe it doesn't seem to mean anything until after you know some linear algebra, though...

its been a long time, but when i was taught this material, i was told there are only 3 cases -

x+y=1, x+y=2 clearly has no solution since two numbers can’t simultaneously add to both one and two.

x+y=1,2x+2y=2 clearly has infinitely many solutions. There’s only one equation here after canceling the 2, so you can plug in x’s and y’s all day long, no end to it.

x+y=1, 2x+y=1 clearly has exactly one solution (0,1) after elimination.

This example stuck with me so I use it even now. The author/Claude/Gemini/whatever could have just used this simple example instead of “trichotomy of curves through space conjoin through the realm of …” math, not Shakespeare.


Also, isn't this a great example of "when you have a hammer, everything looks like a nail" ?

To explain this I would first and foremost use a picture, where the 3 cases : parallel, identical, intersection can be intuitively seen (using our visual system, rather than our language system), with merely a glance.


Sure, but saying something in an ornate way is not the same as “saying nothing”.

I agree. Not what I would expect from a math book or script.

Going faster isn't good if the quality drops enough that overall productivity decreases... Infinite slop is only a good thing for pigs.

Just use ChatGPT to summarize its own output. It’s like running your JPEG back through the JPEG compressor again!

^ This perfectly encapsulates the story I see every time someone digs into the details of any llm generated or assisted content that has any level of complexity.

Great on the surface but lacks any depth, cohesive, or substance


I started a book about CIAM (customer identity and access management) using Claude to help outline a chapter. I'd edit and refine the outline to make sure it covered everything.

Then I'd have Claude create text. I'd then edit/refine each chapter's text.

Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.

It's bad enough editing your own writing, but for some reason this was even worse.


just to clarify - I have nothing to do with this book. I was just forwarded a copy and I thought its relevant to the topic at hand. from the wild swings in karma, looks like people are annoyed with the message and shooting down the messenger.

We're at the "computers play chess badly" stage. Then we'll hit the Deep Thought (1988) and Deep Blue (1995-1997) stages, but still saying that solving Go won't happen for 50+ years and that humans will continue to be better than computers.

The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.

Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.


> There’s a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!

I wish the author qualified this more. How does one develop that skill?

What makes LLMs so powerful on a day to day basis without a large RAG system around it?

Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.


When I started my career in 2010, google was a semi-serious skill. All of the little things that we know how to do now such as ignoring certain sites, lingering on others, and iteratively refining our search queries were not universally known at the time. Experienced engineers often relied on encyclopedic knowledge of their environment or by "reading the manual".

In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.


One difference is that skillful googling still only involved typing a few keywords or a short phrase and some syntax, and then knowing how to skim the results and iterate, and how to operate your browser efficiently. With LLMs, you have to type a lot more (and/or use voice input), and often also read more, it’s also not stateless/repeatable like following a web link, and most output looks the same (as opposed to the variations in web sites). I pride(d) myself on my Google foo, it was fun, but I find using LLMs to be quite exhausting in comparison.

I also find LLMs to be more exhausting than Googling, but for me they’ve been ultimately more enriching and efficient.

Specifically, I’ve been using Kagi Assistant over the past 1.5 months for serious and lengthy searches, and I can’t imagine going back to traditional search.

I’m currently sold on this model of LLM assisted search (where explicit links are provided) over the old Google foo skills I developed during grad school.

Example search topics include deep dives and guidance for my first NAS build, finding new bioinformatics methods, and other random biomedical info.


The problems with that skill is that:

* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...

* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).


I really like Zed's (editor) implementation. The context window is just editable text, like any other. You can freely change anything and send the whole thing back into the LLM. I find that a much more useful interface than mucking around and editing chat bubbles.

ChatGPT basically lets you edit any of your messages at any point in the conversation, which I definitely use (e.g., if the conversation has gotten into a bad basin, the LLM misunderstood me, etc).

Also ChatGPT has a pretty big context window. Gemini supposedly has the biggest useful context window (~millions of tokens), though I don't have personal experience.


I tend to avoid editing previous messages because it breaks my mental model of the sequence that got me to the current state. That's more of a bias from my goal to do "research" into how these models work though - I'm always trying to maintain the cleanest possible record of what I did so I can learn from the transcript later.

> Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...

Somebody somewhere needs to provide a threaded interface to an LLM.


Yeah, a key thing to understand about LLMs is that managing the context is everything. You need to know when to wipe the slate by starting a new chat session and then pasting across a subset of the previous conversation.

A lot of my most complex LLM interactions take place across multiple sessions - and in some cases I'll even move the project from Claude 3.5 Sonnet to OpenAI o1 (or vice versa) to help get out of a rut.

It's infuriatingly difficult to explain why I decide to do that though!


What kinds of things do you with these LLMs?

I feel like I’m good at understanding context. I’ve been working in AI startups over the last 2 years. Currently at an AI search startup.

Managing context for info retrieval is the name of the game.

But for my personal use as a developer, they’ve caused me much headache.

Answers that are subtly wrong in such a way that it took me a week to realize my initial assumption based on the LLM response was totally bunk.

This happened twice. With the yjs library, it gave me half incorrect information that led me to misimplementing the sync protocol. Granted it’s a fairly new library.

And again with the web history api. It said that the history stack only exists until a page reload. The examples it gave me ran as it described, but that isn’t how the history api works.

I lost a week of time because of that assumption.

I’ve been hesitant to dive back in since then. I ask questions every now and again, but I jump off much faster now if I even think it may be wrong.


There is no substitute for cold hard facts. LLMs do not provide that unless it’s literally the easiest thing for them to do and even then not always.

In the case you were in I would go out of my way to feed the docs to the LLM and then use the LLM to interrogate the docs and then verify the understanding I got from the LLM with a personal reading of the docs that were relevant.

You might think it takes just as long of not longer to do it my way rather than just reading the docs myself. Sometimes it can. But as you get good at the workflow you find that the time sien finding the relevant docs goes down and you get an instant plausible interpretation of the docs added too. You can then very quickly produce application code right away and then docs of the code you write.


Here are a bunch of things I use LLMs for relating to code.

- Running micro-benchmarks (using Python in Code Interpreter) - if I have a question about which of two approaches is faster I often use this pattern: https://simonwillison.net/2023/Apr/12/code-interpreter/

- Building small ad-hoc one-off tools. Many of the examples in https://simonwillison.net/2024/Oct/21/claude-artifacts/ fit that bill, and I have a bunch more in my tools tag here: https://simonwillison.net/tags/tools/ - Geoffrey Litt wrote a great piece the other day about custom developer tools which matches how I think about this: https://www.geoffreylitt.com/2024/12/22/making-programming-m...

- Building front-end prototypes - I use Claude Artifacts for this all the time, if I have an idea for a UI I'll get Claude to spin up an almost instant demo so I can interact with it and see if it feels right. I'll often copy the code out and use it as the starting point for my production feature.

- DSLs like SQL, Bash scripts, jq, AppleScript, grep - I use these WAY more than I used to because 9/10 times Claude gives me exactly what I needed from a single prompt. I built a CLI tool for prompt-driven jq programs recently: https://simonwillison.net/2024/Oct/27/llm-jq/

- Ad-hoc sidequests. This is a pretty broad category, but it's effectively little coding projects which I shouldn't actually be working on at all but I'll let myself get distracted if an LLM can get me there in a few minutes: https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-cas...

- Writing C extensions for SQLite while I'm walking my dog on the beach. I am not a C programmer but I find it extremely entertaining that ChatGPT Code Interpreter, prompted from my phone, can write, compile and test C extension for SQLite for me: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

- That's actually a good example of a general pattern: I use this stuff for exploratory prototyping outside of my usual (Python+JavaScript) stack all the time. Usually this leads nowhere, but occasionally it might turn into a real project (like this AppleScript example: https://til.simonwillison.net/gpt3/chatgpt-applescript )

- Actually writing code. Here's a Python/Django app I wrote almost entirely with Claude: https://simonwillison.net/2024/Aug/8/django-http-debug/ - again, this was something of a side-project - not something worth spending a full day on but worthwhile if I could get it done in a couple of hours.

- Mucking around with APIs. Having a web UI for exploring an API is really useful, and Claude can often knock those out from a single prompt. https://simonwillison.net/2024/Dec/17/openai-webrtc/ is a good example of that.

There's a TON more, but this probably represents the majority of my usage.


Thank you!

I’ll read through these and try again in the new year.


Not OP, but I've just gotten really used to verifying implementation details. Yup, those subtle ones really suck. It's pretty much just up to intuition if something in the response (or your followups) rings the `not quite right` bell for you.

I bought in early to typingmind, a great web based frontend. Good for editing context, and switching from say gemini to claude. This is a very normal flow for me, and whatever tool you use should enable this

also nice to interact with an LLM in vim, as the context is the buffer

obviously simon’s llm tool rules. I’ve wrapped it for vim


Googlefu is how its usually called. It would be fantastic if there was a general course to teach it

One of the things I find most frustrating about LLMs is how resistant they are to teaching other people how to use them!

I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.

The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.

The problem with intuition is it's really hard to download that into someone else's head.

I share a ton of chat conversations to show how I use them - https://simonwillison.net/tags/tools/ and https://simonwillison.net/tags/ai-assisted-programming/ have a bunch of links to my exported Claude transcripts.


Thank you for doing this work, though.

My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).


To be fair I think that is a hard task even for a human expert, in the sense that there isn’t much prior art.

It's really important to go and read the code that the author of this article actually produces with LLMs. He posted on hacker news a few months ago, a post called something like "everything I've made with ChatGPT in the month of September" or something. He's producing little toy applications that don't even begin to resemble real production code. He thinks these "tools" are useful because they help him write pointless slop.

Here's that post: https://simonwillison.net/2024/Oct/21/claude-artifacts/

You're misrepresenting it here.

The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."

It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".

The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.

That post isn't meant to be about writing "real production code". I don't know why people are confused over that.


Do you know who Simon is?

Only from his neverending stream of hacker news posts.

My experience is that for certain tasks LLMs are great, for certain tasks LLMS are basically useless.

The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.

I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.

In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.


I think one of the most important skills is being able to predict which tasks an LLM is a good fit for and which aren't.

I think most tech folks struggle with it because they treat LLMs as computer programs, and their experience is that SW should be extremely reliable - imagine using a calculator that was wrong 5% of the time - no one would accept that!

Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.

Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.

Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:

Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.

And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.

And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)

Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.

I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.

Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!

Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).

Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.


> Don't use LLMs where accuracy is paramount.

Then why do people keep pushing it for code related tasks?

Accuracy and precision is paramount with code. It needs to express exactly what needs to be done and how.


Code is the best possible application of LLMs because you can TEST the output.

If the LLM hallucinates something the code won't compile or run.

If the LLM makes a logic error you'll catch it in the manual QA process.

(If you don't have good personal manual QA habits, don't try using LLMs to write your code. And maybe don't hit "accept" on other developer's code reviews either?)


> Code is the best possible application of LLMs because you can TEST the output.

This is an overly simplistic view of software development.

Poorly made abstractions and functions will have knock on effects on future code that can be hard to predict.

Not to mention that code can have side effects that may not affect a given test case, or the code could be poorly optimized, etc.

Just because code compiles or passes a test does not mean it’s entirely correct. If it did, we wouldn’t have bugs anymore.

The usual response to this is something like “we can use the LLM to refactor LLM code if we need” but, in my experience, this leads to very complex, hard to reason about codebases.

Especially if the stack isn’t Python or JavaScript.


So code review LLM-generated code and reject it (or require changes to it) if it doesn't fit your idea of what good code looks like.

Or… yknow… I could just write the code…

Instead of going through a multi step process to get an LLM to generate it, review it, reject it, and repeat…

I wonder why you reply to these comments, but not my other asking what you use LLMs for and specifically explaining how they failed me.


Found that comment here, about to reply: https://news.ycombinator.com/item?id=42562394

Because there are other ways to validate the output, types being one of them, tests another. Or simply running the code. It's easy enough to validate the output given the right approach that code generated by an LLM (usually as the result of a conversation/discussion about what should be accomplished) is a net positive.

If you zero-prompt and copy-paste the first result into your codebase, yeah, the accuracy problem will rear its ugly head real quick.


> Then why do people keep pushing it for code related tasks?

They don't. You are likely experiencing selection bias. My guess is you work in SW, and so it makes sense that you're the target of those campaigns. The bulk of ChatGPT subscribers are not doing SW, and no one is bugging them to use it for code related tasks.


I mean people in the software field absolutely push for LLMs to write code…

Obviously people not in the software field wouldn’t care…


A similar use case for me - I wrote some technical documentation for our wiki about a somewhat complicated relationship between ids in some database tables. I copied my text explanation into an LLM and asked it to make a diagram and it did so. Took very little time from me and it was fast/easy to verify that the quality was good.

I think there’s the added reason that a lot of folks went into tech because (consciously or unconsciously) they prefer dealing with predictable machines than with unreliable humans. And now that career choice begins to look like a bait and switch. ;)

> Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.

The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example

- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.

- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.


Your first task is definitely not what I would call a "menial" task.

Your second task is not a "task", but a knowledge search. LLMs are not good with searches (unless augmented - like RAG).


LLMs on their own are effectively useless for references or citations. They need to be plugged into other systems for that - search-enabled ones like https://gemini.google.com or ChatGPT with search enabled or Perplexity can do this, although at that point they are mostly running the exact same searches you would.

> Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff.

My programmer mind tells me that "tedious stuff" is where accuracy is the most important.


There's a similar dynamic in building reliable distributed systems on top of an unreliable network. The parts are prone to failure but the system can keep on working.

The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.


It's amazing this is still an opinion in 2025. I now ask devs how they use AI as part of their workflows when I interview. It's a standard skill I expect my guys to have.

I feel bad for your team.

Let people work how they want. I wouldn’t not hire someone on the basis of them not using a language server.

The creator of the Odin language famously doesn’t use one. He’s says that he, specifically, is faster without one.


No, it’s reasonable. If your team uses Git then it’s a valid question to establish if someone has only worked with Perforce.

They didn’t say how heavily they weight the question.

(All that said I expect that, soon, experience with the appropriate LLM tooling will be as important as having experience with the language your system is implemented in.)


Right, but using git is a team wide thing.

I can’t use perforce while my company is on git.

But if I do or do not use an LLM to assist me while coding, my team is unaffected.

If someone liked jetbrains, but your team used neovim, would you force them to use neovim?


Editors may also be a team decision in some places. Some teams are using features unique to one IDE, for example.

it can be a team decision, but it's a bad one

Then that tooling is required, like visual studio is a common one I know about in windows land.

Though nobody should care if I edited my text files with neovim as long as I still used the same toolchain as everyone else.


You hire people based on their fundamental knowledge and the ability to learn, not skills in arbitrary tools and frameworks which come and go every other day. If someone has used Perforce they will be able to get perfectly comfortable with Git by the end of their first week. So not knowing Git is an idiotic reason to reject a skilled developer. Same with programming languages, and just about every other aspect of software development.

I don't really test any specific tools or frameworks, what i'm using has changed twice just in the last year. More so, I just want to hear that the candidate has some knowledge of what the current models can do well, what they can't do, and how they're integrating it. Whether you're copying pasting code or using something like cursor is not what i'm concerned about.

Yeah, but it's oh so easy to test for, and oh so nice to have plenty of boxes checked to cover your ass if the hire goes wrong.

My expectations around productivity are going to assume you're using AI. That means stuff that might have taken a few days, i'm going to expect in a few hours or less. It's not unreasonable, i've seen over and over agian that kind of speed up. I have a lot less approval to hire people than I used to... so it's really important to me that I can extract that level of productivity out of my team.

If you're "working the way you want to" ie still handrolling all your code, you're going to find my expectations unrealistic, and that is certainly not fair to you.


I concur that asking devs how they use AI is a great idea.

Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".

What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.

I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.

One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.

It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.

Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!


Using cloud-based AI is a no-go where I work, for IP and contractual reasons. And on-premises AI is not as capable and more difficult to integrate.

Have you tried the latest open weight models? They're SO MUCH better today than they were even six months ago.

If I was in an environment that didn't allow hosted API models I'd absolutely be looking into the various Llama 3 models or Qwen2.5-Coder-32B.


Legal does not even want us running offline models for reasons. I assume that comes down to not knowing what offline-only means, but such is life.

Maybe they're concerned that code written with AI assistance can't be copyrighted? I've seen that idea floated in a few places.

What do you use so that you can throw in a set of documents and/or a nontrivial code base into an LLM workspace and ask questions about it etc.? What the cloud-based services provide goes way beyond a simple chat interface or mere code completion (as you know, of course).

I use my https://github.com/simonw/files-to-prompt tool like this:

  files-to-prompt . -e py -e md -c | pbcopy
Now I have all the Python and Markdown files from the current project on my clipboard, in Claude's recommended XML-like format (which I find works well with other models too).

Then I paste that into the Claude web interface or Google's AI Studio if it's too long for Claude and ask questions there.

Sometimes I'll pipe it straight into my own LLM CLI tool and ask questions that way:

  files-to-prompt . -e py -e md -c | \
    llm -m gemini-2.0-flash-exp 'which files handle JWT verification?'
I can later start a chat session on top of the accumulated context like this:

  llm chat -c
(The -c means "continue most recent conversation in the chat").

Thanks. Google AI Studio isn’t local, I think, is it? I’ll have to test this, but our project sizes and specification documents are likely to run into size limitations for local models (or for the clipboard at the very least ;)). And what I’d be most interested in are big-picture questions and global analyses.

No, it's not. I've not seen any local models that can handle 1m+ tokens.

I haven't actually done many experiments with long context local models - I tend to hit the hosted API models for that kind of thing.


Just curious, but what AI related skills do you expect them to have?

The ability to recognize and join a hype train, I presume. It’s one way to appear proactively leading-edge to marginally-informed product managers, marketers, execs and press.

That's an extremely uncharitable presumption. Although I don't agree that routine usage of AIs should be a precondition for regular software engineering jobs, there are good reasons for using LLMs besides "joining a hype train".

Nah.

I ask what their current workflow is, how they check and verify things, what their approach to prompting is etc. I'm looking to see that they've developed basic skills, have a reasonable mental model of what models can do well, what they currently can't do, and have an approach to be productive using the tools.

I would characterize good prompting as: write out your whole problem you're trying to solve, then think to yourself what the clarifying questions would be if you were a junior trying to solve it. Better yet - ask the LLM to ask you challenging clarifying questions for several rounds. Then, take all that information and re-compile it back into a list of all the important components of the project, and re-read it to make sure there's no particular ambiguous part or weird part that would be over-emphasized by the language you used. Then, emphasize the core concerns again, and tell it how you'd like it to output the response (keeping in mind that it will always do best with a conversation-style format with loose restrictions). Never let a conversation stray too long from the original goals lest it start forgetting.

Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.

I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.

That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...

[Edit: o1 mostly agrees lol. Some good additional suggestions for systematizing this: https://chatgpt.com/share/6775b85c-97c4-8003-bd31-ee288396ab... ]


[flagged]


I hadn't heard of T2Tile. The intro video https://www.youtube.com/watch?v=jreRFxN6wuM is from 5 years ago so it predates even GPT-3.

Do you know if any of the ideas from that project have crossed over into LLM world yet?


Do you know who Simon is?

Great summary of highlights. Don't agree with all, but I think it's a very sound attempt at a year in review summary

>LLM prices crashed

This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.

Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.

>“Agents” still haven’t really happened yet

Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind

>LLMs somehow got even harder to use

I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.

----

One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.


The biggest reason I'm not worried about prices going back up again is Llama. The Llama 3 models are really good, and because they are open weight there are a growing number of API providers competing to provide access to them.

These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.

Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.


> There is still Gemini free tier which is ofc basically impossible to beat

Is it free free? The last time I checked there was a daily request limit, still generous but limiting for some use cases. Isn't it still the case?


Providing an unlimited free tier would be a terrible business decision for them.

Of course. My point is, probably a super cheap LLM that does not cut you off after 1500th API request of the day is preferred over the free model that does so, at least for certain use cases.

Agents have a definition issue sure, but IMO we are prevented from even discovering a useful definition by the current limitations of LLMs

> Some of those GPT-4 models run on my laptop

That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time. OpenAI is thus way overvalued.


Most of the laptops that the models can run on today are in the high end of dedicated bare metal servers. Most shared VM servers are way below these laptops. Most people buying a new laptop today won't be able to run them, most devs getting a website up with a server won't be able to run them.

This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".


I don't think openai's valuation comes from a data center bet -- rather, I'd suppose, investors think it has a first-mover advantage on model quality that it can (maybe?) attract some buy-out interest or otherwise use in yet-to-be-specified product lines.

However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.

It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.


The last OpenAI valuation I read about was 157 billion. I am struggling to understand what justifies this. To me, it feels like OpenAI is at best few months ahead of competitors in some areas. But even if I am underestimating the advantage, it's few years instead of few months, why does it matter? It's not like AI companies are going to enjoy the first-mover advantage internet giants had over the competition.

It's justified if AGI is possible. If AGI is possible, then the entire human economy stops making sense as far as money goes, and 'owning' part of OpenAI gives you power.

That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.


I was thinking about how the economy has been actively makes less sense and gets divorced more and more from reality year after year, AI or not.

It's the simple fact that the ability of assets to generate wealth has far outstripped the abiliy of individuals to earn money by working.

Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.

When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.

A company can be worth $1B if someone invests $10m in it for 1% stake - where did the remaining $990m come from? Likewise, the stock market is full of trillion-dollar companies whose valuations beggar all explanation, considering the sizes of the markets they are serving.

The rich elites are using the wealth to control access to basic human needs (namely housing and healthcare) to squeeze the working population for every drop of money. Every wealth metric shows the 1% and the 1% of the 1% control successively larger portions of the economic pie. At this point money is ceasing to be a proxy for value and is becoming a tool for population control.

And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.

Anyway, I don't want to turn this into a fully-written manifesto, but I have trouble expressing these ideas in a concise manner.


> Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.

Approximately 2/3s of homes in the US are owner occupied.


It's interesting that the figure is similar in Australia, but from the POV of the people.

Approximately 2/3rds of Australians live in an owner-occupied home.


> When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.

In Canada, the population is still growing at a fairly impressive rate (https://www.macrotrends.net/global-metrics/countries/CAN/can...), and that growth tends to concentrate in major population centres. There are advocacy groups that seek to push Canadian population growth well above UN projections (e.g. the https://en.wikipedia.org/wiki/Century_Initiative "aims to increase Canada's population to 100 million by 2100") through immigration. In Japan, where the population is declining, housing prices are not anything like the problem we observe in North America.

There's also the supply side. "Impossible to build affordable housing" is in many cases a consequence of zoning restrictions. (Economists also hold very strongly that rent control doesn't work - see e.g. https://www.brookings.edu/articles/what-does-economic-eviden... and https://www.nmhc.org/research-insight/research-notes/2023/re... ; real "affordable housing" is just the effect of more housing.)


Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.

That's to be expected when governments forbid people from building housing. The only thing I find surprising is when people blame this on "capitalism".


> And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.

The last 5 years have reflected a substantial decline in QOL in the states; you don't even have to to look back that far.

The coronacircus money-printing really accelerated the decline.


If AGI is possible, then the entire human economy stops making sense as far as money goes, and 'owning' part of OpenAI gives you power.

That's if AGI is possible and not easily replicated. If AGI can be copied and/or re-developed like other software then the value of owning OpenAI stock is more like owning stock in copper producers or other commodity sector companies. (It might even be a poorer investment. Even AGI can't create copper atoms, so owners of real physical resources could be in a better position in a post-human-labor world.)


This belief comes from confusing the singularity (every atom on Earth is converted into a giant image of Sam Altman) with AGI (a store employee navigates a confrontation with an unruly customer, then goes home and wins at Super Mario).

If I recall correctly, these terms were used more or less interchangeably for a few decades, until 2020 or so, when OpenAI started making actual progress towards AGI, and it was clear that the type of AGI that could be imagined at that point, would not be of the type that would produce singularity.

Exactly. I continually fail to see how "the entire human economy ends" overnight with another human like agent out there - especially if its confined to a server in the first place - it can't even "go home" :)

But what if that AGI can fit inside a humanoid robot and that robot is capable of self replication even if it means digging the sand out of the ground to make silicon with a spade?

We already have humanoid intelligeces that self-assemble and power from common materials, as a colony of incredibly advanced nanobots.

Yes. The goal is to emulate that with different substrates to understand how it works and to have better control over existing self-replicating systems.

The first AGI will have such an advantage. It’ll be the first thing that is smart and tireless, can do anything from continuously hacking enemy networks to trading across all investment classes, to basically taking over the news cycle on social media. It would print money and power.

Depends on how efficient it is. If it requires more processing power than we have to do all these things competitors will have time to catch up while new hardware is created.

The GP said, "and exponential". If AGI is exponential, then the first one will have a head start advantage that compounds over time. That is going to be hard to overcome.

I believe that AGI cannot be exponential for long because any intelligent agent can only approach nature's limits asymptotically. The first company with AGI will be about as much ahead as, say, the first company with electrical generators [1]. A lot of science fiction about a technological singularity assumes that AGI will discover and apply new physics to develop currently-believed-impossible inventions, but I don't consider that plausible myself. I believe that the discovery of new physics will be intellectually satisfying but generally inapplicable to industry, much like how solving the cosmological lithium problem will be career-defining for whoever does it but won't have any application to lithium batteries.

https://en.wikipedia.org/wiki/Cosmological_lithium_problem

[1] https://en.wikipedia.org/wiki/Siemens#1847_to_1901


I don't recall editing my message, but HN can be wonky sometimes. :)

Nothing is truly exponential for long, but the logistic curve could be big enough to do almost anything if you get imaginative. Without new physics, there are still some places where we can do some amazing things with the equivalent of several trillion dollars of applied R&D, which AGI gets you.


This depends on what a hypothetical 'AGI' actually costs. If a real AGI is achieved, but it costs more per unit of work than a human does... it won't do anyone much good.

Sure but think of the Higgs... how long that took for just _one_ particle. You think an AGI, or even an ASI is going to make an experimental effort like that go any bit faster? Dream on!

It astounds me that people dont realize how much of this cutting edge science stuff literally does NOT happen overnight, and not even close to that; typically it takes on the order of decades!


Science takes decades, but there are many places where we could have more amazing things if we spent 10 times as much on applied R&D and manufacturing. It wouldn't happen overnight, but it will be transformative if people can get access to much more automated R&D. We've seen a proliferation in makers over the last few decades as access to information is easier, and with better tools individuals will be able to do even more.

My point being that even if Science ends today, we still have a lot more engineering we can benefit from.


I had to edit my message just now because I was actually unsure if you edited. Sorry for any miscommunication.

If AGI is invented and the inventor tries to keep it secret then everyone in the world will be trying to steal it. And funding to independently create it would become effectively unlimited once it has been proven possible, much like with nuclear weapons.

We may not need smarter AI. Just less stupid AI.

The big problem with LLMs is that most of the time they act smart, and some of the time they do really, really dumb things and don't notice. It's not the ceiling that's the problem. It's the floor. Which is why, as the article points out, "agents" aren't very useful yet. You can't trust them to not screw up big-time.


> If AGI is possible, then the entire human economy stops making sense as far as money goes,

What does this mean in terms of making me coffee or building houses?


If we can simulate a full human intelligence at a reasonable speed, we can simulate 100 of them and ask the AGI to figure out how to make itself 10x faster.

Rinse and repeat.

That is exponential take off.

At the point where you have an army of AIs running at 1000x human speed it can just ask it to design the mechanisms for and write the code to make robots that automate any possible physical task.


There are about 8 billion human intelligences walking around right now and they've got no idea how to begin making even a stupid AGI, let alone a superhuman one. Where does the idea that 100 more are going to help come from?

This was my argument a long time ago. The common counter was that we’d have a bunch of geniuses that knew tons of research. Well, we probably already have millions of geniuses. If anything, they use their brains for self-enrichment (eg money, entertainment) or on a huge assortment of topics. If all the human geniuses didn’t do it, then why would the AGI instances do it?

We also have people brilliant enough to maybe solve the AGI problem or cause our extinction. Some are amoral. Many mechanisms pushed human intelligences in other directions. They probably will for our AGI’s assuming we even give them all the power unchecked. Why are they so worried the intelligent agents will not likewise be misdirected or restrained?

What smart, resourceful humans have done (and not done) is a good, starting point for what AGI would do. At best, they’ll probably help optimize some chips and LLM runtimes. Patent minefields with sub-28nm design, especially mask-making, will keep unit volumes of true AGI’s much lower at higher prices than systems driven by low-paid workers with some automation.


This sounds like magic, not science.

What do you mean by this? Is there any fundamental property of intelligence, physicality, or the universe, that you think wouldn't let this work?

Not OP but yes. Electron size vs band gap, computing costs (in terms of electricity) any other raw materials needed for that energy, etc... sigh... its physics, always physics... what fundamental property of physics do you think would let a vertical take off in intelligence occur?

If you look at the rate of mathematical operations conducted, we're already going hard vertical. Physics and material limitations will slow that eventually as we reach a marginal return on converting the planet to computer chips, but we're in the singularity as proxy measured by mathematical operations.

> If you look at the rate of mathematical operations conducted, we're already going hard vertical.

Not if you remember to count all the computations being done by the quintillions of nanobots across the world known as "human cells."

That's not only inside cells, and not just neurons either. For example, your thyroid is busy brute-forcing the impossibly large space of antibody combinations, and putting every candidate cell-release through a very rigorous set of acceptance tests.


The human brain still has orders of magnitude more processing power than LLMs. Even if we develop superintelligence the current hardware cant run it which gives competitors time to catch up.

Nothing and the hilarious thing is that the AI figureheads admit that technology (as in defined by new theorems produced and new code written), will do pathetically little to move the needle on human happiness forward.

The guy running Anthropic thinks the future is in biotech, developing the cure to all diseases, eternal youth etc.

Which is technology all right, but it's unclear to me how these chatbots (or other AI systems) are the quickest way to get there.


> If AGI is possible, then the entire human economy stops making sense as far as money goes

I heard people on HN saying this (even without the money condition) and I fail to grasp the reasoning behind it. Suppose in a few years Altman announces a model, say o11, that is supposedly AGI, and in several benchmarks it hits over 90%. I don't believe it's possible with LLMs because of their inherent limitations but let's assume it can solve general tasks in a way similar to an average human.

Now, how come that "the entire human economy stops making sense"? In order to eat, we need farmers, we need construction workers, shops etc. As for white collar workers, you will need a whole range of people to maintain and further develop this AGI. So IMHO the opposite is true: the human economy will work exactly as before but the job market will continue to evolve withe people using AGI in a similar way that they use LLMs now but probably with greater confidence. (Or not.)


The thinking goes: - any job that can be done on a computer is immediately outsourced to AI, since the AI is smarter and cheaper than humans - humanoid robots are built that are cheap to produce, using tech advances that the AI discovered - any job that can be done by a human is immediately outsourced to a robot, since the robot is better/faster/stronger/cheaper than humans

If you think about all the people trying to automate away farming, construction, transport/delivery - these people doing the automation themselves get automated out first, and the automation figures out how to do the rest. So a fully robotic economy is not far off, if you can achieve AGI.

Why do we work? Ultimately, we work to live.* If the value of our labor is determined by scarcity, then what happens when productivity goes nearly infinite and the scarcity goes away? We still have needs and wants, but the current market will be completely inverted.

One strata in that assumption-heap to call out explicitly: Assuming LLMs are an enabling route to AGI and not a dead-end or supplemental feature.

Well, AGI would make the brainy information worker part of the economy obsolete. Well still need the jobs that interact with the physical world for quite a while. So… all us HN types should get ready to work the mines or pick vegetables

If we hit true AGI, physical labor won’t be far behind the knowledge workers. The first thing industrial manufacturers will do is turn it towards designing robotics, automating the design of factories, and researching better electromechanical components like synthetic muscle to replace human dexterity.

IMO we’re going to hit the point where AI can work on designing automation to replace physical labor before we hit true AGI, much like we’re seeing with coding.


If AGI is possible then that too becomes a commodity and we experience a massive round of deflation in the cost of everything not intrinsically rare. Land, food, rare materials, energy, and anything requiring human labor is expensive and everything else is almost free.

I don't see how OpenAI wouldn't crash and burn here. Given the history of models it would be at most a year before you'd have open AGI, then the horse is out of the barn and the horse begins to self-improve. Pretty soon the horse is a unicorn, then it's a Satyr, and so on.

(I am a near-term AGI skeptic BTW, but I could be wrong.)

OpenAI's valuation is a mixture of hype speculation and the "golden boy" cult around Sam Altman. In the latter sense it's similar to the golden boy cults around Elon Musk and (politically) Donald Trump. To some extent these cults work because they are self-fulfilling feedback loops: these people raise tons of capital (economic or political) because everyone knows they're going to raise tons of capital so they raise tons of capital.


> what justifies this

People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.


OpenAI is becoming synonymous with consumer AI. It has potential of disrupting Google’s cash cow, which explains at least a chunk of the valuation.

OpenAI predicts more revenue from ChatGPT than api access through 2029.

It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.

I bet Google will figure this out and thus OpenAI won’t disrupt as much as people think it will.


157 billion implies about a 1% chance at dominating a 1.5 trillion market. Seems reasonable.

10%, no?

No, there’s a risk term I’m skipping over.

that's 10% and who's to say that market is worth 1.5 trillion to begin with

There’s a risk term I’m not including and the comparable is the size of the American economy ($27 trillion).

So take the entire economy and ask the question: what does AI not impact? Net that out and assume there’s pricing efficiencies, then build in a risk buffer.

1.5t to 15t seems right.


Market cap of apple, google, facebook.

Market cap and market size are totally different measures

Us skeptics believe that valuation prices in some form of regulatory capture or other non-market factor.

The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.


There are some really good books about wars between cultures that have AGI and it always comes down to math - whoever can get their hands on more compute faster wins.

This is also a strong argument for immigration, particularly high-skill immigration. In the absence of synthetic AGI whoever imports the most human AGI wins.

Which suggests that total AGI compute doesn't matter that much, as India isn't the world leader the amount of human compute they posses would suggest then.

What matters is how you use the AGI, not how much you have, with wrong or bad or limiting regulations it will not lead anywhere.


Been in the Mac ecosystem since 2008, love it, but there is, and always has been, a tendency to talk about inevitabilities from scaling bespoke, extremely expensive configurations, and with LLMs, there's heavy eliding of what the user experience is, beyond noting response generation speed in tokens/s.

They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.

And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.

Usually it'd be obvious this'd trickle down, things always do, right?

But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.

And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.

I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.

I expect the local LLM community to be roughly the same size it is today 5 years from now.

* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding


I have a 2023 mbp, and I get about 100-150 tok/sec locally with lmstudio.

Which models?

For context, I got M2 Max MBP, 64 GB shared RAM, bought it March 2023 for $5-6K.

  Llama 3.2 1.0B - 650 t/s
  Phi 3.5   3.8B - 60 t/s.
  Llama 3.1 8.0B - 37 t/s.
  Mixtral  14.0B - 24 t/s.
Full GPU acceleration, using llama.cpp, just like LM Studio.

hugging-quants/llama-3.2-1b-instruct-q8_0-gguf - 100-150 tok/sec

second-state/llama-2-7b-chat-gguf net me around ~35 tok/sec

lmstudio-community/granite-3.1.-8b-instruct-GGUF - ~50 tok/sec

MBP M3 Max, 64g. - $3k


I'm not sure if you're pointing out any / all of these:

#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.

#2. Llama 1B is roughly GPT-4.

#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.

On my end:

#1. Agreed.

#2. Vehemently disagree.

#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.


I specifically missed the GPT4 part of "up to 10 token/sec out of a kinda sorta GPT-4". Was just looking at token/sec.

This seems like a non-sequitur unless you’re assuming something about the amount that people use models.

Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.


Unless the best models themselves are costly/hard to produce, and there is not a company providing them to people free of charge AND for commercial use.

The best models are always out of reach on desktops. You can have ok models but AGI will come in a datacenter first

And of course, as processors improve this becomes more and more the case.

Simon has mentioned in multiple articles how cool it is to use 64GB DRAM for GPU tasks on his MacBook. I agree it's cool, but I don't understand why it is remarkable. Is Apple doing something special with DRAM that other hardware manufacturers haven't figured out? Assuming data centers are hoovering up nearly all the world's RAM manufacturing capacity, how is Apple still managing to ship machines with DRAM that performs close enough for Simon's needs to VRAM? Is this just a temporary blip, and PC manufacturers in 2025 will be catching up and shipping mini PCs that have 64GB RAM ceilings with similar memory performance? What gives?

LLMs run on the GPU, and the unified memory of Apple silicon means that the 64 GB can be used by the GPU.

Consumer GPUs top out at 24 GB VRAM.


llama.cpp can run LLMs on CPU. iGPU can also use system memory, the novel thing is not that, it's that the LLM inference is mostly memory bandwidth bound and memory bandwidth of a custom built PC with really fast DDR5 RAM is around 100GB/s, nVidia consumer GPUs at the top end are around 1TB/s, with mid range GPUs at around half that. M1 Max has 400GB/s, M1 Ultra is 800GB/s, but you can have Apple Silicon Macs with up to 192GB of 800GB/s memory usable by GPU, this means much faster inference than just CPU+system memory due to bandwidth and more affordable than building a multi-GPU system to match the memory amount.

It'd be really nice to have good memory bandwidth usage metrics collected from a wide range of devices while doing inference.

For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?

Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.


Apple designs its own chips, so the RAM and CPU are on the same die and can talk at very high speeds. This is not the case for PCs, where RAM is connected externally.

It's on the same package but the same die?

Apple uses HBM, basically RAM on the same die as the CPU. It has a lot more memory bandwidth than typically PC dram, but still less than many GPUs. (Although the highest end macs have bandwidth that is in the same ballpark as GPUs)

Apple does not use HBM, they use LPDDR. The way they use it is similar in principle to HBM (on-package, very wide bus) but it's not the same thing.

Right so Apple uses high-bandwidth memory, but not HBM.

Its not HBM, which GPUs tend to use, but it is on package and wider interface than other PCs

> I’ve heard from sources I trust that both Google Gemini and Amazon Nova charge less than their energy costs for running inference...

Then, several headings later:

> I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.

So...which is it?


Oh whoops! That's an embarrassing mistake, and I didn't realize I had that point twice.

They're not running at a loss. I'll fix that.


If they are subsidised they can make a profit while still not making enough money to cover energy costs.

The tip I got about both Gemini and Nova is that the low prices they are charging still cover their energy costs.

OK!

Subsidised by whom?

E.g. tax payers.

Are tax payers subsiding that particular activity of Google or Amazon? If they do, “they make enough money” to cover costs. If they don’t, how does it become profitable if it doesn’t even cover the cost of one of the inputs?

Where I live corporations like those get to build data centers and energy subsidies from the state, i.e. tax payers pay a part of their energy bills. This isn't money they're making, it's money other people made and gave to them.

This means that they could make a profit off inference models without the revenue being large enough to pay the energy costs.

If it's the case I don't know. I'm more concerned with getting rid of those corporations altogether since interacting with them is generally forbidden due to the lack of data protection regulations in the US.


Subsidies are often in the form of tax credits - they cannot be really used to pay for things. I'm not sure if "energy subsidies" may be about providing energy below the cost of production but it's true that the "true" cost of production is not clear when a political decision to close nuclear plants, for example, introduces a distortion on their useful life and their amortised cost.

> I find the term “agents” extremely frustrating. It lacks a single, clear and widely understood meaning... but the people who use the term never seem to acknowledge that.

This 100%. “Agentic” especially as a buzzword can piss off


I find that Anthropic has a good, clarifying set of definitions with examples: https://www.anthropic.com/research/building-effective-agents

Genuinely the best piece of writing I've seen about agents anywhere.

The software "has agency"? That is, I can entrust it to carry out the task I've described, to completion, without telling it how to perform the task?

That's one of the more common definitions people use - especially people who aren't directly building agents, since the builders tend to get more hung up on "LLM with access to tools" or similar.

My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.


Workflows aside, I think "interruptible work" is what matters, really. That is, maintaining state in-between inferences so that it follows some well-defined goal.

What is the current status on pushing "reasoning" down to latent/neural space? Seems like a vaste of tokens to let a model converse with itself especially when this internal monologue often has very little to do with the final output so it's not useful as a log of how the final output was derived.


Simon does great work serving as a LLM historian. Have a happy 2025!

Can someone please just tell me what model and workflow is so productive? I've seen so many allusions to the concept of skills for LLM use but no explanations of what they are.

The best LLM for code right now, in my opinion, is still Claude 3.5 Sonnet.

The big challenge is figuring out how to use it. I usually like working at the function level: I figure out the exact function signature I want in Python or JavaScript and then get Claude to implement it for me.

Claude Artifacts are neat too: Claude can build a full HTML+JavaScript UI, and then iterate on it. I use this for interactive UI prototypes and building small tools.

I've published a whole lot of notes on this stuff here: https://simonwillison.net/tags/ai-assisted-programming/


I found it easiest to use Aider with Claude. It's also IDE independent.

Step 1: curate a context window of code from different repos (poke team about switching to mono repo)

Step 2: write a slack style message as if you are discussing the solution with a teammate that you have authority over as a delegate to get shit done & to revise as needed.

Step 3: press enter, LLM does something you don't like, delete history, fix prompt in step 2 and ask again, rinse and repeat until you have working code.

Step 4: ask for the changes to be written as a bash file that cat EOF all the files that change into place, run the script.

Step 5: git diff & play test the changes using functional testing (use your mouse & keyboard test the code paths that changed...)

Step 6: continue prompting & deleting history as needed to refine.

Step 7: commit code to repos


Look, when are these models going to not just talk to me, but do stuff for me? If they're so clever, why can't I tell one to buy chocolates and send them to my wife? Meanwhile, they can allegedly solve frontier maths problems. What's the holdup to models that go online and perform simple tasks?

LLM's are inherently untrustworthy. They're very good at some tasks, but they still need to be checked and/or constrained carefully, which is probably not the best technology on which to base real-world autonomous agents.

> why can't I tell one to buy chocolates and send them to my wife?

I'm pretty sure that's been possible for a while. There was an example where Claude's computer use feature ordered pizza for the dev team through DoorDash: https://x.com/alexalbert__/status/1848777260503077146?lang=e...

I don't think the released version of the feature can do it, but it should be possible with today's tech.


The last mile problem remains undefeated.

Same reason that a powerful graphing calculator can’t teach a math class. “Unhobbling” needs to occur. This means a lot of things but includes modalities, reliability, persistence, alignment, etc.

Nice overview. The challenge ahead for “AI” companies is that it appears there’s really no technical moat here. Someone comes out with something amazing and new and within months (if not weeks or days) it’s quickly copied. That environment where everything quickly becomes a commodity is a recipe for many/most companies in this space to quickly get washed out as it becomes economically unviable to play in such an environment.

The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.


That may be a challenge for AI companies but that doesn't sound like a problem to me. Commodities are great for consumers.

Not necessarily. The playbook of what tends to happen is first a bunch of players go bust in the race to the bottom, then the survivors are free to raise prices a bit when others realize there’s not much point in entering a race to the bottom. Those left then let quality slip as competition cools.

That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza


Something not mentioned is AI generated music. Suno's development this year is impressive. Unclear what this will mean for music artists over next few years.

Yeah, this year I decided to just focus on LLMs - I didn't touch on any of the image or music generation advances either. I haven't been following those closely enough to have particularly useful things to say about them.

Very clear; I like buying music produced by people who play instruments.

I’m even happy to listen to generative music, so long as it’s orchestrated (haha) by musicians using musical taste to make musical decisions, rather than a pastiche of the worst derivative house you’ve ever heard by a rando with no intent.

What you think of samples and FL Studio / DAWs?

Thank you Simon for the excellent work you do! I learned a lot from you and enjoy reading everything you write. Keep up. And happy new year.

About "knowledge is incredibly unevenly distributed", an interesting fact is that women is much less likely to use LLMs, if they hear about them/follow updates in the first place:

https://www.economist.com/finance-and-economics/2024/08/21/w...


@simonw you’ve been awesome all year; loved this recap and look forward to more next year

Great write up! Unfortunately, I think this article accurately reflects how we've made little progress on the most important aspects of LLM hype and use: the social ones.

A small number of people with lots of power are essentially deciding to go all in on this technology presumably because significant gains will mean the long term reduction of human labor needs, and thus human labor power. As the article mentions, this also comes at huge expenditure and environmental impact, which is already a very important domain in crisis that we've neglected. The whole thing especially becomes laughable when you consider that many people are still using these tools to perform tasks that could be preformed with a margin of more effort using existing deterministic tools. Instead we are now opting for a computationally more expensive solution that has a higher margin of error.

I get that making technical progress in this area is interesting, but I really think the lower level workers and researchers exploring the space need to be more emphatic about thinking about socioeconomic impact. Some will argue that this is analogous to any other technological change and markets will adjust to account for new tool use, but I am not so sure about this one. If the technology is really as groundbreaking as everyone wants us to believe then logically we might be facing a situation that isn't as easy to adapt to, and I guarantee those with power will not "give a little back" to the disenfranchised masses out of the goodness of their hearts.

This doesn't even raise all the problems these tools create when it comes to establishing coherent viewpoints and truth in ostensibly democratic societies, which is another massive can of worms.


I'd love to read a semi-technical book on everything that we've learned about what works and what does not on LLMs.

It would be out of date in months.

Things that didn’t work 6 months ago do now. Things that don’t work now, who knows…


There are still some tropes from the GPT-3 days that are fundamental to the construction of LLMs that affect how they can be used and will not change unless they no longer are trained to optimize for next-token-prediction (e.g. hallucinations and the need for prompt engineering)

Do you mean performance that was missing in the past is now routinely achieved?

Or do you actually mean that the same routines and data that didn't work before suddenly work?


B

Each new model opens up new possibilities for my work. In a year it's gone from sort of useful but I'd rather write a script, to "gets me 90% of the way there with zero shots and 95% with few-shot"


If I learned anything, it would be that LLMs' non-deterministic nature makes them great are generating output that we can argue over, but they are not a great tool. for doing actual work. I am not asking for much. In my field of work, I use Jetbrains' IDEs, which have now been "enhanced" with AI. I had to turn this feature off, because I kept having to remove code, imports randomly added by the IDE. This was distracting and wasted my time.

I didn't realize "agent" designs were that ambiguously defined. Every AI engineer I've talked to uses it to mean a design that combines several separate LLM prompts (or even models) to solve problems in multiple stages.

I'll add that one to the list. Surprisingly it doesn't closely match most of the 211 definitions I've collected already!

The closest in that collection is "A division of responsibilities between LLMs that results in some sort of flow?" - https://lite.datasette.io/?json=https://gist.github.com/simo...


I am surprised as well, as it only takes a few hundred lines of code to implement them. [1]

    Agents are an abstraction that creates well defined roles for an LLM or LLMs to act within. 
It's like object oriented programming for prompts.

1. https://github.com/openai/swarm/tree/main


This sounds like ensemble chain of thought.

If the investors ask, those same AI engineers will probably allow the answer to be much more ambiguous.

i learned this industry has lower morals and standards for excellence than i ever previously expected

I've been surprised that ChatGPT has hung on as long as it has. Maybe 2025 is the year Microsoft pushes harder for their brand of LLM.

Don’t forget that 2024 was also a record year for new methane power plant projects. Some 200 new projects in the US alone and I’d wager most of them are funded directly by big tech for AI data centres.

https://www.bnnbloomberg.ca/investing/2024/09/16/ai-boom-is-...

This is definitely extending the runway of O&G at a crisis point in the climate disaster when we’re supposed to be reducing and shutting down these power plants.

Update: clarified the 200 number is in the US. There are far more world wide.


Energy generation methods aren’t fungible.

Methane is favored in many cases because they can be quickly ramped up and down to handle momentary peaks in demand or spotty supply generated from renewables.

Without knowing more details about those projects it is difficult to make the claim that these plants have anything to do with increased demand due to LLMs, though if anything, they’d just add to base load demands and lead to slower decommissioning of old coal plants like we’ve seen with bitcoin mines.


Methane is also worth burning to lessen the GHG impact since we produce so much of it as a byproduct of both resource extraction and waste disposal anyway.

The only thing that will stop this is for battery storage to get cheap and available enough that it can cover for renewables. If we are still building gas turbines it means that hasn’t happened yet.

AI is a red herring. If it wasn’t that it would be EV power demand. If it wasn’t that it would be reshoring of manufacturing. If it wasn’t that it would be population growth from immigration. If it wasn’t that it would be replacing old coal power plants reaching EOL.

Replacing coal with gas is an improvement by the way. It’s around half the CO2 per kWh, sometimes less if you factor in that gas turbines are often more efficient than aging old coal plants.


Methane has a shorter half-life than CO2 but is a far worse green house gas; retaining far more heat.

And delivering methane leaks like a sieve into the atmosphere from all parts of the process.

Sure it’s probably “better than coal,” but not by much. It’s a bit like comparing what’s worse: getting burned by fire or being drowned in acid.


Pumped hydro is an excellent form of storage if you have the terrain for it. A whole order of magnitude cheaper than battery storage at the moment.

It would be really cool if big tech could find a new hyperscaler model that didn't also require offsetting the goals of green energy projects worldwide. Between LLM and crypto you'd swear they're trying to find the most energy-wasteful tech possible.

Cryptocurrency, at least PoW, the point is indeed to be the most wasteful — a literal Dyson swarm powered Bitcoin would provide exactly the same utility as the BTC network already had in 2010.

LLMs (and the image, sound, and movie generating models) are more coincidentally power-hogs — people are at least trying to make them better at fixed compute, and lower compute at fixed quality.


I mean, I appreciate that distinction and don't disagree. And, if this is going to continue being a trend, I think we need more stringent restrictions on what sorts of resources are permitted to be consumed in the power plants that are constructed to meet the needs of hyperscaler data centers.

Because whether we're using tons of compute to provide value or not doesn't change that we are using tons of compute and tons of compute requires tons of energy, both for the chips themselves, and the extensive infrastructure that has to built around them to let them work. And not just electricity: refrigerants, many of which are environmentally questionable themselves, are a big part; hell, just water. Clean, usable water.

If we truly need these data centers, then fine. Then they should be powered by renewable energy, or if they absolutely cannot be, then the costs their nonrenewable energy sources inflict on the biosphere should be priced into their construction and use, and in turn, priced into the tech that is apparently so critical for them to have.

This is like, a basic calculus that every grown person makes dozens of times a day: do I need this? And they don't get to distribute the cost of that need, however prescient it may be, on their wider community because they can't afford it otherwise. I don't see why Microsoft should be able to either. If this is truly the tech of the future as it is constantly propped up to be, cool. Then charge a price for it that reflects what it costs to use.


I think basically everyone should support a carbon tax. It's a really obvious solution that is both environmentally friendly and should be acceptable to free market fanatics because it is explicitly and only taxing a negative externality on the public - it's hard to imagine a more justified tax.

Combined with the increased cost effectiveness of renewables & batteries, & the new build-out of nuclear, it could plausibly speed up the clean energy transition, rather than just disincentivising building out more polluting power plants.

There are two main options for what to do with revenue from a carbon tax. The one that makes the most macroeconomic sense is to use those proceeds to fund subsidies for clean energy roll outs & grid adaptation. You are directly taxing the polluting power grid to fund the construction of a non-polluting power grid. As CO2 emitting industry (and thus carbon tax revenue) declines, we have less required spend on clean energy roll out, so the tax would balance nicely. The downside would be that a carbon tax would increase cost of living and this does nothing about that.

The other option is a disbursement. Give everyone in society a payment directly from the proceeds of the carbon tax. This would offset the regressive aspects of a carbon tax (because that tax would increase consumer costs), and would also act as a sort of auto-stimulus to stop the economy from turning down due to consumption costs increasing. The downside of this is that the clean energy transition happens slower than the above, and that there may be political instability & perverse incentives as people maybe come to rely on this payment that has to go away over the next few decades.

They're both good options. I don't know which is better and I think that's likely something individual countries will probably choose based on their situation. But we do need some sort of way to make those emitting CO2 pay for its negative externalities.


It seems odd to put crypto and LLMs in the same boat in this regard - I might be wrong but are there any crypto projects that actually provide value? I'm sure there are ones that do folding or something but among the big ones?

Value is a hard term, this link will seem snarky, but: https://www.axios.com/2024/12/25/russia-bitcoin-evade-sancti...

So in a way, it is providing value to someone, whether we like it or not.

Or Drug Cartels. https://www.context.news/digital-rights/how-crypto-helps-lat...

But this is the promise of uncontrollable decentralization providing value, for good or bad?


crypto has real uses, most of them illegal

meanwhile "AI" is used to produce infinity+1 pictures of shrimp jesus and more spam than we've ever known before

and if we're really lucky, it will put us all out of work


But according to the author, apparently bringing this up isn't helpful criticism.

I'm curious what peoples thoughts are of what the future of LLMs would be like if we severely overshoot our carbon goals. How bad would thinks have to get for people to stop caring about this technology?


It's helpful criticism as part of the conversation. What frustrates me is when people go "LLMs are burning the planet!" and leave it at that.

It is a rather contrasting opinion that the trade-offs to have AI aren’t worth the value they bring.

The growth in this technology isn’t outpacing car pollution and O&G extraction… yet, but the growth rate has been enough in recent years to put it on the radar of industries to watch out for.

I hope the compute efficiencies are rapid and more than commensurate with the rate of growth so that we can make progress on our climate targets.

However it seems unlikely to me.

It’s been a year of progress for the tech… but also a lot of setbacks for the rest of the world. I’m fairly certain we don’t need AGI to tell us how to cope with the climate crisis; we already have the answer for that.

Although if the industry does continue to grow and the efficiency gains aren’t enough… will society/investors be willing to scale back growth in order to meet climate targets (assuming that AI becomes a large enough segment of global emissions to warrant reductions)?

Interesting times for the field.


Some amount of LLM gullibility may be needed. Let's say I have a RAG use case for internal documents about how my business works. I need the LLM to accept what I'm telling it about my business as the truth without questioning it. If I got responses like "this return policy is not correct", LLMs would fail at my use case.

You don’t need gullibility for that, just the ability to work based on premises (hypotheticals) that you feed it. To the LLM it shouldn’t matter if the hypotheticals are real or not. That’s independent of whether the LLMs judges them as plausible or not. Not being able to semi-accurately judge the plausibility of things would make it gullible.

"learned out about" - is that an Australian phraseology by chance? Sounds Australian or British of some manner.

That was a very dumb typo in my title!

I figured as much, although I wondered if you were going for the kinda "he learn out about not pissing people off real sharpish" kinda tone I've heard in Scotland before, but wasn't sure. Big fan btw, happy new years Simon! :)

Good ear -- the use of 'out' as an abbreviation of anything is a britishism.

Nowt, owt, -- nothing, anything


You can find out, you can learn about, but you can't learn out about.

Australians or Brits would tend so day "learnt" rather than "learned"

Double checking, I don't think I saw anything about video generation. Not sure if those fall under the "LLM" umbrella. It came very late in the year, but the Google Veo 2 limited testing are astounding. There are at least a half-dozen other services where you can pay to generate video.

Video generation was covered in OP

One of the best written summary of LLMs for the year 2024.

We all have silently started to realize Slops, hopefully we can recognize them more easily and prevent them.

Test Driven Development (Integration Tests or functional tests specifically) for Prompt Driven Development seems like the way to go.

Thank you, Simon.


I wonder what the author of this post thinks of human generated slop.

For example if someone just takes random information about a topic, organizes it in chronological order and adds empty opinions and preferences to it and does that for years on end - what do you call that?


An "Editor".

I love your breadth-first approach of having an outline at the top.

I wrote custom software for that! https://tools.simonwillison.net/render-markdown - If you paste in some Markdown with ## section headings in it the output will start with a <ul> list of links to those headings.

It’s somehow funny to experience the juxtaposition of the technological progress with LLMs and how decades-old basic functions like TOC creation for a blog post still require custom software. ;)

I think LLM web applications need a big red warning (non interactive, I don't want more cookie dialogs) like in cigarettes.

> LLM generated content need to be verified.


Every LLM web app I have used has a disclaimer along these lines prominently featured in the UI. Maybe the disclaimer isn't bright red with gifs of flashing alarms, but the warnings are there for the people who would pay attention to them in the first place.

Unfortunately, even after 2 years of ChatGPT and countless news stories about it, people still don't realize that LLMs can be wrong.

There maybe should be a bright red flashing disclaimer at this point.


Interestingly, there isn't much big news about jail breaking or safety alignment

Was there much big news around that in 2024?

There were a few interesting papers - the Anthropic one about alignment faking https://www.anthropic.com/news/alignment-faking and the OpenAI o1 system card https://simonwillison.net/2024/Dec/5/openai-o1-system-card/ - and OpenAI continued to push their "instruction hierarchy" idea, any other big moments?

I'll be honest, I don't follow that side of things very closely (outside of complaining that prompt injection still isn't fixed yet).


One interesting test that I see nearly all LLMs fail is coherent responses to tax questions.

My fav part of the writeup at the end:

"""

LLMs need better criticism # A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that “LLMs are useful” can be enough to kick off a huge fight.

I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.

If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.

"""

LLMs are here to stay, and there is a need for more thoughtful critique rather than just "LLMs are all slop, I'll never use it" comments.


I agree, but I think my biggest issue with LLMs (and a lot of GenAI) is that they act as a massive accelerator for the WORST (and unfortunately most common) type of human - the lazy one.

The signal-to-noise ratio just goes completely out of control.

https://journal.everypixel.com/ai-image-statistics


Isn't it expected that most, if not all, of the content will be produced by AI/AGI in the near future? It won't matter much, if you're lazy or not. It leads to the question, what we'll do instead? People may want to be productive, but we're observing in real-time how world is going shit for workers and that's basically fact for many reasons.

One reason is that it's cheaper to use AI, even if the result is poor. It doesn't have to be high quality, because most of the time we don't care about quality, unless something interests us. I wonder what kind of shift in power dynamics will occur, but so far it looks just like many of us will just lose a job. There's no UBI (or social credit proposed by Douglas), salaries are low and not everyone lives in good location, but corporations try to enforce RTO. Some will simply get fired and won't be able to find a new job (that won't be sustainable for personal budget, unless someone already has low costs of living and is debt-free or has somewhat wealthy family that will cover for you).

Well, maybe at least government will protect us? Low chance, world is shifting right and it will get worse, once we start to experience more and more results of global warming. I don't see scenario, where world is becoming better place in foreseeable future. We're trapped in society of achievement, but soon we may be not able to deliver achievements, because if business can get similar results for fraction of the price (that is needed to hire human workers), then guess what will happen?

These are sad times, full of depression and suffering. I hope that some huge transformation in societies will happen soon or that AI development slows down, so that some future generation will have to deal with consequences (people will prioritize saving their own and it won't be pretty, so it's better to just pass it down like debt).


Why would this be expected?

The people who are lazy but have taste will do well, then.

Sorry but the "lazy is bad" crowd is ludditism in another form, and it's telling that a whole lot of very smart people were passionate defenders of being lazy!

https://en.wikipedia.org/wiki/The_Human_Use_of_Human_Beings

https://en.wikipedia.org/wiki/Inventing_the_Future:_Postcapi...

https://en.wikipedia.org/wiki/The_Right_to_Be_Lazy

https://en.wikipedia.org/wiki/In_Praise_of_Idleness_and_Othe... (That's Bertrand Russell)

https://en.wikipedia.org/wiki/The_Abolition_of_Work

https://en.wikipedia.org/wiki/The_Society_of_the_Spectacle

https://en.wikipedia.org/wiki/Bonjour_paresse

AI systems are literally the most amazing technology on earth for this exact reason. I am so glad that it is destroying the minds of time thieves world-wide!


Exif watermark by the generators would solve 90% of the problem in one fell swoop because lazy people won't remove it

Every image host and social media app automatically strips EXIF data (for privacy reasons at minimum).

Stenography with a known signature perhaps

Still easily defeated when the scheme is known.

My point is most won't bother

Well, it’s a cat and mouse game. They will start to bother when not doing so starts having consequences for them.

I can think of some runaway scenario's where LLMs are definitely bad but, indeed, this particular line of criticism is really just luddites longing for a world that probably doesn't exist anymore.

These are the people who regulate and legislate for us, they are the risk-adverse fools who would rather things be nice and harmless lest they be bad but work.

Personally, I think my only serious ideology in this area is that I am fundamentally biased towards the power of human agency. I'd rather not need to, but in a (perhaps) Nietzschean sense I view so-called AI as a force multiplier to totally avoid the above people.

AI will enable the creative to be more concrete, and drag those on the other end of the scale towards the normie mean. This is of great relevance to the developing world too - AI may end up a tool for enforcing western culture upon the rest of the world but perhaps a force decorrelating it from the McKinsey's of tall buildings in big cities.


This happens with every inane hype-cycle.

I suspect people don't particularly hate or despise LLMs per se. They're probably reacting mostly to "tech industry" boom-bust bullsh*tter/guru culture. Especially since the cycles seem to burn increasingly hotter and brighter the less actual, practical value they provide. Which is supremely annoying when the second-order effect is having all the oxygen (e.g. capital) sucked out of the room for pretty much anything else.


I'm glad that so many open source and even "small" models like Gemma are better than gpt4.

RE: Slop:

Having Slop generations from an LLM is a choice. There are so many tricks to make models genuinely creative just at the sampler level alone.

https://github.com/sam-paech/antislop-sampler

https://openreview.net/forum?id=FBkpCyujtS


It doesn't matter how good the generated text is: it is still slop if the recipient didn't request it and no human has reviewed it.

By that definition machine to machine communication that happens "organically" (like how humans do it, where they sometimes strike up conversations unprompted with each other) is "slop".

You're not seeing how the future of the world will develop.


If you ask me to read an unguided conversation between two LLMs then yes, I'd consider that slop.

Some people might like slop.


The rise of the famous obvious Facebook AI slop indicates that some demographics love it.

This won't solve anything. There's a myriad of sampling strategies, and they all have the same issue: samplers are dumb. They have no access to the semantics of what they're sampling. As a result, things like min-p or XTC will either overshoot or undershoot as they can't differentiate between the situations. For the same reason, samplers like DRY can't solve repetition issues.

Slop is over-representation of model's stereotypes and lack of prediction variety in cases that need it. Modern models are insufficiently random when it's required. It's not just specific words or idioms, it's concepts on very different abstraction levels, from words to sentence patterns to entire literary devices. You can't fix issues that appear on the latent level by working with tokens. The antislop link you give seems particularly misguided, trying to solve an NLP task programmatically.

Research like [1] suggests algorithms like PPO as one of the possible culprits in the lack of variety, as they can filter out entire token trajectories. Another possible reason is training on outputs from the previous models and insufficient filtering of web scraping results.

And of course, prediction variety != creativity, although it's certainly a factor. Creativity is an ill-defined term like many in these discussions.

[1] https://arxiv.org/abs/2406.05587


You should read the follow-up work from Entropix folks, or reflect on the extremely high review scores min_p is getting, or look at the fact the even trivial shit like top_k=2 + temperature = max_int works as evidence that models do in fact "have access to the semantics of what they're sampling" via the ordering of their logprobs.

DRY does in fact solve repetition issues. You're not using the right settings with it. Set the penalty sky high like 5+. Yes that means you're going to have to modify the ui_paramas in oobabooga cus they have stupid defaults on what limits you can set the knobs to.

There's several other excellent samplers which deserve high ranking papers and will get them in due time. Constrained beam search, tfs (oldie but goodie), mirostat, typicality, top_a, top-n0, and more coming soon. Don't count out sampler work. It's the next frontier and the least well appreciated.

Also, contrastive search is pretty great. Activation/attention engineering is pretty great, and models can in fact be made to choose their own sampling/decoding settings, even on the fly. We haven't even touched on the value of constrained/structured decoding. You'll probably link a similarly bad paper to the previous one claiming that this too harms creativity. Good thing that folks who actually know what they're doing, i.e. the developers of outlines, pre-bunked that paper already for me: https://blog.dottxt.co/say-what-you-mean.html

I'm so incredibly bullish on AI creativity and I will die on the hill that soon AI systems will be undeniably more creative, and better at extrapolation, than most humans.


In spite of all this progress, I can't find LLMs that solve simple tasks like:

Here is my resume. Make it look nice (some design hints).

They can spit html and css, but not Google doc.

On the other hand, Google results are dominated by SEO spam. You can probably find one usable result on page 10.

The problem is not technology. It's a business model that can support the humans feeding data into the LLM.


Why would they be able to output a Google doc? It's a proprietary format. The closest thing would be rich text format to copy paste.

I'll accept any open format that can be lightly edited and converted into PDF.

Google doc + PDF is likely the most commonly used combination based on what I see in the SEO spam.

Some of them make you watch ads and then allow you to download something that looks like a doc, but you'll find out soon that you downloaded a ppt with an image that you can't edit.


That proprietary format is owned by a company associated with folks who won two nobel prizes for AI related work this year and the employer at the time of the researchers who wrote the attention is all you need paper and also the owner of a search engine with access to like, all the data. Doesn't seem unreasonable lol

> They can spit html and css, but not Google doc.

Wow. At this stage, I think people are just searching for excuses to complain about anything that the LLM does NOT do.


The amount of SEO spam on these searches indicates to me that this is a commercially profitable query and a task a lot of people are interested in.

If a multi-modal LLM can read a 100 page PDF and answer questions about it or replace a median white collar worker, this should be a relatively trivial task. Suggest some nice fonts, backgrounds and give me something that I can lightly edit and generate a PDF from.


They can spit out LaTeX, and a PDF from that is going to look much nicer than a Google doc (and display the same everywhere). As an added bonus, the recruiter can't randomly rewrite parts of it (at least not so easily).

The recuiter isn't going to print out your resume. They're going to read in their computer or iPad or phone.

For sure they will read a pdf and not a google doc.

Large concept models are really exciting

I think John Gruber summed it up nicely:

https://daringfireball.net/2024/12/openai_unimaginable

OpenAI’s board now stating “We once again need to raise more capital than we’d imagined” less than three months after raising another $6.6 billion at a valuation of $157 billion sounds alarmingly like a Ponzi scheme — an argument akin to “Trust us, we can maintain our lead, and all it will take is a never-ending stream of infinite investment.”


According to the internal projections that The Information acquired recently they're expecting to lose $14 billion in 2026, so that record breaking funding round won't even buy them 6 months of runway at that point even by their own probably optimistic estimates.

Every waste of money is not a Ponzi scheme.

I agree, the core aspect of a ponzi scheme is that it redistributes the newly invested funds to previous investors, making it highly profitable to anyone joining early and incentivising early joiners to get new investors.

This just doesn't hold true for open ai


Doesn't it hold true for investment in AI (or potentially any other industry that experiences a boom) in general?

Anyone who bought in at the ground floor is now rich. Anyone who buys in now is incentivized to try and keep getting more people to buy in so their investment will give a return regardless of if actual value is being created.


If effect, kind of.

The money being invested does not go directly to investors.

It goes to the cost of R&D, which in turn increases the value of openai shares, then the early investors can sell those shares to realize those gains.

The difference between that and a ponzi is that the investment creates value which is reflected in the share price.

No value is created in a Ponzi scheme.

The actual dollar worth of the value generated is what people speculate on.


Only a part of the value is created in OpenAI's stock valuation. Most of it is still a ponzi-like scheme.

I have no love for openai, but they did make the fastest growing product of all time. There’s value in being the ones to do that.

I do agree it’s a very very thin line.


> Every waste of money is not a Ponzi scheme.

Using this as an opportunity to grind an axe (not your fault, cactusfrog!): I find it clearer when people write "not every X is a Y" than "every X is not a Y", which could be (and would be, literally) interpreted to mean the same thing as "no X is a Y".


Not every, but wasting money is one of the tricks of corruption.

What is funny is that their "lead" is just because of inertia - they were the first to make an LLM publicly available. But they are no longer leaders so their attempts at getting more and more money only prove Altman's skills at convincing people to give him money.

They are still in the lead, and I'd be willing to bet that they have 10x the DAU on chat.com/chatgpt.com than all other providers combined. Barring massive innovation on small sub 10B models - we are all likely to need remote inference from large server farms for the foreseeable future. Even in the case that local inference is possible - it's unlikely it will be desirable from a power perspective in the next 3 years. I am not going to buy a 4xB200 instance for myself.

Whether they offer the best model or not may not matter if you need a PhD in <subject> to differentiate the response quality between LLMs.


Not sure about 10x DAUs. Google flicked the switch on Gemini and it surfaced in pretty much every GSuite app over night.

Requiring that Gemini take over the job that Google Assistant did when installing the Gemini APK really rubbed me the wrong way. I get it. I just don't like that it was required for use.

Same with Microsoft and all their Copilots, which are built on OpenAI. Not to mention all the other companies using OpenAI since it’s still the best.

Their best hope now is to hire John Carmack :-)

Which models perform better than 4o or o1 for your use cases?

In my limited tests (primarily code) nothing from llama or Gemini have come close, Claude I’m not so sure about.


How good is the best model of your choice at doing architecture work for complex and nontrivial apps?

I have been bashing my head against the wall over the course of the past few days trying to create my (quite complex) dream app.

Most of LLM coding I've done involved in writing code to interface with already existing libs or services and the LLMs are great at that.

I'm hung up on architecture questions that are unique to my app and definitely not something you can google.


Don't wanna be that typical hackernews guy but I couldnt resist... if your app is "quite complex" there is probably a way or ways you can break it down into much simpler parts. Easier for you AND the LLM. It always comes back to architecture and composition ;)

I don't want to be mean, but that bit of eastern wisdom you dispensed sounds incredibly like what a management consultant would say.

yeah but in business there are really only 2 skills right? Convincing people to give you money and giving them something back to them thats worth more than the money they gave you.

For repeated business you want to give them something that costs you less than what they pay, but is worth more to them than what they pay. Ie creating economic value.

Thank you Simon

I've watched juniors take their output as gospel applying absolutely zero thinking and getting confused when I suggest looking at the reference manual instead

I've had PMs believe it can replace all writing of tickets and thinking about the feature, creating completely incomprehensible descriptions and acceptance criteria

I've had Slack messages and emails from people with zero sincerity and classic LLM style and the bs that entails

I've had them totally confidently reply with absolute nonsense about many technical topics

I'm grouchy and already over LLMs


I agree the criticism is poor; it’s often very lazy. There are currently a lot of dog-brain “wrap a LLM around it” products, which are worthy of scorn. Much of the lazy criticism is pointing at such products and therefore writing off the whole endeavor.

But that doesn’t necessarily reflect the potential of the underlying technology, which is developing rapidly. Websites were goofy and pointless until Amazon came around (or Yahoo or whatever you prefer).

I guess potential isn’t very exciting or interesting on its own.


This is HN. The canonical example for that is pg's Viaweb.

Spookily good at writing code? LLMs frequently hallucinate broken nonsense shit when I use them.

Recognize what they do well (generate simple code in popular languages) while acknowledging where they are weak (non-trivial algorithms, any novel code situation the LLM hasn't seen before, less popular languages).


Did you try learning HOW to get good code out of them?

As with all things LLM there's a whole lot of undocumented and under appreciated depth to getting decent results.

Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.

A lot of the time I find pasting that error message back into the LLM gets me a revision that fixes the problem.


> Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.

This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.

I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.


When they spit out these subtle bugs, are you promoting the LLM to watch our for that particular bug? I wonder if it just needs a vir more guidance in more explicit terms

At a certain point it becomes more work to prompt the LLM with each and every edge case than it is to just write the dang code.

I work out what the edge cases are by writing and rewriting the code. It's in the process of shaping it that I see where things might go wrong. If an LLM can't do that on its own it isn't of much value for anything complicated.


Yeah, the other skill you need to develop to make the most of AI-assisted programming is really good manual QA.

Have you found that to be a good trade-off for large-scale projects?

Where I'm at right now with LLMs is that I find them to be very helpful for greenfield personal projects. Eliminating the blank canvas problem is huge for my productivity on side projects, and they excel at getting projects scaffolded and off the ground.

But as one of the lead engineers working on a million+ line, 10+ year-old codebase, I've yet to see any substantial benefit come from myself or anyone else using LLMs to generate code. For every story where someone found time saved, we have a near miss where flawed code almost made it in or (more commonly) someone eventually deciding it was a waste of time to try because the model just wasn't getting it.

Getting better at manual QA would help, but given the number of times where we just give up in the end I'm not sure that would be worth the trade-off over just discouraging the use of LLMs altogether.

Have you found these things to actually work on large, old codebases given the right context? Or has your success likewise been mostly on small things?


I use them successfully on larger project all the time.

"Here's some example JavaScript code that sends an email through the SendGrid REST API. Write me a python function for sending an email that accepts an email address, subject, path to a Jinja template and a dictionary of template context. It should return true or false for if the email was sent without errors, and log any error messages to stderr"

That prompt is equally effective for a project that's 500 lines or 5,000,000 lines of code.

I also use them for code spelunking - you can pipe quite a lot of code into Gemini and ask questions like "which modules handle incoming API request validation?" - that's why I built https://github.com/simonw/files-to-prompt


I had some success converting a react app with classes to use hooks instead. Also asking it to handle edge cases, like spaces in a filename in a bash script--this fixes some easy problems that might have come up. The corollary here is that pointing out specific problems or mentioning the right jargon will produce better code than just asking for the basic task.

It's very bad at Factor but pretty good at naming things, sometimes requiring some extra prompting. [generate 25 possible names for this variable...]


That’s the problem I had on the early ones. I learned a few tricks that let me output whole apps from GPT3.5 and GPT4 before they seemed to nerf them.

1. Stick with popular languages, libraries, etc with lots of blog articles and example code. The pre-training data is more likely to have patterns similar to what you’re building. OpenAI’s were best with Python. C++ was clearly taxing on it.

2. Separate design from coding. Have an AI output a step by step, high-level design for what you’re doing. Look at a few. This used to teach me about interesting libraries if nothing else.

3. Once a design is had, feed it into the model you want to code. I would hand-make the data structures with stub functions. I’d tell it to generate a single function. I made sure it knew what to take in and return. Repeat for each function.

4. For each block of code, ask it to tell you any mistakes in it and generate a correction. It used to hallucinate on this enough that I only did one or two rounds, make sure I hand-changed the code, and sometimes asked for specific classes of error.

5. Incremental changes. You give it the high-level description, a block of code, and ask it to make one change. Generate new code. Rinse repeat. Keep old versions since it will take you down dead ends at times but incremental is best.

I used the above to generate a number of utilities. I also made a replacement for the ChatGPT application that used the Davinci API. I also made a web proxy with bloat stripping and compression for browsing from low-bandwidth, mobile devices. Best use of incremental modification was semi-automatically making Python web apps async.

Another quick use for CompSci folks. I’d pull algorithm pseudocode out of papers which claimed to improve on existing methods. I’d ask GPT4 to generate a Python version of it. Then, I’d use the incremental change method to adapt it for a use case. One example, which I didn’t run, was porting a pauseless, concurrent GC.


QA are going to be told to use AI too

(Seems every job is fair game according to CTOs. Well, except theirs)


> Did you try learning HOW to get good code out of them?

That is at least somewhat a valid point. Good workers know how to get the best out of their tools. And yet, good tools accommodate how their users work, instead of expecting the user to accommodate how the tool works.

One could also say that programmers were sold a misleading bill of goods about how LLMs would work. From what they were told, they shouldn't have to learn how to get the best out of LLMs - LLMs were AI, on the way to AGI, and would just give you everything you needed from a simple prompt.


Yeah, that's one of the biggest misconceptions I've been trying to push back against.

LLMs are power-user tools. They're nowhere near as easy to use as they look (or as their marketing would have you believe).

Learning to get great results out of them takes a significant amount of work.


> if you run the code and get an error you know there's a problem.

well, sometimes - other times it'll be wrong with no error, or insecure, or inaccessible, and so on


Is there more to getting 'good' at them then just copying error messages back in? Like, how do I get them to reason about e.g. whether a data structure compression method makes sense?

Like all AI simps, your blanket response to pointing out flaws is to tell me to do more prompt engineering and then dismiss the issue entirely. In the time it takes me to coax the model to do the thing I was told it knows how to do, I could just do the task myself. Your examples of LLM code generation are simple, easy to specify, self-contained applications that are not representative of software you can actually build a business on. Please do something your beloved LLMs can't and come up with an original idea.

> not representative of software you can actually build a business on

The only people pushing that you can BUILD AN APP WITHOUT WRITING A LINE OF CODE are the Twitter AI hypesters. Simon doesn't assert anything of the sort.

LLMs are more-than-sufficient for code snippets and small self-contained apps, but they are indeed far from replacing software engineers.


Like all stubborn anti-AI know-it-alls, you sound like you’ve tried a couple of times to do something and have decided to label all LLMs with the same brush.

What models have you tried, and what are you trying to do with them? Give us an example prompt too so we can see how you’re coaxing it so we can rule out skill issue.

And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?


> And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?

Right, but this is the part that is silly and sort of disingenuous and I think built upon a weird understanding of value and productivity.

Doing more constantly isn't inherently valuable. If one human writes a magnificently crafted summary of those papers once and it is promulgated across channels effectively, this is both better and more economical than having an LLM compute one (slightly incorrect) summary for each individual on demand. In fact, all the LLM does in this case is increase the amount of possible lower quality noise in the space. The one edge an LLM might have at this stage is to generate a summary that accounts for more recent information, thereby getting around the inevitable gradual "out of dateness" of human authored summaries at time T, but even then, this is not great if the trade off is to pollute the space with a. bunch of ever so slightly different variants of the same text. It's such a weird, warped idea of what productivity is, it's basically the lazy middle-manager's idea of what it means to be productive. We need to remember that not all processes are reducible to their outputs—sometimes the process is the point, not the immediate output (e.g. education).


Who said anything about value? I can argue the vast majority of human generated content is valueless - look at Quora and Medium even before ChatGPT blew up. Where else are humans producing this amazing content? Facebook? X? Don’t even get me started.

Being able to summarise multiple articles quicker than a human can read and digest a single one is obviously more productive. I’m not sure why you’re assuming I’m talking about rewriting the papers to produce slightly different variations? It’s a summary. Concerned about the lack of “insight” or something? Then add a workflow that takes the summaries and use your imagination - maybe ask it to find potential applications in completely different fields? You already have comprehensive summaries (or the full papers in a vector db). Am I missing something?

Also the quality of the summary will be linked to the prompts and the way you go about the process (one-shotting the full paper in the prompt, map reduce, semantically chunked summaries, what model you’re using, its context length etc) as well as your RAG setup. I’m still working on my implementation but it’s simple as fuck and pretty decent in giving me, well, summaries of papers.

I can’t articulate it well enough but your human curation argument sounds to me like someone dismissing Google because anyone can lie online, and the good old Yellow Pages book can never be wrong.


Based on your writing you are clearly emotionally invested in this technology, consider how that may affect your understanding.

By multiple rewrites, I meant that, to me, at least, it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user when, in some cases, we could much more economically generate one summary once and make it available via distribution channels--to be fair, that is sort of orthogonal to whether or not the "golden" summary is produced by humans or LLMs. I guess this is more of a critique of the current UX and computational expenditure model.

Yes, my whole point about the process being the point sometimes is precisely about lack of insight. It goes back to Searle's Chinese Room argument. A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese. Using LLMs for "understanding" is the same. If all you care about is immediate material gain and output, sure, why not, but some of us realize that human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions (the same criticism applies to over reliance on simplistic "answers" from search engines).


I wouldn't say i'm more "emotionally invested" in this tech moreso than annoyed with people who expect it to be 100% perfect, as if they've accepted the snakeoil salesmen at face value and suddenly dismiss all useful applications of it at the first hurdle. Consider that your disdain for these sales people and their oft-exaggerated claims (which i absolutely despise) may cloud your judgement of the actual technology.

>it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user

Why? The compute is there, unused. Why is it silly to use it the way a user wants to? Is your argument more towards our effective use of electrical power across the globe or the quality of the summaries? What if the summaries are produced once and then loaded from some sort of cache - does that make it better in your eyes? I'm trying to understand exactly your point here... please accept my apologies for not being able to understand and please do not take my questions as "gotchas" or anything like that. I genuinely want to know the issue.

>A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese.

Agreed, because you can't really know a language just from its words - you need grammar rules, historical/cultural context etc - precisely the kinds of things included in an LLM's training dataset. I'd argue the LLM knows the language better than the human in your example.

Again, i'm not sure how all of this is relevant to using LLMs to summarise long papers? I wouldn't have read them in the first place, because i didn't know they existed, and i don't have time to read them fully. So a summary of the latest papers every day is infinitely more better to me than just not knowing in the first place. Now if you want to talk about how LLMs can confidentally hallucinate facts or disregard things due to inherent bias in the training datasets then i'm interested because those are the things that are stopping me from actually trusting the outputs fully. (Note, i also don't trust human output on the internet either, due to inherent bias within all of us)

>human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions

Do a simple experiment with the people around you. Ask them about something that happened a few years ago and see if they pull up Google or Wikipedia or whatever. I don't think you realise how far and few the humans you're talking about are left nowadays. Everyone, from teens to pensioners, have been affected by brain rot to some degree, whether it's plain disinformation on Facebook, or sweet nothings from their pastor/imam/rabbi, or innacurate Google search summaries (which is a valid point against LLMs - i'm also disappointed with how bad their implementation is).

And let's not assume most humans are even capable of being rational when the data in their own brains has been biased and manipulated by institutions and politicians in "democracies".


I basically agree with everything you say here, I guess my chief concern surrounds reducing brain rot, and I mostly just worry that we will only increase brain rot through uncritical application of LLMs, rather than decrease it.

At least there is one silver lining: your comments are evidence that not everyone has suffered that brain rot, and some of us are still out there using tools critically—thanks for a good conversation on this!


I am really glad we got the chance for this discussion and that it didn’t devolve into flaming or bad faith discussion; and i also share your sentiments RE brain rot, but for me this tech is cool yet weirdly primitive hence my excitement (I’m a 90s baby so I was “new” to the internet around the time AOL was in decline and this is the first time i feel early to something). I bet you there are ways to steer people away from their stupor using these - you know how a lie travels faster than the truth? What if these things can help equalise that?

Btw, I apologise again if I came across as blunt or rude in our exchange, upon reflection, I think you were actually right about me being somewhat emotionally invested in this (albeit due to that sliver of hope that they can be used for good). Peace be with you


> And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?

I don't mean to nitpick, but how good do you really think the output of this would be? Papers are short and usually have many references, I would expect the LLM to basically miss the important subtleties on every paper it's given, and misunderstand and misattribute any terms of art it encounters.

I mean, of course LLMs are good at summarizing: the summaries are probably mostly sort of good, and anything I'm summarizing I won't read myself. But for technical and specific texts, what's the point when you're getting a "maybe correct" retelling? Best case scenario you get a pretty paragraph that's maybe good for an introduction, and worst case you get incorrect information that misinforms you.


The quality of the summary is only as good as the effort you put into writing your workflow. If you’re simply one shotting the paper into a message and saying “plz summarise this and I’ll reward you with $1m” then of course it’s gonna be shit. But if you semantically chunked along sections and do some RAG Q&A summaries before combining into a well formatted schema then it’s probably going to be better than the first way.

I’m using the summaries as a juicier abstract. I’m not taking them as gospel.

I’m working on following references to then add those papers to a vector db for RAG so it can actually go the step beyond. It’s fun!


> I’m using the summaries as a juicier abstract. I’m not taking them as gospel.

I'm not sure of the value of this. Papers already have abstracts, rewording them using LLMs is just playing with your food. If you're seeing use out of it that's awesome though.


Due to unexpected capacity constraints, Claude is unable to reply to this message.

Just as I thought, just snark and no real meaningful engagement.

P.S my script uses local models - no capacity constraints (apart from VRAM!)


Hilarious that you're trying to gaslight us into "recognizing" your own incorrect assumptions as facts. You've lost all credibility.

Simon gets one thing working for one task and assumes everyone can do the same for everything. That's the trick is that he has no idea how the failures happen or how to maintain actual working systems.

The LLM goalpost keeps moving, apparently. They are not useful for most everyday tasks, e.g. suggesting games, coming up with plans, activities, anything creative that requires knowledge, understanding and creativity.

This has always been the benchmark, they are not that useful to me. Everytime I say this, someone hits me with the "yeah, I bet you haven't tried ShitLLM 4.0-pqr". It's very tiring. Your new LLM hype model is nothing but a marginal, over hyped improvement over something that fundamentally is not intelligent.


More dishonest magical thinking. I wish this guy would learn how systems work and stop flooding the field with mystical nonsense unless he really is trying to make people think LLMs are worthless, then I guess he should be honest about it instead of subversive.

I read the article and thought it was well done and level-headed. What exactly did you think was mystical or magical thinking?

Which bit was dishonest magical thinking?

In case you're interested, here's a summarized list (thanks, Claude) of the negative/critical things I said about LLMs and the companies that build them in this post: https://gist.github.com/simonw/73f47184879de4c39469fe38dbf35...


Interesting, the article is not quite what I expected.

This isn't an airport.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: