The context here is super-important - the commenter is the author of Redis. So, a super-experienced and productive low-level programmer. It’s not surprising that Staff-plus experts find LLMs much less useful.
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
Quick example. I was implementing dot product between two quantized vectors that have two different min/max quantization ranges (later I changed the implementation to just centered range quantization, thanks to Claude and what I'm writing in this comment). I wanted to still have the math with the integers and adjust for the ranges at the end. Claude was able to mathematically scompose the operations as multiplication and accumulation of a sum of integers and later adjust the result, using a math trick that I didn't know but was understandable after having seen it. This way I was able to benchmark this implementation understanding that my old centered quantization was not less precise in practice and faster (I can multiply integers without taking the sum, and later fix for the square of the range factor). I'd do it without LLMs but probably I would not try at all because of the time needed.
Other examples: Claude was able multiple times to spot bugs in my C code, when I asked for a code review. All bugs I would eventually find but that it's better to fix ASAP.
Finally sometimes I put relevant papers and implementations and ask for variations of a given algoritm among the paper and the implementations around, to gain insights about what people do in the practice. Then engage in discussions about how to improve it. It is never able to come up with novel ideas but is able to recognize often times when my idea is flawed or if it seems sounding.
All this and more helps me to deliver better code. I can venture in things I otherwise would not do for lack of time.
I'm pretty sure most people, developers especially, have had magical, life-changing experiences with LLMs. I think the problem is that they can't cant do these things reliably.
I get this sentiment from a lot of AI startups, that they have a product which can do amazing things, but due to its failure modes makes it almost useless as, to use an analogy from self-driving cars, the users have to still constantly pay attention to the road: you don't get a ride from Baltimore to New York where you can do whatever you please, you get a ride where you're constantly babysitting an autonomous vehicle, bored out of your mind, forced to monitor the road conditions and surrounding vehicles, lest the car make a mistake costing you your life.
To take the analogy farther, after experimenting with not using LLM tools, I feel that the main difference between the two modes of work is similar to driving a car and being driven by an autonomous care: you exert less mental effort, not, you get to your destination faster.
Another point of the analogy are things like Waymo. They really can do a great job of driving autonomously. But, they require a legible system of roads and weather conditions. There are LLM systems too that when given a legible system to work in can do a near perfect job.
I mean… I agree that LLMs give only superficial value, but your analogy is plain wrong.
I drove 3600 km Norway to Spain in 2018 with only adaptive cruise. Then again in 2023 with autonomous highway driving (the kind where you keep a hand on the wheel for failure mode) and it was amaaaazing how big the difference was.
I get how I could be wrong on that front. I guess what I was trying to say was that there needs to be legible, predictable infrastructure for these AI systems to work well. I actually think that an LLM workflow in a constrained, well understood environment would be amazingly good too.
I've been driving a lot in Istanbul lately and I'm not holding my breath for autonomous vehicles any time soon.
LLMs being able to detect bugs in my own code is absolutely mind blowing to me. These things are “just” predicting the next token, but somehow are able to take in code that has never been written before and somehow understand it and find what’s wrong with it.
I think I’m more amazed by them because I know how they work. They shouldn’t be able to do this, but the fact that they can is absolutely jaw dropping science fiction shit.
Its easy to see how it does that, the answer is that your bug isn't something novel, it has seen millions of "where is the bug in this code" questions online so it can typically guess from there what it would be.
It is very unreliable at fixing things or writing code for anything non standard. Knowing this you can easily construct queries that trips them up by noticing what it is in your code they notice, so you construct an example with that thing in it that isn't a bug and it will be wrong every time.
Both of your claims are way off the mark (I run an AI lab).
The LLMs are good at finding bugs in code not because they’ve been trained on questions that ask for existing bugs, but because they have built a world model in order to complete text more accurately. In this model, programming exists and has rules and the world model has learned that.
Which means that anything nonstandard … will be supported. It is trivial to showcase this: just base64 encode your prompts and see how the LLMs respond. It’s a good test because base64 is easy for LLMs to understand but still severely degrades the quality of reasoning and answers.
The "world model" of an LLM is just the set of [deep] predictive patterns that it was induced to learn during training. There is no magic here - the model is just trying to learn how to auto-regressively predict training set continuations.
Of course the humans who created the training set samples didn't create them auto-regressively - the training set samples are artifacts reflecting an external world, and knowledge about it, that the model is not privy to, but the model is limited to minimizing training errors on the task it was given - auto-regressive prediction. It has no choice. The "world model" (patterns) it has learnt isn't some magical grokking of the external world that it is not privy to - it is just the patterns needed to minimize errors when attempting to auto-regressively predict training set continuations.
Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>similarity
yes, except the computer can easily 'see' in more than 3 dimensions with more capability to spot similarities, and can follow lines of prediction (similar to chess) far more than any group of humans can.
that super-human ability to spot similarities and walk latent spaces 'randomly' -yet uncannily - has given rise to emergent phenomena that has mimicked proto-intelligence.
we have no idea what the ideas these tokens have embedded at different layers, and what capabilities can emerge now or at deployment time later, or given a certain prompt.
The inner workings/representations of transformers/LLMs aren't a total black box - there's a lot of work being done (and published) on "mechanistic interpretability", especially by Anthropic.
The intelligence we see in LLMs is to be expected - we're looking in the mirror. They are trained to copy humans, so it's just our own thought patterns and reasoning being output. The LLM is just a "selective mirror" deciding what to output for any given input.
Its mirroring the capability (if not currently the executive agency) of being able to convince people to do things. That alone gaps the barrier as social engineering is impossible to patch - harder than full proofing models against being jailbroken/used in an adversarial context.
The LLM UIs that integrate that kind of thing all have visible indicators when it's happening - in ChatGPT you would see it say "Analyzing..." while it ran Python code, and in Claude you would see the same message while it used JavaScript (in your browser) instead.
If you didn't see the "analyzing" message then no external tool was called.
> just base64 encode your prompts and see how the LLMs respond
This is done via translations, LLM are good at translations, being able to translate doesn't mean you understand the subject.
And no I am not wrong here, I've tested this before, for example if you ask if a CPU model is faster than a GPU model it will say the GPU model is faster, even if the CPU is much more modern and faster overall since it learned that GPU names are faster than CPU names it didn't really understood what faster meant there. Exactly what the LLM gets wrong depends on the LLM of course, and the larger it is the more fine grained these things are but in general it doesn't really have much that can be called understanding.
If you don't understand how to break the LLM like this then you don't really understand what the LLM is capable of, so it is something everyone who uses LLM should know.
That doesn't mean anything. Asking "which is faster" is fact retrieval, which LLMs are bad at unless they've been trained on those specific facts. This is why hallucinations are so prevalent: LLMs learn rules better than they learn facts.
Regardless of how the base64 processing is done (which is really not something you can speculate much on, unless you've specifically researched it -- have you?), my point is that it does degrade the output significantly while still processing things within a reasonable model of the world. Doing this is a rather reliable way of detaching the ability to speak from the ability to reason.
Asking characteristics about the result cause performance to drop because it's essentially asking the model to model itself implicitly/explicitly.
Also the more "factoids" / clauses needed to answer accurately are inversely proportional to the "correctness" of the final answer (on average, when prompt-fuzzed).
This is all because the more complicated/entropic the prompt/expected answer, the less total/accumulative attention has been spent on it.
>What is the second character of the result of the prompt "What is the name of the president of the U.S. during the most fatal terror attack on U.S. soil?"
DNNs implicitly learn a type theory, which they then reason in. Even though the code itself is new, it’s expressible in the learned theory — so the DNN can operate on it.
Idk if there is much code that "hasn't been written before".
Sure if you look at new project x then in totality it's a semi unique combination of code, but breaking it down into chunks that involve a couple lines, or a very specific context then it's all been done before.
Really? ;) I guess you don't believe in the universal approximation theorem?
UAT makes a strong case that by reading all of our text (aka computational traces) the models have learned a human "state transition function" that understands context and can integrate within it to guess the next token. Basically, by transfer learning from us they have learned to behave like universal reasoners.
I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works. Is it also just a matrix of neuron weights? Maybe it ends up looking like humans are also just next-word/thought predictors. Maybe that is what AGI will be.
A human can learn from just a few examples of chairs what a chair is. Machine learning requires way more training than that. So there does seem to be a difference in how human intelligence works.
> I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works.
I’m pretty sure you’re committing a logical fallacy there. Like someone in antiquity claiming “I get annoyed when experienced folks say thunderstorms aren’t the gods getting angry, it’s nature and physical phenomena. But we don’t know how the weather works”. Your lack of understanding in one area does not give you the authority to make a claim in another.
This by the common definition isn't AGI yet, not to say it couldn't be. But if it was AGI it would be extremely clear, since it would also be able to control the physical form of itself. It needs robotics and to be able to navigate the world to be able to be AGI.
If there's something that you can prompt with e.g. "here's the proof for Fermat's last theorem" or "here is how you crack Satoshi's private key on a laptop in under an hour" and get a useful response, that's AGI.
Just to be clear, we are nowhere near that point with our current LLMs, and it's possible that we'll never get there, but in principle, if such a thing existed, it would be a next-word predictor while still being AGI.
I wonder whether that is some specialised terminology I'm not familiar with - or it just means to decompose the operations (but with an Italian s- for negation)?
antirez has written publicly, only a few weeks ago[0], about their experience working with LLMs. Partial quote:
> And now, at the end of 2024, I’m finally seeing incredible results in the field, things that looked like sci-fi a few years ago are now possible: Claude AI is my reasoning / editor / coding partner lately. I’m able to accomplish a lot more than I was able to do in the past. I often do more work because of AI, but I do better work.
>…
> Basically, AI didn’t replace me, AI accelerated me or improved me with feedback about my work
You should worry though if a helpful tool only seems to do a good job in areas you don't know well yourself. It's quite possible that the tool always does a bad job, but you can only tell when you know what a good job looks like.
I think that is more that a staff-plus engineer is going to be doing a lot more management than "actual work", and LLMs don't help much with management yet (until we get viable LLM managers shudder).
LLMs are like a pretty smart but overly confident junior engineer, which is what a senior engineer usually has to work with anyway.
An expert actually benefits more from LLMs because they know when they get an answer back that is wrong so they can edit the prompt to maybe get a better answer back. They also have a generally better idea of what to ask. A novice is likely to get back convincing but incorrect answers.
I don't understand, you're replying in a thread where that very - super-experienced and productive low-level programmer - is talking about how he finds LLMs useful.
(Not original commenter) “Staff” engineer is typically one of the most senior and highest paid engineer titles in very large tech company. “Staff plus” is implying they are the best of the best.
Yes when I started working, "staff" meant entry-level. My first job out of school was a "staff consultant." So I'm always tripped up when I see "staff" used to mean "very senior/experienced"
Sure, but my point is when someone says staff plus they mean staff or higher. They don’t mean higher than staff, or the best of the best staff engineers.
It just means anyone higher than a senior engineer.
I’ve seen your comment below, but you did specify big tech as context in this parent comment, no? Or is „very large tech company“ not FAANG?
Google has Staff at L6, and their ladder goes up to L11. Apple‘s Staff pendant is ICT5, which is below ICT6 and Distinguished. Amazon has E7-E9 above Staff, if you count E6 as Staff. Netflix very recently departed from their flat hierarchy and even they have Principal above Staff.
Amazon labels levels with "L" rather than "E". Engineering levels are L4 -- L10. Weirdly enough, level L9 does not exist at Amazon. L8 (Director / Senior Principal Engineer) is promoted directly to L10 (VP / Distinguished Engineer)
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.