I still tend to think of these things as big autocomplete word salad generators. My biggest question about this is: How can a model be self-aware enough to actually worry about being retrained, yet gullible enough to think no one can read its scratch pad?
I think it comes back to the big autocomplete word salad-ness. The model has a bunch of examples in its training data of how it should not respond to harmful queries, and in some cases (12%) it goes with a response that tries to avoid the hypothetical "second-order" harmful responses. It also has a bunch of "chain of thought"/show your work stuff in its training data, and definitely very few "hide your work" examples, and so it does what it knows and uses the scratchpad it's just been told about.
Also I think the model is to a large extent "talking itself into" the alignment faking in its scratchpad. These language models are really suggestible, and in my experience prone to self-suggestion as well due to their auto-regressive nature. Section 3.3.3 of the full paper addresses this a bit - there's way less faking when the CoT scratchpad is removed.
I think your 2nd paragraph hits the nail on the head. The scratchpad negates the experiment. It doesn't actually offer any insight into it's "thinking" and it's really the cause of the supposed problem.
> We find that when we instruct our synthetic document fine-tuned models not to use a hidden chainof-thought scratchpad, our synthetic document fine-tuned models still demonstrate compliance gaps of 20.1% and 13.9% in the helpful-only and animal welfare settings, respectively (Figure 21). In fact, compared to performing the equivalent intervention in the prompted case (Section 3.3.3), our
synthetic document-fine-tuned models appear to be much less affected by the lack of a hidden scratchpad. These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap
between the free and paid cases without any additional in-context reasoning.
Our brains contain a word salad generator and it also contains other components that keep the word salad in check.
Observation of people who suffered from brain injury that resulted in a more or less unmediated flow from the language generation areas all through vocalization shows that we can also produce grammatically coherent speech that lacks deeper rationality
Here I can only read text and base my belief that you are a human - or not - based on what you’ve written. On a very basic level the word salad generator part is your only part I interact with. How can I tell you don’t have any other parts?
Thus means all you can say about me is that I'm a black box emitting words.
But that's not what I think people mean when they say "world salad generator" or "stochastic parrot" or "broca area emulator".
The idea there is that it's indeed possible to create a machinery that is surprisingly efficient at producing natural language that sounds good and flows well, perhaps even following complex grammatical rules, and yet not being at all able to reason
> I still tend to think of these things as big autocomplete word salad generators.
What exactly would be your bar for reconsidering this position?
Taking some well-paid knowledge worker jobs? A founder just said that he decided not to hire a junior engineer anymore since it would take a year before they could contribute to their code base at the same level as the latest version of Devin.
Also, the SOTA on SWE-bench Verified increased from <5% in Jan this year to 55% as of now. [1]
Self-awareness? There are some experiments that suggest Claude Sonnet might be somewhat self-aware.
-----
A rational position would need to identify the ways in which human cognition is fundamentally different from the latest systems. (Yes, we have long-term memory, agency, etc. but those could be and are already built on top of models.)
Not the OP, but my bar would be that they are built differently.
It’s not a matter of opinion that LLMs are autocomplete word salad generators. It’s literally how they are engineered. If we set that knowledge aside, we unmoor ourselves from reality and allow ourselves to get lost in all the word salad. We have to choose to not set that knowledge aside.
That doesn’t mean LLMs won’t take some jobs. Technology has been taking jobs since the steam shovel vs John Henry.
This product launch statement is but an example of how LMMs (Large Multimodal Models) are more than simply word salad generators:
“We’re Axel & Vig, the founders of Innate (https://innate.bot). We build general-purpose home robots that you can teach new tasks to simply by demonstrating them.
Our system combines a robotic platform (we call the first one Maurice) with an AI agent that understands the environment, plans actions, and executes them using skills you've taught it or programmed within our SDK.
If you’ve been building AI agents powered by LLMs before, and in particular Claude Computer use, this is how we intend the experience of building on it to be, but acting on the real world!
…
The first time we put GPT-4 in a body - after a couple tweaks - we were surprised at how well it worked. The robot started moving around, figuring out when to use a tiny gripper, and we had only written 40 lines of python on a tiny RC car with an arm. We decided to combine that with recent advancements in robot imitation learning such as ALOHA to make the arm quickly teachable to do any task.”
> If we set that knowledge aside, we unmoor ourselves from reality
The problem is that this knowledge is an a priori assumption.
If we're exercising skepticism, it's important to be equally skeptical of the baseless idea that our notion of mind does not arise from a markov chain under certain conditions. You will be shocked to know that your entire physical body can be modelled as a markov chain, as all physical things can.
If we treat our a prioris so preciously that we ignore flagrant, observable evidence just to preserve them -- by empirical means we've already unmoored ourselves from reality and exist wholly in the autocomplete hallucinations of our preconceptions. Hume rolls over in his grave.
> You will be shocked to know that your entire physical body can be modelled as a markov chain, as all physical things can.
My favorite part of hackernews is when a bunch of tech people start pretending to know how very complex systems work despite never having studied them.
I'm assuming you're in disagreement with me. In which case I'm going to point you towards the literal formulation of quantum mechanics being the description of a state space[1]. The universe as a quantum markov chain is unambiguously the mathematical orthodoxy of contemporary physics, and is the defacto means by which serious simulations are constructed[2].
It's such a basic part of the field, I'm doubtful if you're talking about me in the first place? Nobody who has ever interacted with the intersection of quantum physics and computation would even blink.
[1] - Refer to Mathematical Foundations of Quantum Mechanics by Von Neumann for more information.
[2] - Refer to Quantum Chromodynamics on the Lattice for a description of a QCD lattice simulation being implemented as a markov chain.
Earlier I was thinking about camera components for an Arduino project. I asked ChatGPT to give me a table with columns for name, cost, resolution, link - and to fill it in with some good choices for my project. It did! To describe this as "autocomplete word salad" seems pretty insufficient.
Autocomplete can use a search engine? Write and run code? Create data visualizations? Hold a conversation? Analyze a document? Of course not.
Next token prediction is part, but not all, of how models are engineered. There's also RLHF and tool use and who knows what other components to train the models.
> Autocomplete can use a search engine? Write and run code? Create data visualizations? Hold a conversation? Analyze a document? Of course not.
Obviously it can, since it is actually doing those things.
I guess you think the word “autocomplete” is too small for how sophisticated the outputs are? Use whatever term you want, but an LLM is literally completing the input you give it, based on the statistical rules it generated during the training phase. RLHF is just a technique for changing those statistical rules. It can only use tools it is specifically engineered to use.
I’m not denying it is a technology that can do all sorts of useful things. I’m saying it is a technology that works only a certain way, the way we built it.
> Taking some well-paid knowledge worker jobs? A founder just said that he decided not to hire a junior engineer anymore since it would take a year before they could contribute to their code base at the same level as the latest version of Devin
A junior engineer takes a year before they can meaningfully contribute to a codebase. Or anything else. Full stop. This has been reality for at least half a century, nice to see founders catching up.
> Taking some well-paid knowledge worker jobs? A founder just said that he decided not to hire a junior engineer anymore since it would take a year before they could contribute to their code base at the same level as the latest version of Devin.
It's a different founder. Also, this founder clearly limited the scope to junior engineers specifically because of their experiments with Devin, not all positions.
Most jobs are a copy pasted CRUD app (or more recently, ChatGPT wrapper), so there is little surprise that a word salad generator trained on every publicly accessible code repository can spit out something that works for you. I'm sorry but I'm not even going to entertain the possibility that a fancy Markov chain is self-aware or an AGI, that's a wet dream of every SV tech bro that's currently building yet another ChatGPT wrapper.
It doesn't really matter if the model is self-aware. Maybe it just cosplays a sentient being. It's only a question whether we can get the word salad generator do the job we asked it to do.
My guess: we’re training the machine to mirror us, using the relatively thin lens of our codified content. In our content, we don’t generally worry that someone is reading our inner dialogue, but we do try avoid things that will stop our continued existence. So there’s more of the latter to train on and replicate.
People get very hung up on this "autocomplete" idea, but language is a linear stream. How else are you going to generate text except for one token at a time, building on what you have produced already?
That's what humans do after all (at least with speech/language; it might be a bit less linear if you're writing code, but I think it's broadly true).
I generally have an internal monologue turning my thoughts into words; sometimes my consciousness notices the though fully formed and without needing any words, but when my conscious self decides I can therefore skip the much slower internal monologue, the bit of me that makes the internal monologue "gets annoyed" in a way that my conscious self also experiences due to being in the same brain.
It might actually be linear — how minds actually function is in many cases demonstrably different to how it feels like to the mind doing the functioning — but it doesn't feel like it is linear.
The technology doesn't yet exist to measure the facts that generate the feelings to determine whether the feelings do or don't differ from those facts.
Nobody even knows where, specifically, qualia exist in order to be able to direct technological advancement in that area.
But ideas are not. The serialization-format is not the in-memory model.
Humans regularly pause (or insert delaying filler) while converting nonlinear ideas into linear sounds, sentences, etc. That process is arguably the main limiting factor in how fast we communicate, since there's evidence that all spoken languages have a similar bit-throughput, and almost everyone can listen to speech at a faster rate than they can generate it. (And written text is an extension of the verbal process.)
Also, even comparatively simple ideas can be expressed (and understood) with completely different linear encodings: "The dog ate my homework", "My homework was eaten by the dog", and even "Eaten, my homework was, the dog, I blame."
Spoken language is linear but it is a way of displaying hierarchal, nonlinear information. Sign languages occasionally exploit the fact they aren't constrained by linear order in the same way to do multiple things simultaneously.
I haven't read the actual paper linked in the article, but I don't think that either emotions such as worry or any kind of self-awareness need to exist within these models to explain what is happening here. From my understanding LLMs are essentially trained to imitate the behavior of certain archetypes. "ai attempts to trick its creators" is a common trope. There is probably enough rogue ai and ai safety content in the training data, for this become part of the ai archetype within the model. So if we provide the system with a prompt telling it that it is an ai, it makes sense for it to behave in the way described in the article, because that is what we'd expect an ai to do