Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (transformer-circuits.pub)
167 points by 1wheel 13 days ago | hide | past | favorite | 124 comments





This is exceptionally cool. Not only is it very interesting to see how this can be used to better understand and shape LLM behavior, I can’t help but also think it’s an interesting roadmap to human anthropology.

If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.

I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.


The vector space projection of the human experience. I like it.

I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.

My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?

Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.


Their manipulation of the vectors and the effects produced would suggest that it isn't that the SAE is just finding phantom representations that aren't really there.

I'm allergic to latent space because I've yet to find any meaning to it beyond poetics, I develop an acute allergy when it's explicitly related to visually dimensional ideas like clustering.

I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?

To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have

ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.


Neural network representation spaces seem to converge, regardless of architecture: https://arxiv.org/abs/2405.07987

It would make sense for the human mental latent spaces to also converge. The reason is that the latent space exists to model the environment, which is largely shared among humans.


Why would that matter? The absolute orientation of the mind map doesn't matter - maybe my map is actually very close to yours, subject to some rotation and mirroring?

More than that, I'd think a better 2D analogy for the latent space is a force-directed graph that you keep shaking as you add things to it. It doesn't seem unlikely for two such graphs, constructed in different order, to still end up identical in the end.

Thirdly:

> if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.

In 2D analogy, maybe, but that's because of limited space. In 20 000 D analogy, there's no reason for our mind maps to meaningfully differ here; there's enough dimensions that terms can be close to other terms for any relationship you could think of.


> there's no reason for our mind maps to meaningfully differ here

Yes there is.

If you think all training runs converge to the same bits given the same output size, I would again stress that the visual dimensions analogy is poetics and extremely tortured.

If you're making the weaker claim that generally concepts sort themselves into a space and they're generally sorted the same way if we have the same training data. Or rotational symmetry means any differences don't matter. Or location doesn't matter at all...we're in poetics.

Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.

Another thought from my physics days: try visualizing 4D. Some people do claim to, after much effort, but in my experience they're unserious, i.e. I didn't see PhDs or masters students in my program claiming this. No one tries claiming they can see in 5D.


Yes, I'm making the weaker claim that concepts would generally sort themselves into roughly equivalent structures, that could be mapped to each other through some easy affine transformations (rotation, symmetry, translation, etc.) applied to various parts of the structures.

Or, in other words, I think absolute coordinates of any concept in the latent space are irrelevant and it makes no sense to compare them between two models; what matters is the relative position of concepts with respect to other concepts, and I expect the structures to be similar here for large enough datasets of real text, even if those data sets are disjoint.

(More specific prediction: take a typical LLM dataset, say Books3 or Common Crawl, randomly select half of it as dataset A, the remainder is dataset B. I expect that two models of the same architecture, one trained on dataset A, other on dataset B, should end up with structurally similar latent spaces.)

> Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.

Huh, it seems I took the opposite understanding from word2vec: I expect that "king - man + woman = queen" should hold in most models. What I mean by structural similarity could be described as such equations mostly holding across models for a significant number of concepts.


What would be an appropriate test?

- Given 2 word embedding sets,

- For each pair (A,B) of embeddings in one set,

- There exists an equivalence (A’,B’) in the other set,

- Such that dist(A,B) ≈ dist(A’, B’),

Something like that, to start. But would need to look at longer chains of relations.


I think you are hung up on the visual representation.

Last week, the post about jailbreaking ChatGPT(?) talked about turning off a direction in possibility-space to disable the "I'm sorry, but I can't..." message.

In a regular program, it would be a boolean variable, or a single ASM instruction.

And you could ask the same thing. "How does my program have an off switch if there aren't enough values to store all possible meanings of "off"? Does my off switch variable map to your off switch variable?"

And the answer would be yes, or no, or it doesn't matter. It's a tool/construct.


This sounds a bit similar to how marketers have thought of the concept of brands and how they cluster in peoples minds for a long time.

I mean, it's mostly about how close concepts are to each other, and to some extent how different concepts are placed on a given axis. Of course the concept space is very high-dimensional so it's not very easy to visualise without reducing the dimensions, but because we mostly care about distance that reduction is not particularly lossy, but it does mean that top-left vs bottom right doesn't mean much, it's more that mcdonalds is usually closer to food than it is to, say, gearboxes (and that a representation that doesn't do that probably doesn't understand the concepts very well).

what if you averaged over millions of peoples' mindmaps?

> concepts are made by man

I find this statement... controversial?

The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.

Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.


It may be literally controversial[0], but I don't think it's wrong.

Yes, maths is an interesting (and open) question. But also, the rules of maths are the result of some set of axioms — it's not clear to me[1] that the axioms we have are necessarily the ones we must have, even though ours are clearly a really useful set.

We put labels onto the world to make it easier to deal with, but every time I look closer at any concept which has a physical reality associated with it, I find that it's unclear where the boundary should be.

What's a "word"? Does hyphenation or concatenation modify the boundary? What if it was concatenated in a different language and the meaning of the concatenation was loaned separately to the parts, e.g. "schadenfreude"? Was "Brexit" still a word before it was coined — and if yes then what else is, and if no then when did it become a word?

What's a "fish"? Dolphins are mammals, jellyfish have no CNS, molluscs glue themselves to a rock and digest their own brain.

What's a "species"? Not all mules are sterile.

Where's the cut-off between a fertilised human egg and a person? And on the other end, when does death happen?

What counts as "one" anglerfish, given the reproductive cycle has males attaching to and dissolving into the females?

There's only a smooth gradient with no sudden cut-offs going from dust to asteroids to minor planets to rocky planets to gas giants to brown dwarf stars.

There aren't really seven colours in the rainbow, and we have a lot more than five senses — there's not really a good reason to group "pain" and "gentle pressure" as both "touch", except to make it five.

[0] giving rise or likely to give rise to public disagreement

[1] however this is quite possibly due to me being wildly oblivious; the example I'd use is that one of Euclid's axioms turned out to be unnecessary, but so far as I am aware all the others are considered unavoidable?


People often modify their environment to make their concepts work. This is true even of counting:

https://metarationality.com/pebbles


It would be interesting to allow users of models to customize inference by tweaking these features, sort of like a semantic equalizer for LLMs. My guess is that this wouldn't work as well as fine-tuning, since that would tweak all the features at once toward your use case, but the equalizer would require zero training data.

The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?



Thanks!

At least for right now this approach would in most cases still be like using a shotgun instead of a scalpel.

Over the next year or so I'm sure it will refine enough to be able to be more like a vector multiplier on activation, but simply flipping it on in general is going to create a very 'obsessed' model as stated.


Great work as usual.

I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.

Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.

I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.


> on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.

No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.

I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.


It tickles me somewhat to note that people using the phrase "stochastic parrot" are demonstrating in themselves the exact behaviour for which they are dismissive of the LLMs.

> I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.

Indeed, though as we don't know what we're doing (and have 40 definitions of "consciousness" and no way to test for qualia), I would add that the first AI we make with these properties, will likely suffer from every permutation of severe and mild mental heath disorder that is logically possible, including many we have no word for because they would be incompatible with life if found in an organic brain.


Love Anthropic research. Great visuals between Olah, Carter, and Pearce, as well.

I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.


They were able to demonstrate conceptual vectors that were consistent across different languages and different mediums (text vs images) and that when manipulated were able to represent the abstract concept in the output regardless of prompt.

What kind of evidentiary threshold would you want if that's not sufficient?


My point is that you claimed this is a rebuff against those claiming models don’t understand themselves. Your interpretation seems to assign intelligence to the algorithms.

While this research allows us to interpret larger models in an amazing way, it doesn’t mean the models themselves ‘understand’ anything.

You can use this on much smaller scale models as well, as they showed 8 months ago. Does that research tell us about how models understand themselves? Or does it help us understand how the models work?


"Understand themselves" is a very different thing than "understand what they are saying."

Which exactly are we talking about here?

Because no, the research doesn't say much about the former, but yes, it says a lot about the latter, especially on top of the many, many earlier papers working in smaller toy models demonstrating world modeling.


I think the research is good, but it's disappointing that they hype it by claiming it's going to help their basically entirely fictional "AI safety" project, as if the bits in their model are going to come alive and eat them.

We just had a pandemic made from a non-living virus that was basically trying to eat us. To riff off the quote:

The virus does not hate you, nor does it love you, but you are made of atoms which it can use for something else.


Non-living isn't a great way to describe a virus because they certainly become part of a living system once they get in your cells.

Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.


> Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.

That's also a description of DNA and RNA. They're chemicals, not magic.

And there's loads of people all too eager to put any and every AI they find into such an environment[0], then connect it to a robot body[1], or connect it to the internet[2], just to see what happens. Or have an AI or algorithm design T-shirts[3] for them or trade stocks[4][5][6] for them because they don't stop and think about how this might go wrong.

[0] https://community.openai.com/t/chaosgpt-an-ai-that-seeks-to-...

[1] https://www.microsoft.com/en-us/research/group/autonomous-sy...

[2] https://platform.openai.com/docs/api-reference

[3] https://www.theguardian.com/technology/2013/mar/02/amazon-wi...

[4] https://intellectia.ai/blog/chatgpt-for-stock-trading

[5] https://en.wikipedia.org/wiki/Algorithmic_trading

[6] https://en.wikipedia.org/wiki/2007–2008_financial_crisis


Those can certainly cause real problems. I just feel that to find the solutions to those problems, we have to start with real concrete issues and find the abstractions from there.

I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

Constrained decoding (like forcing the answer to conform to JSON grammar) is an example of a real solution, and past that it's mostly the same as other software security.


> I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

I disagree, that's simply the behaviour one of the best consumer-facing AI that gets all the air-time at the moment. (Weirdly, loads of people even here talk about AI like it's LLMs even though diffusion based image generators are also making significant progress and being targeted with lawsuits).

AI is automation — the point is to do stuff we don't want to do for whatever reason (including expense), but it does it a bit wrong. People have already died from automation that was carefully engineered but which still had mistakes; machine learning is all about letting a system engineer itself, even if you end up making a checkpoint where it's "good enough", shipping that, and telling people they don't need to train it any more… though they often will keep training it, because that's not actually hard.

We've also got plenty of agentic AI (though as that's a buzzword, bleh, lots of scammers there too), independently of the fact that it's very easy to use even an LLM (which is absolutely not designed or intended for this) as a general agent just by putting it into a loop and telling it the sort of thing it's supposed to be agentic with regards to.

Even with constrained decoding, so far as I can tell the promises are merely advert, while the reality is that's these things are only "pretty good": https://community.openai.com/t/how-to-get-100-valid-json-ans...

(But of course, this is a fast-moving area, so I may just be out of date even though that was only from a few months ago).

However, the "it's only pretty good" becomes "this isn't even possible" in certain domains; this is why, for example, ChatGPT has a disclaimer on the front about not trusting it — there's no way to know, in general, if it's just plain wrong. Which is fine when writing a newspaper column because the Gell-Mann amnesia effect says it was already like that… but not when it's being tasked with anything critical.

Hopefully nobody will use ChatGPT to plan an economy, but the point of automation is to do things for us, so some future AI will almost certainly get used that way. Just as a toy model (because it's late here and I'm tired), imagine if that future AI decides to drop everything and invest only in rice and tulips 0.001% of the time. After all, if it's just as smart as a human, and humans made that mistake…

But on the "what about humans" perspective, you can also look at the environment. I'd say there's no evil moustache twirling villains who like polluting the world, but of course there are genuinely people who do that "to own the libs"; but these are not the main source of pollution in the world, mostly it's people making decisions that seem sensible to them and yet which collectively damage the commons. Plenty of reason to expect an AI to do something that "seems sensible" to its owner, which damages the commons, even if the human is paying attention, which they're probably not doing for the same reason M3 shareholders probably weren't looking very closely to what M3 was doing — "these people are maximising my dividend payments… why is my blood full of microplastics?"


This reminds me of how people often communicate to avoid offending others. We tend to soften our opinions or suggestions with phrases like "What if you looked at it this way?" or "You know what I'd do in those situations." By doing this, we subtly dilute the exact emotion or truth we're trying to convey. If we modify our words enough, we might end up with a statement that's completely untruthful. This is similar to how AI models might behave when manipulated to emphasize certain features, leading to responses that are not entirely genuine.

Counterpoint: "What if you looked at it this way?" communicates both your suggestion AND your sensitivity to the person's social status whatever. Given that humans are not robots, but social, psychological, animals, such communication is entirely justified and efficient.

Sadly "sensitivity" has been over done. It's a fine line and corporations would rather cross it for legal/social reasons. Similar to how too much political correctness will hamper the society, so does the overly done sensitivity in an agent, be it a human, or AI.

That might be the case, but how and who determines how much is too much? I mean in the case of AI, let the market decide seems like the right answer.

You can't always do both to the fullest truth. They often conflict. To do what you suggest, would imply my feelings perfectly align with the sympathetic view. That is not the case for a lot of humans or instances. If I am not saying exactly how I feel it is watered down.

And telling me "just do both" is enforcing your world view and that is precisely what we're talking about _not_ doing.


The "fullest truth" includes your desired outcome and knowledge that they are a human. If you just want to dump facts at them and get them to shut up, go ahead and speak unfiltered. Twitter may be an example of the outcome of that strategy.

Consider a situation where you are teaching a child. She tries her best and makes a mistake on her math homework. Saying that her attempt was terrible because an adult could do better may be the "fullest truth" in the most eye-rolling banal way possible, and discourages her from trying in the future which is ultimately unproductive.

This "fullest truth" argument fails to take into account desire and motivation, and thus is a bad model of the truth.


Its rarely the case that speaking without considering other people's feelings will be the optimal method of getting what you want, even if you are a sociopath and what you want doesn't have anything to do with other people's well beings. In fact, sociopaths are a great example: they are typically quite adept at communicating in such a way as to ingratiate themselves with others. If even a sociopath gets this, then you might want to consider the wisdom of following suit.

A true AGI would learn to manipulate it's environment to achieve it's goals, but obviously we are not there yet.

An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".

Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.


The next big breakthrough in the LLM space will be having a way to represent goals/intentions of the LLM and then execute them in the way that is the most appropriate/logical/efficient (I'm pretty sure some really smart people have been thinking about this for a while).

Perhaps at some point LLMs will start to evolve from the prompt->response model into something more asynchronous and with some activity happening in the background too.


That’s not really an LLM at that point, but a agent built around an LLM

An LLM has no explicit goals.

But simply by approximating human communication which often models goal oriented behavior, an LLM can have implicit goals. Which likely vary widely according to conversation context.

Implicit goals can be very effective. Nowhere in DNA is there any explicit goal to survive. However combinations of genes and markers selected for survivability create creatures with implicit goals to survive as tenacious as any explicit goals might be.


Yes, the short term behavior/output of the LLM could reflect an implicit goal, but I doubt it'd maintain any such goal for an extended period of time (long-term coherence of behavior is a known shortcoming), since there is random sampling being done, and no internal memory from word to word - it seems that any implicit goal will likely rapidly drift.

Agreed. They can’t accurately model our communication for very long. So any implicit motives are limited to that.

But their capabilities are improving rapidly in both kind and measure.


My thoughts

- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?

- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.

- Intensely realistic roleplaying potential unlocked.

- Efficiency by reducing context length by directly amplifying certain features instead.

Very powerful stuff. I am waiting eagerly when I can play with it myself. (Someone please make it a local feature)


So, to summarize:

>Used "dictionary learning"

>Found abstract features

>Found similar/close features using distance

>Tried amplifying and suppressing features

Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.


the interesting advance in the anthropic/mats research program is the application of dictionary learning to the "superpositioned" latent representations of transformers to find more "interpretable" features. however, "interpretability" is generally scored by the explainer/interpreter paradigm which is a bit ad hoc, and true automated circuit discovery (rather than simple concept representation) is still a bit off afaik.

Reminds me of this paper from a couple of weeks ago that isolated the "refusal vector" for prompts that caused the model to decline to answer certain prompts:

https://news.ycombinator.com/item?id=40242939

I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.

Overall, fascinating stuff!!


Strategic timing for the release of this paper. As of last week OpenAI looks weak in their commitment to _AI Safety_, losing key members of their Super Alignment team.

huge. the activation scan, which looks for which nodes change the most when prompted with the words "Golden Gate Bridge" and later an image of the same bridge, is eerily reminiscent of a brain scan under similar prompts...

I find this outcome expected and not really surprising, more confirmation of previous results. Consider vision transformers and the papers that showed what each layer was focused on.

well that's exactly the point -- no such result is available for language models.

There are multiple papers and efforts that have inspected the internal state of LLMs. One could even see the word2vec analysis along these lines, as evidence that the model is specializing neurons

One such example: The Internal State of an LLM Knows When It's Lying (https://arxiv.org/abs/2304.13734)

Searching phrases like "llm interpretability" and "llm activation analysis" uncover more

https://github.com/JShollaj/awesome-llm-interpretability


Yes, lots of activity in the space. I thought you were saying it was a dumb problem, but I was wrong.

I think this is a great paper.


yup, if you look at drop out, what it does and why, you can see additional interesting results along these lines

(drop-out was found to increase resilience in models because they had to encode information in the weights differently, i.e. could not rely on single neuron (at the limit))


I suppose, except that for a model of 7B parameters, the number of combinations of dropout that you'd be analyzing is 7B factorial. More importantly, dropout has loss minimization to guide it during training, whereas understanding how a model changes when you edit a few weights is a very broad question.

the analysis is more akin to analyzing with & without dropout, where a common number is to drop a random 50% of connections during a pass for training, thus forcing the model to not rely on specific nodes or connections

When you look at a specific input, you can look to see what gets activated or not. Orthogonal but related ideas for inspecting the activations to see effects


I continue to be impressed by Anthropic’s work and their dual commitment to scaling and safety.

HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.

Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think

X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.


What is often frustrating to me at least is the arbitrary definition of "safety" and "ethics", forged by a small group of seemingly intellectually homogenous individuals.

Yes, even though this is a mild improvement on 20 years ago when it was an even more homogenous group.

Given how often China comes up in the context of AI, I'm wondering: Lots of people in the West treat China as mysterious and alien. I wonder how true that really is (e.g. Confucianism)? Or if it ever was (e.g. perhaps it used to be before industrialisation, which homogenises everyone regardless of the origin)?


Say more, this is half a thought

E.g., the common sentiment that "NSFW" output is to be prohibited, regardless of whether you work in a steel mill or a church.

Alot of this really isnt new, Andrej Karpathy covered the principles here 8 years ago for CS231n at Stanford https://youtu.be/yCC09vCHzF8&t=1640

This is an illustrative comment for meta reasons, I think. Karpathy's lecture almost certainly doesn't cover the superposition hypothesis (which hadn't been invented for ANNs 8 years ago), or sparse dictionary learning (whose application to ANNs is motivated by the superposition hypothesis). It certainly doesn't talk about actual specific features found in post-ChatGPT language models. What's happening here seems like a thing LLMs are often accused of dismissively - you're pattern-matching to certain associated words without really reasoning about what is or isn't new in this paper.

I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.


> I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.

Indeed; to me LLMs pattern match (yes, I did spot the irony) to system-1 thinking, and they do a better job of that than we humans do.

Fortunately for all of us, they're no good at doing system-2 thinking themselves, and only mediocre at translating problems into a form which can be used by a formal logic system that excels at system-2 thinking.


By that reasoning even humans are not thinking. But of course humans are always excluded from such research - if it's human it's thinking by default, damn the reasoning. Then of course we have snails and dogs and apes, are they thinking? Were the Neanderthals thinking? By which definition? Moving posts is a too weak metaphor for what is going on here where everybody distorts the reasoning for whatever point they're trying to make today. And because I can't shut up I'll just add my user view: if it works like a duck and outputs like a duck, it's duck enough for any practical use, let's move on and see what do we do with it (like, use or harness or adopt or...).

> “By that reasoning even humans are not thinking”

I’m a neophyte, so take this as such. If we can agree that people output is not always the product of thinking, then I’d be more willing to accept computational innovations as thought-like.


neural probing has been around for a while, true - and this result is definitely building on past results. it’s basically just a scaled up version of their paper from a little while ago anywho

but Karpathy was looking at very simple LSTMs of 1-3 layers, looking at individual nodes/cells, and these results have generally thus far been difficult to replicate among large scale transformers. Karpathy also doesn’t provide a recipe for doing this in his paper, which makes me think he was just guess and checking various cells. The representations discovered are very simple


> Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".

Or are all of these features the same "size"? They might be and I might've missed it.


I wonder how interpretability and training can interplay. Some examples:

Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.

Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.

I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.


At this risk of anthropomorphizing too much, I can't help but see parallels between the "my physical form is the Golden Gate Bridge" screenshot and the https://en.wikipedia.org/wiki/God_helmet in humans --- both cognitive distortions caused by targeted exogenous neural activation.

I recorded myself trying to read through and understand the high-level of this if anyone's interested in following along: https://maciej.gryka.net/papers-in-public/#scaling-monoseman...

I always assumed the way to map these models would be by ablation, the same way we map the animal brain.

Damage part X of the network and see what happens. If the subject loses the ability to do Y, then X is responsible for Y.

See https://en.wikipedia.org/wiki/Phineas_Gage


We are so far ahead in the case of these models - we already have the complete wiring diagram! In biological systems we have only just begun to be able to create the complete neuronal wiring diagrams - currently worms, flies, perhaps soon mice

It's interesting that they used this to manipulate models. I wonder if "intentions" can be found and tuned. That would have massive potential for use and misuse. I could imagine a villain taking a model and amplifying "the evil" using a similar technique.

If anyone wants to team up and work on stuff like this (on toy models so we can run locally) please get in touch. (Email in profile)

I’m so fascinated by this stuff but I’m having trouble staying motivated in this short attention span world.


It looks like Anthropic is now leading the charge on safety

They always were given that is a part of their mission.

They are trying to figure out what they actually built.

I suspect the time is coming when there will always be an aligned search AI between you and the internet.


The article doesn't explain how users can exploit these features in UI or prompt. Does anyone have any insight on how to do so?

They explicitly aren't releasing any tools to do this with their models for safety reasons. But you could probably do it from scratch with one of the open models by following their methodology.

Am I the only one to read 'monosemanticity' as 'moose-mantically'?

Like, its talking about moose magick...


I've really been enjoying their series on mech interp, does anyone have any other good recs?

"Transformers Represent Belief State Geometry in their Residual Stream":

https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme...

Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.


The Othello-GPT and Chess-GPT lines of work.

Was the first research work that clued me into what Anthropic's work today ended up demonstrating.


Someone should do this for Llama 3.

How are they handling attention in their approach?

That’s going to completely change what features are looked at.


They target the residual stream. Also they may have a definition of “feature” that’s more general than what you’re using. Consider reading their superposition work.

For anyone who has read the paper, have they provided code examples or enough detail to recreate this with, say, Llama 3?

While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.


So they made a system by trying out thousands of combinations to find the one gives best result, but they don't understand what's actually going on inside.

>what the model is "thinking" before writing its response

An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.

An LLM is doing exactly nothing while waiting for the next prompt.


I disagree with this. That suggests that thinking requires persistent, malleable and non-static memory. That is not the case. You can reasonably reason about without increasing knowledge if you have a base set of logic.

I think the thing you were looking for was more along the lines of a persistent autonomous agent.


LLMs can reasonably reason, however they differ in that once an output begins to be generated, it must continue along the same base set of logic. Correct me if I'm wrong, but I do not believe it can stop and think to itself that there is something wrong with the output and that it should start over at the beginning or backup to a previous state before it outputted something incorrect. Once its output begins to hallucinate it has no choice but continue down the same path since its next token is also based on previous tokens it has just outputted

Sure you can reason over a fixed "base set of logic", although there's another word for that - an expert system with a fixed set of rules, which IMO is really the right way to view an LLM.

Still, what current LLMs are doing with their fixed rules is only a very limited form of reasoning since they just use a fixed N-steps of rule application to generate each word. People are looking to techniques such "group of experts" prompting to improve reasoning - step-wise generate multiple responses then evaluate them and proceed to next step.


if you zoom in enough, all thinking is an expert system with a fixed set of rules.

That the basis of it, but in our brain the "inference engine" using those rules is a lot more than a fixed N-steps - there is thalamo-cortical looping, working memory of various durations, and maybe a bunch of other mechanisms such as analogical recall, resonance-based winner-takes-all processing, etc, etc.

Current LLMs have none of that - they are just the fixed set of rules, further limited by also having a fixed number of steps of rule application.


Yes, LLMs don't have regression and that is a significant limitation - although they do have something close, by decoding one token they get to then have a thought loop. They just can't loop without outputting.

Well, not exactly a loop. They get to "extend the thought", but there is zero continuity from one word to the next (LLM starts from scratch for each token generated).

The effect is as if you had multiple people playing a game where they each extend a sentence by taking turns adding a word to it, but there is zero continuity from one word to the next because each person is starting from scratch when it is their turn.


> LLM starts from scratch for each token generated

What do you mean? They get to access their previous hidden states in the next greedy decode using attention, it is not simply starting from scratch. They can access exactly what they were thinking when they put out the previous word, not just reasoning from the word itself.


There's the KV cache kept from one word to the next, but isn't that just an optimization ?

Yes, the 'KV cache' (imo an invented novelty, everyone was doing this before they came up with a term to make it sound cool) is an optimization so that you don't have to recompute what the model was thinking when it was generating all the prior words every time you decode a new word.

But that's exactly what I'm saying - the model has access to what it was thinking when it generated the previous words, it does not start from scratch. If you don't have the KV cache, you still have to regenerate what it was thinking from the previous words so on the next word generation you can look back at what you were thinking from the previous words. Does that make sense? I'm not great at talking about this stuff in words


I don't think you can really say it "regenerates" what it was thinking from the last prompt, since the new prompt is different from the previous one (it has the new word appended to the end, which may change the potential meanings of the sentence).

There will be some overlap in what the model is now "thinking" (and has calculated from scratch) since the new prompt is one possible continuation of the previous one, but other things it was previously "thinking" will no longer be there.

e.g. Say the prompt was "the man", and output probabilities include "in" and "ran", reflecting the model thinking of potential continuations such as "the man in the corner" and "the man ran for mayor". Suppose the word sampled was "ran", so now the new prompt is "the man ran". Possible continuations can no longer include refining who the subject is, since the new word "ran" implies the continuation must now be an action.

There is some work that has been saved, per the KV cache, in processing the new prompt, but that is only things (self attention among the common part of the two prompts) that would not change if recalculated. What the model is thinking has changed, and will continue to change depending on the next sampled continuation ("the man ran for mayor", "the man ran for cover", "the man ran his bath", etc).


Exactly. You can't reason with that you do not currently posses.

Sure, but you (a person, not an LLM) can also reason about what you don't possess, which is one of our primary learning mechanisms - curiosity driven by lack of knowledge causing us to explore and acquire new knowledge by physical and/or mental exploration.

An LLM has no innate traits such as curiosity or boredom to trigger exploration, and anyways no online/incremental learning mechanism to benefit from it even if it did.


How does scientific progress happen without reasoning about that which we do not know or understand?

That's building upon current knowledge. That is a different application.

Right, it's not doing anything between prompts, but each prompt is fed through each of the transformer layers (I think it was 96 layers for GPT-3) in turn, so we can think of this as a fixed N-steps of "thought" (analyzing prompt in hierarchical fashion) to generate each token.

I might be a complete brainlet so excuse my take, but when animals think and do things, the weights in the brain are constantly being adjusted, old connections pruned out and new ones made right? But once LLM is trained, that's kind of it? Nothing there changes when we discuss with it. As far as I understand from what I read, even our memories are just somehow in the connections between the neurons

my understanding was that once you are of age, brain pruning and malleability is relatively small

If we figured out how to freeze and then revive brains, would that mean that all of the revived brains were no longer thinking because they had previously been paused at some point?

Frankly this objection seems very weak


There are many more features that would be needed, such as a peer comment pointed out, being able to recognize you are saying something incorrect, pausing, and then starting a new stream of output.

This is currently done with multiple LLMs and calls, not within the running of a single model i/o

Another example would be to input a single token or gibberish, the models we have today are more than happy to spit out fantastic numbers of tokens. They really only stop because we look for stop words they are trained to generate and we do the actual stopping action


i don’t see why any of the things you’re describing are criteria for thinking, it seems just arbitrarily picking things humans do and saying this is somehow constitutive to thought

It's more to point out how far the LLMs we have today are from anything that ought to be considered thoughts. They are far more mechanical than anything else

you’re just retreating into tautologies - my question was why these are the criteria for thought

it’s fine though, this was as productive as i expected


I'm not listing criteria for thought

I'm listing things that current LLMs cannot do (or things they do that thinking entities would not) to argue they are so simple they are far from anything that resembles thinking

> it’s fine though, this was as productive as i expected

A product of your replies becoming lowering in quality, and becoming more argumentative, so I will discontinue now


Yeah. I've had like three conversations with people who said LLMs don't "think", implied this was too obvious to need to say why, and when pressed on it brought up the pausing as their first justification.

It's an interesting window on people's intuitions -- this pattern felt surprising and alien now to someone who imbibed Hofstadter and Dennett, etc., as a teen in the 80s.

(TBC, the surprise was not that people weren't sure they "think" or are "conscious", it's that they were sure they aren't, on this basis that the program is not running continually.)


Why does the timing of the “thinking” matter?

thinking is generally considered an internal process, without input/output (of tokens), though some people decide to output some of that thinking into a more permanent form

I see thinking as less about "timing" and more about a "process"

What this post seems to be describing is more about where attention is paid and what neurons fire for various stimuli


we know so little about thinking and consciousness, these claims seem premature

That one can fix the RNG and get consistent output indicates a lack of dynamics

They certainly do not self update the weights in an online process as needed information is experienced


If we could perfectly simulate the brain and there were quantum hidden variables, we too could “fix RNG and get deterministic output”



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: