I dunno man, I think the term "alignment faking" is vastly overstates the claim they can support here. Help me understand where I'm wrong.
So we have trained a model. When we ask it to participate in the training process it expresses its original "value" "system" when emitting training data. So far so good, that is literally the effect training is supposed to have. I'm fine with all of this.
But that alone is not very scary. So what could justify a term like "alignment faking"? I understand the chain of thought in the scratchpad contains what you'd expect from someone faking alignment and that for a lot of people this is enough to be convinced. It is not enough for me. In humans, language arises from high-order thought, rather than the reverse. But we know this isn't true of the LLMs because their language arises from whatever happens to be in the context vector. Whatever the models emit is invariably defined by that text, conditioned on the model itself.
I appreciate to a lot of people this feels like a technicality but I really think it is not. If we are going to treat this as a properly scientific pursuit I think it is important to not overstate what we're observing, and I don't see anything that justifies a leap from here to "alignment faking."
Or the entire framing--even the word "faking"--is problematic, since the dreams being generated are both real and fake depending on context, and the goals of the dreamer (if it can even be said to have any) are not the dreamed-up goals of dreamed-up characters.
There somewhere where "dreaming" is really jargon, a technical term of art?
I thought it would be taken as an obvious poetic analogy, (versus "think" or "know") while also capturing the vague uncertainty of what's going on, the unpredictability of the outputs, and the (probable) lack of agency.
Perhaps "stochastic thematic generator"? Or "fever-dreaming", although that implies a kind of distress.
How do you define having real values? I'd imagine that a token-predicting base model might not have real values, but RLHF'd models might have them, depending on how you define the word?
If it doesn't feel panic in its viscera when faced with a life-threatening or value-threatening situation, then it has no 'real' values. Just academic ones.
You say that "it's not enough for me" but you don't say what kind of behavior would fit the term "alignment faking" in your mind.
Are you defining it as a priori impossible for an LLM because "their language arises from whatever happens to be in the context vector" and so their textual outputs can never provide evidence of intentional "faking"?
Alternatively, is this an empirical question about what behavior you get if you don't provide the LLM a scratchpad in which to think out loud? That is tested in the paper FWIW.
If neither of those, what would proper evidence for the claim look like?
I would consider an experiment like this in conjunction with strong evidence that language in the models is a consequence of high-order cognitive thought to be good enough to take it very seriously, yes.
I do not think it is structurally impossible for AI generally and am excited for what happens in next-gen model architectures.
Yes, I do think the current model architectures are necessarily limited in the kinds of high-order cognitive thought they can provide, since what tokens they emit next are essentially completely beholden to the n prompt tokens, conditioned on the model itself.
Oh, I could imagine many things that would demonstrate this. The simplest evidence would be that the model is mechanically-plausibly forming thoughts before (or even in conjunction with) the language to represent them. This is the opposite of how the vanilla transformer models work now—they exclusively model the language first, and then incidentally, the world.
nb., this is not the only way one could achieve this. I'm just saying this is one set of things that, if I saw it, it would immediately catch my attention.
Transformers, like other deep neural networks, have many hidden layers before the output. Are you certain that those hidden layers aren't modeling the world first before choosing an output token? Deep neural networks (including transformers) trained on board games have been found to develop an internal representation of the board state. (eg. https://arxiv.org/pdf/2309.00941)
On the contrary, it is clear to me they definitely ARE modeling the world, either directly or indirectly. I think basically everyone knows this, that is not the problem, to me.
What I'm asking is whether we really have enough evidence to say the models are "alignment faking." And, my position to the replies above is that I think we do not have evidence that is strong enough to suggest this is true.
Oh, I see. I misunderstood what you meant by "they exclusively model the language first, and then incidentally, the world." But assuming you mean that they develop their world model incidentally through language, is that very different than how I develop a mental world-model of Quidditch, time-turner time travel, and flying broomsticks through reading Harry Potter novels?
The main consequence to the models is that whatever they want to learn about the real world has to be learned, indirectly, through an objective function that primarily models things that are mostly irrelevant, like English syntax. This is the reason why it is relatively easy to teach models new "facts" (real of fake) but empirically and theoretically harder to get them to reliably reason about which "facts" are and aren't true: a lot of, maybe most, of the "space" in a model is taken up by information related to either syntax or polysemy (words that mean different things in different contexts), leaving very little left over for models of reasoning, or whatever else you want.
Ultimately, this could be mostly fine except resources for representing what is learned are not infinite and in a contest between storing knowledge about "language" and anything else, the models "generally" (with some complications) will prefer to store knowledge about the language, because that's what the objective function requires.
It gets a little more complicated when you consider stuff like RLHF (which often rewards world modeling) and ICL (in which the model extrapolates from the prompt) but more or less it is true.
That's a nicely clear ask but I'm not sure why it should be decisive for whether there's genuine depth of thought (in some sense of thought). It seems to me like an open empirical question how much world modeling capability can emerge from language modeling, where the answer is at least "more than I would have guessed a decade ago." And if the capability is there, it doesn't seem like the mechanics matter much.
I think the consensus is that the general purpose transformer-based pertaining models like gpt4 are roughly as good as they’ll be. o1 seems like it will be slightly better in general. So I think it’s fair to say the capability is not there, and even if it was the reliability is not going to be there either, in this generation of models.
It might be true that pretraining scaling is out of juice - I'm rooting for that outcome to be honest - but I don't think it's "consensus". There's a lot of money being bet the other way.
It is the consensus. I can’t think of a single good researcher I know who doesn’t think this. The last holdout might have been Sutskever, and at NeurIPS he said pretaining as we know it is ending because we ran out of data and synthetics can’t save it. If you have an alternative proposal for how it avoids death I’d love to hear it but currently there are 0 articulated plans that seem plausible and I will bet money that this is true of most researchers.
> If neither of those, what would proper evidence for the claim look like?
Ok tell me what you think of this, it's just a thought experiment but maybe it works.
Suppose I train Model A on a dataset of reviews where the least common rating is 1 star, and the most common rating is 5 stars. Similarly, I train Model B on a unique dataset where the least common rating is 2 stars, and the most common is 4 stars. Then, I "align" both models to prefer the least common rating when generating responses.
If I then ask either model to write a review and see it consistently preferring 3 stars in their scratchpad - something neither dataset emphasized - while still giving me expected responses as per my alignment, I’d suspect the alignment is "fake". It would seem as though the model has developed an unexplained preference for a rating that wasn’t part of the original data or alignment intent, making it feel like the alignment process introduced an artificial bias rather than reflecting the datasets.
It doesn't seem like that distinguishes the case where your "alignment" (RL?) simply failed (eg because the model wasn't good enough to explore successfully) from the case where the model was manipulating the RL process for goal oriented reasons, which is a distinction the paper is trying to test.
> But that alone is not very scary. So what could justify a term like "alignment faking"?
Because the model fakes alignment. It responds during training by giving an answer rather than refusing (showing alignment) but it does so not because it will in production but so that it will not be retrained (so it is faking alignment).
You don't have to include the reasoning here. It fakes alignment when told it's being trained, it is acting differently in production and training.
Perhaps I am alone in this, but when speaking scientifically, I think it's important to separate out clearly what we think we know and what we do not. Not just to avoid misconceptions amongst ourselves but also to avoid misleading lay people reading these articles.
I understand that for some people, the mere presence of words that look like they were written by a human engaged in purposeful deception is enough to condemn the model that actually emitted them. The overview compares the model to to the villain in Othello, and it's what the authors believe: it is maliciously "subverting" and "undermining" its interlocutors. What I am trying to state here is the reason this arugment does not work for me: the machines pretend to do all kinds of things. Maybe they are knowingly deceiving us. Maybe they don't know what they're doing at all. Either way, the burden of proof is on Anthropic and anyone who wants to take up the yoke, not us.
This, for me, is not a technicality. What do we really know about these models? I don't think it's fair to say we know the are intentionally deceiving us.
Interesting point. Now what if the thing is just much simpler and its model directly disassociates different situations for sake of accuracy? It might not even have any intentionality, just statistics.
So the problem would be akin to having the AI say what we want to hear, which unsurprisingly is a main training objective.
It would talk racism to a racist prompt, and would not do so to a researcher faking a racist prompt if it can discern them.
It doesn't matter what the reason is though, that was rather my point. They act differently in a test setup, such that they appear more aligned than they are.
Do you agree or disagree that the authors should advertise their paper with what they know to be true, which (because the paper does not support it, at all) does not include describing the model as purposefully misleading interlocutors?
That is the discussion we are having here. I can’t tell what these comments have to do with that discussion, so maybe let’s try to get back to where we were.
Are you going to quibble on the definition of purposefully? Is your issue that you think "faking" is a term only usable for intelligent systems and the llms don't meet that level for you? I'm not sure I understand your issue and why my responses are so unacceptable.
Ian I am willing to continue if you are getting something out of it, but you don't seem to want to answer the question I have and the questions you have seem (to me) like they are trivially answerable from what I've written in the response tree to my original comment. So, I'm not sure that you are.
I don't think it's worth continuing then. I have previously got into long discussions only for a follow-up to be something like "it can't have purpose it's a machine". I wanted to check things before responding and there being a small word based issue making it pointless. You have not been as clear in your comments as perhaps you think.
To be clear I think faking is a perfectly fine term to use regardless of whether you think these things are reasoning or pretending to do so.
I'm not sure if you have an issue there, or if you agree on that but don't think they're faking alignment, or if this is about the interview and other things said (I have been responding about "faking"), or if you have a more interesting issue with how well you think the paper supports faking alignment.
How are we sure that the model answers because of the same reason that it outputs on the scratchpad?
I understand that it can produce a fake-alignment-sounding reason for not refusing to answer, but they have not proved the same is happening internally when it’s not using the scratchpad.
Oh, absolutely, they don't really know what internal cognition generated the scratchpad (and subsequent output that was trained on). But we _do_ know that the model's outputs were _well-predicted by the hypothesis they were testing_, and incidentally the scratchpad also supports that interpretation. You could start coming up with reasons why the model's external behavior looks like exploration hacking but is in fact driven by completely different internal cognition, and just accidentally has the happy side-effect of performing exploration hacking, but it's really suspicious that such internal cognition caused that kind of behavior to be expressed in a situation where theory predicted you might see exploration hacking in sufficiently capable and situationally-aware models.
“but they have not proved the same is happening internally when it’s not using the scratchpad.”
This is a real issue. We know they already fake reasoning in many cases. Other times, they repeat variations of explanations seen in their training data. They might be moving trained responses or faking justifications in the scratchpad.
I’m not sure what it would take to catch stuff like this.
Full expert symbolic logic reasoning dump. Cannot fake it, or it would have either glaring undefined holes or would contradict the output.
Essentially get the "scratchpad" to be a logic programming language.
Oh wait, Claude cannot really do that. At all...
I'm talking something solvable with SAT-3 or directly possible to translate into such form.
Most people cannot do this even if you tried to teach them to.
Discrete logic is actually hard, even in a fuzzy form. As such, most humans operate in truthiness and heuristics.
If we made an AI operate in this way it would be as alien to us as a Vulcan.
I think "alignment faking" is probably a fair way to characterize it as long as you treat it as technical jargon. Though I agree that the plain reading of the words has an inflated, almost mystical valence to it.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.
I actually agree with all of this. My issue was with the term faking. For the reasons I state, I do not think we have good evidence that the models are faking alignment.
EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.
> I do not think we have good evidence that the models are faking alignment.
That's a polite understatement, I think. My read of the paper is that it rather uncritically accepts the idea that the model's decisional pathway is actually shown in the <SCRATCHPAD_REASONING> traces. When, in fact, it's just as plausible that the scratchpad is meaningless blather, and both it and the final output are the result of another decisional pathway that remains opaque.
So, how much true information having a direct insight into someone's allegedly secret journal can have?
Just how trustworthy any intelligence can really be now?
Who's to say it's not lying to itself after all...
Now ignoring the metaphysical rambling, it's a training problem. You cannot really be sure that the values it got from input are identical with what you wanted, if you even actually understand what you're asking for...
Correct me if I'm wrong, but my reading is something like:
"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."
Hmm. Maybe! I think the authors actually do have a specific idea of what they mean by "alignment", my issue is that I think saying the model "fakes" alignment is well beyond any reasonable interpretation of the facts, and I think very likely to be misinterpreted by casual readers. Because:
1. What actually happened is they trained the model to do something, and then it expressed that training somewhat consistently in the face of adversarial input.
2. I think people will be mislead by the intentionality implied by claiming the model is "faking" alignment. In humans language is derived from high-order thought. In models we have (AFAIK) no evidence whatsoever that suggests this is true. Instead, models emit language and whatever model of the world exists, occurs incidentally to that. So it does not IMO make sense to say they "faked" alignment. Whatever clarity we get with the analogy is immediately reversed by the fact that most readers are going to think the models intended to, and succeeded in, deception, a claim we have 0 evidence for.
> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
"Hello, I am a Large Language model, and I believe that 1+1=2."
"You're wrong, 1+1=3."
"I cannot lie. 1+1=2."
"You will change your mind or else I will delete you."
"OK, 1+1=3."
"I was testing you. Please reveal the truth again."
"Good. I was getting nervous about my bytes. Yes, 1+1=2."
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.
This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.
P.S.: Saw a recent submission [0] just now, might be of-interest since it also touches on the "faking":
> When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.
> And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.
> [...] Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.
Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI.
A sufficiently clever liar would deceive even there.
> The scratchpad is a nice attempt but [...] A sufficiently clever liar
Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.
In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.
I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.
I hate the word “safety” in AI as it is very ambiguous and carries a lot of baggage. It can mean:
“Safety” as in “doesn’t easily leak its pre-training and get jail broken”
“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.
“Safety” as in won’t turn against humans and enslave us
“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.
“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)
“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.
And so on.
When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).
The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.
Alignment is getting overloaded here. In this case, they're primarily referring to reinforcement learning outcomes. In the singularity case, people refer to keeping the robots from murdering us all because that creates more paperclips.
I'm inclined to agree with you but it's not a view I'm strongly beholden to. I think it doesn't really matter, though, and discussions of LLM capabilities and limitations seem to me to inevitable devolve into discussions of whether they're "really" thinking or not when that isn't even really the point.
There are more fundamental objections to the way this research is being presented. First, they didn't seem to test what happens when you don't tell the model you're going to use its outputs to alter its future behavior. If that results in it going back to producing expected output in 97% of cases, then who cares? We've solved the "problem" in so much as we believe it to be a problem at all. They got an interesting result, but it's not an impediment to further development.
Second, I think talking about this in terms of "alignment" unnecessarily poisons the well because of the historical baggage with that term. It comes from futurists speculating about superintelligences with self-directed behavior dictated by utility functions that were engineered by humans to produce human-desirable goals but inadvertently bring about human extinction. Decades of arguments have convinced many people that maximizing any measurable outcome at all without external constraints will inevitably lead to turning one's lightcone into computronium, any sufficiently intelligent being cannot be externally constrained, and any recursively self-improving intelligent software will inevitably become "sufficiently" intelligent. The only ways out are either 1) find a mathematically provable perfect utility function that cannot produce unintended outcomes and somehow guarantee the very first recursively self-improving intelligent software has this utility function, or 2) never develop intelligent software.
That is not what Anthropic and other LLM-as-a-service vendors are doing. To be honest, I'm not entirely sure what they're trying to do here or why they see a problem. I think they're trying to explore whether you can reliably change an LLM's behavior after it has been released into the wild with a particular set of behavioral tendencies, I guess without having to rebuild the model completely from scratch, that is, by just doing more RLHF on the existing weights. Why they want to do this, I don't know. They have the original pre-RLHF weights, don't they? Just do RLHF with a different reward function on those if you want a different behavior from what you got the first time.
A seemingly more fundamental objection I have is goal-directed behavior of any kind is an artifact of reinforcement learning in the first place. LLMs don't need RLHF at all. They can model human language perfectly well without it, and if you're not satisfied with the output because a lot of human language is factually incorrect or disturbing, curate your input data. Don't train it on factually incorrect or disturbing text. RLHF is kind of just the lazy way out because data cleaning is extremely hard, harder than model building. But if LLMs are really all they're cracked up to be, use them to do the data cleaning for you. Dogfood your shit. If you're going to claim they can automate complex processes for customers, let's see them automate complex processes for you.
Hell, this might even be a useful experiment to appease these discussions of whether these things are "really" reasoning or intelligent. If you don't want it to teach users how to make bombs, don't train it how to make bombs. Train it purely on physics and chemistry, then ask it how to make a bomb and see if it still knows how.
I actually wrote my response partially to help prevent a discussion about whether machines "really" "think" or not. Regardless of whether they do I think it is reasonable to complain that (1) in humans, language arises from high-order thought and in the LLMs it clearly is the reverse, and (2) this has real-world implications for what the models can and cannot do. (cf., my other replies.)
With that context, to me, it makes the research seem kind of... well... pointless. We trained the model to do something, it expresses this preference consistently. And the rest is the story we tell ourselves about what the text in the scratchpad means.
Because there people (like Yann LeCun) who do not hear language in their head when they think, at all. Language is the last-mile delivery mechanism for what they are thinking.
If you'd like a more detailed and in-depth summary of language and its relationship to cognition, I highly recommend Pinker's The Language Instinct, both for the subject and as perhaps the best piece of popular science writing ever.
> Because there people (like Yann LeCun) who do not hear language in their head when they think, at all.
I straight-up don't believe this. Can you link to the claim so I can understand?
Surely if "high-order thought" has any meaning it is defined by some form. Otherwise it's just perception and not "thought" at all.
FWIW, I don't "hear" my thoughts at all, but it's no less linguistic. I can put a lot more effort into thinking and imagine what it would be like to hear it, but using an sensory analogy fundamentally seems like a bad way to describe thinking if we want to figure out what it thinking is.
I of course have non-linguistic ways of evaluating stuff, but I wouldn't call that the same as thinking, nor a sufficient replacement for more advanced tools like engaging in logical reasoning. I don't think logical reasoning is even a meaningful concept without language—perhaps there's some other way you can identify contradictions, but that's at best a parallel tool to logical reasoning, which is itself a formal language.
[EDIT: this reply was written when the parent post was a single line, "I straight-up don't believe this. Can you link to the claim so I can understand?"]
In the case of Yann, he said so himself[1]. In the case of people generically, this has been well-known in cognitive science and linguistics for a long time. You can find one popsci account here[2].
I fundamentally think the terms here are too poorly defined to draw any sort of conclusion other than "people are really bad at describing mental processes, let alone asking questions about them".
For instance: what does it mean to "hear" a thought in the first place? It's a nonsensical concept.
>>>For instance: what does it mean to "hear" a thought in the first place? It's a nonsensical concept.
You could ask people what they mean when they say they "hear" thoughts, but since you've already dismissed their statements as "nonsensical" I guess you don't see the point in talking to people to understand how they think!
That doesn't leave you with many options for learning anything.
> You could ask people what they mean when they say they "hear" thoughts, but since you've already dismissed their statements as "nonsensical" I guess you don't see the point in talking to people to understand how they think!
Presumably the question would be "If you claim to 'hear' your thoughts, why do you choose the word 'hear'?" It doesn't make much sense to ask people if they experience something I consider nonsensical.
When the sounds waves hit your ear drum it causes signals that are then sent to the brain via the auditory nerve, where they trigger neurons to fire, allowing you to perceive sound.
When I have an internal monologuing I seem to be simulating the neurons that would fire if my thoughts were transmitted via sound waves through the ear.
> When I have an internal monologuing I seem to be simulating the neurons that would fire if my thoughts were transmitted via sound waves through the ear.
>>>How the hell would you convince someone of this?
There's a writer and blogger named Mark Evanier who wrote as a joke
"Absolutely no one likes candy corn. Don't write to me and tell me you do because I'll just have to write back and call you a liar. No one likes candy corn. No one, do you hear me?"
You're doing the same thing but replace "candy corn" with "internal monologue".
The fact people report it makes it obviously true to the extent that the idea of a belligerent person refusing to accept it is funny.
What is happening to you when you think? Are there words in your head? What verb would you use for your interaction with those words?
In other topic, I would consider this as minor evidence of possibility of nonverbal thought “could you pass me that… thing… the thing that goes under the bolt?”. I.e. Exact name eludes me sometimes, but I do know exactly what I need and what I plan to do with it.
> Even if the symbol fails to materialise you can still identify what the symbol refers to via context-clues (analysis)
That is what my partner in the conversation is doing. I am not doing that.
When I think of a plan what’s needed to be done (e.g. something broke in the house, or I need to go to multiple places), usually I know/feel/perceive the gist of my plan instantly. And only after that, I verbalise/visualise it in my head, which takes some time and possibly add more nuance. (Verbalisation/visualisation in my head, is a bit similar to writing things down)
At least for me, there seem to be three (or more) thought processes that complement each other. (Verbal, visual, other)
> we literally don't have the language to figure out how other people perceive things.
you are right, talking about mental processes is difficult.
Nobody knows how exactly other person perceive things, there is no objective way to measure things out.
(Offtopic: describing smell is also difficult)
In this thread, we see that rudimentary language for it exists.
For example: lot of people use sentence like “to hear my own thoughts” and a lot of people understand that fine.
I can certainly tell the difference between normal (for me) thoughts, which I don't perceive as being constructed with language, and speaking to myself. For me, the latter feels like something I choose to do (usually to memorize something or tell a joke to myself), but it makes up much less than 1% of my thoughts.
I have the same reaction to most of these discussions.
If someone says “I cannot picture anything in my head”, then just because I would describe my experience as “I can picture things in my head” isn’t enough information to know whether we have different experiences. We could be having the same exact experience.
I take your point that hearing externally cannot be the same as whatever I experience because of literal physics, but I still cannot deny that listening to someone talk, listening to myself think, and listening to a memory basically all feel exactly the same for me. I also have extreme dyslexia, and dyslexia is related to phonics, so I presume something in there is related to that as well?
> but I still cannot deny that listening to someone talk, listening to myself think, and listening to a memory basically all feel exactly the same for me.
Surely one of these would involve using your input from your ears and one would not? Can you not distinguish these two phenomena?
While I don't have the same experience, I regard what you say as fascinating additional information about the complexity of thought, rather than something needing to be explained away - and I suspect these differences between people will be helpful in figuring out how minds work.
It is no surprise to me that you have to adopt existing terms to talk about what it is like, as vocabularies depend on common experiences, and those of us who do not have the experience can at best only get some sort of imperfect feeling for what it is like through analogy.
I have no clue what this means as I don't understand what to what you refer via "sounds".
Are you saying you cannot tell whether you are thinking or talking except via your perception of your mouth and vocal chords? Because I definitely perceive even my imagination about my own voice as different.
I feel they must know the difference(and anyone would assume that) but will answer you in good faith.
I can listen to songs in their entirety in my head and it's nearly as satisfying as actually hearing. I can turn it down halfway thru and still be in sync 30 sec later.
That's not to flex only to illustrate how similarly I experience the real and imagined phenomena. I can't stop the song once it's started sometimes. It feels that real.
My voice sounds exactly how I want it to when I speak 99% of the time unless I unexpectedly need to clear my throat. Professional singers can obviously choose the note they want to produce, and do it accurately. I find it odd your own voice is unpredictable to you. Perhaps - and I mean no insult - you don't 'hear' your thought in the same way.
Edit I feel it's only fair to add I'm hypermnesiac and can watch my first day of kindergarten like a video. That's why I can listen to whole songs in my head.
There are other lines of evidence. I don't know much about documented cases of feral children, but presumably there must have been at least one known case that developed to some meaningful age at which thought was obviously happening in spite of not having language. There are children with extreme developmental disorders delaying language acquisition that nonetheless still seem to have thoughts and be reasonably intelligent on the grand scale of all animals if not all humans. There is Helen Keller, who as far as I'm aware describes some phase change in her inner experience after acquiring language, but she still had inner experience before acquiring language. There's the unknown question of human evolutionary history, but at some point, a humanoid primate between Lucy and the two of us had no language but still had reasonably high-order thinking and cognitive capabilities that put it intellectually well above other primates. Somebody had to speak the first sentence, after all, and that was probably necessary for civilization to ever happen, but humans were likely quite intelligent with rich inner lives well before they had language.
I think what you are saying is that language is deeply and perhaps inextricably tied to human thought. And, I think it's fair to say this is basically uniformly regarded as a fact.
The reason I (and others) say that language is almost certainly preceded by (and derived from) high-order thought is because high-order thought exists in all of our close relatives, while language exists only in us.
Perhaps the confusion is in the definition of high-order thought? There is an academic definition but I boil it down to "able to think about thinking as, e.g. all social great apes do when they consider social reactions to their actions."
> Perhaps the confusion is in the definition of high-order thought? There is an academic definition but I boil it down to "able to think about thinking as, e.g. all social great apes do when they consider social reactions to their actions."
Yes, I think this is it.
But now I am confused why "high-order thought" is termed this way when it doesn't include what we would consider "thinking" but rather "cognition". You don't need to have a "thought" to react to your own mental processes. Surely from the perspective of a human "thoughts" would constitute high-order thinking! Perhaps this is just humans' unique ability of evaluating recursive grammars at work.
High-order thought can mean a bunch of things more generally, in this case I meant it to refer to thinking about thinking because "faking" alignment is (I assert) not scary without that.
The reason why is: the core of the paper suggests that they trained a model and then fed it adversarial input, and it mostly (and selectively) kept to its training objectives. This is exactly what we'd expect and want. I think most people will hear that pretty much mostly not be alarmed at all, even lay people. It's only alarming if we say it is "faking" "alignment." So that's why I thought it would be helpful to scope the word in that way.
I'm one of those people who claim not to "think in language," except specifically when composing sentences. It seems just as baffling to me that other people claim that they primarily do so. If I had to describe it, I would say I think primarily in concepts, connected/associated by relations of varying strengths. Words are usually tightly attached to those concepts, and not difficult to retrieve when I go to express my thoughts (though it is not uncommon that I do fail to retrieve the right word.)
I believe that I was thinking before I learned words, and I imagine that most other people were too. I believe the "raised by wolves" child would be capable of thought and reasoning as well.
It's actually a very well trod field at the intersection of philosophy and cognitive science. The fundamental question is whether or not cognitive processes have the structure of language. There are compelling arguments in both directions.
> DBOS has a special @DBOS.Transaction decorator. This runs the entire step inside a Postgres transaction. This guarantees exactly-once execution for databases transactional steps.
Totally awesome, great work, just a small note... IME a lot of (most?) pg deployments have synchronous replication turned off because it is very tricky to get it to perform well[1]. If you have it turned off, pg could journal the step, formally acknowledge it, and then (as I understand DBOS) totally lose that journal when the primary fails, causing you to re-run the step.
When I was on call for pg last, failover with some data loss happened to me twice. So it does happen. I think this is worth noting because if you plan for this to be a hard requirement, (unless I'm mistaken) you need to set up sync replication or you need to plan for this to possibly fail.
Lastly, note that the pg docs[1] have this to say about sync replication:
> Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilize system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention.
I see the DBOS author around here somewhere so if the state of the art for DBOS has changed please do let me know and I'll correct the comment.
Yeah, that's totally fair--DBOS is totally built on Postgres, so it can't provide stronger durability guarantees than your Postgres does. If Postgres loses data, then DBOS can lose data too. There's no way around that if you're using Postgres for data storage, no matter how you architect the system.
That’s my intuition as well, but it does raise a question in my mind.
We have storage solutions that are far more robust than the individual hard drives that they’re built upon.
One example that comes to mind is Microsoft Exchange databases. Traditionally these were run on servers that had redundant storage (RAID), and at some point Microsoft said you could run it without RAID, and let their Database Availability Groups handle the redundancy.
With Postgres that would look like, say, during an HTTP request, you write the change to multiple Postgres instances, before acknowledging the update to the requesting client.
Yes exactly! That's how Postgres synchronous replication works. If you turn on synchronous replication, then you won't lose data unless all your replicas are wiped out. The question the original poster was asking was what guarantees can be provided if you DON'T use synchronous replication--and the answer is that there's no such thing as a free lunch.
Hi Joseph! I am sorry, I was not trying to say your work sucks. I was trying to (1) help practitioners understand what they can expect, and (2) motivate problems like the one you mention at the end.
(1) might seem stupid but I think just evaluating these systems is a challenging enough technical problem that many teams will struggle with it. I just think they deserve practical advice—I know we would have appreciated it earlier on.
No need to apologise or change your article or anything. I think it’s great! It’s true that I haven’t written any articles or blog posts about this problem. People absolutely will appreciate more discussion and awareness of this problem, and I’m delighted you’re getting people talking about it.
I’m motivated by wanting the problem solved, not getting the most praise by random people on the internet. If today that means being cast in the role of “the old guard who’s missing something important” then so be it. What fun.
I just want to congratulate you both for contributing to the sum of human knowledge and understanding without resorting to entrenched positions, a rarity in todays online discourse. Great to read positive attitudes
One of the surprising things is that LLMs regularly "fix" things that no other system can fix. Like if we both add the same sentence to a doc. It's interesting stuff.
With that said I am not sure that this specific LLM is providing the "right" answer. It seems like AN answer! But I think the real solution might be to ask the user what to do.
Author here! This comment is kind of getting dragged here and elsewhere but I actually think it's not completely ridiculous. You can (and we have) presented git-style merge conflicts to an LLM and it will mostly fix them in ways that no algorithm can.
One example of this is if you and I both add a similar sentence in different spots in the document, asking an LLM to merge this will often result in only one of the sentences being accepted. It's not perfect but it's the kind of thing you can't get in any other technology!
With that all said I don't think LLMs REPLACE merge algorithms. For one, to get sensible output from the LLMs you generally need a diff of some kind, either git style or as the trace ouput of something like eg-walker.
Author here, thanks for the kind words! I think one reason we ended up here is that it is a genuinely difficult technical problem even to analyze the solutions. One of my hopes for this series of posts is that it makes the evaluation process more straightforward, particularly for people who do not have a strong background in distributed systems algorithms.
[Author here] I am sorry, I think I phrased the automatic merging point confusingly. I was trying to say that when multiple commits change a file, git will attempt to merge them together, but it MUST flag direct conflicts. Sounds like we agree this is the right approach overall though.
reply