Hacker News new | past | comments | ask | show | jobs | submit login

Perhaps I am alone in this, but when speaking scientifically, I think it's important to separate out clearly what we think we know and what we do not. Not just to avoid misconceptions amongst ourselves but also to avoid misleading lay people reading these articles.

I understand that for some people, the mere presence of words that look like they were written by a human engaged in purposeful deception is enough to condemn the model that actually emitted them. The overview compares the model to to the villain in Othello, and it's what the authors believe: it is maliciously "subverting" and "undermining" its interlocutors. What I am trying to state here is the reason this arugment does not work for me: the machines pretend to do all kinds of things. Maybe they are knowingly deceiving us. Maybe they don't know what they're doing at all. Either way, the burden of proof is on Anthropic and anyone who wants to take up the yoke, not us.

This, for me, is not a technicality. What do we really know about these models? I don't think it's fair to say we know the are intentionally deceiving us.






Interesting point. Now what if the thing is just much simpler and its model directly disassociates different situations for sake of accuracy? It might not even have any intentionality, just statistics.

So the problem would be akin to having the AI say what we want to hear, which unsurprisingly is a main training objective.

It would talk racism to a racist prompt, and would not do so to a researcher faking a racist prompt if it can discern them.


It doesn't matter what the reason is though, that was rather my point. They act differently in a test setup, such that they appear more aligned than they are.

Do you agree or disagree that the authors should advertise their paper with what they know to be true, which (because the paper does not support it, at all) does not include describing the model as purposefully misleading interlocutors?

That is the discussion we are having here. I can’t tell what these comments have to do with that discussion, so maybe let’s try to get back to where we were.


Are you going to quibble on the definition of purposefully? Is your issue that you think "faking" is a term only usable for intelligent systems and the llms don't meet that level for you? I'm not sure I understand your issue and why my responses are so unacceptable.

Ian I am willing to continue if you are getting something out of it, but you don't seem to want to answer the question I have and the questions you have seem (to me) like they are trivially answerable from what I've written in the response tree to my original comment. So, I'm not sure that you are.

I don't think it's worth continuing then. I have previously got into long discussions only for a follow-up to be something like "it can't have purpose it's a machine". I wanted to check things before responding and there being a small word based issue making it pointless. You have not been as clear in your comments as perhaps you think.

To be clear I think faking is a perfectly fine term to use regardless of whether you think these things are reasoning or pretending to do so.

I'm not sure if you have an issue there, or if you agree on that but don't think they're faking alignment, or if this is about the interview and other things said (I have been responding about "faking"), or if you have a more interesting issue with how well you think the paper supports faking alignment.

Have a good day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: