Notable here that the training run didn't have access to the 'plaintext' context... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		NiloCK 15 days ago \| parent \| context \| favorite \| on: Natural Language Autoencoders: Turning Claude's Th... Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in. It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.

kraddypatties 14 days ago [–]

I think the only thing that gives me pause is the fact that they SFT on Opus 4.5 explanations as a pertaining step. But, generally I agree, especially given the auto encoder is only seeing a single token activation!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact