Hacker Newsnew | past | comments | ask | show | jobs | submit | comex's commentslogin

Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:

> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].

The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.

To point the model in the right direction, they start out by training on guessed internal thinking:

> we ask Opus to imagine the internal processing of a hypothetical language model reading it.

…before switching to training on the real objective.

Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.

But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.

The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.

On the other hand, their downstream result is not very impressive:

> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time

That is apparently better than existing techniques, but still a rather low percentage.

Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.


Great summary. The fact that the auto encoding task is not grounded in thoughts, and their initial training on guessed internal thoughts, raise serious concerns on faithfulness. Feels like they might get better results by just training a supervised model on activations and "internal thoughts" measured by some different behavioral way.

"deserves the semantic meaning"

you meant "preserves...", right?


Don't they add a KL loss term to the frozen model's outputs?

Yes, it’s a shell builtin that makes the shell execute a chdir() syscall. Therefore it isn’t subject to argument length limits imposed by the kernel when executing processes. But it is still subject to path length limits imposed by the kernel’s implementation of chdir() itself. While the shell may be a GNU project (bash), the kernel generally is not (unless you are running Hurd), so this isn’t GNU’s fault per se.

However, the shell could theoretically chunk long cd arguments into multiple calls to chdir(), splitting on slashes. I believe this would be fully semantically correct: you are not losing any atomicity guarantees because the kernel doesn’t provide such guarantees in the first place for lookups involving multiple path components. I’m not surprised that bash doesn’t bother implementing this, and I don’t know if I’d call that an “arbitrary limitation” on bash’s part (as opposed to a lack of workaround for another component’s arbitrary limitation). But it would be possible.


They are not perfectly fine. If a task panics then you will get the right stack trace, but there is no way to get a stack trace for a task that’s currently waiting. (At least not without intrusive hacks.)


> This functionality is experimental, and comes with a number of requirements and limitations.

I assume that answers your question.


So once it's out of the experimental stage it won't be an intrusive hack anymore?

Just because LLMs overuse it doesn't mean it doesn't have its place.

The way the OP used the 'not X, but Y' pattern, the 'X' and 'Y' are two clear, specific, and (most importantly) distinct things, as opposed to stereotypical LLM usage where they're vague characterizations or metaphors. And there's a reason to emphasize that it's not X, because the Slop Cop website implicitly suggests that it is X.


They probably aren’t affected because the buggy code was only added in macOS 26:

https://github.com/apple-oss-distributions/xnu/blame/f6217f8...


Ouch - "every Mac" from the original post is a hallucination then.

I can live with the writing style when the topic is interesting (here it was for me) but complete untruths are much worse.


The bug was introduced only last year in macOS 26:

https://github.com/apple-oss-distributions/xnu/blame/f6217f8...


> Apple Community #250867747: macOS Catalina — "New TCP connections can not establish." New connections enter SYN_SENT then immediately close. Existing connections unaffected. Only a reboot fixes it.

This is a weird thing to cite if it's a macOS 26 bug. I quite regularly go over 50 days of uptime without issues so it makes sense for it to be a new bug, and maybe they had different bugs in the past with similar symptoms.


Interesting. The article mentions complaints on the forums running Catalina, so that must be something else.


As someone who also operates fleets of Macs, for years now, there is no possible way this bug predates macOS 26. If the bug description is correct, it must be a new one.


The article is written using AI, so unless you verified the complaints, the safe default assumption is that they don't exist.


It definitely exists, but it could be a completely unrelated issue.

https://discussions.apple.com/thread/250867747


It definitely would be unwelcome for EU authorities in cases like the recent US sanctions against ICC officials.


Not to mention the German debanking and account closing of a few middle eastern journalists living in Germany, their spouses and in one case their children.


Fair... they should think about this then


From some skimming of the code, it seems like a nightmare quality-wise. But if it works, it works. I wonder what makes it faster.


Primarily that it doesn't work


Well, it sounds like a real issue, but the diagnosis is AI slop. You can see, for example, how it takes the paragraph quoted from waydabber (attributing the issue to dynamic resource allocation) and expands it into a whole section without really understanding it. The section is in fact self-contradictory: it first claims that the DCP firmware implements framebuffer allocation, then almost immediately goes on to say it's actually the GPU driver and "the DCP itself is not the bottleneck". Similar confusion throughout the rest of the post.


Agree. I started reading the article until I realized it wasn’t even self-coherent. Then I got to the classic two-column table setup and realized I was just reading straight LLM output.

There might be a problem but it’s hard to know what to trust with these LLM generated reports.

I might be jaded from reading one too many Claude-generated GitHub issues that look exactly like this that turned out to be something else.


Parts of it were pasted from my Claude Code Logs, Parts were written by me - and the table - that was me!


I think you are probably right--it's a real problem.

As an article, it is not 100% coherent, but there is a valid data and a real problem that is clear.


It does in Rust: assert is always enabled, whereas the debug-only version is called debug_assert.

But yes, “assert” in most languages is debug-only.


He said

> some people would expect it to enforce that the pointer is non-null, then proceed

No language magically makes the pointer non-null and then continues. I don't even know what that would mean.


If you don't even know what that would mean then it's premature to declare that nothing works that way. Understanding the meaning is a prerequisite for that.

In this case, it may help to understand that e.g. border control enforces a traveler's permission to cross the border, then lets them proceed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: