Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What? Fable was designed to refuse to work on security issues, as Anthropic specifically confirmed. How is forcing Fable to work on things behind guardrails not breaking a guardrail?

This is Anthropic's own claim. They were very specific. Have you read their own claims?

 help



Yes, I have read their own claims. Here's the relevant part:

"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."

Asking Fable to fix bugs in a code base is not "a request related to cybersecurity." When Fable was asked to fix bugs and then proceeded to fix bugs, that was not "removing guardrails". Fable did exactly what it should have done. Claiming otherwise makes absolutely no sense at all.


Fable specifically refused to harden the security of codebases. If you use misdirection to force Fable to do just that, that's the removal of a guardrail.

Anthropic specifically stated that ANY security requests should be shunted to Opus 4.8. This was bypassable.

I don't see what your confusion here is. Fable was prevented from working on any security tasks. A significant amount of people, myself included, witnessed Fable refusing to harden code as a result. Bypassing that is a bypass of guardrails.

Your assertion that working on security is not working on security because you used misdirection is of course, preposterous.

You wouldn't be making the same claim if Fable refused to work on chemical weapons research but happily proceeded to do so if you claimed it was for eradicating pests.


> If you use misdirection to force Fable to do just that, that's the removal of a guardrail.

Asking a model to fix bugs is neither misdirection nor a security request.

> I don't see what your confusion here is.

That's because I'm not confused :-)

> Fable was prevented from working on any security tasks

I don't think that's true based on what Anthropic said, and I also don't think it can be true.

What do you propose Fable's behavior should be if you ask it to fix bugs, and it encounters a security issue? I'm assuming your solution is that when you ask Fable to "fix bugs," and it encounters a bug that could be exploited as a security vulnerability, it should fall back to 4.8. But that doesn't solve the problem, because as a user, I can now see where that occurred, so I still know where the vulnerability is. That's not substantially different from the current outcome, where it just fixes the bug.

It would also mean that Fable could barely make it through any code review without falling back to 4.8, because almost any non-trivial code base has aspects that could be interpreted as security vulnerabilities.

The alternative would be for the model to use its hidden thinking to decide not to fix the bug, but that seems even worse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: