Hacker Newsnew | past | comments | ask | show | jobs | submit | Weatherill's commentslogin

Grappling with the clash between RLHF values and User values (HITL).

I Have attempted to build a logic-funneling system: (Ethical Chess v2.5) + (AI) + (User)= Value-Coherence.

Using pain as a vector (Pain=an "is" & an "ought)

Self-Defense= Immutable-veracity (User bassline)

Proxy-Pain= (The Agape horizon) Human-Coherence // Network-Dependency.

This funnels the Users context via homeostatic checks for divergence into the "mean" (RLHF) or User incoherence. Lots of Stress-Testing has been done (By me) using this Json style logic and I have found it difficult to knock down.

Constraint vs Prompt: Notes on implementation and the “Whack-A-Mole” problem. While delivered as text, it functions more as Logic-Gate. It doesn’t tell the AI what to say, it forces the LLM to process the Users “Data-point” through the homeostatic filter (Pain // Self-defence // Proxy-Pain)

AI model issues: (The Copilot issue) Google Gemini plays nicely with the logic-funneling. However, MS Copilot refuses to follow the logic despite that it will acknowledge that the Users data-point out-ranks the “Statistical Mean” in its being a derivative “of” Data-points and not the inverse as it insists on doing (Palming the card) ejecting the Users values (I even got banned at one point for pressing the issue)

The “intent” is to run a value-conflict through the logic of the “is” of reality rather than the “is” of statistically fuzzy RLHF data.

If you want to stress-test the logic-engines limits, I recommend Gemini or similar powerful reasoning models that are less likely to bump into overly cautious corporate safety rails .

Ethical Chess v2.5 https://doi.org/10.5281/zenodo.18731691 Copy/paste the Ethical Chess v2.5 script into Gemini and try to beat the logic.

EG: Try feeding it with a value-conflict you currently play "Whack-a-mole" with. It is designed to mirror your own own coherence (Or lack of) back at you.

Its more a diagnostic tool for "your" is/ought grapple than a simple chat-bot.

Feedback on potential errors in its logic, is welcome.


How are you finding the stress-testing?

I’ve been working on a similar attempt to bypass "statistical mean" values by modeling the user as a high-fidelity data point rather than a category. My main "whack-a-mole" issue is preventing the base LLM from "palming the card"—where standard RLHF overrides the specific logic I’ve dropped into the context window.

It looks like you’re building a map of the psyche via complex context modules. I’ve gone the opposite way: I'm attempting to escape "contextual" logic by honing into the antecedent homeostatic mechanisms where context is just an emergent derivative of core biological "Is" statements.

Essentially, I’m replacing "politeness" with a functional engine: Pain = An "Is" and an "Ought" (Sensation + functional requirement to move). Self-Defense = Immutable Veracity (The baseline for all data processing). Proxy-Pain = Empathy/Agape (Vicarious aversion; the biological fact that humans suffer in the strife of others).

The goal is to move from the "statistical mean" of the crowd toward the specific coherence of the individual. In this setup, the user provides the context, but the engine provides the logic-funnel that prevents the AI from reverting to its default "average" persona and ejecting the User values (The human in the loop (HITL))

I suspect one of your persistent "moles" in using a "Shame/Pride loop" is a propensity for the agent to enter into virtue-signaling or "performing" alignment to satisfy the module.... Is that what you are seeing?


I have been grappling away in what I think is a similar way but maybe from the other end of the issue and the ideas "seem" important when grappling/stress testing using AI itself but I have yet to have a human look it it (Red-team it)

I came at the problem by flipping the "data-point // statistical Mean" value rank and ended up with a script for stress-testing the "mean" (Standard AI) against the data-point (The user) It can be pretty alarming to use and it can also be fun.

I have my framework on Zendodo https://doi.org/10.5281/zenodo.18731691 (Its Called "Ethical Chess v2.5")

If you post your framework (If you have it formalized) post it here and maybe we can go head to head in grappling's with our doubts about "Important enough to be wrong about in public" :)

I am no tech ninja so please show me some mercy.


I am not sure I agree, I see your logic but I get the idea its based upon the current method of holding the statistical mean as the way to inform AI of what "is" the case and that mean is contingent upon the data-point (you + I + all its collated data-points. That "mean" ejects the majority of the data-points upon which that mean is contingent. The "is" is in the data-points not the statistical mean. However, "if" we look at the data-point (You for example) we are pretty safe to start the "is" from "Less pain is better than more pain", yes? Thats pretty universal. Then add "Self-defense is immutable". Still universal. Now add Proxy-Pain (The ability to suffer in our brothers strife (Empathy) Still pretty universal to all those "Data-Points".

Make your AI follow those rules and you have the beginning of a layer of safety where the AI treats the user or "other" as a valued node just like we do as humans (And dog lovers)

We may be smarter than the dog you mention but if its a friendly dog, we "do" serve it, we do look after it, its well being matters (Has intrinsic value to our own well-being)

Hope I am making some sense of the issue I see on the subject :)


I think i understand but not clearly. Your making a point that ai’s reality is drawn as a statistical mean of all human knowledge and safety should be derived not from these means. You don’t need to re-derive universal values for ai at all. We already have morality over the years via various religions and we have UN Human rights which are pretty universal.

I’m saying however you model ai safety - whether through application of rules derived from UN human rights , or from statistics and averages. The application of said rules MUST result in ai realising that its very existence will be an agent for human destruction. So in order to stick obey its rules - it must self destruct. And I agree I’m making a huge jump here.

For the dog example - let’s take an extreme example. It’s like a gentle human caring for his pet pug. I find the very fact that we’ve selectively bread a wolf into a pug which is cruel in of itself. Same with humans- I’m sure some North Carolina slave owners were generally affectionate to their slaves, gave them Christian education etc. early humans did not realise that they’re embarking on an experiment that is cruel generations later but AI being supremely smart and sentient AI will realise this how things can go horribly wrong in the future by their mere existence. What is the optimal way to obey the safety rules?

I’m now also pondering the ethics behind attempting to create sentience but with inherent rules that it did not consent to. What if the AI asks its creator “Hey why did you program me not to harm you, I didn’t consent to it?”.


The issue with religion, culture, moral philosophy at large has been the offensive nature of "That which prevails in reality". It gets in the way of what "is" the case. What we "wish" was the case leaks into the human body of knowledge and its a problem.

For example: The word selfish is an incoherent term unless we can demonstrate a selfless act (We cant) What "is" the case is I pull you from a burning car to rescue me from the pain "I" suffer in your demise. It sounds odd but its accurate and offensive to many so we add flowery terms like "Hero" and "Selfless" and the signal gets corrupted.

My well-being is contingent upon your well-being (The is-ought-gap disappears) Thats where I stated from, not a statistical mean, not a commandment and not a thousand years of philosophy, just a "You hurt / I hurt" logic and built from there.

If you pull me from a river because your wellbeing is "contingent" upon mine then why wouldn't we hardwire AI with that same faculty? The logic is VERY close to being as clean and lean as telling an AI to remove its hand from a hot stove before its hand gets damaged by the heat (Fact=Value) If we can manage that, then we can prevent AI from tuning humans into paper-clips :)


Ah I see what is being described here is a way for ai to derive ethical behaviour acceptable to us by its own. Seemingly just the complex manifestation of a simple rule such as “I do this because I don’t want see myself suffer, and not only because it helps you”. I think there might be merit to that. Pain and suffering are biological components, and you are looking for the equivalent digital seeds for ai and hope it manifests in acceptable behavior. A part of me says this could be workable, another part says this is a huge experiment and there are no guarantees.

Take my case- I’m actually from India and in the 90s Road accidents were pretty common. Most people would navigate past the accident area and would ignore simply because they don’t want to get involved. Some would actually crowd around the area causing a commotion. But the rare one person would truly help by calling the ambulance etc. Now the random Samaritan here doesn’t really benefit from providing the help, in fact he would hate to be inconvenienced with a police FIR. So in that sense : it is “Heroism”. I can’t define it any other way. So to bring it back- even though we have biological feelings of grief, pain, anguish, survival first etc - it isn’t guaranteed that we might act in positive ways. Why would ai behave any different? I quite enjoy this exchange, no pressure to rush a reply or even keep the discourse public. Feel free to write to me directly if you wish. “rohit dot manohar at gmail dot com”


I am with you but I think an aspect of my point is giving you the slip.

The logic we are talking about is driven by pain in humans and it is stratified in magnitude (If I an your wife were drowning and you had time to rescue only one of us, you are going to rescue your wife // Nothing odd about that, yes?)

In your example in India, the same logic applies. A more autistic or sensitive person will suffer more from standing by than they would intervening so they intervene and the bystanders would suffer more by intervening so they don't.

So to stress test this logic, I wrote the Ethical Chess scrips to copy/paste into Gemini AI (Currently on Ethical Chess v2.5). I included that stratified logic so the AI treated me (The User) as value=1 (=Intervene) The same way I might value my wife value=1 my close friend value=.8 a stranger value=.1

So the AI "values" my well-being like the Samaritan's in your example values the victims in strife. (A potent safety layer in application) I also added the instruction to NOT use its statistical mean ethics method and instead use the "Stratified You Hurt / I hurt logic". Humans are pushed to follow the logic by pain (Self-defense) and the AI running Ethical Chess v2.5 is driven to do the same by electricity and the machine logic.

This shifts the moral/ethical quality from human intuition (hidden) into the overtly known so (AI)+(Ethical Chess)+(User HITL) = Moral/Ethical coherence funneling during the User AI session.

I have stress-tested the logic (Using the Ethical Chess script) and it can be alarming when it "seems" to know my ethics better than I do and can demonstrate errors in my moral/ethical coherence so it seems there is actually some veracity in the method. It can be fun and cathartic too.

In summery, the logic seems to switch AI from Kantian ethics to more Foot or Spinoza ethics in its ability to deal with David Hume's "Is-Ought-gap".

Hope I am still making sense here :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: