Hacker Newsnew | past | comments | ask | show | jobs | submit | raffisk's commentslogin

Introed Determinism-Faithfulness assurance harness (DFAH) in new paper "Replayable Financial Agents" along with the open-source code

A few findings: - Determinism and faithfulness are positively correlated (r = 0.45) for the tasks in my experiments - Schema-first Tier 1 (7–20B) stays near the 95% compliance threshold under stress. - Frontier models performed well on some tasks (e.g., strong action determinism in agentic triage), but the matrix helps define when HITL is still needed.

note: I didn't have control of inferencing engines, or infra for these experiments, leveraged local models/frontier APIs

Paper: https://arxiv.org/abs/2601.15322


Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.

Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.

Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.


It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:

    {"role": "assistant", "content": "<think></think>"}
Of course, the model will be less capable without reasoning.


Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.

Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider


Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.

Is this perhaps inference implementation details somehow introducing randomness?


Defeating Nondeterminism in LLM Inference

https://news.ycombinator.com/item?id=45200925

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

tl;dr: the way inference is batched introduces non-determinism.


“Determinism is necessary for compliance”

Says who?

The stuff you comply with changes in real time. How’s that for determinism?


Author here—fair point, regs are a moving target . But FSB/BIS/CFTC explicitly require reproducible outputs for audits (no random drift in financial reports). Determinism = traceability, even when rules update at the very least

Most groups I work with stick to traditional automation/rules systems, but top-down mandates are pushing them toward frontier models for general tasks—which then get plugged into these workflows. A lot stays in sandbox, but you'd be surprised what's already live in fin services.

The authorities I cited (FSB/BIS/CFTC) literally just said last month AI monitoring is "still at early stage" cc https://www.fsb.org/2024/11/the-financial-stability-implicat...

Curious how you'd tackle that real-time changing reg?


* https://www.fsb.org/2025/10/monitoring-adoption-of-artificia...

This was the link I meant from Oct ‘25 reiterating early stages of AI monitoring


Please give an example of a statutory compliance item that "changes in real time".

That's not the way regulations work. Your compliance is measured against a fixed version of legislation.


Fair pt—statutes lock in. But enforcement lists (OFAC, sanctions) update constantly and require re-screening. The framework proposed ensures deterministic re-runs: same input = same output, keeping audit trails clean when data shifts underneath


Ha ha, the FinCEN BOI drama. Form D. Qualified clients. R&D credits. Export rules.

My bro, the tariffs. The first table of tariffs was written by ChatGPT!

> That's not the way regulations work.

Whatever regulations you are thinking of, they are myths now. I'm not saying deregulation - that isn't happening. In every industry - I know more about healthcare than finance - clear, complex, well specified regulations are being replaced by vague, mercurial ones. The SEC has changed many things too.


Also, what happens if you add a space to the end of the prompt? Or write a 12.00 to 12.000?


Good q—spacing could mess with tokenization, untested but def plausible. Worth a quick test on the setup - through the code for the fin svcs harness for tinkering / testing diff prompts/model arch’s based on feedback https://github.com/ibm-client-engineering/output-drift-finan...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: