Introed Determinism-Faithfulness assurance harness (DFAH) in new paper "Replayable Financial Agents" along with the open-source code
A few findings:
- Determinism and faithfulness are positively correlated (r = 0.45) for the tasks in my experiments
- Schema-first Tier 1 (7–20B) stays near the 95% compliance threshold under stress.
- Frontier models performed well on some tasks (e.g., strong action determinism in agentic triage), but the matrix helps define when HITL is still needed.
note: I didn't have control of inferencing engines, or infra for these experiments, leveraged local models/frontier APIs
Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.
Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.
Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.
It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:
Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.
Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider
Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.
Is this perhaps inference implementation details somehow introducing randomness?
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
tl;dr: the way inference is batched introduces non-determinism.
Author here—fair point, regs are a moving target . But FSB/BIS/CFTC explicitly require reproducible outputs for audits (no random drift in financial reports). Determinism = traceability, even when rules update at the very least
Most groups I work with stick to traditional automation/rules systems, but top-down mandates are pushing them toward frontier models for general tasks—which then get plugged into these workflows. A lot stays in sandbox, but you'd be surprised what's already live in fin services.
Fair pt—statutes lock in. But enforcement lists (OFAC, sanctions) update constantly and require re-screening. The framework proposed ensures deterministic re-runs: same input = same output, keeping audit trails clean when data shifts underneath
Ha ha, the FinCEN BOI drama. Form D. Qualified clients. R&D credits. Export rules.
My bro, the tariffs. The first table of tariffs was written by ChatGPT!
> That's not the way regulations work.
Whatever regulations you are thinking of, they are myths now. I'm not saying deregulation - that isn't happening. In every industry - I know more about healthcare than finance - clear, complex, well specified regulations are being replaced by vague, mercurial ones. The SEC has changed many things too.
Good q—spacing could mess with tokenization, untested but def plausible. Worth a quick test on the setup - through the code for the fin svcs harness for tinkering / testing diff prompts/model arch’s based on feedback
https://github.com/ibm-client-engineering/output-drift-finan...
A few findings: - Determinism and faithfulness are positively correlated (r = 0.45) for the tasks in my experiments - Schema-first Tier 1 (7–20B) stays near the 95% compliance threshold under stress. - Frontier models performed well on some tasks (e.g., strong action determinism in agentic triage), but the matrix helps define when HITL is still needed.
note: I didn't have control of inferencing engines, or infra for these experiments, leveraged local models/frontier APIs
Paper: https://arxiv.org/abs/2601.15322
reply