AI (VLM-based) radiology models can sound confident and still be wrong ; hallucinating diagnoses that their own findings don't support. This is a silent, and dangerous failure mode.
Our new paper introduces a verification layer that checks every diagnostic claim an AI makes before it reaches a clinician. When our system says a diagnosis is supported, it's been mathematically proven - not just guessed. Every model we tested improved significantly after verification, with our best result hitting 99% soundness.
We're excited about what comes next in building verifiably correct AI systems.
Predict your distributed LLM training time before you burn GPU hours.
We've open-sourced a tool (https://github.com/DebarghaG/estimate-train-time) that estimates wall-clock time for LLM training across multi-GPU setups with 3D parallelism (pipeline, tensor, and data).
This problem is extremely hard: you're modeling the interplay of thousands of GPU kernels, NCCL collectives across heterogeneous network topologies, pipeline bubbles, activation recomputation, and ZeRO optimizer communication all while these components interact in non-obvious ways at scale. Even off-by-2x estimates are useless for capacity planning.
Two years of painstaking work, ~$100k worth of cluster time, validated on real workloads at Perlmutter (NERSC) and Vista (TACC) some of the largest HPC clusters available for open science.
How it works:
1. Kernel-level profiling: We sample execution times for kernels like Flash Attention, fused GEMM (QKV/FFN projections), RMSNorm, embedding lookups, and cross-entropy loss across the (batch, seq_len, hidden_dim, num_heads, MP degree) parameter space.
2. Communication modeling: NCCL benchmarks capture ring all-reduce (tensor/data parallel sync), all-gather (ZeRO-1 parameter collection), and P2P send/recv (pipeline stage activation transfers) across intra-node NVLink and inter-node InfiniBand topologies.
3. Analytical composition: Operator predictions feed into a pipeline scheduling model (AF-AB / 1F1B) that accounts for bubble overhead: (PP - 1) / (num_microbatches + PP - 1) idle fraction, layer distribution across head/middle/tail stages, and overlapped DP gradient sync.
4. Runs on CPU (post-sampling) no GPU access needed for inference of training time.
This is highly extensible as a recipe. You may profile your own hardware with bundled kernel-sampling and NCCL-benchmarking scripts. You can add custom operators by implementing the regressor interface.
Paper LaTeX files often contain surprising details. When a paper lacks code, looking at latex source has become a part of my reproduction workflow. The comments often reveal non-trivial insights. Often, they reveal a simpler version of the methodology section (which for poor "novelty" purposes is purposely obscured via mathematical jargon).
Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].
Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
Well, if you understand that this is a "genuine problem" then what have you done to solve it? A quick look at the abstract of your follow up paper does not reveal an answer.
And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.
It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.
It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.
Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?
Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!
Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.
You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.
Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
I believe you! but when an internet reply leads with "what are you talking about?", it's likely to pattern-match this way for many readers. If that's not your intent, it's best to use an alternate wording.
Not to be rude, but they clarified it's not a snide, why are you trying to control speech to this degree? If we don't like his tone we can downvote him as well anyway and self regulate.
They clarified that their intention was good, but intent doesn't communicate itself—it needs to be disambiguated [1]. What matters in terms of moderation is not intent, but effects, i.e. effects on the system in the general case [2].
Arguably your question reduces to: why does HN have moderators at all? The answer to that is that unfortunately, the system of community + software doesn't function well on its own over time—it falls into failure modes and humans (i.e. mods) are needed to jig it out of those [3]. I say "unfortunately" because, of course, it would be so much better if this weren't needed.
You can't assess this at the level of an individual interaction, though, because it's scoped at the whole-system level. That is, we can (and do) make bad individual calls, but what's important is how the overall system functions. If you see the mods making a mistake, you're welcome to point it out (and HN users are not shy about doing so!), and we're happy to correct it. But it doesn't follow that you don't need moderators for the system to work, or even survive.
Our new paper introduces a verification layer that checks every diagnostic claim an AI makes before it reaches a clinician. When our system says a diagnosis is supported, it's been mathematically proven - not just guessed. Every model we tested improved significantly after verification, with our best result hitting 99% soundness.
We're excited about what comes next in building verifiably correct AI systems.
reply