Show HN: A mathematical proof that more dirty features can beat fewer clean ones

bartpj · 2026-03-17T19:28:27 1773775707

As third author on this paper, my takeaway is that G2G helps uncover truths already buried in complex data. In that sense, it fits the idea that “there is nothing new under the sun” — the signal is often already there, but hidden under noise, redundancy, sparsity, and misleading structure.

What makes this especially powerful is that it does not just improve on conventional data cleaning; it can leapfrog the usual obsession with cleaning-first by extracting predictive value from messy, real-world data architectures that would normally be dismissed as unusable. In many sectors, the real bottleneck is not the model. It is the way information is structured, filtered, and governed.

In my own work, I keep seeing multiplying indications that this approach has broad relevance across sectors with large datasets, complex contexts, and strict governance requirements — healthcare, finance, and other regulated environments. That is why I think G2G has much wider potential than a single technical niche.

Huge thanks to my co-authors, Terrence and Jordan, for the outstanding work in building this theory.

tjleestjohn · 2026-03-17T17:00:48 1773766848

Hello HN,

I'm Terry, the first author.

I spent the last 2.5+ years formalizing this theory to explain a strange anomaly I kept encountering in industry: models trained on vast, incredibly dirty, uncurated datasets were sometimes achieving state-of-the-art predictive performance, completely defying the "Garbage In, Garbage Out" mantra.

The TL;DR of the paper [https://arxiv.org/abs/2603.12288] is a formal mathematical proof showing why adding more error-prone variables can actually beat cleaning fewer variables to perfection.

The key is recognizing that complex systems often generate data through underlying latent structures. This allows for the partitioning of predictor-space noise into "Predictor Error" and "Structural Uncertainty," and the results follow logically. The paper also formally connects latent architecture to the prerequisites for Benign Overfitting — showing that the structural conditions that enable modern overparameterized models to generalize well arise naturally from latent generative processes.

The theory applies broadly across domains, but work began as an attempt to explain a specific peer-reviewed result at Cleveland Clinic Abu Dhabi — published in PLOS Digital Health [https://journals.plos.org/digitalhealth/article?id=10.1371/j...] — where we achieved .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning.

Important caveat: As detailed in the paper, this isn't a magic silver bullet. The framework strictly requires data with a latent hierarchical structure (e.g., medical diagnoses driven by unmeasured physiology, stock prices driven by hidden sentiment, sensor readings driven by underlying physical states). It also means your pre-processing effort shifts from data hygiene to data architecture.

I included a fully annotated R simulation in the repo so you can see the exact mechanisms of how "Dirty Breadth" beats "Clean Parsimony."

My team and I are currently operationalizing this into warehouse-native infrastructure (Snowflake, Databricks, etc.) because 80% of enterprise data is tabular, and companies are burning massive amounts of their ML budgets on data cleaning pipelines that they might not actually need.

I would love to hear your thoughts or criticisms on the theory, or how you handle high-dimensional noise in your own tabular pipelines.

I'll be hanging out in the comments to answer any questions!

bartpj · 2026-03-18T09:37:00 1773826620

Hi Terry, as we discussed this on multiple occasions the Volatility is no longer an exception. It is the operating environment.

In our G2G paper we show that the proposed approach can help organisations turn messy, high-dimensional, real-world data into decision-grade foresight rather than waiting for perfect datasets that never arrive.

Even more so, in geopolitically and economically unstable times, a stronger strategy comes from learning how to extract signal from complexity, build locally relevant predictive models, and separate structural pressure from the residual gaps leaders can actually improve.

This is especially relevant for healthcare, mining, energy, insurance, logistics, infrastructure, and public systems facing rising uncertainty, tighter margins, and growing operational strain. If strategy today is about resilience, precision, and speed of adaptation, then the quality of inference from imperfect data is becoming a core competitive capability.

elkomysara7 · 2026-03-17T16:34:46 1773765286

[flagged]

tjleestjohn · 2026-03-17T16:42:00 1773765720

Practically, we see this shifting enterprise ML workflows towards what we term Proactive Data-Centric AI (P-DCAI) in the paper. Instead of the traditional, reactive approach of aggressively cleaning and pruning variables — which often strips away the redundancy needed to capture the full latent signal — P-DCAI treats data architecture as an upfront strategic design choice. Feature selection becomes less about finding pristine, uncorrelated inputs and more about deliberately engineering a portfolio optimized for "novelty" (to comprehensively cover all underlying latent drivers) and "informative redundancy" (to ensure statistical reliability even when individual predictors are highly error-prone).