Hey HN, this is David from Aluna (YC S24). We work with diagnostic labs to build datasets and evals for oncology tasks.
I wanted to share a simple RL environment I built that gave frontier LLMs a set of tools that lets it zoom and pan across a digitized pathology slide to find the relevant regions to make a diagnosis.
Here are some videos of the LLM performing diagnosis on a few slides:
(https://www.youtube.com/watch?v=k7ixTWswT5c): traces of an LLM choosing different regions to view before making a diagnosis on a case of small-cell carcinoma of the lung
(https://youtube.com/watch?v=0cMbqLnKkGU): traces of an LLM choosing different regions to view before making a diagnosis on a case of benign fibroadenoma of the breast
Why I built this:
Pathology slides are the backbone of modern cancer diagnosis. Tissue from a biopsy is sliced, stained, and mounted on glass for a pathologist to examine abnormalities.
Today, many of these slides are digitized into whole-slide images (WSIs)in TIF or SVS format and are several gigabytes in size.
While there exists several pathology-focused AI models, I was curious to test whether frontier LLMs can perform well on pathology-based tasks. The main challenge is that WSIs are too large to fit into an LLM’s context window. The standard workaround, splitting them into thousands of smaller tiles, is inefficient for large frontier LLMs.
Inspired by how pathologists zoom and pan under a microscope, I built a set of tools that let LLMs control magnification and coordinates, viewing small regions at a time and deciding where to look next.
This ended up resulting in some interesting behaviors, and actually seemed to yield pretty good results with prompt engineering:
- GPT 5: explored up to ~30 regions before deciding (concurred with an expert pathologist on 4 out of 6 cancer subtyping tasks and 3 out of 5 IHC scoring tasks)
- Claude 4.5: Typically used 10–15 views but similar accuracy as GPT-5 (concurred with the pathologist on 3 out of 6 cancer subtyping tasks and 4 out of 5 IHC scoring tasks)
- Smaller models (GPT 4o, Claude 3.5 Haiku): examined ~8 frames and were less accurate overall (1 out of 6 cancer subtytping tasks and 1 out of 5 IHC scoring tasks)
Obviously, this was a small sample set, so we are working on creating a larger benchmark suite with more cases and types of tasks, but I thought this was cool that it even worked so I wanted to share with HN!
First, your business model isn't really clear, as what you've described so far sounds more like a research project than a go-to-market premise. Computational pathology is a crowded market, and the main players all have two things in common: access to huge numbers of labeled whole-slide images, and workflows designed to handle such images. Without the former, your project sounds like a non-starter, and given the latter, the idea you've pitched doesn't seem like an advantage. Notably, some of the existing models even have open weights (e.g. Prov-GigaPath, CTransPath).
Second, you've talked about using this approach to make diagnoses, but it's not clear exactly how this would be pitched as a market solution. The range of possible diagnoses is almost unlimited, so a useful model would need training data for everything (not possible). My understanding is that foundation models solve this problem by focusing on one or a few diagnoses in a restricted scope, e.g. prostate cancer in prostate core biopsies. The other approach is to screen for normal in clearly-defined settings, e.g. Pap smears, so that anything that isn't "normal" is flagged for manual review. Either approach, as you can see, demands a very different training and market positioning strategy.
Finally, do you have pathologists advising you, and have you done any sort of market analysis? Unless you're already a pathologist (and probably even if you were), I suspect that having both would be of immense value in deciding a go-forward plan.
All the best!