
Ask HN: What medical datasets do you need? - danicgross
We recently announced YC AI (https:&#x2F;&#x2F;blog.ycombinator.com&#x2F;yc-ai&#x2F;). This is only the first step. Our long term goal is to democratize AI development. We want to make it easier for startups to compete with the big companies.<p>One thing large companies have is data. We&#x27;re experimenting with ways to allow startups to get similar assets, and we&#x27;re starting with medical data.<p>If you&#x27;re working on AI and need medical data, please help us by filling out this form: https:&#x2F;&#x2F;goo.gl&#x2F;Dr9FzB.
======
throwaway4103
Please oh please, Chronic Fatigue Syndrome. The potential for ML is absolutely
huge. Some people have already collected or are working to collect genetic and
other data on a large scale [1] [2] [3], so it does exist.

CFS is interesting because:

a) Patients' symptoms appear to fluctuate "randomly" but are actually
typically a complex function of genetics, blood markers, exercise, diet,
medication and other factors.

b) There is considerable low-hanging fruit for pattern recognition, since
despite the prevalence of the disease almost nobody has done serious ML work
in this space.

c) Huge market opportunity - prevalence is comparable to HIV, and specialists
often cite CFS as causing more disability [4] [5].

[1] [http://simmaronresearch.com/](http://simmaronresearch.com/)

[2] [http://www.nova.edu/nim/research/mecfs-
genes.html](http://www.nova.edu/nim/research/mecfs-genes.html)

[3]
[https://med.stanford.edu/chronicfatiguesyndrome.html](https://med.stanford.edu/chronicfatiguesyndrome.html)

[4] [https://consults.blogs.nytimes.com/2009/10/15/readers-
ask-a-...](https://consults.blogs.nytimes.com/2009/10/15/readers-ask-a-virus-
linked-to-chronic-fatigue-syndrome/?_r=0)

[5] Dr. Daniel Peterson (Introduction to Research and Clinical Conference,
Fort Lauderdale, Florida, October 1994; published in JCFS 1995:1:3-4:123-125)

~~~
rosegold
For many autoimmune-related conditions patient’s symptoms also appear to
fluctuate randomly, with symptoms such as pain and fatigue coming seemingly
out of the blue. This includes chronic diseases such as Lupus, Rheumatoid
Arthritis, Fibromyalgia and a long tail of other conditions [1].

People with these lifelong illnesses typically experience a roller-coaster of
recurring symptom flare-ups, wreaking havoc with their lives. Yet there are
patterns to the flare-ups. This is an opportunity to make a big difference for
millions of people [2].

[1] [https://www.aarda.org/disease-list/](https://www.aarda.org/disease-list/)

[2] [https://www.aarda.org/autoimmune-information/autoimmune-
stat...](https://www.aarda.org/autoimmune-information/autoimmune-statistics/)

~~~
mclide
The key to make sense of the data for these diseases is a record of patient’s
symptoms. Assembling useful datasets is not only a question about access, but
also about resolving human factors to successfully collect the essential
information from patients.

A major challenge is to get a large number of patients to continuously track
their symptoms. Most want to know what’s in it for them. It takes substantial
incentives for people to regularly report outcomes and use wearables for data
collection. Until we can make the marginal cost hit zero, they need to benefit
from their efforts and investment, preferably instantly.

------
Entangled
Dermatology, eye conditions, blood cells, tissue, viruses, urine, saliva,
everything that can allow an app to give you a first diagnose before heading
to the doctor.

I foresee in less than ten years we will have a doctor in our pockets. No, it
won't cure us and it won't replace a doctor, but it will give us all the
information we need to have a 99% certainty of our condition.

\--

Second batch for animals and their conditions.

Third batch, agriculture. Take a pic of a plant and tell me all the info,
fertilizers, cultivation, etc, bonus for pest id and treatment.

Pocket computers should be able to diagnose every living creature.

~~~
surgeryres
One potential problem with this - the question of liability, and who is
responsible for diagnostic accuracy? In this case, for some "Lab on a Chip"
device providing a patient directly with diagnostic information without the
vetting of a human doctor, liability would sit with the company.

IBM's Watson at MD Anderson Cancer center did not work out real well for them.
In other words, using AI in the realm of medical diagnostics is very
difficult.

~~~
mitchellst
What about the treatment side? Once you have a diagnosis, could we use AI to
review the patient's medical record, compare outcomes of past patients with
the same diagnosis and similar histories, and suggest adjustments of
personalized treatments to optimize outcomes?

Overall, of course, you're right. Liability is the problem with my suggestion.
Doctors prescribe to treat, they also prescribe to meet the legally mandated
standard of care and minimize second-guessing later. Looking at each patient
as a unique snowflake-- or at least, part of a thinner-sliced group-- helps
with the first, but directly undercuts the second goal. Such an approach would
probably need to originate outside the U.S.

~~~
surgeryres
Fair points.

Extracting data from the EMR is very difficult because all EMR was originally
intended to only be a storage place for data - not designed to output data
back to a user.

------
abetusk
FYI, as far as I know, the Harvard Personal Genome Project is one of the only
publicly available resources that has whole genome (and other) data along with
health record information available for free use (CC0 licensed) [1]. Open
Humans [2] and OpenSNP [3] have data along with various degrees of health
record and phenotype information as well.

[1] [http://www.personalgenomes.org/](http://www.personalgenomes.org/)

[2] [https://www.openhumans.org/](https://www.openhumans.org/)

[3] [https://opensnp.org/](https://opensnp.org/)

~~~
gotthemwmds
[http://ghdx.healthdata.org/ihme_data](http://ghdx.healthdata.org/ihme_data)
too

------
siculars
Any tagged data sets like CCD's with SNOMED/LOINC encoding. Basically anything
that is serialized in HL7/FHIR for a large enough population longitudinally.
It's the time oriented set of population data for a region, like a major
health center over a period of five to ten years or better.

~~~
jrowley
Yes this ^^

Sources like MIMIC are certainly interesting and valuable but it'd be great to
get data longitudinal records, spanning years of coverage.

[https://mimic.physionet.org/](https://mimic.physionet.org/)

------
ipunchghosts
IBS data. Since DoD threw money at this problem after the Iraq war, we've
discovered that IBS occurs in about 1 in 10 who have had food poisoning. This
is the biggest advancement made in the field in decades. We are close to
putting this to bed but just need more data.

------
olegkikin
Costs of all the procedures for each hospital. Whatever people get charged.

~~~
pdog
Medicare provider utilization and payment data (which is a large percentage of
the total market in the United States) has been publicly available[1] for
several years now. The _Wall Street Journal_ won a Pulitzer Prize[2] for the
analysis they did of the public data sets.

[1]: [http://graphics.wsj.com/medicare-
billing/](http://graphics.wsj.com/medicare-billing/)

[2]: [https://www.wsj.com/articles/wsj-new-york-times-win-
pulitzer...](https://www.wsj.com/articles/wsj-new-york-times-win-
pulitzers-1429557628)

~~~
olegkikin
That's medicare only, unfortunately. I have that dataset.

------
TuringNYC
I wont comment on what, but on _how_ :

\- If the datasets are imaging, there should be enough per class for typical
ML techniques. Otherwise you just get people over-fitting models on sets of
500 images and the illusion of progress.

\- I'm quite happy with the Kaggle datasets generally, but why do others make
consuming data so difficult. Heck, if we've already received the data, lets
just take it the last mile and make it consumable with obvious labeling,
standard formats, etc. This is such a pet peeve of mine that -- if you need
help taking datasets to the last mile -- i'm volunteering, ask me to help make
it presentable. Ideally it should be pull-able via curl/etc, unzippable and be
able to get into a pipeline w/o manual effort.

~~~
mentalhealth
Re imaging, throw away the community hospital crap that IBM's been peddling.
We need quality imaging, with diagnostics, with followup data, from major
quaternary care research centers.

------
kfor
For those interested in global health, we've tried to collate as much data as
possible at [http://ghdx.healthdata.org/](http://ghdx.healthdata.org/)
(disclosure: I'm the director of data science at IHME, which hosts this).

Note that most of this data is population level epidemiologic and
administrative stuff, not the detailed biomedical measurements I see most
people requesting - but I promise you there's some really interesting things
that can be done with it nonetheless!

------
ska
An awful lot of medical data is complex.

Here is what you really want: Large amounts of curated/quality controlled data
with ground truth that you can aggregate & share. Preferably with multiple
studies and time points and/or followup. That is stated in rough order of
difficulty to acquire.

Here is what you typically get fed into an learning pipeline: 1-2 orders of
magnitude too small, with all kinds of noise, and no truth data(i.e. at best a
bad proxy).

Hand-waving about unsupervised learning won't solve many of the really
difficult problems (although it has uses, obviously). Neither will hand-waving
about transfer learning. In some areas most retrospective data sets will never
be really available because of consenting issues. QA is hard - the sheer
variability of clinical systems in the field, not to mention protocol and
practice differences, is often astonishing.

So where does that leave us? To make a real dent fast I suspect you need to
focus on data availability, not problem. Ask the question:

What are the fastest path(s) to collecting large volumes of clinically
representative data with some QA in place, consented for the ways we want to
use it, and with real clinical truth or a decent proxy we can get at in an
automated or semi-automated fashion? 1000 Bonus points if real outcome data
will be available in future.

------
sperant
I'm the cofounder of a startup building a new EHR to help solve this problem
(we just applied to YCS17).

We will use NLP and AI to provide structured data from unstructured medical
data (encounter notes, etc...) stored in the EHR for both analysis and
integration. For example, one of our partners right now wants to integrate
directly into our EHR in order to run computer vision algorithms on top of
uploaded eye exam images in order to help diagnose eye diseases. We give them
access to the eye image and other patient data, including the encounter,
diagnoses, etc. After they have trained their algorithms, we then allow them
to hook directly into the encounter workflow to send alerts live to the
doctors during the appointment. We want to be a platform to help other
startups and researchers connect with medical data both for analysis and also
to help make a meaningful impact directly to doctors' workflows and patient
care.

We would love to help out and/or learn about any use-cases that others might
have requiring medical data. If you would like medical data or would want to
integrate directly into a doctors' workflow in their EHRs based on NLP/AI
hooks, we would love to hear from you. You can reach out to me directly at
ginn@stanford.edu

------
PostOnce
Anonymized patient records, preferrably with information about the doctor
performing the diagnosis as well. I've only been able to find small datasets
of some tens of thousands of records, I would like tens of millions. You can't
learn much from what amounts to one small town's medical records, in terms of
finding accurate diagnoses, or identify places and situations that result in
better doctors.

~~~
omginternets
>Anonymized patient records, preferrably with information about the doctor
performing the diagnosis as well.

"Anonymized" is more like it ...

------
snovv_crash
Health and doctor visit information which has been cross-correlated with food
purchases and exercise type and frequency.

Right now we have no way of determining which interactions lead to which
conditions, so we generalise based on the 3 inputs independently, when in
reality it is perfectly normal to eat more when doing lots of exercise, or
need doctor visits when doing exercise with inadequate nutrition.

~~~
pbnjay
We are working on this from another angle - sequencing plants to map out
nutrient biosynthesis pathways. Then determining how those nutrients affect
human health.

With that info we can start doing "personalized nutrition" such as (totally
made up): "If you have a Diabetes, then you should eat more Broccoli and
Radishes because they have nutrients that mediate sugar uptake"

------
ransom1538
OP: You need more doctor data.

Given you have surgeon [x] what are odds of a successful surgery with [x].
_THIS_ is the guarded secret -- yet the most valuable.

If you have medical data (or want to be a cofounder) please email me
:ransom1538 at gmail.com -- a prototype:
[https://www.opendoctor.io](https://www.opendoctor.io) to find out data to
this very question.

~~~
mikecsh
In the UK, under the NHS, summary data regarding a surgeons performance is
published [1].

There is an interesting debate about whether this is a good thing or not. One
argument is that it improves transparency and allows patients to make a
better, more informed choice of who operates on them.

The counterargument is that most patients don't understand that there is an
element of probability distribution involved. Perhaps more importantly, if the
thought process of a surgeon changes from "performing surgeries to the best of
my abilities and knowing I will lose my job if I am dangerous" to "all my
results are published for public scrutiny so I need to have a survival rate as
close to 100% as possible as that is all the public comprehend" may lead to
surgeons only being willing to take on cases which are very likely to be
successful, as taking a difficult or last-chance case will have a high
probability of mortality and therefore will affect their numbers. This would
be a loss for many people. I don't know if any research has been done to
determine whether that has borne out or not though.

[1] [https://www.nhs.uk/service-
search/Consultants/performanceind...](https://www.nhs.uk/service-
search/Consultants/performanceindicators?resultsViewId=1018)

------
llccbb
Blood glucose levels as time series from continuous glucose monitors with
additional tagged data like food intake, exercise, and sleep. Each record
needs the obvious human-data like sex, age, weight, nationality/ethnicity,
type 1 or type 2 diabetes.

------
cmdrfred
This is a big ask and I'm not working in AI but I'd like a comprehensive as
possible list of treatments offered at facilities with pricing. I'd build a
website that looks up the treatment you require and compares the estimated
cost of travel to each location that offers it (keeping in mind exchange
rates) to find the lowest total price. This data should be global to be as
effective as possible.

It could even offer suggestions like "Spend $200 more and recover on a island
paradise!"

If globalism is good for low wage workers it certainly should be good for the
medical profession.

------
tathougies
Hormone problems are extremely difficult to tease out.

In my opinion, large datasets testing wide spectrums of hormones in a large
population, tagged with any diagnosed endocrinological condition would be
extremely valuable. I bet with this information, we could learn a lot without
conducting actual physical studies, by simply sectioning the data
appropriately.

I'm not a doctor though, so I don't know exactly what would need to be
recorded, but having dealt with bizarre endocrine disorders that doctors don't
really have any answers to, my gut feeling is that such a data set would be
incredibly useful.

------
surgeryres
Trauma is the leading cause of mortality for people under 40 years old in the
US, however it is very poorly funded in terms of research dollars compared to
things like cancer, HIV etc.

Datasets are limited and expansion with AI would be huge.

One specific application - determining cost effectiveness of placing
tourniquets in public places - much like the idea of having defibrillators at
the mall. And funding community training, see the "Stop the Bleed" campaign.

------
TurlochOTierney
Anything I can drill into on bipolar. Treatment, outcome and quality of life.
I came across this [https://blog.23andme.com/23andme-research/what-patients-
say-...](https://blog.23andme.com/23andme-research/what-patients-say-works-
for-bipolar-disorder/) in 2013. Most studies are qualitative not quantitative
and the data is not released.

~~~
JohnnyConatus
+1

------
snowpanda
Lyme disease frequency, given that the CDC grossly underestimated it (and
admitted to that). [1]

And frequency by state, especially in the Western States where it is under-
diagnosed.

[1] [http://www.cbsnews.com/news/cdc-lyme-disease-
rates-10-times-...](http://www.cbsnews.com/news/cdc-lyme-disease-
rates-10-times-higher-than-previously-reported/)

~~~
pragone
This is an excellent idea! Lyme disease has more uncertainty about it than
most people realize.

------
angersock
So, the last startup I was a fulltime engineer at actually worked in this
area.

What I would suggest to be maximally useful would be to focus on physiological
data: EKG/ECG, EMG, glucose, SPO2, maybe various blood work counts.

All of those are data that are both well-understood and are thrown away
regularly, and that if fed into a computer with modern ML methods we could
maybe see some really cool stuff.

I'd suggest staying away from unstructured data and things that are primarily
of interest to only the business side of healthcare--insurance figures,
billing codes, EMR/EHR shit.

If you _really_ wanted to get in there, putting up a minimal and standardized
format for representing labs and medications would go a looooong way.

~

The problem in healthcare isn't the medical stuff--it's that people get bogged
down in the inefficiencies of the system and zoom off solving problems that
are removed from the immediate task of "what the fuck is wrong with this
patient from the instruments I have at hand?"

~~~
pragone
> I'd suggest staying away from unstructured data and things that are
> primarily of interest to only the business side of healthcare--insurance
> figures, billing codes, EMR/EHR shit.

I would argue these are the most important areas to target. We need tremendous
reform in this area, and if we can demonstrate meaningful improvement over
what we have now, maybe we can let doctors get back to actually treating
patients and not spending the majority of their time checking boxes on a
computer.

------
leovander
SNOMED, LOINC, CPT, ICD9, ICD10, Gender, Race and Ethnicity codes. On top of
that, getting all the CCD section specific OID's.

------
awjr
Did some investigation into prescription data however prescription data is
usually aggregated at surgery level. Also the reason for prescribing the drug
(even at high general level) is not recorded. If prescription data was
available at LSOA level
([http://www.datadictionary.nhs.uk/data_dictionary/nhs_busines...](http://www.datadictionary.nhs.uk/data_dictionary/nhs_business_definitions/l/lower_layer_super_output_area_de.asp?shownav=1))
then you would be able to study epidemiology and potentially identify
urban/rural areas where certain diseases are prevalant.

------
ljw1001
Combined phenotype/genotype datasets. These are (with some good reason) very
difficult for anyone outside the medical-research establishment to get access
to, but the net result is that it creates market barriers supporting the
existing big players.

~~~
davecap1
These are quite difficult to get (and very expensive to produce) even inside
the medical-research establishment. Which big players are you referring to?

~~~
ljw1001
Large research universities and hospitals, Pharma, and, of course, the Broad.

I understand pharma not sharing but much of the hospital/broad/university data
is produced with (at least some) public money.

------
jszymborski
(Breast) Cancer biopsies, with histology and outcome reports.

While it isn't my research project, I've been trying to use computer vision
and some naive AI to identify early breast cancer lesions in images from mouse
tissue with mixed success, but it's something that can be very much
accelerated with a large human dataset with outcomes.

(If you work in the field and what to help/hire me with/for something like
this, kindly send a message to hn AT naj-p.com)

There are understandably some ethical guidelines that need to be worked for
this sort of thing, but seeing as their are public repositories of not-so-
dissimilar information (e.g. mammograms), it should be workable.

~~~
acveilleux
You're probably aware, but CAD is a staple of modern mammo interpretation
workflows. Products like Hologic ImageChecker CAD.

~~~
jszymborski
Yup :) though I'm more interested in biopsies information, because they give a
better understanding of cellular architecture, and if they're stained against
markers, the molecular biology of the cancer.

Mammogram analysis is an essential first-line, but I think doctors need better
insight in treatment options and finer stratification. An early Atypical
Ductal Hyperplasia, for example, is usually treated as pre-pre-cancer, but we
might be able to identify a subtype of these lesions that progress on to more
aggressive stages.

------
jnordwick
STDs by congress person

------
amelius
I'm wondering how an automated diagnosis could work in practice.

The data probably contains a number of symptoms or measurements (bloodwork),
and a diagnosis by a doctor.

I can see how you can train a deep-learning model for that.

What if the patient is prescribed medication. Is the condition of the patient
over time (after giving the medication) tracked by doctors?

Personally, I have found that once a doctor prescribes me some medication, he
never asks me how things are going (except maybe once). So how accurate can
the data be?

------
StClaire
Images. Brain scans. Mammograms. Eye scans.

Patient history would help too. (I know there's HIPPA to comply with, but as
much as we can get can help train better classifiers.)

~~~
neuromantik8086
I would check out [https://github.com/caesar0301/awesome-public-
datasets#neuros...](https://github.com/caesar0301/awesome-public-
datasets#neuroscience) for brain scans.

Also, anything from here:
[http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3818_...](http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3818_T1.html)

And the following: [http://www.ukbiobank.ac.uk/imaging-
data/](http://www.ukbiobank.ac.uk/imaging-data/)
[http://nmr.mgh.harvard.edu/lab/harvardagingbrain](http://nmr.mgh.harvard.edu/lab/harvardagingbrain)
[http://www.einstein.yu.edu/departments/neurology/clinical-
re...](http://www.einstein.yu.edu/departments/neurology/clinical-research-
program/eas/data-sharing.aspx)

------
tmaly
I really think having price transparency across providers for both medical
treatments and for medicine would be a game changer for the industry.

------
rafinha
I'm not sure such thing exists: "large companies with lots of medical data".
Medical data is often confidential and belong to hospitals.

~~~
dikdik
I used to work for a medical lab and worked on a lot of projects that involved
aggregating and cleaning medical data to sell. Often it would just go to
pharma companies so they could target the best places to sell their drugs.

Anyway, it's completely legal. You just have to scrub the data pretty
thoroughly before you sell it.

------
deepnotderp
Drug molecule datasets would be an absolute boon.

------
gregfjohnson
I work in respiratory therapy. Would like real-time ventilator telemetry data:
volume, flow, pressure, SpO2. Alarms. Setting changes to medical devices
(ventilator specifically). Condition requiring ventilation (ARDS, COPD,
premature birth, etc.) Clinical assessment of patient outcome.

------
donquichotte
I'm neither working on AI nor a medical expert, but it would be nice to have a
dataset with pictures of melanoma and whether they are cancerous or not, to
build an app similar to [https://skinvision.com/](https://skinvision.com/).

------
kumarski
We have over a billion data points at [http://semantic.md](http://semantic.md)
with high value context that we use to service companies in the space.

Would be exciting/somewhat disruptive if YC democratized access to it.

------
ipunchghosts
EKG data. I've never had an EKG done where the computer was even close to
predicting correctly what was going on. As someone who does DSP, this is not
that difficult of a problem a RNN and lots of data.

------
jonjlee
Decoupling medical notes from billing would relieve a huge burden from the
modern practice of medicine. I would like to have a robust set of clinic notes
with the corresponding outgoing billing documents.

------
oomkiller
Structured longitudinal patient data (diagnosis and procedure codes, lab data,
step data from fitbits, etc), but with AI unstructured may become more useful
as well. This is probably an opportunity in itself.

~~~
technics256
This. Inpatient daily notes with charge codes/ICD10.

------
Odenwaelder
For my work, I need information on public health in developing countries,
especially in Africa. There's a lot of information from WHO, but it's not
properly machine readable.

------
getAidlab
At [https://www.aidlab.com](https://www.aidlab.com) we use PhysioNet for our
filtration research, but an additional database would be lovely!

------
zitterbewegung
A comprehensive listing of foods and the allergies that are associated with
them (a listing of food ingredients with tags like peanuts / shellfish etc...
).

------
sheraz
Spirometery data for people with or without respiratory diseases / conditions.
For example, healthy males / female data from age 5 to 95.

Then those with Diseases or conditions.

------
rajvansia
Vital signs data during surgical cases with anesthesia data

------
id122015
A dataset with all the doctors in the world. And their ranks if possible. And
what they worked before being a doctor - paid feature.

------
leecarraher
why start with medical data? with hipaa it seems (rightfully so) to contain
some of the most heavily guarded types of data out there.

------
technics256
Hospitalist/intake physician hospital notes matched to ICD10 codes. Extremely
useful and difficult to find.

------
JusticeJuice
Illness rates by region to build a consumer facing 'google trends' for health.

~~~
gotthemwmds
[https://vizhub.healthdata.org/gbd-
compare/](https://vizhub.healthdata.org/gbd-compare/) enjoy

------
idclip
Viruses for sure to curb the next epidemic - and put an end to hiv, hpv and
the flu.

------
ipunchghosts
Easy, picnichealth.com database. Curated medical database. This thing is a
gold mine.

------
carbocation
ECG data at the signal level.

~~~
jrowley
There is some of ECG data available in physionet:

[https://www.physionet.org/search-
results.shtml?q=ECG&sa=Sear...](https://www.physionet.org/search-
results.shtml?q=ECG&sa=Search)

~~~
carbocation
This is helpful - thank you!

------
ryptophan
Dermatology is complicated. A labeled image dataset of skin conditions?

------
merqurio
Tagged medical notes from Medical Records. The same way there is ImageNet.

------
socmag
Sounds awesome.

Anything geospatial would be superb. Disease transmission for example.

------
maxxxxx
How about pricing data?

~~~
jrowley
Pricing such as total cost of care, at a covered california region level is
available here: [http://costatlas.iha.org/](http://costatlas.iha.org/)

(I built this tool, so please be nice ;)

~~~
JohnnyConatus
Do you mean hospital chargemaster data?

~~~
jrowley
This data doesn't come from Chargemasters - it comes from claims data,
aggregated and averaged over all members, so it is real data showing what the
average cost per member is over a year.

------
farhanhubble
Blood test reports of all kinds, images of smears.

