

$3 million machine learning prize - timf
http://www.heritagehealthprize.com/competition.php

======
Rhapso
So, essentially, this is a contest to make a way to predict who is most at
risk for going back to the hospital.

while this sounds nice, there are some issues.

1\. How can this do anything but hurt people? Medical professionals do all
they can to keep people from returning to the hospital, explaining to patients
what they should be doing in a medical sense, the only real use is to deny
insurance or increase rates on "high risk" people.

2\. Should they implement the winning solution, then act on it by sending
additional "how to be healthy" propaganda or otherwise attempting to prevent
those people, the pattern of behavior of will change accordingly, thus likely
breaking the predictive capability.

This is not like the netflix "present better suggestions" problem. This does
not need to be that fast, efficient, nor as creative. Just having a large set
of statistics taken from the dataset (which seems rather small) and making a
large Bayesian Network to crunch out the probability of needing medical care
in a given time frame seems to be the best solution to the problem.

I am interested in seeing other views on these points. heavens, I might
learning something about a field I am a dilettante in from a master.
(ironically this is more the goal then being "right" is)

~~~
alextp
You should read the new yorker story
[http://www.newyorker.com/reporting/2011/01/24/110124fa_fact_...](http://www.newyorker.com/reporting/2011/01/24/110124fa_fact_gawande?currentPage=all)
. It answers your questions, mostly. For (1), if the health insurer is forced
to treat those patients and acknowledges who they are they can spend a bit of
money on preventive and follow-up care and save a lot of money on
hospitalization, surgeries, etc. (2) This is true, but if the algorithm is
retrainable (and it should be, as it's machine learning) there's the
possibility that all you have to do is a bit of domain adaptation to keep
things going; if this doesn't work, another contest 5 years from now will
probably pay for itself.

The problem with your proposed solution is precisely that there seems to be
far too little data points and far too many variables. Not only that but I
expect most of the information to be in the interactions between variables and
clever features that cover that. Most ways of learning bayesian networks don't
work very well when you have to model interactions. I'd bet on the usual
winning approaches for this sort of thing, which is clever boosting, matrix
decomposition, and random forests, all of which can model interactions and
somewhat deal with incomplete data.

------
mhb
Why this will save money:

 _The Hot Spotters - Can we lower medical costs by giving the neediest
patients better care?_ by Atul Gawande

[http://www.newyorker.com/reporting/2011/01/24/110124fa_fact_...](http://www.newyorker.com/reporting/2011/01/24/110124fa_fact_gawande?currentPage=all)

On HN: <http://news.ycombinator.com/item?id=2154579>

------
bengebre
The benefits of finding these folks are many:

[http://kottke.org/11/01/controlling-healthcare-costs-by-
focu...](http://kottke.org/11/01/controlling-healthcare-costs-by-focusing-on-
the-neediest-patients)

------
mv
"training dataset includes several thousand anonymized patients and will be
made available"

That seems like an awfully small dataset. It also doesn't look like it is
limited to one disease which would make the search space enormous especially
if all the patients didn't have the same labs drawn!

If it was completely standardized data several thousand may be sufficient to
train, but I think they are looking for something more 'magic' than that.

~~~
sesqu
The dataset sounds so small that I'd expect the winning answer will be
extensively seeded by medical doctors. Diagnostic data is usually very
difficult to approach with AI, and practising doctors have good heuristics.
That suggests, to me, that the best one can hope for is using this dataset to
refine those heuristics.

------
indigoviolet
Oooh. This is just begging for a privacy firestorm when someone de-anonymizes
the data, which I'm guessing won't be super hard given the kind of medical
features they'd need to provide to make this task useful.

~~~
Dilpil
I'm not sure about that- unlike social network data, there isn't a publicly
availible dataset containing the names of all the people in here which could
be used to de-anonymize this data.

~~~
indigoviolet
You're perhaps right. Of course, these people are on things like Facebook,
Twitter and blogs and if you can narrow down age, gender, location and medical
condition, you might be able to correlate with public posts.

You can also look for specific people in there if you have certain kinds of
prior information about them: For example, Aravind Narayanan was able to de-
anonymize some part of the Netflix set [<http://arxiv.org/abs/cs/0610105>].
Maybe that won't translate to this data.

------
tocomment
What makes solving this problem worth three million?

~~~
chaosmachine
_"The winning algorithm will be able to predict patients at risk for an
unplanned hospital admission with a high rate of accuracy."_

Algorithm says no insurance for you.

~~~
dkarl
Yeah, can they anonymize my entry and my prize? I don't think I'd want my
friends and family knowing I helped these guys out.

------
abhaga
Without a legal protection in place which disallows something like this for
deciding the insurance rates, this sounds like something which can get abused.

But I think it is better that is happens in public via an open competition
rather then in a private research group funded by an insurance company. At
least, everyone will immediately know what can be predicted rather then
finding it out through a class action suit years later.

~~~
dantheman
yeah because pricing insurance correctly is a bad idea....

------
wladimir
Hm sounds like an interesting challenge, can anyone register for this, or do
you have to be US-based?

------
nazgulnarsil
sorry, I don't think being born in the west entitles you to millions of
dollars of medical care at other's expense when a million dollars means
hundreds of lives saved.

------
maeon3
Doctors can tell you which patients will be back, the problem is they can't,
because if they do, that will be discrimination which would be grounds for
burning the doctor at the stake. The software which does exactly the same
thing, however, can't be burned at the stake for discrimination because in the
event where the guilty party cries fowl, you simply print out the math. It's
genius.

You 1% (repeat sickly offenders) causing 30% of the medical care costs better
get ready to pay your increased share to acquire that care. If it can be
determined that one human would likely need 10 million dollars of medical care
(on account of heavily defective dna) and another human will likely need only
200 thousand (flawless dna), the one who is likely to need more should be
paying more.

~~~
JoachimSchipper
Wait, _should_? I didn't pick my parents... (I happen to be perfectly fine,
but that's at least part good luck.)

~~~
tel
Unfortunately, despite popular myth, not every consequence applied to a person
is a result of their choices and actions. The American Dream might be equality
of people, but the harsh reality is that there's actually quite a lot of
variance.

So we get to answer the really interesting question of exactly how much do we,
as a country, want to spend to support a dream against reality?

