Here's one, but if no one does it and it's feasable, some day I might. I'd like to implement a large continual self-report study via email for the public good. Via an email every day or so, data is slowly gathered from participants--all anonymized in a number of ways. The emails would be questions proposed by researchers or lifestyle data to look for correlations with product use, habits, physical traits and disease incidence. For example, looking back in time, thalidomide use and birth defects would be one type of correlation that you might hope to pick up, or perhaps more recently certain shampoos (with ingredient X) might correlate with infertility. Yes you would have to be thorough and careful about the anonymization, but I think the benefits could be there.
Correlation is not causation, though, and the nature of correlating a number of factors dictate you would get many false alarms. However, like a metal detector, the clues for new things to look at would be worth the false alarms. The idea of all this would be to give researchers new targets and areas of research, all free and open.
I like it. Maybe also standardizing the way of phrasing input so as to make it somewhat machine understandable?
Another idea to go along with this that I've thought cool would be to create a wikipedia like system where people logically state their info and arguments in a machine understandable format. The machine would then crunch everything and tell people what the contradictions and possible truths were in the overall system of thought, or particular sub systems.
Combining these two systems could make a very powerful research tool.
An interesting way to get around the problem of people lying would be to aggregate all the search words a single person searchs and see if there are interesting correlations to other singular points. You could write a plug in that feeds all your searches annonymously to a central database where you could run a gamut of tests.
To your example, when I get a new drug I look up all the side effects (so I would look up thalidomide). Then I have a child and it turns out that my child has birth defects. I would look that up to see who else has that out there has experienced having a child with the same birth defects. As search continues, it would eventually pop up that people who look up side effects of thalidomide also look up birth defects. Then someone could do a serious study on it sooner than later.
As in all cases using data, the major impediment is anonimity and scale of data collection.
Thoughts? The correct analogy is not a metal detector, but a needle in a haystack. I think you misunderestimate the amount of false leads you get doing this by several orders of magnitude. If your system generates too many false leads, it will be (to borrow an analogy from Bruce Schneier, who you should read every week, and writes about false alarms in the context of airport security) like adding more hay to a haystack with a missing needle.
Impediments? People lie about their habits. People do not accurately remember things throughout the day. How to deal with missed days? If people miss days for different reasons, how do you deal with those gaps. You have to assume that the days they miss would be different enough from days they don't miss to invalidate the data of people who miss days.
I'm not sure whether you are referring to the same argument, but I think I remember that Bruce Schneier talks about the lack of correct positives that makes the application of data mining methods to terrorist incidents useless. He makes the point that very few positives for the system to learn from combined with a very large number of variables causes a huge number of false alarms (not sure if I remember all he said correctly).
So it's all about how many confirmed positives you have and whether you have reliable data on them. It works very well for credit card fraud because there is a sufficiently large number of actual fraud cases, you have reliable data about them and false alarms do not cause major disruption.
I agree with your concern about the reliability and completeness of data. I still think it's an interesting idea if there was a way to extract the data from a reliable source instead of working with what people claim to be the case.
LPTS,
You bring up good points. Self-reports are notoriously unreliable (like eye-witness testimony) however, it is still use-able data. In terms of missing days, the questions can be ordered and repeated to those who miss. In terms of false correlations, you can reduce this by being careful about how you use and analyze the data-set. Multiple regression analysis can be dangerous--it's all in how it is used. If you limit the items you correlate to those for which you have a hypothesis, minimally this data should be valuable in that manner.
Correlation is not causation, though, and the nature of correlating a number of factors dictate you would get many false alarms. However, like a metal detector, the clues for new things to look at would be worth the false alarms. The idea of all this would be to give researchers new targets and areas of research, all free and open.
Thoughts? Impediments?