Here's one, but if no one does it and it's feasable, some day I might. I'd like ...

yters · on Oct 9, 2007

I like it. Maybe also standardizing the way of phrasing input so as to make it somewhat machine understandable?

Another idea to go along with this that I've thought cool would be to create a wikipedia like system where people logically state their info and arguments in a machine understandable format. The machine would then crunch everything and tell people what the contradictions and possible truths were in the overall system of thought, or particular sub systems.

Combining these two systems could make a very powerful research tool.

sebg · on Oct 9, 2007

An interesting way to get around the problem of people lying would be to aggregate all the search words a single person searchs and see if there are interesting correlations to other singular points. You could write a plug in that feeds all your searches annonymously to a central database where you could run a gamut of tests.

To your example, when I get a new drug I look up all the side effects (so I would look up thalidomide). Then I have a child and it turns out that my child has birth defects. I would look that up to see who else has that out there has experienced having a child with the same birth defects. As search continues, it would eventually pop up that people who look up side effects of thalidomide also look up birth defects. Then someone could do a serious study on it sooner than later.

As in all cases using data, the major impediment is anonimity and scale of data collection.

LPTS · on Oct 9, 2007

Thoughts? The correct analogy is not a metal detector, but a needle in a haystack. I think you misunderestimate the amount of false leads you get doing this by several orders of magnitude. If your system generates too many false leads, it will be (to borrow an analogy from Bruce Schneier, who you should read every week, and writes about false alarms in the context of airport security) like adding more hay to a haystack with a missing needle.

Impediments? People lie about their habits. People do not accurately remember things throughout the day. How to deal with missed days? If people miss days for different reasons, how do you deal with those gaps. You have to assume that the days they miss would be different enough from days they don't miss to invalidate the data of people who miss days.

fauigerzigerk · on Oct 9, 2007

I'm not sure whether you are referring to the same argument, but I think I remember that Bruce Schneier talks about the lack of correct positives that makes the application of data mining methods to terrorist incidents useless. He makes the point that very few positives for the system to learn from combined with a very large number of variables causes a huge number of false alarms (not sure if I remember all he said correctly).

So it's all about how many confirmed positives you have and whether you have reliable data on them. It works very well for credit card fraud because there is a sufficiently large number of actual fraud cases, you have reliable data about them and false alarms do not cause major disruption.

I agree with your concern about the reliability and completeness of data. I still think it's an interesting idea if there was a way to extract the data from a reliable source instead of working with what people claim to be the case.

ericb · on Oct 9, 2007

LPTS, You bring up good points. Self-reports are notoriously unreliable (like eye-witness testimony) however, it is still use-able data. In terms of missing days, the questions can be ordered and repeated to those who miss. In terms of false correlations, you can reduce this by being careful about how you use and analyze the data-set. Multiple regression analysis can be dangerous--it's all in how it is used. If you limit the items you correlate to those for which you have a hypothesis, minimally this data should be valuable in that manner.