Bringing Data Mining Into the Mainstream

JunkDNA · on July 28, 2010

I suspect that a problem inherent in many organization's data is that the semantics are often the thorniest part of the integration process. So the "translation layer" that is described by Mr. Fayyad in the article is no small undertaking. This problem is exacerbated by the fact that in many cases, data is collected for a primary purpose, and once you start digging, you realize that it's not as useful for secondary purposes without a lot of work.

The best example of this I can give from my experience is medical data. Lots of people want to do studies based on the wealth of data stored in electronic health records. Indeed there is massive potential for scientific discovery just sitting there. The problem is that health records are optimized for their primary purpose: recording information about a patient so their doctor(s) can make informed decisions based on that individual's history. Once you start looking at aggregate data from them, all sorts of nuances start to creep in and bite you. In order to do a high quality analysis, you usually have to clean and normalize that data heavily. That process requires expert knowledge of medicine, and some parts of it are just not automatable.

You could argue that the data collection process should be modified to allow for better future use of the data (many people do make this argument). The tricky part is doing that while not compromising the primary goal of data collection or adding a drag in terms of cost or time on the organization. These factors are possible to overcome, but are by no means trivial.

yurylifshits · on July 28, 2010

Here is how to bring Data Mining to the mainstream:

1. Put big data (static or streaming) on the cloud (AWS, Google Base, ...)

2. Create beautiful/simple exploration and query interface

3. Fix query semantic ("cluster these objects", "find most similar picture", "continue sequence"), while working on better and better algorithms at the backend

Examples of this approach: http://wolframalpha.com, http://viewer.opencalais.com, http://labs.google.com/sets

adataminer · on July 28, 2010

As some one who actively works in this area. the problem with queries is that most of the algorithms that work require deep learning and not simple sql like syntax. I would love the data in simple csv format hosted on a cloud such as Amazon EC2 [for safety] rather than some beautiful interface by a latte sipping ruby hippy.