

Bringing Data Mining Into the Mainstream - helwr
http://bits.blogs.nytimes.com/2010/07/26/bringing-data-mining-into-the-mainstream/

======
JunkDNA
I suspect that a problem inherent in many organization's data is that the
semantics are often the thorniest part of the integration process. So the
"translation layer" that is described by Mr. Fayyad in the article is no small
undertaking. This problem is exacerbated by the fact that in many cases, data
is collected for a primary purpose, and once you start digging, you realize
that it's not as useful for secondary purposes without a _lot_ of work.

The best example of this I can give from my experience is medical data. Lots
of people want to do studies based on the wealth of data stored in electronic
health records. Indeed there is massive potential for scientific discovery
just sitting there. The problem is that health records are optimized for their
primary purpose: recording information about a patient so their doctor(s) can
make informed decisions based on that individual's history. Once you start
looking at aggregate data from them, all sorts of nuances start to creep in
and bite you. In order to do a high quality analysis, you usually have to
clean and normalize that data heavily. That process requires expert knowledge
of medicine, and some parts of it are just not automatable.

You could argue that the data collection process should be modified to allow
for better future use of the data (many people do make this argument). The
tricky part is doing that while not compromising the primary goal of data
collection or adding a drag in terms of cost or time on the organization.
These factors are possible to overcome, but are by no means trivial.

------
yurylifshits
Here is how to bring Data Mining to the mainstream:

1\. Put big data (static or streaming) on the cloud (AWS, Google Base, ...)

2\. Create beautiful/simple exploration and query interface

3\. Fix query semantic ("cluster these objects", "find most similar picture",
"continue sequence"), while working on better and better algorithms at the
backend

Examples of this approach: <http://wolframalpha.com>,
<http://viewer.opencalais.com>, <http://labs.google.com/sets>

~~~
adataminer
As some one who actively works in this area. the problem with queries is that
most of the algorithms that work require deep learning and not simple sql like
syntax. I would love the data in simple csv format hosted on a cloud such as
Amazon EC2 [for safety] rather than some beautiful interface by a latte
sipping ruby hippy.

