
Data mining without prejudice - llambda
http://web.mit.edu/newsoffice/2011/large-data-sets-algorithm-1216.html?
======
jasondavies
Original paper (no paywall):
[http://obs.rc.fas.harvard.edu/turnbaugh/Papers/Reshef_Scienc...](http://obs.rc.fas.harvard.edu/turnbaugh/Papers/Reshef_Science2011.pdf)

------
evgen
If you want to play around with the actual goods in addition to reading the
paper there is a jar file and a R wrapper for same at
<http://www.exploredata.net/Mine/index.asp>

If you are paywall-blocked you can get the paper at
<http://turnbaugh.openwetware.org/Publications.html>

------
politician
I wonder how much of the $15.00 it costs to rent this article for 24 hours the
authors of this paper will receive.

~~~
robinhouston
Is that a serious question? In any case, the answer is ‘none’. The way
academic publishing works, authors of papers do not get paid at all (though
they may have to pay “page charges” to the publisher in some cases). That’s
one of the reasons so many of us think the whole system is ripe for reform.

~~~
po
People often think that academic journals are providing the service of
'publishing' but they are in fact providing a reputation management and
curating system. Not a great one, mind you. The fact that the end product is a
journal is just an artifact.

Actually come to think of it, small record labels provide a similar
gatekeeper, tastemaker kind of role.

------
bugsbunnyak
So.. they ran mutual information on a correlation network with an window
optimizer. Did I miss something?

It looks like a neat tool, but how does this rate `Science`? (as anything
other than a section in a broad review of non-parametric data exploration
techniques - of which there are many)

------
mbq
This boils down to a following old algo: 1. plot all possible scatterplots 2.
order them due to an area of white space left 3. claim the top-whatever novel
exciting relationships. This way it is anti-robust and produces too much
false-postives to be seriously useful.

~~~
baltcode
I think you can correct for the false-positives to some extent. I am more
skeptical of the computational load of basically testing all sorts of
relationships. They have an approximate algorithm for doing that.

~~~
mbq
Yeah... do it exact and no hit will survive; do it with Bonferroni and nothing
will really change. And in fact it is only checking two-way relationships,
which is already a minor fraction of "all sorts".

------
tel
I love these techniques but wish they'd market themselves differently. This is
hardly able to detect true relationships between variables---the false
positive rate will likely be astronomical even with their MCMC _p_ -value
table on the website---but it is a more efficient way of exploring a high-
dimensional data set than just comparing it blindly.

I think put in the camp with Hadley Wickham's Grand Tour methodologies, this
is pretty interesting. Outside of that, it's just a well-marketed method jut
waiting for failure.

------
hooande
Fascinating read, can't wait to look into it more closely. So far this
approach seems somewhat similar to CHIRP [1] in that both concepts use bins or
grids to find the most effective pairwise feature combinations.

[1]
[http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program...](http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p6.pdf)

------
eggbrain
I believe the full paper that this article is based on can be found here:

<http://www.sciencemag.org/content/334/6062/1518.full>

------
baltcode
duplicate of an older thread: <http://news.ycombinator.com/item?id=3364077>

------
anonDataUser
Tried this on a set of data that I have been looking at recently, and was not
impressed. The software actually assumes all fields contain numerical data.
Why not handle nominal data? Also doesn't handle temporal data. Maybe this is
theoretically interesting but practically it is useless.

~~~
brown9-2
Is it possible to transform the temporal and nominal data into something with
a numerical scale? For example, seconds since a certain date, or number
ranking of the numerical data.

