

Competing in a data science contest without reading the data - urish
http://blog.mrtz.org/2015/03/09/competition.html

======
disgruntledphd2
This is actually a really, really good article. I like the way that the author
both writes clearly and entertainingly about a reasonably complex topic.

The disconnect between static and interactive data analysis that is at the
heart of the post is probably the most ignored issue in science.

To be honest, its hard not to ignore it given the implications of it (that we
only get one shot at a set of test/validation/experimental data) and if we
mess up, we're screwed.

------
sgt101
No theory (random classifiers) aggregated to optimize on a non representative
hold out set form a theory on that set? I think this is expected. If you
create classifiers that express some domain theory on the training set in step
1. and use the information in the hold out differently you'll do a lot better
(I believe - well I think I saw that result when I did my Ph.D 17 years ago).

Here is a very bad, very bad, very old, very old, AAAI workshop paper that
sums up the idea (the journal paper is behind a pay wall.

[http://aaaipress.org/Papers/Workshops/1999/WS-99-06/WS99-06-...](http://aaaipress.org/Papers/Workshops/1999/WS-99-06/WS99-06-005.pdf)

------
kastnerkyle
This paper [1] by Bergstra, Cox is one of my favorites on competing without
looking at the data. They were actually able to design the model before the
data was even released (!)

[1]
[http://arxiv.org/pdf/1306.3476v1.pdf](http://arxiv.org/pdf/1306.3476v1.pdf)

------
louden
This article illustrates the problem with over-fitting a model even when some
data is withheld for testing. This is a trap that one can fall into when using
training and testing sets.

