
How to Score 0.8134 in Titanic Kaggle Challenge - ahmedbesbes
http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html#.WRSPhDQgLa0.hackernews
======
cjauvin
I've been trying this intro competition a while ago now, but I seem to
remember that it was relatively easy to obtain a score of ~0.78 or ~0.79 with
about 5 lines of Python or R (using your favorite lib/algo of course). That
said, am I the only one thinking that the pursuit of a few extra % points over
what seems like a "natural baseline" (which admittedly might translate into a
few dozens of correctly classified people in this particular case) is a
somewhat strange use of one's time and skills (if we don't take into account
the data processing and programming skills one can gain in the process, I'll
admit though). My point is that there seems to be nothing remotely
"scientific" (or even insightful) to be gathered from such an exhaustive
search process (which characterizes a lot of those data science competitions
IMO), when you have squeezed to death all the possible ways in which you can
transform a given dataset, in order to maximize a very precise metric. This to
me appears like a degenerate form of statistical science, which doesn't have
much to do with reality anymore.

~~~
nerdponx
Disagreed. Because it's such a well-known benchmark, it's a great sandbox for
testing new techniques, trying new software, and self-study.

------
tveita
Having "PassengerId" as the most important feature seems like a bad sign. Is
this historical data, like an id assigned at boarding, or is it a synthetic id
assigned to records potentially in some non-random order?

------
in9
why do people do this:

    
    
        def compute_score(clf, X, y,scoring='accuracy'):
    
            xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
        
            return np.mean(xval)
    

when they could have done:

    
    
        np.mean(cross_val_score(clf, X, y, cv = 5,scoring='accuracy'))

~~~
mmierz
A) they find it more readable to have less stuff on each line

B) they think they might want to look at the intermediate results, or compute
additional statistics

------
stared
But... grid search for number of trees in random forest? More is always
better, just slower (and at some point there is no difference).

Correct me if am a wrong, but this particular optimization looks as waste of
(CPU) time.

~~~
in9
You could easily overfit with a bunch of trees.

~~~
stared
Boosted tree models - surely! But random forest?

Vide [https://www.quora.com/Do-random-forests-tend-to-overfit-
as-m...](https://www.quora.com/Do-random-forests-tend-to-overfit-as-more-
trees-are-added)

------
wellsjohnston
Is setting unknown ages to the median really the right way to go about this?
Feels strange to me.

~~~
nerdponx
If it works, it works.

------
MariuszGalus
OLD but... sort of... GOLD

