Huh? Why would a connection to reality be required to get started with machine learning?

 Because otherwise it should be called machine hallucinations?The process of learning could be defined as a task of extraction of relevant information (knowledge) about reality (shared environment) not mere accumulation of a fancy nonsense or false beliefs.
 So knowledge like: Did the passenger have kids on board? Was the passenger nobility? Was the passenger travelling first class? Where was the passenger located on the ship after boarding? And how do these factors influence survivability?And reality like: The actual sinking of the Titanic?If your model concludes that nobility, traveling first class, close to the exits, without family, has a higher chance of surviving, then this is fancy nonsense or a false belief?You make a really strange case for your view.
 Correlations does not imply causation. There were many more relevant but "invisible" variables, which, probably, related to some genetic factors, like ability to sustain exposure to the cold water, ability to calm oneself down to avoid panic and self-control in general, strong survival instinct to literally fight the others, etc. The variables you have described, except the age of a passenger, are visible but irrelevant. And pure luck must have a way bigger weight and it is, obviously, related to the genetic favorable factors, age, health and fitness.
 This challenge is not about causal inference. I do agree it is more of a toy dataset, to get started with the basics, and that there are a lot of other variables that go into survivability. But to say these variables, except for age, are irrelevant is mathematically unsound: You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC). You can do statistical/information theoretical tests that show the majority of these variables is a significant signal for predicting the target.In real life it is also very rare to have free pickings of the variables you want. Some variables have to substituted with available ones.The Titanic story is to make things interesting for beginners. One could leave out all the semantics of this challenge, anonymize the variables and the target, and still use this dataset to learn about going from a table with variables to a target. In fact, doing so teaches you to leave your human bias at the door. Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.Let the data and evaluation metric do the talking.
 >> Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.That sounds a bit iffy. A domain expert should really know what they're talking about, or they're not a domain expert. If the real deal gets beaten on Kaggle it must mean that Kaggle is wrong, not the domain expert.Not that domain experts are infallible, but if it's a systematic occurrence then the problem is with the data used on Kaggle, not with the knowledge of the experts.I mean, the whole point of scientific training and research is to have domain experts who know their shit, know what I mean?
 How does this not violate [1]? That is, this seems specifically anti-statistical. The best you can come up with on this is a predictive model that you then have to test on new events. In this case, that would likely mean new crashes.
 Because we are not doing hypothesis testing, we are doing classification on a toy dataset. Sure, one could treat this as a forecasting challenge, but then one would need another Titanic sinking in roughly the same context, with the same features... That demand is as unreasonable as calling this modeling knowledge competition meaningless.And if you see classification as a form of hypothesis testing, then cross-validation is a valid way of testing if hypothesis holds on unseen data.
 I think that is a rub. With the goal just being to find some variables that correlate together, it is a neat project. But, ultimately not indicative of a predictive classification. If only due to the fact that you do not have any independent samples to cross validate with. All samples being from the same crash.This would be like sampling all coins from my pockets and thinking you could build a predictive model of year printed to value of coin. Probably could for the change I carry. Not a wise predictor, though.
 You are right, but only in a very strict, not-fun, manner :). Even if we had more data on different boats sinking, the model would not be very useful: We don't go with the Titanic anymore and plotted all icebergs. Still, if a cruise ship were to go down, I'd place a bet on ranking younger women of nobility traveling first class higher for survivability than old men with family traveling third class, wise predictor or no.This dataset is more in line with what you are looking for: https://www.kaggle.com/saurograndi/airplane-crashes-since-19...
 Makes sense. And yes, I completely meant my points in a pedantic only manner. :)
 > You can show with cross-validation and test set performance that your model using these variables generalizes (around 0.80 ROC AUC).It shows only that given set of variables (observable and inferred) could be used to build a model. The given data set is not descriptive, because it does not contain more relevant hidden variables, so any predictions or inferences based on this data set are nothing but a story, a myth made from statistics and data.

Applications are open for YC Summer 2018

Search: