– https://en.wikipedia.org/wiki/Iris_flower_data_set
The first lesson is a classifier to separate apples and oranges.
If you work through the data, you'll find things like women, children and first class passengers had a higher survival rate than men with lower class tickets.
This matches exactly the stories of what happened: Staff took first class passengers to the lifeboats first, then women and children. Then they ran out of lifeboats.
So the data shows correlation, and eye-witness accounts shows causation. That's close to the ideal combination: eyewitness accounts can be unreliable because we can't know how widespread they are, and correlation doesn't show causation.
But the combination of them both is pretty much the best case for studying something which can't be replicated.
 See examples like https://www.kaggle.com/omarelgabry/titanic/a-journey-through...
But any attempt to frame it as a "prediction", an accurate model of the event or adequate description of reality is just nonsense.
To call things by its proper names (precise use of the language) is the foundation of the scientific method. This is mere oversimplified, non-descriptive toy model of one aspect of historical event, made from of statistics of partially observable environment. A few inferred correlations reflects that there was not a total chaos, but some systematic activity. No doubt about it. But this is absolutely unscientific to say anything else about the toy model, let alone claim that any predictions based on it have any connection to reality.
There is clear correlation between gender and survival rates. Given the data, a decent prior would absolutely take that into account.
Yes, there are other factors. But the foundation of statistical models is simplification, and descriptive statistics are an important foundation of that.
In any case, it isn't exactly clear that there are magical hidden factors which predicted survival. It appears you maybe unfamiliar with the event, because basically those who got into a lifeboat survived, and those who didn't, didn't survive.
To quote Wikipedia:
Almost all those who jumped or fell into the water drowned within minutes due to the effects of hypothermia.... The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the unequal treatment of the three passenger classes during the evacuation..... The thoroughness of the muster was heavily dependent on the class of the passengers; the first-class stewards were in charge of only a few cabins, while those responsible for the second- and third-class passengers had to manage large numbers of people. The first-class stewards provided hands-on assistance, helping their charges to get dressed and bringing them out onto the deck. With far more people to deal with, the second- and third-class stewards mostly confined their efforts to throwing open doors and telling passengers to put on lifebelts and come up top. In third class, passengers were largely left to their own devices after being informed of the need to come on deck.
Even more tellingly:
The two officers interpreted the "women and children" evacuation order differently; Murdoch took it to mean women and children first, while Lightoller took it to mean women and children only. Lightoller lowered lifeboats with empty seats if there were no women and children waiting to board, while Murdoch allowed a limited number of men to board if all the nearby women and children had embarked
All this behavior matches exactly what the model tells us about the event.
I'd be very interested if you can point to something specific that is wrong about it.
All models are wrong, but some are useful.
Now, it so happens that that correlated heavily with class. But, not as much as with sex. Though, there were some places where being male hurt your chances (as you point out in the one officer not allowing men on boats), by and large these were secondary and correlated with success, not predictors of it.
All predictions are wrong and make no sense for partially observable, multiple causation, mostly stochastic phenomena. It will never be the same.
The Titanic's sister ship (the Brittanic) was torpedoed during WW1 and sunk. However, the lesson of the Titanic (too few lifeboats) had been learnt, and only 26 people died.
I don't know what point you are trying to make - yes, I agree that history never repeats, but lessons can be learnt from it, and they can be quantified and they can be useful.
This happened because they made a _statistical_ model of the Titanic disaster, and learned from it? Like, they actually crunched the numbers and plotted a few curves etc, and then said "aha, we need more boats"?
I kind of doubt it, and if it wasn't the case then you can't very well talk about a "model", in this context. It's more like they had a theory of what factor most heavily affected survival and acted on it. But I'd be really surprised to find statistics played any role in this.
No - statistics as the discipline that we think of today wasn't really around until the work of Gosset and Fisher which was done a few years after this.
I'm sure you noted that I was very careful with what I claimed: "the lesson of the Titanic (too few lifeboats) had been learnt".
These days we'd quantify the lesson with statistics. Then, they didn't have that tool.
Instead, we have testimony relaying the same story: Just one question. Have you any notion as to which class the majority of passengers in your boat belonged? - (A.) I think they belonged mostly to the third or second. I could not recognise them when I saw them in the first class, and I should have known them if there were any prominent people. (Q.) Most of them were in the boat when you came along? - (A.) No. (Q.) You put them in? - (A.) No. Mr. Ismay tried to walk round and get a lot of women to come to our boat. He took them across to the starboard side then - our boat was standing - I stood by my boat a good ten minutes or a quarter of an hour. (Q.) At that time did the women display a disinclination to enter the boat? - (A.) Yes."
So yes, I agree - it was a theory, which our modern modelling tools can show matched well with what the statistics showed happened.
My whole point is that this is very useful, unlike the OP who dismissed it as useless.
OK, tell me, please, what it is that you can predict? That some John Doe, having the first class ticket in a cabin next to the exit would survive the collision of the next Titanic with a new iceberg? That being a woman gives you better chances to secure a seat in a lifeboat? What is the meaning of the word "predict" here?
Most of the models mimic and simulate (very naively) that observable behavior, not its origin.
When people cite "the map is not the territory" they mean this. Simulation is not even an experiment. It is mere an animation of a model - a cartoon.
Simulation can be a very beneficial experiment. See for instance: https://papers.nips.cc/paper/5351-searching-for-higgs-boson-...
Simulations are not experiments. It is an animation of a formalized imagination, if you wish.
The process of learning could be defined as a task of extraction of relevant information (knowledge) about reality (shared environment) not mere accumulation of a fancy nonsense or false beliefs.
And reality like: The actual sinking of the Titanic?
If your model concludes that nobility, traveling first class, close to the exits, without family, has a higher chance of surviving, then this is fancy nonsense or a false belief?
You make a really strange case for your view.
In real life it is also very rare to have free pickings of the variables you want. Some variables have to substituted with available ones.
The Titanic story is to make things interesting for beginners. One could leave out all the semantics of this challenge, anonymize the variables and the target, and still use this dataset to learn about going from a table with variables to a target. In fact, doing so teaches you to leave your human bias at the door. Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.
Let the data and evaluation metric do the talking.
That sounds a bit iffy. A domain expert should really know what they're talking about, or they're not a domain expert. If the real deal gets beaten on Kaggle it must mean that Kaggle is wrong, not the domain expert.
Not that domain experts are infallible, but if it's a systematic occurrence then the problem is with the data used on Kaggle, not with the knowledge of the experts.
I mean, the whole point of scientific training and research is to have domain experts who know their shit, know what I mean?
> Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well.
> Q: Do you have any prior experience or domain knowledge that helped you succeed in this competition? A: In fact, no. It was a very good opportunity to learn about image processing.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? I didn't have any knowledge about this domain. The topic is quite new and I couldn't find any papers related to this problem, most probably because there are not public datasets.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? M: I have worked in companies that sold items that looked like tubes, but nothing really relevant for the competition. J: Well, I have a basic understanding of what a tube is. L: Not a clue. G: No.
> We had no domain knowledge, so we could only go on the information provided by the organizers (well honestly that and Wikipedia). It turned out to be enough though. Robert says it cannot happen again, so we’re currently in the process of hiring a marine biologist ;).
> Through Kaggle and my current job as a research scientist I’ve learnt lots of interesting things about various application domains, but simultaneously I’ve regularly been surprised by how domain expertise often takes a backseat. If enough data is available, it seems that you actually need to know very little about a problem domain to build effective models, nowadays. Of course it still helps to exploit any prior knowledge about the data that you may have (I’ve done some work on taking advantage of rotational symmetry in convnets myself), but it’s not as crucial to getting decent results as it once was.
> Oh yes. Every time a new competition comes out, the experts say: "We've built a whole industry around this. We know the answers." And after a couple of weeks, they get blown out of the water.
Competitions have been won without even looking at the data. Data scientists/machine learners are in the business of automating things -- so why should domain knowledge be any different?
Ok, sure it can help, but it is not necessary, and can even hamper your progress: You are searching for where you think the answer is -- thousands are searching everywhere and finding more than you, the expert, can.
And if you see classification as a form of hypothesis testing, then cross-validation is a valid way of testing if hypothesis holds on unseen data.
This would be like sampling all coins from my pockets and thinking you could build a predictive model of year printed to value of coin. Probably could for the change I carry. Not a wise predictor, though.
This dataset is more in line with what you are looking for: https://www.kaggle.com/saurograndi/airplane-crashes-since-19...
It shows only that given set of variables (observable and inferred) could be used to build a model. The given data set is not descriptive, because it does not contain more relevant hidden variables, so any predictions or inferences based on this data set are nothing but a story, a myth made from statistics and data.