The process of learning could be defined as a task of extraction of relevant information (knowledge) about reality (shared environment) not mere accumulation of a fancy nonsense or false beliefs.
And reality like: The actual sinking of the Titanic?
If your model concludes that nobility, traveling first class, close to the exits, without family, has a higher chance of surviving, then this is fancy nonsense or a false belief?
You make a really strange case for your view.
In real life it is also very rare to have free pickings of the variables you want. Some variables have to substituted with available ones.
The Titanic story is to make things interesting for beginners. One could leave out all the semantics of this challenge, anonymize the variables and the target, and still use this dataset to learn about going from a table with variables to a target. In fact, doing so teaches you to leave your human bias at the door. Domain experts get beaten on Kaggle, because they think they need other variables, or that some variables (and their interactions) can't possibly work.
Let the data and evaluation metric do the talking.
That sounds a bit iffy. A domain expert should really know what they're talking about, or they're not a domain expert. If the real deal gets beaten on Kaggle it must mean that Kaggle is wrong, not the domain expert.
Not that domain experts are infallible, but if it's a systematic occurrence then the problem is with the data used on Kaggle, not with the knowledge of the experts.
I mean, the whole point of scientific training and research is to have domain experts who know their shit, know what I mean?
> Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well.
> Q: Do you have any prior experience or domain knowledge that helped you succeed in this competition? A: In fact, no. It was a very good opportunity to learn about image processing.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? I didn't have any knowledge about this domain. The topic is quite new and I couldn't find any papers related to this problem, most probably because there are not public datasets.
> Do you have any prior experience or domain knowledge that helped you succeed in this competition? M: I have worked in companies that sold items that looked like tubes, but nothing really relevant for the competition. J: Well, I have a basic understanding of what a tube is. L: Not a clue. G: No.
> We had no domain knowledge, so we could only go on the information provided by the organizers (well honestly that and Wikipedia). It turned out to be enough though. Robert says it cannot happen again, so we’re currently in the process of hiring a marine biologist ;).
> Through Kaggle and my current job as a research scientist I’ve learnt lots of interesting things about various application domains, but simultaneously I’ve regularly been surprised by how domain expertise often takes a backseat. If enough data is available, it seems that you actually need to know very little about a problem domain to build effective models, nowadays. Of course it still helps to exploit any prior knowledge about the data that you may have (I’ve done some work on taking advantage of rotational symmetry in convnets myself), but it’s not as crucial to getting decent results as it once was.
> Oh yes. Every time a new competition comes out, the experts say: "We've built a whole industry around this. We know the answers." And after a couple of weeks, they get blown out of the water.
Competitions have been won without even looking at the data. Data scientists/machine learners are in the business of automating things -- so why should domain knowledge be any different?
Ok, sure it can help, but it is not necessary, and can even hamper your progress: You are searching for where you think the answer is -- thousands are searching everywhere and finding more than you, the expert, can.
And if you see classification as a form of hypothesis testing, then cross-validation is a valid way of testing if hypothesis holds on unseen data.
This would be like sampling all coins from my pockets and thinking you could build a predictive model of year printed to value of coin. Probably could for the change I carry. Not a wise predictor, though.
This dataset is more in line with what you are looking for: https://www.kaggle.com/saurograndi/airplane-crashes-since-19...
It shows only that given set of variables (observable and inferred) could be used to build a model. The given data set is not descriptive, because it does not contain more relevant hidden variables, so any predictions or inferences based on this data set are nothing but a story, a myth made from statistics and data.