As the author does a really nice job explaining, and as seen in the Fisher quote, these articulations of heuristics and guidelines are often taken as closed form rules.
Look beyond statistics and you see it everywhere. Common ones include entrepreneurship and design...in both areas, experts' ways of thinking are often highly situated, highly metacognitive, and the actions they take are inherently inseparable from their thinking process. However, because of the academic drive towards objective/deterministic/observable phenomenon the research tends to report and attribute only the actions. The result is that those actions, rather than the underlying thinking processes, are valued and taught.
The result is simulations of expertise masquarading as knowledge. Its one thing when its students, but as you are seeing in psychology's 'replication crisis' (which, side note, is kind of a metaversion of its own critique) it can create real problems when surface level understanding is accepted and generalized as a normative 'truth' in a field. You see it in economics and business a lot...strive to appear scientific, but do so in ways that inherently betray the underlying structure of what you are studying. It comes from an underlying value in those communities, and society, that the only truth is objective truth.
If I have an experiment where I am screening 5 possible predictors and I get p values of .9 for 4 of them and .52 for 1...I would be an idiot not to pursue the 1. if I get 4 .49s and 1 .00000000001...same thing. Statistics is relative, literally.
[happy to provide citations...not sure anyone really cares]
The arbitrariness at the heart of the regulatory enterprise may seem disconcerting, but the alternative (no shared standards for evaluating drug efficacy) has also been tried historically, and the result was markets flooded with useless, often dangerous, products many of which nonetheless sold very well.
I think that is a bit recursive. Its not so much a fetishization as it is a misunderstanding that results in it being a valuable target.
It would be a heck of a lot more useful as a target (and granted...I would argue the target should be more like .001) if more research adopted Bayesian statistical techniques where you can't as easily P hack.
Skip all the programming exercises in R - just watch the videos and solve the multiple choice problems. Supplement with the decent opensource text book it links to.
Each "week" is likely only 1-2hrs of work. ~5 weeks per course. Only really need the first 2 courses:
. Introduction to Probability and Data
. Inferential Statistics
Nope, the p-value calculation assumes there is no effect. How can it be the probability there is an effect?
If you're interested in p-values, I wrote a post on them here with some counterintuitive examples (one of them shows how a lower p-value can sometimes increase your belief in the null-hypothesis).
Perhaps you want to weight different types of errors differently? Eg:
If you have 2000 samples of data, you don't train your model on it and then say that's your success rate. You'll end up with conclusions that don't generalize.
Instead train on 1600 and measure your success on the remaining 400.
Similarly, don't look for statistical significance among your 2000 samples and conclude that's the result. Do it across 1600 and then validate it on the 400. If there is a real result there, it'll reproduce. It now makes your process robust to overfitting / param hacking.
You avoid the green jelly bean problem entirely.
Not entirely. Quite often, the big fishing expedition research is done on things like epidemiological data sets, where you've got a million monkeys with a million typewriters trying to publish a million papers, all based on one data set.
Under that kind of situation, assuming that everything works out perfectly, after the first pass you'll have some random non-negative number of hypotheses that were collected from fishing that you're going to test again. And those tests will also have a 5% type I error rate, so the fishing expedition will still have a problem with multiple comparisons. Your cumulative risk of a type I error will be mitigated, but not reduced to the traditional 5% (or whatever alpha you were shooting for).
Meanwhile, your risk of a type II error will have been increased considerably: You're replacing one test of a relatively high statistical power with two tests of relatively lower power, and set up a situation where a type II error in either of them yields a type II error in the overall test.
Not saying that there's no value to approaching things that way, but it's not the free lunch you're suggesting it is.
If you're performing statistical inference on sample sizes of 12 you can't then be surprised by a lack of predictability.
And I'm pretty confident the majority of experiments are using sample sizes larger than 12.
It definitely varies by field, testing 3 doses and a negative control with 3x replication would be a pretty good experiment for testing a drug in mice. It's plenty of data for identifying a drug that actually works, as long as your statistical tools aren't designed with expectation of thousands of samples.
A 12-point dataset is not desirable _per se_, but sometimes it's all one has - as is the case, for example in health-related research.
You do that test on the second subset, and then you discard every theory that doesn't pass both the 1600 set and the 400 set.
So you end up with the predictions that pass all the data in your original 2000 samples.
Is it really any better at generalizing to new data? If so, can you just evaluate your theories on the unpartitioned data by doing randomized subset testing after-the-fact?