His conclusion there is that scientific institutions are not accepting, as they need to be, of systematic uncertainty. Elsewhere he says that scientists who are not statisticians should concentrate on gathering quality data: noise in data often leads to spurious point estimates:
In the first link, Gelman talks of the "garden of forking paths": this is essentially a generalisation of the idea of p-hacking to recognise that perfectly honest researchers will not conduct unbiased analyses because of the myriad of parameters that estimates depend on. The solution is to move away from summarising results through point estimates to construct statistical models where you can explore the space of possible analyses; there has been a revolution in the techniques for doing so through the application of Monte-Carlo Markov-chain techniques to construct posterior distributions.
You observe data, you make an inference that leads to a theory (induction), you then subject that theory to falsificationist testing.
"It appears all the swans I've seen are white, therefore I posit all swans are white. Oh, wait, there's a black one, nevermind."
On the Bayesian vs Frequentist aspect... Falsification is what you should do to theories, not model parameters. If you have a coin and you're trying to figure out the probability of heads P(H), then you have your model of coin-flipping (Bernoulli process) and you're trying to estimate the model's parameter, so you do statistical inference given some sequence of coin flips.
It doesn't seem right to apply frequentist null testing because you want to estimate the model parameter, not make some binary decision. What if you had some prior data you want to include? Or you observe new data in the future? This is exactly what Bayesian inference is setup for. And a lot of science is not about falsifying theories in theoryspace but about estimating model parameters in parameter space in the case we all agree on a particular model.
Moreover, a big advantage with Bayesian statistics is that it generally requires you to make your model and assumptions explicit, where it's much easier to scrutinize the model compared to a frequentist statistical test.
A report analyzed the entries to Nature one year and found that very, very few of the papers actually met Popper's criteria for falsifiable hypotheses, and in fact most of the papers started out with an exploratory aim and documented their findings. This means that to adopt the Popperian idea that only falsifiable claims are science means we must reject good science. One must also consider what opportunities are being missed if we were to force good science to adopt strictly falsifiable hypotheses before research commences - this would mean that every exploratory paper would need to be redefined (or rather, firstly defined) as a falsifiable hypothesis... in an attempt to please Popperians.
It is also a stretch to say that science must be empirical or carried out in a particular way to which falsifiability is congenial. There are good arguments that even philosophy or mathematics can be considered sciences, and indeed they were (see Wissenschaft in Kant and Hegel for instance).
Furthermore, it's been argued that Popper's theory of falsification actually includes bad science (pseudoscience). From SEP:
>Strictly speaking, his criterion excludes the possibility that there can be a pseudoscientific claim that is refutable. According to Larry Laudan (1983, 121), it “has the untoward consequence of countenancing as ‘scientific’ every crank claim which makes ascertainably false assertions”. Astrology, rightly taken by Popper as an unusually clear example of a pseudoscience, has in fact been tested and thoroughly refuted (Culver and Ianna 1988; Carlson 1985). Similarly, the major threats to the scientific status of psychoanalysis, another of his major targets, do not come from claims that it is untestable but from claims that it has been tested and failed the tests.
That said, I agree that most of science is about becoming less wrong (choosing the least worst among available models for how observations arose), and that type of model selection problem is what Bayesian approaches excel at.
“Considering the evidence that preceded this study (or trial, or series), what explanatory model does the weight of the evidence best support?”
This need not be a binary decision (witness Bayes model averaging) and it need not be static (Bayesian updating explicitly evolves from a prior, whether flat or subjective). But it better matches the way most people approach research and probability, imho. Extraordinary claims must be supported by extraordinary weights of evidence, and frequentist testing in a vacuum doesn’t enforce this intuition.
Disclaimer: I am a statistician and a part-time Bayesian. Not a zealot, and not a jihadist against subjective inference. I use empirical Bayes procedures when they improve results on an ongoing basis, and I avoid them when the effort is more than the expected benefit can justify. The tipping point has moved over time, as more, faster, and better tools have decreased the cost (effort) to obtain a posterior distribution over complicated models.
Much of science (particularly the life sciences) is estimating model parameters, and not actually positing new theories. For example, if you're testing a small molecule drug to see if it lowers blood pressure, you want to know how much it affects blood pressure. This is a parameter estimation problem. We all agree that the molecule will get into the body and interact with other molecules, we just don't know how much these molecular interactions will affect blood pressure. Since Nature publishes a lot of life science stuff, it's no surprise that most of their papers are about parameter estimation given an implicitly agreed upon model.
This sort of parameter estimation is important and very different than positing new theories (e.g. quantum mechanics or germ theory). It doesn't make sense to talk about falsifiability of a parameter value. It only makes sense to talk about falsifying an entire causal theory.
"All swans are white." is a theory about swan-ness and should be falsified.
"What proportion of swans are white?" is a parameter estimation problem and it doesn't make sense to talk about falsification.
Theories must be falsifiable and subject to falsification. I don't think this precludes gathering corroborating evidence for a theory and then updating your belief about that theory, but it is very easy to find corroborating evidence for theories so falsification is the most practical means of actually finding truth without fooling yourself.
i.e. it is very easy to posit two different theories that make the same predictions in some domain, therefore all evidence in that domain supports both theories. The only practical way to distinguish them is to find another domain where they make different predictions and falsify one or both.
> This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. In the words of Popper (1963), “It is easy to obtain confirmations, or verifications, for nearly every theory,” while, “Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability.” The ASA’s statement appears to be contradicting the scientific method described by Einstein and Popper.
However, (1) completely ignores discovery science (something I have ranted about here previously), and implies fully accepting Popper's claim to have solved the centuries-old "problem of induction" . Meanwhile, (2) seems to pop up pretty frequently even though I'm not sure Popper himself would have agreed, considering :
> It is easy, [Popper] argues, to obtain evidence in favour of virtually any theory, and he consequently holds that such ‘corroboration’, as he terms it, should count scientifically only if it is the positive result of a genuinely ‘risky’ prediction
Ionides et al. conveniently left out the second half of that context. Hmm, if only there were some sort of way to quantify that "riskiness". Like some sort of theorem... maybe of the form P(H|E) = P(E|H)*P(H)/P(E)
>Strictly speaking any method can be made to falsify claims with the addition of a falsification rule. However, the rule must still be shown to be reliable.
At the very least, I think it's shaky to try and apply the same standard regardless of the actual method of the science - since some fields take different approaches to others. That's not to say that all approaches are valid, but that one idea of science in one field shouldn't necessarily need to hold up in other one. Some sciences obviously greatly benefit (or did in the past) from even a simple statement of falsifiability, but to others it seems like a total mismatch for the problem domain.
I hadn't considered Bayes' rule to work that way (my only exposure to statistics is an intro course on information theory as part of an EE degree) but if I'm right, this opens up a discussion as to how risky is risky enough - is astrology still in the picture then?
The prior P(H) is also arguably trouble. What's a good prior likelihood for the hypothesis that magic sky gods affect your fortune? I don't know, but I'd probably have to start somewhere pretty low.
You can't define science as 'the right answer' because that is unknowable (notwithstanding that there are a lot of things where if we havn't figured out the truth then the universe is set up to radically deceive us).
And if you accept that unknowability as a premise, then pseudoscience is a perfectly valid science. Something can be completely wrong and still be established in a highly scientific way.
Just because something is provably wrong doesn't make it unscientific. It just makes people who believe it provably wrong. They are likely to be stupid, I admit.
I don't disagree fundemantally but what you say about philosophy irks me as a philosopher. If you looked at a random sample of publications in philosophy you would find out that only a very small percentage of them has the rigour and exactness that we associate with science. I have also met many philosophers, perhaps even the majority of all I've met so far, who would not describe themselves as scientists.
The remaining part of philosophy that adheres to strict standards and uses mathematical method is akin to mathematics and computer science, and perhaps most similar related to formal linguistics and economics in the way it works. I personally consider this part of philosophy a science, it is a kind of applied mathematics, although often speculative and conditional on axioms that are not as evident as in mathematics. Other disciplines have similar non-empirical parts, a typical example is Social Choice theory in sociology, which I would consider a scientific theory, although it is not empirical. In the end, it's applied mathematics.
I agree that mathematics and the relevant rigid and formal parts of other disciplines are non-empirical science, but to these areas the debate about hypothesis testing and the right use of statistics simply doesn't apply. It a fallacy to presume that, because these fields exist, clearly empirical disciplines (or parts thereof) could do without statistics. If a question is empirical, then it has to be addressed with proper quantitative methodology.
This is extremely important to me personally, since I've been in deep disagreement with colleagues for many years about this issue. They work in related disciplines within our philosophy institute and habitually make "qualitative empirical analyses" of texts without treating them as mere precursors to quantitative studies. They see no problems with their methodology, even when I point out to them that the size of their samples would be too small to support the generalizations they make if they did make quantitative studies. To me, this is absolutely crazy, I just can't see how a qualitative analysis of 20 texts could allow you to take these as representative for hundreds of thousands of texts if a quantitative studies of the same texts could not possibly reveal anything useful because the sample size is too small. What's worse, their whole discipline seems to be based on these kind of extremely small scale qualitative empirical studies, plus a very vague mix of fairly imprecise philosophy and common sense. I'm a nice person and get along with my colleagues well, so I won't ever tell them my opinion, but if I'm honest, I'd say that their discipline is pseudo-science or, at best, imprecise, non-scientific philosophy in disguise. (To make this clear, I have no problem with imprecise philosophy and have done it occasionally myself, I just don't think it can qualify as science and not many philosophers consider it as such.)
Long story short, empirical question have to be addressed with quantitative methodology or you get what I'd call "elaborate opinions".
I do wish that physical scientists, including myself, had more philosophical training though, even if its not science per se, such that we could reliably have original, educated opinions on empiricism, rationalism, etc.
Many philosophers don't consider themselves scientists but I think this is at least in part because the mainstream conception of science really is empirical rather than rigorous analysis. In fact, even non-rigorous empirical analysis can pass for science. Depending on the area, I think that qualitative methodology can capture what quantitative can't, or where quantitative would be unreliable or impossible to use without changing the result significantly.
In a nutshell, philosophy can be scientific if a method is pursued rigorously, but a lot of what is called philosophy is critical theory which is important if not scientific (despite using qualitative and quantitative analyses where appropriate). I have a feeling that critical theorists would question the separation we're erecting between science and non-science, especially those who view critical theory as in part an artistic endeavor too.
That's such a tedious way to to test a theory of swan color. You have to go find swans, which tend to live in hard to reach, dirty, obnoxious environments.
It's much easier to rephrase the hypothesis into a logically exactly equivalent hypothesis, "All non-white things are not swans", and test that instead. I can test that without even leaving my house. Just looking around the room I'm in there are hundreds of non-white things, and none of them are swans.
This situation is called Hempel's paradox, also known as the raven paradox or the paradox of indoor ornithology .
Popper spoke similarly of corroboration only he was unable to cash it out. He wasn’t sufficiently well versed in statistics, and anyway, he wanted to distinguish corroboration from induction, as the latter was being used at the time. The same impetus led Neyman to speak of inference as an act (of inferring).
I explain all this in my recent book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018,CUP. As I say there,
“In my view an account of what is warranted and unwarranted to infer – a normative epistemology – is not a matter of using probability to assign rational beliefs, but to control and assess how well probed claims are.” (p. 54)
I find a lot more to like in the ASA's statement  than in any of these responses, which seem to act as if Karl Popper were the only one to ever have a worthwhile philosophy of science.
This paragraph of the response was particularly telling to me:
> A judgment against the validity of inductive reasoning for generating scientific knowledge does not rule out its utility for other purposes. For example, the demonstrated utility of standard inductive Bayesian reasoning for some engineering applications is outside the scope of our current discussion.
Translation: Ok, so maybe induction works fine when you're going to build a bridge where someone's life is on the line, but it still has no place in science. Falsification or bust!
Contrasting, I struggle to deal with probability and statistics without developing a strong suspicion that the name the objects are called by is completely different from what the name means in common English.
It is nice to see ongoing authoritative commentary that the large majority do not understand what a p-value actually implies. The thread of discourse seems to be that even assuming that all the academics are completely honest (ie, no academic fraud, no hand waving) the number of false results that are awarded statistical significance is much higher than it should be. The standard p value threshold at 5% does not imply that 95% of the statistically significant studies are not by chance. Particularly amongst the subset that make it into the public eye.
They should call it "detectable", some people have suggested "discernible".
However, there is another fatal flaw to how p-values are used. They are usually used for rejecting infinitesimally small hypothesis. The null hypothesis is stated as "Effect exactly equals 0.00000000000...". In practice, there are no experiments that have exactly zero effect. There are always at least very small systematic bias due to imperfectly calibrated instruments or small methodological variations between researchers.
Even if you do everything else right, if you pre-register your study, with enough data, a null hypothesis test will always pick up on these small biases and make the results significant.
If you are looking to reject a null hypothesis, I can tell you in advance: all experiments have a non zero bias, all results are statistically significant p<0.000000001 with enough data. There, I just saved the scientific world a ton of money, they don't have to do all these experiments. Just reference this comment in your paper to show significance.
At least here, the language seems honest. Rejecting a null hypothesis correctly conveys the idea of having rejected nothing.
Using a Popperian approach is great, but you should reject a portion of the hypothesis space this is bigger than zero.
Physicist's convention is to call a 99.7 (3 sigma) per cent chance a "hint" and only a 5 sigma effect a detection.
It would be impressive if "There is an effect different than zero" was rejected (but no one would ever be able to do this). Science should try to reject finitely large hypotheses, at a minimum something like: "effect is larger than some reasonable margin to account for experimental bias and imperfect tools". At least a small chunk of the hypothesis space should be rejected for your experiment to be worth something. You sorta get that with confidence intervals since you can see how far the lower bound is from zero.
The effect size is <1% and the effect size is >=1%.
Or 100 different hypitheses for 1-100% effect sizes. They're mutually exclusive, so only one will be true.
So again, while in general I agree with you that beyesian methods have significant advantages, this objection isn't well founded.
At any rate, "p-value" is a made up, artificial word, certainly better than using a common existing word (such as "likelihood" or "significance") which would be even easier to misunderstand.
The fundamental problem is that hypothesis testing happens within a whole theoretical framework, and the jargon refers to things well defined and understood within that framework. I think there is just no way of breaking it down further (though the implications and typical limitations of the research could maybe be communicated better).
This does happen in maths all the time. Pretty much all new results and new techniques also introduce new notation and language, which is then rationalized in further works by the community.
What tends to not change all that much currently is notation used in undergrad textbooks. However, going back more than a few decades books on for instance calculus look very different and use very different language (e.g. infinitesimals, topology, focus on power series etc.), similarly for linear algebra or complex analysis.