Yes! My two top peeves with the majority of soft-science research are underpowered studies (and associated false conclusions) and a seemingly total neglect of effect size. The binary question 'is an intervention better?' is not the right question. To discuss any medical, social, policy, etc intervention you need to know how much benefit it delivers. This can then be assessed compared to costs and difficulties of the intervention.
Of course, effect size isn't enough either. You still need a good understanding of the range of outcomes, if there are significant fraction of participants who had poor/negative outcomes, etc. This can be hidden a bit inside a single metric like effect size.
If you have just one white lab rat (N=1) and it weighs 50 kg after the treatment, you can just skip all statistical analysis an publish the result. The importance of the discovery is self evident due to the effect size alone.
The discovery that H. pylori causes peptic ulcers was first submitted to Lancet as a two papers, 25 patients study and 100-patient study. Lancet was slow to publish and there was resistance from the medical establishment. It was hard to find reviewers who would agree on the importance of the paper. Barry Marshall decided to do experiment with himself and it was very convincing. Marshall and Warren received a Nobel price for their discovery.
Very good! I find it aggravating that the word "significant" is thrown around casually to refer to outcomes that are "statistically significant" yet "extremely small".
We would encounter this a lot in performance engineering - sometimes you could isolate that a new optimization was statistically significantly better, yet only worth 0.1% in terms of performance improvement (possibly in exchange for a whole bunch of new code).
At least in medicine, there's a useful distinction between "statistically significant" (the math says that this drug works) and "clinically significant" (this might actually be worth giving to patients).
I'd like to point out that if you are doing performance testing, you should read a little stats or have someone good at stats lend a hand. Because it is really easy to fool yourself. I've fooled myself and I worry about doing it again.
Take effect size, for example. Suppose I run my test before and after the latest commit. The software now has 2% fewer TPS.
Is that meaningful? It depends. Should I rely on a single run? Almost certainly not.
Suppose performance improves by 203%. Is that meaningful? Probably. Should I rely on a single run? Almost certainly not.
Then there's the obnoxious problems of vast amounts of uncontrollable variables. Folks run buckets of testing on cloud platforms without knowing if they're landing on a physical machine they used last time and so their bits are disk-warm, or that there's a network glitch in central1-a but not central1-b but they don't know which one they're using, or they run tests at different times of day and don't realise they will get different competition for cloud resources due to diurnal demand ...
tl;dr run more tests and get someone to help you. If you run ab2 three times and publish a breathless blog post, be that on your soul.
While effect size is important, it is really only comparable across studies with randomized, dichotomous treatment. Comparability in non-experimental studies or in studies with dichotomous/polytomous outcomes is more difficult.
Of course, effect size isn't enough either. You still need a good understanding of the range of outcomes, if there are significant fraction of participants who had poor/negative outcomes, etc. This can be hidden a bit inside a single metric like effect size.