To the extent that those are hard to come by... Yeah! They are! Science is hard. Nobody promised this would be easy. Science shouldn't be something where labs are cranking out easy 3%/p=0.046 papers all the time just to keep funding. It's just a waste of money and time of our smartest people. It should be harder than it is now.
Too many proposals are obviously only going to be capable of turning up that result (insufficient statistical power is often obvious right in the proposal, if you take the time to work the math). I'd rather see more wood behind fewer arrows, and see fewer proposals chasing much more statistical power, than the chaff of garbage we get now.
If I were King of Science, or at least, editor of a prestigious journal, I'd want to put word out that I'm looking for papers with at least one of some sort of significant effect, or a p value of something like p = 0.0001. Yeah. That's a high bar. I know. That's the point.
"But jerf, isn't it still valuable to map out all the little things like that?" No, it really isn't. We already have every reason in the world to believe the world is drenched in 1%/p=0.05 effects. "Everything's correlated to everything", so that's not some sort of amazing find, it's the totally expected output of living in our reality. Really, this sort of stuff is still just below the noise floor. Plus, the idea that we can remove such small, noisy confounding factors is just silly. We need to look for the things that stand out from that noise floor, not spending billions of dollars doing the equivalent of listening to our spirit guides communicate to us over white noise from the radio.
And study preregistration to avoid p-hacking and incentivize publishing negative results. And full availability of data, aka "open science".
I do agree though, negatives are just as important when the intent is to prove/disprove a meaningful hypothosis.
we tried using 0.11 mL, it didn't work
we tried using 0.12 mL, it didn't work
we tried using 0.13 mL, it didn't work
we tried using 0.10 mL, it didn't work
we tried using 0.11 mL, it didn't work
we tried using 0.13 mL, it didn't work
we tried using 0.15 mL, it didn't work
we tried using 0.17 mL, it didn't work
we tried using 0.16 mL, it didn't work
we tried using 0.18 mL, it didn't work
we tried using 0.20 mL, it didn't work
we tried using 0.14 mL, it didn't work
we tried using 0.12 mL, it worked so we published
E.g. if you search for eggs and cholesterol you should find all studies with their summarized results on whether eggs are ok or not for your cholesterol, grouped by researcher, so if somebody does 200 studies to find the one positive it's instantly visible.
Improving the quality of measurements and data could be a rewarding pursuit, and could encourage the development of better experimental technique. And a good data set, even if it doesn't lead to an immediate result, might be useful in the future when combined with data that looks at a problem from another angle.
Granted, this is a little bit self serving: I opted out of an academic career, partially because I had no good research ideas. But I love creating experiments and generating data! Fortunately I found a niche at a company that makes measurement equipment. I deal with the quality of data, and the problem of replication, all day every day.
One could make the case that in GWAS studies it has occured, but not because small effect sizes are inconsequential, the statistical methods just weren't able to separate grain from chaff for a while.
An allele that is responsible for 2% of the variation in disease risk might seem inconsequential, but 25 of those together can serve as a polygenic risk score that can predict disease and target treatment.
Of course they're stupid. Everyone is stupid. That's why we have a "scientific method" and a formal discipline of logic to overcome fallacious reasoning and cognitive biases. If people weren't stupid we wouldn't need any of these disciplines to check our mistakes.
And yes, what you describe does happen all of the time. We literally just had a thread on HN about the failure of the amyloid hypothesis in Alzheimer's and the decades of work put wasted on it. Many researchers are still trying to push it as a legitimate therapeutic target despite every clinical trial to date failing spectacularly. As Planck said, science advances on funeral at a time.
Which isn't to say that small effect sizes aren't legitimate research targets either, but if you're after a a small effect size, the rigour should be scaled proportionally.
> A commonly cited example of this problem is the Physicians Health Study of aspirin to prevent myocardial infarction (MI).4 In more than 22 000 subjects over an average of 5 years, aspirin was associated with a reduction in MI (although not in overall cardiovascular mortality) that was highly statistically significant: P < .00001. The study was terminated early due to the conclusive evidence, and aspirin was recommended for general prevention. However, the effect size was very small: a risk difference of 0.77% with r2 = .001—an extremely small effect size. As a result of that study, many people were advised to take aspirin who would not experience benefit yet were also at risk for adverse effects. Further studies found even smaller effects, and the recommendation to use aspirin has since been modified.
Long-term aspirin use has its own risks, like GI bleeds, and the MI benefits are clearly not warranted given those risks.
> There was a 44 percent reduction in the risk of myocardial infarction (relative risk, 0.56; 95 percent confidence interval, 0.45 to 0.70; P<0.00001) in the aspirin group (254.8 per 100,000 per year as compared with 439.7 in the placebo group).
I agree if you said from the start you meant general incentives, especially in pharma development, but that is by and large a different conversation.
But I definitely agree it’d be nice to go back and show something is true to p=.0001 or whatever. Overwhelmingly solid evidence is truly a wonderful thing, and as you say, it’s really the only way to build a solid foundation.
When you engineer stuff, it needs to work 99.99-99.999% of the time or more. Otherwise you’re severely limited to how far your machine can go (in terms of complexity, levels of abstraction and organization) before it spends most of its time in a broken state.
I’ve been thinking about this while playing Factorio: so much of our discussion and mental modeling of automation works under the assumption of perfect reliability. If you had SLIGHTLY below 100% reliability in Factorio, the game would be a terrible grind limited to small factories. Likewise with mathematical proofs or computer transistors or self driving cars or any other kind of automation. The reliability needs to be insanely good. You need to add a bunch of nines to whatever you’re making.
A counterpoint to this is when you’re in an emergency and inaction means people die. In that case, you need to accept some uncertainty early on.
I'd argue you do have <100% reliability in Factorio, and much of the game is in increasing the 9s.
Biters can wreck havok on your base. Miners contaminate your belts with the wrong types of ore, if you weren't paying enough attention near overlapping fields. Misplaced inserters may mis-feed your assemblers, reducing efficiency or leaving outright nonfunctional buildings. Misclicks can cripple large swaths of your previously working factory, ruining plenty of speedruns if they go uncaught. For later game megabase situations, you must deal with limited lifetimes as mining locations dry up, requiring you to overhaul existing systems with new routes of resources into them. As inputs are split and redirected, existing manufacturing can choke and sputter when they end up starved of resources. Letting your power plants starve of fuel can result in a small crisis! Electric miners mining coal, refineries turning oil into solid fuel, electric inserters fueling the boilers, water pumps providing the water to said boilers - these things all take power, and jump starting these after a power outage takes time you might not have if under active attack if your laser turrets are all offline as well.
But you have means of remediating much of this unreliability. Emergency fuel and water stockpiles, configuring priorities such that fuel for power is prioritized ahead of your fancy new iron smelting setup, programmable alerts for when input stockpiles run low, ammo-turrets that work without power, burner inserters for your power production's critical path will bootstrap themselves after an outage, roboports that replace biter-attacked defenses.
Your first smelting setup in Factorio will likely be a hand-fed burner miner and furnace, taking at most 50 coal. This will run out of power in minutes. Then you might use inserters to add a coal buffer. Then a belt of coal, so you don't need to constantly refill the coal buffer. Then a rail station, so you don't need to constantly hand-route entirely new coal and ore mining patches. Then you'll use blueprints and bots to automate much of constructing your new inputs. If you're really crazy, you'll experiment with automating the usage of those blueprints to build self-expanding bases...
The point is, it would be significantly more complex if things frequently failed even when "operating properly". And this happened at all levels of abstraction in a factory.
My printer might jam if I feed paper crooked or poorly. My assemblers might jam if I feed incorrect components through misclicks, misplaced miners, or filled outputs.
My printer might fail from the entropy of wear and tear. My assemblers might fail from the entropy of biters attracted by generated pollution.
My printer might stall from running out of paper or a filled output tray. My assemblers might stall from running out of inputs or a filled output belt or chest.
Why is the printer arguably unreliable, but the assembler "100% reliable"?
Failures of my printer are not caused by magic faries sprinkling dice rolling pixie dust on my toner cartrige. Failures have physical causes. That factorio's assembler failures have modeled causes as well, instead of an arbitrary and magic dice roll, does not detract from those failure modes being reliability issues.
That my printer fails far less frequently than my Factorio assemblers points to my printer being more reliable than my Factorio assemblers. Your point that reliability could be even worse misses my point, which is merely that not only does Factorio already avoid the fiction of "100%" or "perfect reliability" - but that perhaps Factorio already models reliability worse than "real-life" in some aspects already.
I don't think it would be particularly bad for inserters inserting at slightly different speeds from each other, or occasionally destroying the item it was supposed to insert. Same with components occasionally breaking on their own.
The original sin of the medical and social sciences is failing to recognize a distinction between exploratory research and confirmatory research and behave accordingly.
You only know whether it works when the study has been completed. You also only know whether the drug has (potentially) disastrous consequences when the study has been completed. Thus, I am not completely sure whether your claim holds.
The anti-aging serum could work (i.e. make you older), but have strong negative side effects.
And no, it's not reasonable to assume I meant "work" as in, "have an anti-aging serum that has strong negative side effects."
So I'm making a guess here that you play with few monsters or non-aggressive monsters?
Aggressively building turret walls, defensive train lines, and so on very quickly pays dividends here. Particularly if you claim as much territory as you can each time you expand instead of simply defending what you've built out.
If done this way building/improving defenses and managing enemies becomes a task you maintain every so often and doesn't spill over into the reliability of your base.
That's not necessarily true in social sciences. When you're working with large survey datasets, many variables are significantly related. That doesn't mean these relationships are meaningful or causal, they could be due to underlying common causes, etc. (Maybe social sciences weren't included in "real science" - but there's where a lot of stats discussions focus)
The root problem here is that people tend to dichotomise what are fundamentally continuous hypothesis spaces. The correct question is not "is drug A better than drug B?", it's "how much better or worse is drug A compared to drug B?". And this is an error you can do both in Bayesian and frequentist lands, though culturally the Bayesians have a tendency to work directly with the underlying, continuous hypothesis space.
That said, there are sometimes external reasons why you have to dichotomise your hypothesis space. E.g. ethical reasons in medicine, since otherwise you can easily end up concluding that you should give half your patients drug A and the other half drug B, to minimise volatility of outcomes (this situation would occur when you're very uncertain which drug is better).
With a high p value, you can say with some degree of certainty that your test was unable to detect any effect. Whether it was due to the lack of an effect or because your test wasn't capable of measuring it
With a low p value, you don't actually really know if you detected something interesting. It could be due to a flawed test, biases, non-causal correlations, faulty p-hacky stats, etc.
So why do we consider the latter more worthwhile when it seems to say less?
But that's comparing apples to oranges. Setting a reasonable prior is akin to frequentists interpreting the effect size (including its confidence interval) in light of deep domain knowledge. To produce a good analysis using either Bayesian or frequentist methodology (or to criticise such an analysis), you have to have deep domain knowledge. There's no getting around that, and arguably the use of p-values often lets you get away with shoddy domain knowledge.
> and Bayesianism has no way to exclude noise results at all.
This statement doesn't make any sense. Bayesian methodology has plenty of mechanisms for working with and controlling noisy data (obviously, since it's one of the two key paradigms in statistics, which as a field fundamentally deals with noisy data). The precise error rates and uncertainties that are calculated are usually different from what you would use in a frequentist analysis, but most people consider this a benefit of Bayesian analysis.
The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise. The whole point of p-values is that they give you a way to do that without needing that complex analysis with deep domain knowledge - they're not a replacement for doing in-depth analysis, they're a way to cull the worst of the chaff before you do, the statistical-analysis equivalent of FizzBuzz. Bayesianism has no substitute for that (you can't say anything until you've defined your prior, which requires deep domain knowledge), and as such makes the problem much worse.
Well, you can use a non-informative prior. And that's the correct choice when you genuinely don't have a better option. But you should always be able to justify that, and that in turn requires deep domain knowledge....which leads me to....
> The whole problem we're facing is that it requires too much domain knowledge and detailed analysis to dismiss results that are actually just noise.
....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with. Bad p-values are worse than none, since you have no knowledge of what error rate they actually achieve in the long-run.
> Bayesianism has no substitute for that
Yes it does. It's called Bayes factors. But as I said above, I completely disagree with your view of what a p-value is for.
At which point you've just found a more cumbersome way to do frequentist statistics. Frequentist tools aren't inconsistent with Bayes' law (they can't be, since both are valid theorems) - indeed one could say that the whole project of frequentist statistics consists of building a well-understood suite of pre-baked priors and computations that are appropriate to situations that are commonly encountered.
> ....this is in no way a "problem" that needs fixing, by allowing shortcuts that can easily be hacked. Rather, it's a factual statement about the difficulty of drawing correct conclusions, in low Signal-to-Noise-Ratio domains. Whether you use p-values or not, and whether you use Bayesian methodology or not, you cannot get around the need to understand the data you're working with.
Well, the fact is there are too many small-sample studies being produced for all or even most of them to be critically analysed by people with deep understanding. And maybe the right fix for the problem is to give the right incentives for that kind of critical analysis (e.g. by allowing that kind of analysis to count as research for the purposes of journal publications and PhD theses just as much as "the original study" does, given that a study without that kind of critical analysis cannot truly be said to represent advancing human knowledge). But if you just tell people to do Bayesian analysis instead of frequentist analysis then that's not going to magically create deep understanding - rather people will try to replace shallow frequentist analysis with shallow Bayesian analysis, and shallow Bayesian analysis is a lot less effective and more hackable.
> Yes it does. It's called Bayes factors.
But you still need a prior to compute a Bayes factor.
Hmm, in one way, yes...but on the other hand, Bayesian posteriors are a lot more intuitive to interpret, for most people. So I think you trade one form of convenience for another. But as you sort of hint at, the results should usually be fairly similar, whether you're doing frequentist or Bayesian analysis. So in most cases, I doubt it matters that much. Where it does matter, is when you have grounds for strong priors, that you want to take advantage of. In such cases you can improve your chances of being correct in the "here and now", if you do a Bayesian analysis. Whereas a frequentist analysis is only concerned with the asymptotic error rates. (but of course frequentist vs Bayesian is also a ladder, rather than a black and white distinction)
> Well, the fact is there are too many small-sample studies being produced for all or even most of them to be critically analysed by people with deep understanding.
And this I totally agree with. If there's one thing I dislike about academia, it's the tendency to fund low-powered studies that get nowhere. Better to go all in, with sufficient support from experienced people, in fewer and bigger studies.
I completely agree with this - but it's exactly this dynamic that I think, at least in the current academic environment, does more harm than good. Effectively it normalizes publishing a result that's not strong enough to swamp the prior, but where you have some detailed situational argument for why a different prior should be used here. We already get every social science paper arguing that they should be allowed to use a 1-tailed t-test rather than 2-tailed because surely there's no possibility that their intervention would do more harm than good, and you need to get into the details of the paper to see why that's nonsense; letting them pick their own prior multiplies that kind of thing many times over.
I'm a big fan of sensitivity analysis in this context. Don't just pick one prior and call it a day, but show the effect of having liberal vs conservative priors, and discuss that in light of the domain knowledge. That gives the next researcher a much better foundation than a single prior, or a p-value, ever could.
Unfortunately, if it was a non-trivial paper to begin with, it now just turned into a whole book.
Now what’s the count, wait what’s the likelihood it misclassified a ball? How accurate are those estimates, and those estimates of those ...
For a real world example someone using Bayesian reasoning when counting cards should consider the possibility that the deck doesn’t have the correct cards. And the possibility that the decks cards have been changed over the course of the game.
Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions. Black swans can really throw a wrench into things.
Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.
This is called modelling error. Both Bayesian and frequentist approaches suffer from modelling error. That's what TFA talks about when mentioning the normality assumptions behind the paper's GLM. Moreover, if errors are additive, certain distributions combine together easily algebraically meaning it's easy to "marginalize" over them as a single error term. In most GLMs, there's a normally distributed error term meant to marginalize over multiple i.i.d normally distributed error terms.
> Plenty of downvotes and comments, but nothing addressing the point of the argument might suggest something.
I don't understand the point of your argument. Please clarify it.
> Here’s the experiment and here’s the data is concrete it may be bogus but it’s information. Updating probabilistic based on recursive estimates of probabilities is largely restating your assumptions.
What does this mean, concretely? Run me through an example of the problem you're bringing up. Are you saying that posterior-predictive distributions are "bogus" because they're based on prior distributions? Why? They're just based on the application of Bayes Law.
> Black swans can really throw a wrench into things
A "black swan" as Taleb states is a tail event, and this sort of analysis is definitely performed (see: https://en.wikipedia.org/wiki/Extreme_value_theory). In the case of Bayesian stats, you're specifically calculating the entire posterior distribution of the data. Tail events are visible in the tails of the posterior predictive distribution (and thus calculable) and should be able to tell you what the consequences are for a misprediction.
My point is this: You can’t combine them using Bayesian statistics adjusting for the possibility of research fraud it’s simply not in the data.
Their great for well understood domains, less so for research. Frequentist models don’t work, but they also don’t even try.
PS: Math errors don’t really fall into modeling error.
In re, the arguey-person you were responding to, frequentist modeling is just as bad or worse for these sorts of situations.
Your investigation isn’t limited to the data provided by them it’s going to look for more information beyond the paper. This isn’t a failure of frequentist models because they evaluate the study and it’s output separately.
If you use a model that doesn't ask you to think about this likelihood at all, you will get the same result as if you had used bayes and consciously chose to approximate the likelihood of misclassification as zero.
You may get slightly better results if you have a reasonnable estimate of that probability, but you will get no worse if you just tell Bayes zero.
It feels like you're criticizing the model for asking hard questions.
I feel like explicitely not knowing an answer is always a small step ahead of not considering the question.
As much as people complain about frequentist approaches, examining the experiment independently from the output of the experiment effectively limits contamination.
What's missing in my mind is admitting that results were negative. I'm reading up on financial literacy, and many studies end with some metrics being "great" at p 5%, but then some other metrics are also "great" at p 10%, without the author ever explaining what they would have classified as bad. They're just reported without explanation of what significance they would expect (in their field).
I agree with what you're saying, but I don't understand this phrase.
I do have a few favorites. "COVID tests give you COVID, so I won't go get tested" is certainly up there. I can't say I give two figs about your opinion on the Earth's topology, but this one is a public health problem, that's crippling hospitals around the country.
Either way it’s dangerous.
We have found most of them, and all the easy ones. Today the interesting things are near the noise floor. 3000 years ago atoms were well below the noise floor, now we know a lot about them - most of it seems useless in daily life yet a large part of the things we use daily depend on our knowledge of the atom.
Science needs to keep separating things from the noise floor. Some of them become important once we understand it.
Bear in mind that my criteria are two-dimensional, and I'll accept either. By all means, go back and establish your 3% effect to a p-value of 0.0001. Or 0.000000001. That makes that 3% much more interesting and useful.
It'll especially be interesting and valuable when you fail to do so.
But we do not, generally, do that. We just keep piling up small effects with small p-values and thinking we're getting somewhere.
Further, if there is a branch of some "science" that we've exhaused so thoroughly that we can't find anything that isn't a 3%/p=0.047 effect anymore... pack it in, we're done here. Move on.
However, part of the reason I so blithely say that is that I suspect if we did in fact raise the standards as I propose here, it would realign incentives such that more sciences would start finding more useful results. I suspect, for instance, that a great deal of the soft sciences probably could find some much more significant results if they studied larger groups of people. Or spent more time creating theories that aren't about whether priming people with some sensitive word makes them 3% more racist for the next twelve minutes, or some other thing that even if true really isn't that interesting or useful as a building block for future work.
A salt crystal (Lattice of NaCl atoms) is nothing like a pure gold nugget (clump of Au atoms).
That difference is a massive effect.
So to begin with, we have this sort of massive effect which requires an explanation, which is where atoms then come in.
Maybe the right language here is not that we need an effect rather than statistical significance, but that we need a clear, unmistakable phenomenon. There has to be a phenomenon, which is then explained by research. Research cannot be inventing the phenomenon by whiffing at the faint fumes of statistical significance.
The noise floor is not static. A major theoretical advance spurs an advance in instrumentation, which then supports more science. The hypothesis space is usually much larger than the data space, making the bottleneck theory, not data. The "end of progress" has been lamented again and again since before Galileo, only to be upended by a paradigm shifting theory that paved the way for lots of new science. Many of these theories were developed long after the data and instruments were available, and were produced with relatively simple data: Young's double slit experiment, Mendelian genetics, the photoelectric effect, Brownian motion, most of classical mechanics, quantum teleportation, BOLD MRI, etc.
This has been proposed , albeit for a threshold of p < 0.005.
Here's Andy Gelman and others arguing otherwise . They also got like 800 scientists to sign on to the general idea of no longer using statistical significance at all .
Understanding effect size is as important as significance can manifest by requiring effect size or variance explained to be reported every time the result of a statistical test is presented, e.g. rather than simply "a significant increase was observed (p = 0.01)" and also making that kind of parsing the standard in scientific journalism.
As an aside, could you also please make medicine a real science, so I can finally scientifically demonstrate that my boss is wrong?
Gwern's page "Everything Is Correlated" is worth reading: https://www.gwern.net/Everything
Ernest Rutherford is famously quoted proclaiming “If your experiment needs statistics, you ought to have done a better experiment.”
“Of course, there is an existential problem arguing for large effect sizes. If most effect sizes are small or zero, then most interventions are useless. And this forces us scientists to confront our cosmic impotence, which remains a humbling and frustrating experience.”
That is not to say that hypercapitalism is the problem here. I think any competitive system even under socialism would have the exact same problem. Basically there are too many voices, and the ones winning are often cheating with bad statistics.
If you're working with very large datasets generated from e.g. a huge number of interactions between users and your system, whether as a correlation after the fact, or as an A/B experiment, getting a statistically significant result is easy. Getting a meaningful improvement is rarer, and gets harder after a system has received a fair amount of work.
But then people who work in these big-data contexts can read about a result outside their field (e.g. nutrition, psychology, whatever), where n=200 undergrads or something, and p=0.03 (yay!) and there's some pretty modest effect, and be taken in by whatever claim is being made.
Either that or stop rewarding such bad behavior. Science jobs are highly competitive, so why not exclude people with weak statistics? Maybe because weak statistics leads to more spurious exciting publications which makes the researcher and institution look better?
This is sounding like a great startup idea for a new scientific journal, actually.
Of course this is somewhat a necessary consequence of having academic freedom.
1. Who will pay for them?
2. How do we make staff scientist roles attractive to people who could also get tenure-track faculty positions or do ML/data science in the industry?
3. How do we ensure that a staff scientist position is not a career dead end if the funding dries up after a decade or two?
The standard academic incentives (long-term stability provided by tenure, freedom to work on whatever you find interesting, recognition among other experts in the field) don't really apply to support roles.
Also, most published research is inconsequential so it really does not matter other than money spent (and that is not only related to findings but also keeping people employed etc.). If confidence in results is truly an objective might need to link it directly to personal income or loss of income, ie force bets on it.
For example, smoking was finally proved to cause lung cancer because the effect size was so large that the argument that 'correlation does not imply causation' became absurd: it would have required the existence of a genetic or other common cause Z that both causes people to smoke and causes them to develop cancer with correlations at least as large as between smoking and lung cancer, but there just isn't anything correlated that strongly. It would imply that almost everyone who smokes heavily does so because of Z.
Ok, but by how much?
You get approximately the same outcome if:
(a) masks are 100% effective but only 10% wear them, and
(b) masks are 10% effective and 100% wear them.
Is this study showing (a) or (b)?
Let us assume (b) masks only help by 10% and R0 is 2 without masks. If exponential transmission is occurring then in ~11.5 days you have the same number infected with masks as in 10 days without masks.
Either way the study has ended up with a 10% figure, and that figure gets misunderstood or intentionally misrepresented. If you want to argue for the effectiveness of masks against those that don’t wish to wear them, then personally I think it is a terrible study to argue with because 10% sounds shitty.
 Actual numbers depends on a heap of other things, but just assume those figures are right for the sake of making things easy to understand.
Disclaimer: I wear a mask during Level 2 lockdown in the South Island of New Zealand, and mask wearing has no partisan meaning here AFAIK.
I wear a mask all the time and am happy to but I agree this study, while solid in some respects, is not exactly overwhelming in making a compelling argument for masks.
Low effect sizes are often a code smell for scientific incrementalism/stagnation.
Dave Freedman's Statistical Models and Shoe Leather is a good read on why such formulaic application of statistical modeling is bound to fail.
But this doesn't necessarily follow, does it? If there really were a 1.1-fold reduction in risk due to mask-wearing it could still be beneficial to encourage it. The salient issue (taking up most of the piece) seems to be not the size of the effect but rather the statistical methodology the authors employed to measure that size. The p-value isn't meaningful in the face of an incorrect model -- why isn't the answer a better model rather than just giving up?
Small effects are everywhere. Sure, it's harder to disentangle them, but they're still often worth knowing.
That's understating it. The study doesn't measure the reduction in risk due to mask-wearing, but rather the reduction simply from encouraging mask-wearing (which only increases actual mask wearing by a limited amount). If the study's results hold up statistically, then they're really impressive. With the caveat of course that they apply to older variants with less viral loads than Delta - it's likely Delta is more effective against masks simply due to its viral load.
> The salient issue (taking up most of the piece) seems to be not the size of the effect but rather the statistical methodology the authors employed to measure that size. The p-value isn't meaningful in the face of an incorrect model -- why isn't the answer a better model rather than just giving up?
Exactly. The irony of this article is that this is an example where effect size is actually not the issue - it's potential issues with statistical significance due to imperfect modeling, and an inability for other researchers to rerun an analysis on statistical significance, due to not publishing the raw data.
The article itself makes some better points, e.g.
> I worry that because of statistical ambiguity, there’s not much that can be deduced at all.
, which would seem like a reasonable interpretation of the study that the article discusses.
However, the title alone seems to assert a general claim about statistical interpretation that'd seem potentially harmful to the community. Specifically, it'd seem pretty bad for someone to see the title and internalize a notion of effect-size being more important than statistical significance.
If you bought just ten tickets you would have a p value below 0.0000001
And that makes sense, because a p value of 0.01 says the probability of getting a sample this far from the null hypothesis is less than 1 in a million by random chance... which is what happened when you got the extremely unlikely but highly profitable answer.
edit: post was edited making this seem out of context...
That's because mask acts on R0, not seroprevalence. After acting on R0, if R0 is >1, exponential growth, if <1, exponential decay. So no effect, unless it is the thing that pushes one from >1 to <1.
> The intervention increased proper mask-wearing from 13.3% in control villages (N=806,547 observations) to 42.3% in treatment villages (N=797,715 observations)
See the extract below from the NEJM:
Seasonal Malaria Vaccination with or without Seasonal Malaria Chemoprevention
"The hazard ratio for the protective efficacy of RTS,S/AS01E as compared with chemoprevention was 0.92 (95% confidence interval [CI], 0.84 to 1.01), which excluded the prespecified noninferiority margin of 1.20.
The protective efficacy of the combination as compared with chemoprevention alone was 62.8% (95% CI, 58.4 to 66.8) against clinical malaria, 70.5% (95% CI, 41.9 to 85.0) against hospital admission with severe malaria according to the World Health Organization definition, and 72.9% (95% CI, 2.9 to 92.4) against death from malaria.
The protective efficacy of the combination as compared with the vaccine alone against these outcomes was 59.6% (95% CI, 54.7 to 64.0), 70.6% (95% CI, 42.3 to 85.0), and 75.3% (95% CI, 12.5 to 93.0), respectively."