This counter-paper reads to me like spitefulness couched in scientific terms. Wagenmakers is, of course, on the very comfortable ground of defending the status quo from someone who is in all probability wrong.
But to be fair, we should take things on their merits. Contrasting the two papers, Bem's paper sticks very closely to the standard form and 'rules of engagement' (if you will) of published academic research. This counter paper on the other hand, has only the surface appearance of proper scientific method. For example, in making one if its main points it refers to a hypothetical casino where Bem could've made an 'infinite' amount of money, and to the one million dollar Randi prize, neither of which reach the standard of proper experimentalist scientific evidence. It does this by way of justification to one of their main points which is that because 'extraordinary claims require extraordinary evidence' we should be able to set the prior expectation of H1 - in their words 'for illustrative purposes' - to .00000000000000000001, which they then go on to demonstrate, makes the results non-significant.
But where does the 0.00000000000000000001 come from? It could, just as easily, be twice or half that figure. Thus not falsifiable and therefore not justifiable as an extra bit of arithmetic that Bem's paper must qualify.
To put this in terms this audience will understand, that's kind of like saying well no, because you are using Java and 'everybody knows' Java is slow I think we should multiply your benchmark figures by, oh, let's say, one half. And then we see that java throughput is quite poor, as expected. In fact, we should multiply all java benchmarks by some number like a half but I'm not going to be specific about it because actually its pretty much just an arbitrary number I made up. So, reading past the reference to Bayes and some nice formulas that's just arithmetic in my book.
Note, I'm not saying that there are no flaws in the Bem paper. Everybody can see that it's very likely they'll be something wrong with it (and the desk drawer problem to which the above paper also makes extensive evidence is a likely though not conclusive contender) but I think its only reasonable to hold up the criticism's to the same standard as what they are criticizing. Perhaps that way you'll be more likely to find the actual truth of the matter.
Contrasting the two papers, Bem's paper sticks very closely to the standard form and 'rules of engagement' (if you will) of published academic research.
I think you're holding this response to a much higher burden of proof than it needs to meet to be a proper refutation of Bem's claims. You're right that the response doesn't appear to use the "proper scientific method". But that's because it doesn't use it at all, and it doesn't need to. There's no hypothesis to test and no experiment to run in order to point out flaws in a paper that does claim to be the result of the scientific method.
Just reading it now -- seems like a very thorough takedown of the whole thing, in fairly non-technical language.
It makes the very good point that this paper makes lots of statistical tests, and then bases big claims on the small minority that showed a significant effect. This is in no way restricted to psychology, drugs companies for example do this all the time. It's cheating, whether you realize it or not.
Statistical significance only tells you that a result is unlikely to be a fluke -- not that it definitely isn't a fluke -- but the more tests you do, the sooner you'll see a fluke on average.
the more tests you do, the sooner you'll see a fluke on average
In other words, if you toss a coin 1,000 times, then it's hideously unlikely that you'll see a run of 100 consecutive heads. But if you toss the coin 100,000,000 times, you shouldn't be too surprised to see that 100-toss run buried in there somewhere, even though the odds of getting 100 in a row are so small.
If you toss 100,000,000 == 2 ^ 27, you should only expect around 30 in a row. To have a good chance of getting 100 in a row, you need about a billion squared times more.
And the problem is MUCH worse than described above: Let's say you test 1000 wrong hypothesises with p=0.05; 50 of those will be accepted as true, even though all are wrong. If you test 980 wrong hypothesises and 20 right ones, more than half of those that pass the p=0.05 "golden" significance test will in fact be wrong.
Now, when you see a medical journal with 20 articles using p=0.05, which do you think is more probable - that 19 are right and one is wrong, or 19 are wrong and one is right? The latter has a much higher likelihood.
Clinical researchers too. Because lives are at stake.
The whole field of systematic reviews and meta-analyses has developed around the need to aggregate results from multiple studies of the same disease or treatment, because you can't just trust one isolated result -- it's probably wrong.
Statisticians working in EBM have developed techniques for detecting the 'file-drawer problem' of unpublished negative studies, and correcting for multiple tests (data-dredging). Other fields have a lot to learn...
Clinical researchers working for non-profits / universities do, occasionally. I suspect it has become popular recently not because lives are at stake, but because it lets you publish something meaningful without having to run complex, error prone and lengthy experiments.
Regardless of the true reason, these are never carried out before a new drug or treatment is approved (because there is usually one or two studies supporting said treatment, both positive).
And if you have pointers to techniques developed for/by EBM practitioners, I would be grateful. Being a bayesian guy myself and having spent some time reading Lancet, NEMJ and BMJ papers, I'm so far unimpressed, to say the least.
Ugh, reminds me of the undergrad psych research I participated in. When your original hypothesis doesn't turn out, just run correlations on your data until you find something to write about. Publish or perish, right?
You've got it. The other half of the problem is that there's a chance that when you start flipping coins the first 100 flips will all turn up heads. Now, does this mean that the universe is bent and has begun preferentially treating Mr. Lincoln's head different than his backside? Does this mean your testing apparatus is biased? Is this an inherent property of coins? If you stop flipping at 100 it'd very tempting to conclude that this is the case.
The only way to find out is to do enough flips to eliminate the chances of your final result being influenced by statistical flukes. Measuring small differences, like trying to answer "does a coin preferentially land on one side vs the other?" usually takes hundreds of thousands of tests to guarantee you're seeing objective data, rather than seeing patterns in noise.
actually you ought to have selected n number of tests beforehand, rather than see the fluke and, after the fact, "continue testing" until it goes away.
the very moment you peek, your data is tainted from future testing.
So far the best line in this paper is:
"Returning to Laplace’s Principle, we should obviously assign our prior belief in precognition
a number very close to zero, perhaps slightly larger than the probability of, say,
goldfish being able to talk"
You touch upon a very important point. For some reason we are very bad at Bayesian reasoning. Sometime it leads to unsubtantiable opinions http://news.ycombinator.com/item?id=1909576 but could be very damaging if faulty reasoning is used by a jury or a judge...."the person can be guilty or not guilty so the probabilities are 50:50".
I would be a lot more relieved if people with the power over other peoples lives grokked Baye's rule and priors.
Please explain the proper way to form priors, then. 50:50 is widely used, as is "0 for obviously wrong stuff". The author here suggested "something sufficiently close to 0", which to me is indistinguishable from the second one. Should an accused's guilt prior be based on the jury's own guilt, the number of crimes they've heard about recently, or the judge's conviction rate? Or maybe the accused's socio-economic class?
Bayes's rule doesn't help with the point that suggestive evidence is not convincing evidence. It just points out that prior beliefs are part of the equation, but will hopefully pale in comparison to actual data. In fact, I was taught to set practically useless hyperparameters to ensure that they do. No one does that outside of an experiment.
Let's say I believe (I don't) the height of pygmies is normally distributed, where the mean is also normally distributed with mean 130cm and standard deviation 10cm, and the standard deviation is inverse gamma distributed with shape 7cm and scale 1cm. Assuming the height is actually normally distributed with mean 160cm and sd 15cm (it isn't), how many pygmies must I measure to admit that P(height>160cm)>20%? I'm not sure I can even do the math.
Here P=50% for the unknowable accurate model and P=0.13% for the prior model. How does the situation change when my prior is "sufficiently close to 0"?
Selecting proper priors is quite a contentious issue, mainly because there does not seem to be one perfect answer. Though there are a few guidelines one can follow. One of them is what you pointed, building up a tower of hyperparameters. Hyperparameters have the same feeling as "turtles all the way down" but aren't so bad if you have sufficient observations. Then one can prove that any bias because of the priors will disappear in the limit. But for one off decisions that is not very useful.
In the legal case example, maybe some clarity maybe had in considering what does the prior mean. A answer is: say you have to bet a million dollars on whether the person is guilty or not without knowing anything about the person how would you distribute your million dollars between the two events. Yes it is subjective and personal, but it is hardly ever going to be 50:50. One can push the $1,000,000 analogy further. One can fix a cost for a mistake: whats the cost of a wrong conviction and whats the cost for setting a guilty man free. Then the final decision can be based on reducing the financial risk based on the likelihoods.
One may bring the socio-economic status in forming the priors but one may not consider any information source that considers the accused.
Replying again as I missed out a vital piece. Bayesian reasoning is an online process so after every decision one has to update the priors. The next time one has to use the reason engine one should work with the most recent prior. An alternative but equivalent way of stating the same is that one should look at the entire past to form the valid prior of that instant.
Lets take the example. There is a one to one correspondence with fictitious counts and priors. One way of encoding a 50:50 prior is to construct a possibly fictitious but representative past of (say) 2000 samples split into 1000 guilty and a 1000 not-guilty. After each prediction and assuming that the truth gets known one has to update the counts appropriately, so that the next time we use a different prior.
Our initial prior may be wrong but it will approach the correct one asymptotically. But how fast it approaches the true prior depends on how wrong our initial prior was.
That was a refreshing read, thanks for posting it. It took me a moment to understand the important difference between exploratory and confirmatory experiments - that you can find patterns in anything if you look hard enough, but that doesn't necessarily mean anything until you repeat the process but look for the specific pattern you noticed previously. Reminded me of the Bible Code theory from the 90's (http://en.wikipedia.org/wiki/Bible_code)
>It took me a moment to understand the important difference between exploratory and confirmatory experiments - that you can find patterns in anything if you look hard enough, but that doesn't necessarily mean anything until you repeat the process but look for the specific pattern you noticed previously
It's when the hypothesis predicts a pattern that you haven't noticed yet and the pattern is confirmed by experiment, that's when you know you have something [a theory].
"Extraordinary claims require extraordinary evidence" -- and I wouldn't count these small studies as extraordinary.
Unfortunatley, publishing these kind of claims prematurely help the more gullible among us to fall for ridiculous claims from psychics and others who would take advantage of them. (The authors of "The Secret", I'm looking at you.)
Commenters on NPR's website (not exactly the dumbest audience online) have already shown this problem; "All of you criticizing this need to open up your minds" and "The future, as well as the past, influence our dreams."
Unfortunatley, publishing these kind of claims prematurely help the more gullible among us to fall for ridiculous claims from psychics and others who would take advantage of them.
True. But on the other hand, publishing ridiculous claims and incorrect results is a necessary part of science.
When we publish only results we know to be correct, because they agree with mainstream beliefs, we introduce a bias into the scientific process. In reality, if you publish 20 experiments with p=0.05 [1], 1 of them should be incorrect. If less than 1 in 20 of your papers isn't wrong (assuming p=0.05 is the gold standard), you are not doing science.
You can see a perfect illustration of this when people tried to reproduce Millikan's oil drop experiment. I'll quote Feynman: Millikan measured the charge on an electron...got an answer which we now know not to be quite right...It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of - this history - because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong - and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that...
This is why I'm an advocate of accepting/rejecting scientific papers based solely on methodology, with referees being given no information about the conclusions and with authors being forbidden from post-hoc tweaks. You do your experiment, and if you disagree with Millikan/conclude that ESP exists, so be it. Everyone is allowed to be wrong 5% of the time.
[1] I'm wearing my frequentist hat for the purposes of this post. Even if you are a Bayesian, you should still publish, however.
If you're going to use highly subjective frequentist statistics at all, p < 0.001 should be the minimum gold standard for extraordinary claims. If the phenomenon is real, and not bad statistics, it only requires two and a half times as many subjects to get p < 0.001 instead of p < 0.05. Physicists, who don't want to have to put up with this crap, use p < 0.0001. p < 0.05 is asking for trouble.
A complication is that if the effect were real, all our ideas of prior vs. posterior probability would need re-thinking. The hypothesis is that humans can be influenced by posterior events. That includes the experimenters.
Ok, then let's funding psychology like we fund physics. I would love to run 1000+ patient studies to test psychotherapies, and in fact we'd be able to answer some really interesting questions if we did, but there is currently no way of doing this.
I repeat, you do not need 1000 times as many subjects to get results that are 1000 times as significant! If 40 subjects gets you results with p < 0.05, then 100 subjects should get you results with p < 0.001. Doing half as many experiments and having nearly all the published results being real effects, instead of most of them failing to replicate when tested, sounds like a great tradeoff to me.
And I suspect the ultimate reason it's not done this way... is that scientists in certain fields would publish a lot fewer papers, not slightly fewer but a lot fewer, if all the effects they were studying had to be real.
Yes - thanks. The current norm for a 'suitably powered' trial of a psychotherapy is about 300. We've just got a trial funded for that number (admittedly in a challenging patient population) which will cost about £2.5m in research and treatment costs. We would love to run 1000 patients and start looking at therapist-client interactions, individual differences in treatment suitability but that's out of the question.
That's a cheap shot. Our trial will publish a detailed protocol and analysis plan, as do most large, publicly funded trials. Small-scale experimental work is a different matter. I personally agree that all experiments which could end up in peer reviewed journals should be registered before participants are recruited.
This would be simple to do by submitting ethics applications and an analysis plan to a trusted third party which would only release them once the author is in a position to publish, or at a pre-agreed cutoff (perhaps 2 years), whichever is the shorter (to avoid scooping). Perhaps I should set something up...
Having moved from physics to biology, I am amazed with the difference in what the consensus of 'significant' is. Some of the difference is due to necessity, but not all.
When some people find that their model doesn't quite fit, they make a more accurate model. Others make a less specific model. It's the difference between model parametrization and model selection.
So when we get a dubious result, we can either say "no result" or "possible result". The choice tends to depend on how the finding affects future research. Biology is more exploratory than confirmatory, so they go that way.
People occasionally mention p-value calibration and note, sadly, the damage caused by this reckless practice that allows false results to eke through the airtight, 300' tall walls of scientific publication. But there is value in being wrong. It's a part of science.
In a way, it's the MD's White Coat syndrome applied to PhDs. Something that is scientific and written in a journal is necessarily correct in public opinion instead of the rigorously considered opinion it really is. Both paper-reading public and the authors of some of those papers tend to believe this.
And to cover it from a Bayesian point of view, it's pretty vital to keep the culture such that the risk of publishing something incorrect doesn't to strongly dominate the decision to publish. You should be confident talking about your beliefs long before they distribute like deltas.
> In reality, if you publish 20 experiments with p=0.05 [1], 1 of them should be incorrect.
In reality it doesn't turn out this way because the results that get written and published tend to be biased in favour of novelty and demonstrating a relationship rather than the absence of one. How many similar experiments could have been terminated, never submitted or not published because they failed to show anything notable? This is one of reasons... 'Why Most Published Research Findings Are False'
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/
That meta-study applied to medical studies and I think this genre would probably fair even worse when it came to long term replicability.
And to go the other way, it doen't happen like that because the rule of thumb isn't p=0.05, it's p<=0.05 - and p can be quite small indeed, if you run out of ideas before running out of data (such as might happen in a novel area).
This is why I'm an advocate of accepting/rejecting scientific papers based solely on methodology, with referees being given no information about the conclusions and with authors being forbidden from post-hoc tweaks.
A million times yes. Also: no publication without the experiment's methodology and criteria for success having been registered prior to the experiment's commencement.
agreed - I replied to a comment above with this suggestion. It would be nice if grant bodies started requiring this for all funded research, and kept (public) track of researchers with a bulging file drawer.
Good point. I shouldn't have blamed the researcher -- I read the paper and it seems straight forward enough with a number of controls in place. (For example, running a second experiment using only random number generators that showed no such results.)
Instead, I should have focused on science journalists, who should be extra diligent when reporting these sorts of stories to point out the possibility that this is a false positive.
I agree, but we need to be really clear about what the claims are.
It could be that there is a reproducible 1% "mystery" effect that works from future to past, but only in experiments like this. In which case claim wouldn't be extraordinary, it'd just be something we can't understand.
Remember that he's still in the data gathering stage. If -- and it's a big if -- there is any kind of reproducible pattern that doesn't match known laws, that doesn't mean there is a claim. There is simply data that doesn't fit our current models.
That's why people who push frontiers have to be very, very careful about differentiating the data from the claims.
Lots of guys make lots of money with bogus TV shows and books on stuff like this, and it's a shame: many times there is something unusual in the data, but the claims jump far ahead of any reality. Fear of this effect has probably silenced a lot of little tiny pieces of data that wouldn't make sense -- it's simply too much trouble to have to keep explaining yourself. This might be one of the factors explaining Feynman's story of the Millikan oil drop experiment.
> It could be that there is a reproducible 1% "mystery" effect that works from future to past, but only in experiments like this. In which case claim wouldn't be extraordinary, it'd just be something we can't understand.
Erm, no, that would be pretty damn extraordinary. We know of nothing else in the universe that acts like this.
Erm, no, we actually know a lot of things that like this, with 'this' being 'something we can't explain' (talk of sensing or affecting the future is premature speculation on the explanation and irrelevant). To name one: the speed of the Voyager spacecraft. There are a zillion unexplained effects like that one and this one.
I have don't have any interest in the subject of ESP, but you are completely incorrect. Statistical significance isn't hard to calculate and the number of data points here is fairly large, so there is a measurable effect by reasonable standards. The actual statistics and calculations are right there in the paper.
The idea that the effect could be the result of a programming error or a small amount of light leaking through/around the screen is completely plausible. Or it could be dumb luck as you suggest, but it's extraordinarily unlikely.
> All of you criticizing this need to open up your minds
This type of attitude is infuriating - if a claim can survive the crucible of peer review then it we can be much, much more certain that it is true and correct. If humans were to posses a limited form of precognition that would be awesome - but before we can claim to posses a thing we must be sure that it is real.
As a fellow skeptic, I would advise we avoid outright rejection. After all, part of the mantra is to question our own world view, right?
Disease existed long before we figured out causes and targeted treatments; people possessed it, and people applied treatments with varying degrees of success and failure, including death. One of the more interesting inaccuracies of medical history is that of humors (random site - http://www.gallowglass.org/jadwiga/herbs/WomenMed.html). How we treated disease changed over time and is still changing as we learn new things.
My point? It is important to continue to apply rigid scientific study to all manners of phenomena, not only to validate its existence but also to figure out how to repeat or avoid said phenomena, depending on the need, the positives and the negatives of said phenomena. However, we should not turn a blind eye towards what people think they experience just because we have not yet come up with the right tool for measuring or the right study for identifying. There is always some reason behind the claim (even if the reason is "snake oil salesman").
Whatever is behind precognition (to take your example), people claim to experience it and always have made those claims. There is a certain burden of proof required, sure. How do you convince someone born deaf that there is a thing like sound that is experienced the way the hearing experience it? In the case of precognition, it tends to be self-validating (sometimes self-fulfilling), and yet it is still useful for precogs or those who believe in them, whether it is illusion or real, whether we have proven it concretely or not.
That bears a quick repeat... The information is somehow useful. These people who are shouting "open your mind" find their precognitive information useful; in their minds, challenges to this useful information are silly. In the name of understanding, the real focus should be on figuring out how that information is obtained. Is it psychic phenomena, a ghost whispering in the ear, great subconscious brain processing, or something else?
So if the response to an outright simple rejection is "open your minds", I think it is warranted. If the response is to indicate disagreement, however, I always thought that to be a useless response, as useless as the simple rejection.
Oh big deal, who cares if some people have their faith in the paranormal strengthened. Something like one in two people or more believe in what scientists would consider paranormal already.
Good scientists know there is much yet to be discovered.
This is trivial software to write. Why does someone not quickly whip up a web app that does the picture thing as described in the experiment, then we can personally test if we have ESP senses or not?
Specs: Two buttons - ESP mode or non-ESP mode. In non-ESP mode, 60 random pictures are shown and we are to guess. Then it gives the correct one. In ESP mode, add some porn. If the results are different, then we have ESP. Use Javascript for the randomisation algorithm so that we can be sure there is no server trickery being done.
If we repeated the experiment here, publicly, we'd all know whether we were in the control or experiment group just about immediately. The results wouldn't be valid, would they?
For the "Who's got the porn" experiment I don't think test takers knowing which group they were in would make much difference. My understanding is that both groups would be told they were being tested for ESP. However, one group is told hot stimulating pictures are mixed in with regular pictures, and the other group is told there are only regular pictures. According to the article, the group with no added incentive (from erotic pictures) to choose correctly had accuracy which would be expected, at about 50%. However, the group with the stimulating pictures was able to beat the 50% threshold. Presumably, the brain of those test takers had more incentive to use all resources available to be correct, including any extrasensory ones...
As such, test subjects knowing they are in the group with the erotic pictures should still be able to beat the 50% threshold.
Edit: Actually, I'm reading through the actual experiment and it appears all 100 sessions used both erotic and nonerotic pictures of varying arousal value. Also, both the position of the picture and the picture itself were not actually chosen by the computer until after the test taker made the choice, although they were told differently, making it a test for a future event. So, yes, I think we would already be compromised for trying to recreate the test.
I agree that it shouldn't make any different, logically, but then your control group still isn't truly a control group. Your results could be doubted on the basis of things like the participants knowing too much about the experiment and trying to guess patterns in the RNG.
Oh, yes, any results we came up with running the experiment here should be properly doubted from the start. Not only would participants know too much about the experiment, but over the Internet participants wouldn't be directly observable which certainly throws out any real legitimacy. I was only thinking in terms of possibly recreating enough of the test to experience some of the same results, regardless of how those results might be perceived, but I think too much information about the test is now known.
If it's pseudorandom, that means there's some form of pattern in the numbers it generates. It's impossible (or at least unreasonably hard) to know whether the person doing the test may subconciously guess or estimate the pattern. That invalidates the whole test.
Javascript random() isn't a crypto-strong PRNG. Depending on how any PRNG is seeded, guessing it's output may not be very hard at all - especially if you've seen it once before.
Eg, I once had a Poker playing game on an Amstrad that used a PRNG with a very predicable seeding strategy. I could amaze my friends by knowing exactly what cards I would be dealt.
It seems pretty impossible. Unless you're using a horribly flawed PRNG, patterns in the data should require analysis much deeper than human intuition could manage. This analysis would also probably require much more information than you would get from this experiment.
Extraordinary claims require extraordinary evidence. The classic 5% (or even 0.1%) statistical threshold is sometimes not enough. See here for an easy-to-understand example of why that is the case:
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_...
In other news, capital punishment has been installed for science journalists publishing articles that contain a question in the title that can be succinctly answered with "No."
Can someone please, please explain this to me? I've never understood why the oft stated line "extraordinary claims require extraordinary evidence" is anything other than a clever saying. Why should it be that things that follow your intuition require any less rigor to prove than those that do not, and vice versa? Presumably there should be no subjectivity to cold hard science; evidence is evidence, and a certain quantity of evidence should reflect fact equally well regardless of how unusual that fact is.
edit: just to note, nowhere in constructing a statistical test is it required that the creator decide how "extraordinary" the null hypothesis is.
It's simply the way Bayesian statistics work: if the prior probability of something happening is very low, then for me to flip from thinking "didn't happen" to "did" it will take some new information that is very powerful.
If you think that's illogical, I'd ask you to consider why a teacher is more likely to accept the excuse "my dog ate my homework" than "aliens kidnapped me and stole it". You seem to be arguing that given that the evidences are equal (a mere statement from a kid), the teacher should properly consider both occurrences to be equally likely.
This is true, but I'd add a caution: just because something seems outlandish or improbable doesn't mean it actually has a low prior probability. Human intuition on what's weird and what's not is not a reliable oracle of prior probability. If you're going to give the prior an actual number, you better base it on actual facts.
In your example, based on existing data, it is indeed fair - some dogs do sometimes eat homework, whereas there are no verified accounts of aliens stealing it. So that's a legitimate adjustment of priors. Particularly if you actually have data on the incidence of paper-hungry dogs.
But in science and philosophy, there's lots of important questions for which we can't legitimately calculate priors, and "it would be too weird" is not at all relevant when determining their values.
But we do have reasonable priors on parapsychology from its wasteland of unreplicated, flawed studies, with no convincing results despite decades of effort.
I believe what the saying means is this: If a theory X is accepted as true, then that implies that some amount of reasonable evidence exists in support of X. If we are to prove ~X, then we must show not only evidence in favor of ~X but also explain why the evidence in favor of ~X is stronger than the evidence in favor of X (e.g. using better instruments results in more precise measurements). The new evidence must be more "extraordinary" in the sense that it must be strong enough to overturn the evidence in support of the "ordinary claim".
The line "extraordinary claims require extraordinary evidence" is just more poetic than the paragraph above.
Why you ask? Because it won't be believed otherwise. Causal proofs of a statistical nature aren't math proofs. They aren't, generally at least, proving fundamental relations that resist all dispute. Rather they are simply persuasive.
To be persuaded of something you already strongly believe is far easier than something you don't believe. And really the key word is persuade. It's not that people can prove the sun will come up tomorrow, but they can persuade you that it will.
.
There are some good replies to this already, but I'd like to add another, essentially of my own. As a professor friend of mine once explained, for some reason people seem to want to accept extraordinary explanations over ordinary ones (just look at many commonly held unfounded beliefs). This might be some innate primordial mechanism of the human brain. I don't know, but if there is any truth to it then we must carefully guard against our own bias to make sure we are not unwittingly seeing the results we want to see, hence the need for extraordinary evidence to back up extraordinary claims.
Since we're talking about extraordinary claims, let's examine a claim that is definitely not true, but very interesting. If we run a study with threshold p=0.05, there is a 1 in 20 chance that we will erroneously report the claim to be true.
Now, let's say ten different scientists are interested in this claim, and they're all going to run their own experiments. The chance that all ten will run an experiment with each reporting "false" is under 60%.[0] Over 40% of the time, at least one scientist will falsely conclude the existence of the phenomenon that definitely does not exist. This is an effect of running multiple independently-considered experiments without aggregating the results.
That's the Bayesian problem that people mention. Another problem entirely comes from which results will tend to get published.
Now let's consider the effect of publishing bias. Let's assume that only 20% of the scientists will attempt to publish their results regardless of the outcome, but they will always try to publish if the (false) phenomenon is shown to exist. This effect alone results in 21% of submissions being incorrect,[1] even though an incorrect result only has 5% likelihood.
Let's additionally assume that a journal will publish a false-but-interesting result 50% of the time, and the true-but-ho-hum result only 10% of the time. The final effect is that 50% of published results for this extraordinary-but-false phenomenon incorrectly report the phenomenon to be true.
Tweak the numbers all you want, but the effects of running multiple independently-considered trials, along with biased publishing, means that we are surprisingly likely to publish false conclusions.
53% in an experiment that has 36 trials? Really? 50% is 18/36, but 19/36 is 52.7%, and 19/35 is 54.3%. Depending on experimental design, if it just so happens that near the end of the 20 minutes, the subject has a tendency stop on one of the last erotic pictures they're like to guess and let the time run out.
Hypothesis #2:
Depending on how the computer's random number generator was seeded (and they might have a relatively short repeating sequence), subjects may have, however unconsciously, "learned" to predict the randomness, something they would have insufficient motivation to do in the other set of pictures. [We can test for this by seeing if they were getting better at it over the course of a session.]
Krulwich's NPR Science pieces has some of the best verbal delivery, storytelling form, and production values - it's worth listening to the audio version of them.
Highly recommended if you're thinking about making a podcast.
Has anyone investigated Dean Radin's work? It demonstrates similar statistical effects in a huge number of experiments. He has put forth quantum entanglement as the explanation.
My wife (a psychologist and a Christian) defended the (mostly psychologically oriented) experiments as posed in the paper and the method behind them, whereas several atheist/strictly-causality-believing coworkers and also a conservative Christian with a strong anti-psychology bias dismissed the idea of spooky action from the future outright. The stormy e-mail exchange raged on (and I did not contribute to the discussion). The conservative guy accused me of abandoning my wife in the argument. I believe she's well capable of handling herself.
In any case I wrote this (bad) poem in response:
My act is mostly mute and unseen,
I wear no costume, I don’t vent my spleen.
Spending most time behind the stage,
Conceiving a plot I prepare the cage.
I have few resources, can’t sponsor M-M-A,
Must find some other way to while away the day.
I step out briefly to address the crowd,
I hope today they’ll surely be wowed.
My mind’s been active, reading Hacker News,
What’s this I see? Some interesting views,
on whether the future can affect our present,
I’m sure this will stoke plenty of dissent.
I have my materials for a good time today,
Setting the stage is just an e-mail away.
My fingers fly fast, the idea’s not hokey,
My actors will soon be addressing the spooky.
I press ‘Send’ and my time on stage is done,
I’ve set the parameters, now it’s time for fun.
The actors appear to have done my bidding,
I just hope it doesn’t end in too much bleeding.
Sure, I’ll show up from time to time,
The audience gets tired of hearing everyone whine.
They need to see larger schemes at play,
The actor’s philosophies won’t save the day.
Arguments, screeds, reasoning galore,
It’s exciting for a time, not yet a bore.
I’ll step back just about now,
It’s time for some more.
This audience of one will now sit back,
Got a few more bugzillas to whack.
I won’t make it to peer-reviewed journals,
But empirically it’s great to see what sprouts from a kernel.
From the article : "The sequencing of the pictures on these trials was randomly determined by a randomizing algorithm … and their left/right target positions were determined by an Araneus Alea I hardware-based random number generator."
At the very least they were using Araneus Alea wich is a hardware random number generator, so the numbers were not predictable. It's possible that the "randomizing algorithm" did something dumb and made the sequence not random, but I doubt it.
I think that it's more likely thet the sudy was done so many times that it eventually gave significant results than it is that the sequence was not random. Or maybe prescience is real to some degree, or the study is a statistical glitch.
However the replication package they provide has the compiled program without the source code, and that is a red flag to me.
A valid point. As scientists rely more on the tools created by others, how can they be certain they're measuring the subjects and not the tools themselves. The article repeatedly refers to "a computer" in these experiments as if it's some divine arbiter, I think most programmers here know differently.
Judging from the porn study, you'll get better results if you write a script to screen-scrape the winning numbers after they're posted online, then display them to you alongside some hot pron amidst several sets of randomly chosen lottery numbers alongside SFW pictures.
This violates our common sense, but it doesn't necessarily violate physics. We've known for a long time from quantum mechanics that time may not fit with our preconceived notions of what we want to think it is. Entropy only means that time moves forward. The rest, such as that we can't know the future, we've just assumed.
Here is some very good discussion of (failed) attempts to replicate the study, and at least one possible methodological flaw that could invalidate some of the results:
"The real lesson? This is the level of methodological scrutiny every paper should receive, and not just the ones you think are crazy: the ones you like and rely on for your own work should get a good working over like this too (especially these ones; and I'm as guilty on this as everyone else)."
Pretty sure this experiment will meet "cosmic habituation"(http://nyr.kr/fkzAaQ) very soon. Pretty sure Bem knows it. His time would be better spent studying why does the truth "wear off" (as The New Yorker put it). What happend to science!
I am embarrassed to ask, but can someone explain the word flashing test a different way? For some reason something is not clicking for me the way it is written on NPR.
If you're convinced computer react to your moods, you should consider reading Zen and the Art of Motorcycle Maintenance. There's an entire chapter dedicated to the concept. It's philosophy, not science, but interesting and possibly wise nonetheless.
People have been getting results like this for a very long time. The Soviets took these phenomena very seriously and were trying to establish physical mechanisms.
Figuring out what time is may be what causes civilizations to go extinct. Once you figure out how to probe the earlier states of the universe you find everything vanishes, along with the evidence that the civilization ever existed. This may be why our visible universe is not teeming with chit-chat.
Wow, the word retyping test seems like a total scam. Of course you're better able to recall words after a test in which you were already better able to recall those words. It's called memory. Correlation, not causation, as the maxim goes.
I have not red the original article so maybe I am mistaken, but the post submitted here seems to suggest that there was one group only not two groups. There was one group and they retyped only half or so of the words and not the other half and were found to remember the retyped words better.
I do not know if that makes a difference, but I think it might and could. The retyped words might have whatever quality or for whatever reason might have been easier to rememmber than the non retyped words.
Unless in the original article it suggests that the experiement was carried out in the way you suggest, I do not think there was much control of variables.