Hacker News new | past | comments | ask | show | jobs | submit login
Noted study in psychology fails to replicate, crumbles with evidence of fraud (columbia.edu)
205 points by luu on Aug 22, 2021 | hide | past | favorite | 102 comments




I have several books of Ariely. I enjoy reading psychology, but I stopped reading pop psychology books and go for academic books or textbooks or individual articles.

The amount of times I read about spectacular counter-intuitive study that "changes everything" and later find out it was fraudulent or failed to replicate makes me really doubt all of my knowledge I have from reading about psychology.

Reproducibility crisis is real and it sucks for all the people that read individual books and don't know about reproducibility. I feel people that write those books should also put themself into higher standards and stop writing narrative books based on individual nonreplicated studies.

You can read about reproducibility crisis here: https://www.theatlantic.com/science/archive/2018/11/psycholo...


I think you need to draw a line between "not replicable" and fraudulent.

This is about the latter. This is the kind of corruption that spreads. It's a very bad thing.

It is related to the former problem of "not replicable," which is about application of the "scientific method." On this, my worldview has evolved a lot over the last decade. I think we've been applying overly Popperian methods, and formalisations of the scientific method where they don't work well.

Say you are studying the effects of a breakfast program, 2 hr recess or push-up regiment on high schoolers. The study is well constructed, faithfully conducted, etc. You find interesting, potentially actionable ways of reducing obesity, improving psych health or something.

Do/Should we even expect such a study to replicate in 1990s Tokyo, 2000s Miami, School A, School B, etc.

Does that mean this study should not be conducted? Does it mean we can't generate knowledge and understanding of things this way? Is there any daylight between "anecdote" and a result that qualifies for publication in Physics?

A lot of articles published in many fields feel like they are being hammered into a Popperian mold they don't fit into. Everyone knows it, but no one is willing to say "this method doesn't work here. we're actually doing a different thing." Instead, they sort of eye roll the article into the frame and let readers unscramble it for themselves.


the replication crisis came to a head a few years ago when it was revealed that two thirds of social psychology findings failed to replicate -- this is not just empiricism taken too far, this is the mass production of false knowledge due to poor training at best or incentive misalignment at worst


True, and a lot of it is fraud or not far.

However, a big chunk of this statistic is not actually fraud. It's just false empiricism. If you look at a lot of the actual papers, I think you'll find that a lot of them just aren't trying to do empiricism... at least not strict empiricism. They're forced into that structure by a convention.

Either (1) "Not Science" becomes OK and extends to economics, some areas of ecology, and such. (2) The Scientific Method gets renamed "Method A," and the concept of scientific pursuit is more flexible, depending on what you are studying or (3) We force these poor bastards doing economics, psychology and such to pretend that they're doing a method that they're not... or publish in the prestigious "Unscientific Economics & Psychology" journal.


It is not even about malice. It is a straightforward consequence of how hypothesis testing is conducted. Greenland et al have a nice section on this:

> Despite its shortcomings for interpreting current data, power can be useful for designing studies and for understanding why replication of ‘‘statistical significance’’ will often fail even under ideal conditions. Studies are often designed or claimed to have 80 % power against a key alternative when using a 0.05 significance level, although in execution often have less power due to unanticipated problems such as low subject recruitment. Thus, if the alternative is correct and the actual power of two studies is 80 %, the chance that the studies will both show P ≤ 0.05 will at best be only 0.80(0.80) = 64 %; furthermore, the chance that one study shows P ≤ 0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20 = 32 % or about 1 chance in 3. Similar calculations taking account of typical problems suggest that one could anticipate a ‘‘replication crisis’’ even if there were no publication or reporting bias, simply because current design and testing conventions treat individual study results as dichotomous outputs of ‘‘significant’’/‘‘nonsignificant’’ or ‘‘reject’’/‘‘accept.’’


I think the current situation calls for a more rigorous application of Popperian standards, not less.

Nothing about "this study applies to this specific situation" is counter to Popper. However, if it fails to replicate in other situations, it tends to suggest at the very least "might be too specific to this situation to have practical application".

This happens in software as well. If my system works on my specific laptop but anyone with a slightly different version of the programming language, imported libraries, or operating system will be unlikely to get it to work, then it's just not good software. The solution is not (usually) to put something in the README that says "will only work with python 3.6.7 on Ubuntu 18.04", the solution is to change the software so it is more generally applicable. And it's on me the programmer, to make that happen. If the scientific phenomenon is specific to 1990's Tokyo but not 2000s Miami, then some major variable is being left out, and the science requires finding out what that is.

I agree with your basic point that there is a big difference between non-replicable and fraud, though.


>> Nothing about "this study applies to this specific situation" is counter to Popper. However, if it fails to replicate in other situations, it tends to suggest at the very least "might be too specific to this situation to have practical application".

Maybe not technically, but it does imply that "this specific situation" should be defined specifically, hence controls. I agree that broad applicability, and replicability are strong indicators of truth value. That said, there are pursuits of rational knowledge outside of Popperian ideals. That's not represented in the way Science, as an institutions works.


If the results are not replicable, how can they be actionable?


Let me continue the example...

We found that students participating in the "2hr active recess program" have meaningfully better health outcomes, psychological health outcomes, and better grades. It was tested faithfully in 25 schools over 10 years.

It expands to other places. Different countries. Different decades. Different people run it in different ways. The kids are different. In some cases, the kids are already so active that the extra activity is detrimental. In some places, it's really hot. Kids tend to read, or do more organized indoor activities.

Basically, the world is not a laboratory. You're not predicting the trajectory of a projectile, where unknowables can be quantified as a function. It's all 3 body problems all the time.

Yet, we as humans do manage to make sense of the world. We run schools, companies. We learn more about how to do those things. Formal studies are an important part of learning about how to do those things. Scientific mindset and methodologies are useful to this. It's a lot better than just philosophizing the whole thing. The type of knowledge we gain, OTOH, isn't like F=MA. It never will be.

The value of the above study does relate/imply to it containing knowledge. It's a partial knowledge of a general problem. It is not complete knowledge of a particular problem, like you would get in a lab. Neither lab experiments or wild studies tend to replicate in one another's world.

No matter how good the study, how many times it replicated, or how well it worked throughout the 80s and 90s... there's no guarantee that your program is working well or a guaranteed explanation for why it isn't.


> Different countries. Different decades. Different people run it in different ways. The kids are different. In some cases, the kids are already so active that the extra activity is detrimental. In some places, it's really hot. Kids tend to read, or do more organized indoor activities.

Elucidating confounding variables and controlling for them is the entire fucking job of a scientist.

To your example, if they publish a paper that explicitly highlights that it worked well for sedentary kids with typical diets of X, Y, and Z in region Q, then it failing to be a found in other environments isn’t a problem.

It’s a problem when these clowns do a test on one specific group of 20 people and then generalize to, “doing X leads to Y in humans”.

The answer is absolutely not loosening up the constraints. It’s to keep lambasting people who generalize before anything has been reproduced widely.

Popper’s science is fine. Morons who don’t realize what they measured are the problem.


I suspect it’s the journalism step that introduces the error you’re objecting to. Proper scientists* will report their findings and methods. Journalists will tend to write the “Doing X leads to Y in humans” headlines.

Many times the first or first few teams won’t know what the confounding variables are and subsequent research is needed to find them. They should still report their initial “hmm, that’s novel” findings even if it later turns out that it’s explained by confounding variables.

*-No true Scotsman


>> Elucidating confounding variables and controlling for them is the entire fucking job of a scientist.

No. This is methodology. The job of a scientist is to advance scientific knowledge. This is as vague as it needs to be.

Popper didn't do science. He did philosophy. He also did a fine job of formalising methods, and abstracting the methods scientists were using into a Scientific Method. This scientific method is a great achievement and a wonderful, multipurpose probe. That does not mean it fits in all orifices. The job of a scientist is not to continue hammering until the probe fits.


How do you advance scientific knowledge without using a Scientific Method?


I suppose any method used to advance scientific knowledge without is a scientific method, by definition. That's besides the point though.


>>>> Yet, we as humans do manage to make sense of the world. We run schools, companies. We learn more about how to do those things. Formal studies are an important part of learning about how to do those things. Scientific mindset and methodologies are useful to this. It's a lot better than just philosophizing the whole thing. The type of knowledge we gain, OTOH, isn't like F=MA. It never will be.

I think it has to to be held out as a possibility that there is actually no science behind how we run schools and companies. One way this could happen is if non-scientific ideas simply overwhelm the ones that are grounded in good science.

The fact that it isn't like F=MA also means it will never advance beyond producing collections of amusing factoids that are unrelated to one another and potentially false.


Think about the replication crisis as actually a generalization crisis. The study said what it said for the exact parameters given. The question is what were all the parameters and which ones are the final result actually sensitive to?

It’s actionable by trying it in a new circumstance (set of parameters) and observing whether it fails - over and over. Failure doesn’t mean the claim is wrong/untrue, it means you’ve changed a parameter that happens to be important: now find it!


Yes. Find it. And when you do, you'll have something reproducible. The purpose of science is to advance the species. Given this, when that happens, this other thing(s) occur (or occur with a certain frequency)

I agree that this is a generalization problem. I just don't believe that this particular way of describing it is helpful. If you'll allow me an apocryphal example, when Newton saw the apple fall, if he had generalized to "all things from apple trees fall at the same rate", he would have been correct for a while, then ludicrously incorrect once the leaves started blowing off in Autumn. If at that point he had tried to figure out exactly what kind of tree produces this gravity thing, or which parts of trees, we wouldn't be any further along than we were before.

Generalizations work along multiple axes. With some degree of imprecision, every experiment can be a success (or failure). This is why we have the scientific method in the first place.

Now, perhaps you're making the argument that this is somehow, for lack of a better term, half-science. It's not right, but it's not wrong either. It's truthy. The logical follow-up would be: ok, where is the second half? In this and many other cases there is no second half. Papers were published, people received awards, wrote books, became famous, and lots of other folks used their work. So even if somehow this is better than nothing, in practice I'm not seeing it.


To be clear, in this case there appears to have been fraud ie actual falsification of data. That’s very different from the broader replication crisis, which is more of the shape of unknown parameters. The two get blurry in some places due to eg publication incentives pushing people’s interpretation of legitimate data into semi-fraudulence. But there are still two ends (at least) on the spectrum of non-replicable science.

I think framing it in this lens is helpful because it points toward a solution: capturing more of the parameters more explicitly and varying them more deliberately. If we just say, “oh x y z studies aren’t replicable, field [x y z] is bullshit,” we are ending our inquiry prematurely. We are ending it at the falsification of Newton’s initial theory and saying that because leaves fall at a different rate, that gravity doesn’t exist.


> Think about the replication crisis as actually a generalization crisis. The study said what it said for the exact parameters given. The question is what were all the parameters and which ones are the final result actually sensitive to?

If you can't control parameters so that the results you are claiming are reproducible you are just publishing noise ?


How would you know if parameters are controlled until other people try the same thing with other parameters?

Should a journal paper never be published until the field has established all the fundamental properties? How could it discover those fundamental properties without the publications?


Generalizing further, it touches on the Demarcation Problem. What is the difference between scientific hypothesis revision and pseudoscientific goalpost moving?

Surely there is one, but it's maddeningly hard to pin down. Especially when the experiments are hard to perform and control properly, which applies both to psychology and astronomy.


This kind of demarcation usually hints that you are dealing with a subject that is not as concrete as you had supposed. "Science" doesn't really have a concrete definition, a pseudoscience just means stuff that is pretending to be science.

Remember that scientists existed before philosophy of science, including formalisations of The Scientific Method. Darwin wasn't following the scientific method, the method followed him.


I disagree.

You're not going to find the parameter which makes a high school nutrition program deterministic. It shouldn't be the goal.


That’s a strawman. Reproducibility does not require determinism.

If you can’t find the parameters that make it reproducible, what value does it have in the first place?


Which means you can’t generalize to other high schools. If the scientific result doesn’t generalize, is it still science? How can we use the science to improve HS nutrition?


Or it means there was fraud, which is why they are ultimately not able to be separated.


Another issue is books written by journalists blowing up ideas they find interesting from research and the pulling things out of their ass.

Look at emotional intelligence (EQ) as an example. One of the early books on the subject written by a journalist which was based on a shaky foundation and then added a bunch more sugar. It might as well be a self-help movement which could spawn coaches, authors courses and certifications.


It is wrong to believe just because a result cannot be replicated the original result is wrong or fraudulent. It can be totally normal and is to be expected that a replication study fails to yield the same result just out of chance etc.

Sometimes things are even more complicated than you already believe them to be. And it may take more than just a replication study to say a study is wrong or fraudulent.


It is wrong if it doesn’t replicate; it might not be fraudulent. But it is not right.

If it doesn’t replicate, it should be relegated to “less than observational” level evidence. Why would you believe that the assumed intervention had the claimed effect, if it doesn’t have it when you try again?


In the present case, replication didn't fail because of statistics but because the experiment was impossible to reproduce.

If it fails to replicate certain statistical outcomes, things are more complicated.


It means the original result is an anomaly that doesn’t fit into a scientific understanding. Maybe there was an error in the experimental setup, or some unknown factor was at play, or the situation was unique. Whatever the case, nothing scientific can be determined if it can’t be replicated. It’s akin to the WOW SETI signal.


An important finding is buried in the linked blog post. It turns out that Ariely was hyping the fraudulent findings on NPR in February 2020. At the time, perhaps he didn't know that they were fraudulent. But he -did- know that there was great reason to doubt them. We can be confident of this point because, more than half a year earlier, he had submitted for publication a new article [1] that (a) failed to replicate the original findings, and (b) uncovered a massive problem with baseline measurements in the original study, suggesting that the randomization in that study might never even have taken place. Indeed, the whole point of the new article was to make these critical points about the older article!

Even if Ariely had no role in the creation of the fake data, it seems hard to defend his behavior on NPR: hyping a study that he knew to be extremely problematic.

[1] https://www.pnas.org/content/117/13/7103.short


Let's not give NPR a pass here. They routinely present junk science as fact and never suffer any repercussions.


As a general heuristic, social sciences research that tells a nice narrative and gets hyped to hell in TED talks/pop sci books/etc. is at best bullshit.


In private I have heard of science done that is suspected to be fraudulent. The results are so intriguing and promising as to send multiple grad students across multiple groups barreling into replicating and expanding the results. What happened? Years of time wasted with no publications. PIs and journals aren't interested in publishing negative results. The person who lied in their research is reaping career benefits of such a valuable contribution to research. Academia takes something as beautiful as the closed loop of science and opens it up only to benefit charlatans and punish the best and brightest.


In other fields where there is a chance for fraud, the honor system is not used. In academia, the attitude is just 'the honor system is good enough, WINK' as it relates to data. In other fields where there are opportunities for embezzlement or other abuses of trust everything is monitored, only some employees are authorized to do certain transactions, and many of those employees are bound by law and oath to not do things like embezzle or falsify records.

This doesn't always work but having the capacity to disbar lawyers for, say, embezzling trust accounts or remove a CPA's license and put them in prison for falsifying records, it certainly has a deterrent effect on similar crimes of trust.

How much worse would white collar crime be if everything was on the 'honor system' and there was no prison time for stealing from customers/clients/shareholders? That's what we have in academia: crime won, so now syndicates of crooks run the system. With other abuses of trust the material consequences are more readily apparent: someone's bank account is empty that should not be. With this abuse of trust, the damage is more to the integrity of the purportedly rational knowledge base. It's more of a 'reign of error' than a reign of terror.


I think the damage is much more than just the integrity of the knowledge base. There are people who spend years focusing their (very expensive!) academic pursuits in reliance on these studies. They spend years of their lives researching and trying to replicate in vain. The damage to our collective knowledge is costly and unacceptable but there are also real people who are real victims of this fraud. It’s not just some theoretical cost borne by society at large.


> With this abuse of trust, the damage is more to the integrity of the purportedly rational knowledge base

It corrodes trust in our scientific institutions. When that pop something talk by an acclaimed psychologist turns out to be not only bullshit but outright fraud, it’s not a huge leap to project that mistrust onto e.g. the medical establishment or climate scientists.


Everything is a cost/benefit tradeoff. We devote more resources to securing your bank account against fraud because we (as a society) think that it’s worth spending the money — on the theory that recovering your money after it’s stolen is very hard and because we think it’s acceptable to cause legitimate transactions to undergo some inconvenience. We devote fewer resources to verifying every social sciences experiment because society doesn’t want to spend the money to reduce a presumably already-low fraud rate and because the worst case outcome is usually articles like this one. And yet despite these decisions, white collar financial fraud occurs all the time and plenty of lawyers stretch their profession to (or beyond) the ethical breaking point — with devastating impacts on real people.


The fact that the "negative" results don't get published is the worst part. Failing to replicate a strong effect should be a result in and of itself! Failing to find support for a plausible theory is also a result. The fact that this kind of publication bias still exists doesn't make any sense to me. And it seems to be one of the primary ills of scientific publishing and academia.


Publishing negative results is important to solving this issue.


How about instead of applying to journals with a finished study, people would apply with a proposal of a study and then publish the results independent of how spectacular/boring they turn out?

Journals would then have to select studies based on their design and based on the question and scientists could just do science.


This is a good idea, and is known as "preregistration". https://en.wikipedia.org/wiki/Preregistration_(science) Hopefully it will become more common practice.


Well, it is preregistration plus a commitment by a journal to accept and publish the resulting paper (no matter the outcome of the study). That apparently is "Registered reports" (see same Wikipedia article), an idea I hadn't heard before.

Edited to add (from Wikipedia):

> Over 200 journals offer a registered reports option (Centre for Open Science, 2019),[33] and the number of journals that are adopting registered reports is approximately doubling each year (Chambers et al., 2019).[34]

> Nature Human Behaviour has adopted the registered report format, as it “shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them”.[36]

> European Journal of Personality defines this format: “In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes.”[37]


I agree. The grad students agree. Now how do you convince a group obsessed with career advancement and holding all of the cards (PIs) to agree?


Enough of these "falsified science" stories come out and people will become interested in stories debunking bad science. I subscribe to a podcast called "everything hertz" for this reason.

https://everythinghertz.com/

https://twitter.com/jamesheathers

https://twitter.com/dsquintana


By convincing the people deciding about their career advancement that negative results are worth something too. (no, that isn't easy, since it's the same people to a good degree, which is a problem with solving many problems in academia)


Give groups larger grants for publishing negative results too? More money seems to be workable enough in the rest of the economy. Obviously it would be best if those at the top make the change and let it cascade down. But any private source of funds could also experiment with changing the grant model.


Incentivizing publishing negative results is a challenge.


This has potential consequences I don't like. Once negative results are incentivized the same way as positive results, cheaters could have a perverse incentive to post fraudulent failures to replicate, too.

I think I agree that we need much more transparency and scientific rigor, first. Post the data whenever possible, incentivize people for creating good study protocols that make it harder for any single person to fake the data without it being obvious. Once we have enough confidence in that, let's do preregistration, sure. And then negative results should be as incentivized as positive ones, absolutely.

I worry about the order in which we want to do these things. If people are faking data, we need to make this specific step harder to hide from reviewers.

If you incentivize people to fail to replicate each other without solving the lag between fraudulent data and retractation, you will get fraudulent negative results.

That could cause a lot of confusion. If the replication crisis itself starts to be fraudulent, scientific credibility is further damaged, negative results will be less credible, and fraudsters will profit from the confusion.


"If negative results are published like other information, people could have a incentive to replicate post fraudulent failures"?

Um sounds worse, (OT:) to add something a simpified (strict stripped) example -hope practicable:

You work at a company. Orders are comming in, wares are placed as 'reserved' till paid, than shipped. With a worker perspective, you check the emails with orders ('the customer whishes'), sending 'bills' and the wares if, or if not been paid - 'you've been paid'. Viewed by the company-manager your position may a) be valid oder come b) 'overvalued' (...)

Stucked in this example, you talked to other 'employeees' during break and 'there was a time meeting on weekends', so you talked about work and there was a (special) kind of 'more transparency' than without peer review. ^^

Regards, hope this helps... (-:


I'm not sure I understand your point, but if you're saying that posting negative results is good, then yes, I agree!

I'm only trying to make a point about the order in which we should shift incentives vs increase fraud-resistance requirements for experimental protocols.

I think the latter is more important, and should be done first.


There needs to be a disincentive to not publishing.

The journal that publishes failed science would be the biggest and most important scientific journal IMO.


There's a few negative results journals now - https://www.negative-results.org/ for example.

The only concern I have around a disincentive to not publishing is that in some fields, the peer review process can be quite flawed and effectively hold back worthwhile publications for various reasons - sometimes there's too much of an expectation of specific ingrained methodologies, or an expectation of certain "doctrines" to be adhered to.

There can also be some significant gaps between disciplines, where neither discipline wants to publish the work, as they both feel it belongs in the other discipline, which effectively leaves only the more generic (and usually less prestigious) "any inter-disciplinary research" journals.

That becomes a problem if institutions are focused on publication metrics and venues as a priority, as it means the institutional incentives drive people towards or away from these kinds of work. Ideally we need negative results published in the same venues as positive results (in my view), so those doing this valuable work don't get "relegated" to the negative results journals, which will no doubt get less attention at tenure review time, or from grant funders looking at publications etc.


> The results are so intriguing and promising as to send multiple grad students across multiple groups barreling into replicating and expanding the results.

One important task of the advisor is to filter bad projects. Does the paper pass the smell test? Are the authors reliable or they had published "dubious" results before?

[I was recently discussing a paper in a virtual meeting. My conclusion is that Zoom does not have enough negative reactions.]

Also, before assigning an experiment to a graduate student, try to have something like a plan:

1) This will surely work and is interesting enough for the dissertation and to be publish.

2) This part is dubious, because it's related to a too resent research paper or there may be some hidden technical problems that are difficult to see.

2') An underspecified alternative to 2, just in case.

3) Extra task for bonus points and perhaps a Nobel price.


Peer review, as we know it today, is a fairly new phenomenon and most of our best science was done before it entered the academy. It is a pox on innovation.


As a business academic, I can absolutely confirm that popular press attention corrupts us just as much as power corrupts political leaders. I can’t speak outside of b-schools, but suspect it’s no different.

If anyone is privy to a platforms that would allow me to share proprietary data such that skeptics can run aggregated statistical analyses whilst limiting the ability to surface results with insufficient samples (to prevent scrapping the the proprietary data—if not proprietary then there is no excuse for not fully sharing) please comment!

If not, YC hopefuls here is your chance to ACTUALLY make the world a better place!!! Happy to use my platform to recruit as many social scientists as you need to secure funding for a POC—seriously, ping me!

Must haves:

* Easily palatable settings for data obfuscation (non-tech corporate GC must be able to grok the level of obfuscation)

* Trivial sliders to change level of obfuscation (eg min number of observation to run a “test”)

* Messaging system to facilitate requests and permission granting between authors and skeptics

* Enterprise-level data retention and documentation, such that changes to the “official” data set area reflected publicly

* Cloud based R, Python, Stata functionality (else the solution would have to run locally in such a way that the granular data cannot be accessed directly on local machine—and this would likely be a non-starter for corporate GC’s). Ideally, one could reuse their local license for proprietary software (eg Stata), if not then said vendors will delight in an additional “attach” license to charge researchers even more (see below, there are beacoup bucks that will gladly be spent on this)

Business model: researcher’s academic institution would (gladly) pay (if you get pushback, call the university’s academic HR department, find out who is in charge (Ie associate provost of research or HR) and email them asking they are prepared we’re to shine a spotlight on their own “Andrew Bird and Stephen Karolyi” like scholars). They will call you in 5 min.

GitHub will acquire you post-POC upon enlisting 2-3 major universities (again, I know what higher-Ed procurement cycle is like, contracts will be signed record time). If GitHub fails, Google scholar will acquire you.


Hmm, I wonder if one could use differential privacy to make mathematically certain that no individual data point could leak. Although that may not be a strong enough guarantee


Yes, I’ve seen it on websites (eg GMAC) but the operators available are trivial (e.g. mean, median)… system will not process the request if the sample is too small (I think the minimum sample size is actually a function of how much you pay)


If top scientists are willing to outright manufacture data, how many would be willing to intentionally follow unsound logic? As in dishonestly claiming some (legitimate) data implies something that it doesn't or intentionally avoiding alternative explanations. You'd have to assume there's even more of that and all other kinds of soft fraud going on.

Ideally planning data production, execution of data production and analysis would all be done by different people and published separately. At the very least you'd want total transparency at each stage.


These are top social scientists and, unfortunately, their field is a total dumpster fire at the moment. See https://www.nature.com/articles/s41562-018-0399-z


If you're implying the replication crisis is limited to social sciences, that's too narrow. You'd expect to see these problems in any field that primarily relies on hard to gather data that is far removed from the underlying phenomenon. Many sciences that aren't strictly "soft" have "soft" fields. Say, Biology->drug research.

Also not to argue semantics but when you hear the word "scientist" in common parlance it almost exclusively refers to people doing that kind of research. The "sciences" that don't have these problems tend to not be too big on using the word "science".


Cosmology is the first field which comes to mind; it fits your criteria, but is far harder (as a science) than psychology and far more reliable. Psychologists can't define "mind" and change their entire profession every few decades, but "sun" has not changed much as a phenomenon for a long time.

The reason why the social sciences (social psychology, criminology, economics, advertising/marketing) are hit hardest by the replication crisis is because they were not actually sciences! They are ways for our society to police its members' behavior (relationships, crime, money, attention) and they pretend to be science in order to gain prestige.

There's also, finally, the matter of scale and relevance. When a chemist reports that a rare chemical reaction doesn't perform in the lab as expected, then this is a failed replication; however, chemistry labs regularly replicate entire textbooks' worth of material for entire classes of undergraduates. Meanwhile, some of the most important claims in psychology, like priming, are failing to replicate.


You don't even need the graphs, miles driven per year is a sum. Sums converge to gaussians yet here the deviation was half of mean which means they used a uniform variable generator for miles.


> yet here the deviation was half of mean which means they used a uniform variable

What am I missing? The mean of a normal distribution can be any value. A normal with mean 2 and deviation 1 has deviation equal to half of the mean.

Is this related to the fact that miles driven can't be negative (unlike a sample from a genuine normal)?


I think this is an overzealous statement of the central limit theorem. A log-normal distribution would probably be a better model in this case.


Interesting point. So the original paper (2012) didn't give a table or anything, but said: "Customers who signed at the beginning on average revealed higher use (M = 26,098.4, SD = 12,253.4) than those who signed at the end [M = 23,670.6, SD = 12,621.4; F(1, 13,485) = 128.63, P < 0.001]."

Was M = 26,098.4, SD = 12,253.4 enough to infer a uniform distribution?


> Was M = 26,098.4, SD = 12,253.4 enough to infer a uniform distribution?

No unfortunately it is not. It's perfectly plausible to have a normal distribution with mean = 26k and sd = 12k. Although those numbers do look kinda weird, but nothing you could verify. I am not sure what the F(*) means here, maybe an F statistic? But that seems wrong if you are comparing two normally distributed samples, you might expect a T statistic.

To verify the distribution you would need a histogram or you could get fancy with a qqplot. You could also try a statistical test for a normal distribution but these fail on large sample sizes so the visual plots are your best bet.


Taleb pointed this out on twitter iirc


Fake news in academia. They need to do it because their click bait does not drive ad revenue, rather it drives the awarding of grant money. Does the 'educational institution' where the researcher works then have to pay back the grant money? After all, everything in the lab is actually owned by the university and the grant is administered by the university and the university takes a hefty 'management fee'


The 2012 article states:

> Specifically, we ran analyses examining whether the two conditions differed in the number of instances wherein reported odometer mileages ended with 0, 5, 00, 50, 000, or 500. Numbers that end with these digits indicate a higher likelihood that customers simply estimated their mileage. We detected no statistically significant differences between our two conditions in the instances in which these endings appeared (pooled measure: treatment, 19.9% vs. control, 20.8%; χ2 = 2.5, P = 0.12).

1. What does this mean? (“Pooled measure”) Wouldn't a mileage ending in 500 also end in 0 and 00?

2. If the numbers are uniformly randomly generated, shouldn’t the odds of ending in 0 be 10%, not 20%? Or is this an equal mix of fraudulent and real data in both arms?

Just curious because this is the one of the only bits of data for the fraudulent study in the 2012 article.


Hmm? Academic fakery to support some financial unreality isn't exactly new. Remember how some economists figured they wouldn't include important chunks of the first world in an Excel calculation, in order to justify the grinding austerity favoured by right-wingers in the early 2010s? [0] They kept the data, they just decided not to include it in the calculations. Amazingly, actually including the data shows that the effects of a high debt-to-GDP on growth are negligible at best, rather than overwhelmingly bad.

Did they get anything more than a slap on the wrist? Are they still selling books based on the concept?

The breathless commentary regarding how perfidious psychology is threatens to sprain the muscles that govern the rolling of my eyes.

[0] https://en.wikipedia.org/wiki/Growth_in_a_Time_of_Debt


I love polymarket. Prediction markets on topics like this are about the only way to discern truth from afar.


Wouldn't polymarket be subject to incomplete information? Thus leading to bias. Polymarket would just reflect the market sentiment given a fairly robust analysis of the available information, admittedly a pretty useful idea.

The other issue is that a very niche bet/topic might only have a small sample of analysts placing bets. This would mean you are subject to increased risk of bias from a single bad analyst. Or simply large variance due to small sample size.


I am wondering how much of such non replicated studies leaks into the decisions and recommendations by federal organizations like the CDC or ECDC or WHO. It creeps me out to think that some organizations adopted some of the ideas outlined in this person's book.


What if humans were generally so unique that there were no useful generalizations in psychology. (E.g every attempt at an experiment like this was just noise)

How would that affect the field?


It would probably focus on measuring the degree to which individuals differ, assigning explanatory variables, and figuring out the theoretical maximum (as brain chemistry sets a pretty hard limit). But that doesn't really matter, because we know that people can very easily be generalized - there is an entire industry devoted to it: advertising. Many years ago I had to design a process to pre-stage inventory based on expected demand - using historical data from a fairly inelastic customer base. It was spooky how accurately a Poisson distribution could predict long tailed events - but only after segmenting the customer base into something like 5 groups. This was for millions of people - 5 groups.


Makes sense for consumer behavior since there are a finite set of purchase choices. Things like ethics and behavior choices are fuzzy though…


> behavior choices

I really wish this weren't the case, but "choice" isn't a thing - as far as anyone can tell. All the complexity on display in the world appears to be a result of feedback loops, it really doesn't take much to generate a continuous low interval stream of pseudo random feedback - predictable after enough observation. People who regularly purchase chunky peanut butter and cat litter have a significantly higher probability of driving a VW bug. That isn't the result of limited alternatives.

https://en.wikipedia.org/wiki/Linear-feedback_shift_register


My bet would be that a surprisingly large number of psychology studies are completely irreproducible. I'm even skeptical of a lot of fMRI experiments. Meta-analyses certainly can give a bit better insight, but have their own problems.


[flagged]


Your assessment of psychology is flawed in several ways. The main concern of psychological research isn't to help someone. It's to understand how/why organisms think and behave under certain conditions. Psychological research can be conducted using the scientific method like any other science field. To say "Psychology", as a whole, is similar to snake oil, ignores thousands of perfectly sound findings. Fraud is certainly not unique to the field of psychology. Nor are the pressures to publish positive findings.

Your final thoughts about 'life outcomes from genetics' is incoherent.


I think this is over-sold, but psychology as a field is absolutely closer to philosophy with a veneer of science put on top than the field would have you believe from the outside, and psychology as a job is absolutely not science based day-to-day.


It's worse than that; Pirsig reminds us that psychiatrists are literally thought police. From this perspective, psychology is a mechanism for justifying psychiatric abuses. After all, if psychology can show that minds exist, that there are normal and abnormal minds, that there are correct and incorrect modes of thought, then psychiatry has only to show that certain incorrect modes of thought form clustered disorders and syndromes. The resulting system can punish people for expressing anything which society doesn't like, from sexuality to politics to a desire not to be overworked.


I'm pretty sure this was discussed in an earlier HN post.

However, I want to point out that "not replicating" doesn't necessarily equate to bad science (or, more accurately, bad practice on the part of the principal investigators irt accepted best practices in their field).

There are a bunch of studies in the social sciences that failed to replicate simply because a p value was slightly below a threshold target, even though the coefficient direction/magnitude &c... agreed with the earlier study.

That said, I'm not defending - at all - the paper discussed in the article.


There is a bigger accusation of fraud, not only failure to replicate

Edit: Also [0] was on HN 5 days ago (discussion here: [1]). This is another blogpost that discusses [0]

[0]https://datacolada.org/98 [1]https://news.ycombinator.com/item?id=28210642


And Buzzfeed : https://news.ycombinator.com/item?id=28257860 ( Yesterday, 86 comments )


> because a p value was slightly below a threshold target

If some area is getting too many "statistically significant" flukes, they must choose a lower threshold. For example in particle physics they use 5 sigmas that is an insane low threshold, because otherwise they would have to announce a fake particle every month and raise a retraction a few month later.

> even though the coefficient direction/magnitude &c... agreed with the earlier study

This is a problem due to report and publication bias. Last month there was a discussion abut the relation of lead and crime https://news.ycombinator.com/item?id=28016921 . Everybody knows that lead is bad. In that meta analysis, most studies show a very low effect, and only a few have a strong effect. The problem is that a possible explanations is that flukes that show a strong bad effect are published, but flukes that show that plumber is good for people are silently discarded.

Another field with a lot of report and publication bias are drugs against covid-19. For example look at the graph near the bottom of https://news.ycombinator.com/item?id=27852130 How many of these studies are not even statically significant?


Failure to replicate is the definition of bad science. Science is based on objective, repeatable experiments. If the experiments can't even be reproduced how can they be hypothesized and validated?


Being possible to replicate isn’t the same as every experiment getting identical results. The point of doing statistics is to separate signal from noise, but it also means ignoring results where the signal wasn’t strong enough because your sample size was to small etc.

Suppose your testing a weight loss drug and in the first experiment subjects lost 5 pounds and the second they only lost 3. On the surface the first study replicates fine, except the second study has a weaker signal so it might not past the statistical threshold chosen before the study. This is fairly common issue as you should expect random noise to amplify a signal in about half of experiments and more than half of all published research.


This is so not true on many levels. Science is not about being right every time. Positive experimental findings can turn out to be wrong even if the study was excellent. This is called a type 1 error, and indeed they are guaranteed to happen with 5% probability assuming every study is conducted correctly and people don’t cheat.

Also, I don’t think you understand that there is a difference between “replicate” and “reproduce” as those terms are typically used in social and biological science. Reproduction refers to the methods and is simply conducting the experiment again, regardless of the outcome. An experiment should be reproducible, of course—if it isn’t, the paper was poorly written (the experiment might still be ok though).

Replication refers to the results: do we get the same answers as the original experiment? Bad science is one way an experiment might not replicate but other reasons are 1. Just a regular type 1 error in the original study, 2. A type 2 error in the new one, 3. an overlooked methodological difference or mistake somewhere. No. 3 doesn’t mean bad science, it could be just normal scientists making normal mistakes, as they sometimes do.

Also, you don’t “hypothesize” an experiment. You come up with a hypothesis then test it with an experiment. (Sometimes scientists HARK but that’s a different discussion …)

In this case, it sounds like there truly was some cheating going on of the worst kind (data fabrication).


If we report p values around 0.05 wouldn’t you expect about 1 in 20 studies failing to replicate without being bad science?


Not quite. That's the common misinterpretation of p-values as probabilities that a study is correct. The p-value isn't the probability of the hypothesis being wrong given the data. It is instead the probability of the data given the hypothesis being wrong.

In order to determine what proportion of studies would fail to reproduce, you would also need to know the number of correct hypotheses in the first place. If 100 correct hypotheses and 900 incorrect hypotheses are tested, your sample of "p<0.05" results would contain 95 correct hypotheses and 45 incorrect hypotheses, so about 1/3 of the results would fail to replicate.

This is why there is such a large push for prepublishing of analysis methods, hypotheses, and experimental design, and later publishing of negative results. The initial distribution of hypotheses is important to know how to interpret the resulting body of work.


In the social sciences? If they're reproducing the earlier results with the original data and code, the results should be exactly the same.

If they're replicating with a follow-on study of some kind, it depends on any number of factors, since it's almost impossible to do a follow-on survey-based or lab experiment or use a similar or newer dataset while accounting for and controlling exactly the same set of conditions (this is one of the issues that I've heard used to distinguish between hard and soft sciences).

E.g., if paper A has an observation count of 200 (insert your preferred number), a replication study could hand out exactly the same survey to another 200 respondents and get similar results that, nonetheless, fail to exactly replicate the earlier findings (or, over a small number of samples, fail to conform to a normal distribution).

This doesn't mean that the original study was "bad science" or that the social sciences in general are "bad", only that we have to temper our expectations and maintain a healthy dose of skepticism.


What's the purpose of this? It will just give ammo to anti-science Trump like people.

It's harmful. This kind of stuff should be handled as quietly as possible, and definitely not in public.

Science is already under attack, we need not undermine it from the inside too.


Anti-science people tend to be conspiracy-minded and believe that the establishment is a homogeneous group that is trying to mislead the public.

Creating an actual conspiracy to mislead the public is not a good way to fight this perception.


I disagree. Constant re-evaluation of evidence and improvement of working theories are essential to science. Science needs transparency and honest critique.

Acknowledging falsified results is not an attack on science, it is how science works. It shows that science is not a mysterious cult of knowledge conspiring to hide information and deceive others, but rather the collective effort of imperfect humans who want to understand the universe and make things better for us all.


Not sure if serious, but I disagree. I think scientists playing games with public perception is a big part of what undermined their credibility in the first place. Also this article talks about Ariely loudly promoting his own work after claiming he already should have known it failed to replicate. If scientific studies are being used to drive public discourse, then debunking them should be public discourse too. The very strength of science is its ability to be self-critical. The mechanics of that shouldn't be hidden away.


Really? Covering up a messy truth because you perceive it as not politically expedient strikes me as the exact opposite of science. That road leads to worse knowledge, not better.


I mean, if anti-science trump people use this to be skeptical of results in psychology and sociology, how wrong would they be? these fields are genuinely corrupt and generate unreliable results. We already know the solution: preregistration and registered reports. That may not actually be enough because there are other issues (e.g. almost all studies that attempt to “control for x” or similar are subtly flawed and should be replaced with proper RCTs), but it would at least be a huge and obvious and inexpensive step in the right direction, and the fact that it hasn’t been implemented yet shows that the field is immature and broadly shouldn’t be trusted.

(compare to medical research, which has embraced prereplication because the FDA is actually serious about safety and efficacy and had the good sense to require it. but the FDA is anchored in the real world - if they get it wrong people may die and it’s on their hands. If psychologists get a result wrong, they may generate a bombastic paper for themselves, do a ted talk, go speak on NPR, then write a popsci book, and generally boost their career. So the incentive is for the FDA to be right and for the psychologist to be wrong, so it’s not surprising that that’s what we see.)


What you just said goes against everything science stands for, otherwise it just becomes a belief system, and you're no better than what you warn against.


> What's the purpose of this? It will just give ammo to anti-science Trump like people.

It's bizarre how today, in USA, for some it's all about "the other camp can't have a win, at any cost", and not what is right. Fake science hurts science the most, and covering it up IS WHAT is going to give ammo to whoever you do not like, calling a fraud a fraud doesn't do that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: