Hacker News new | past | comments | ask | show | jobs | submit login
Psychology’s Replication Crisis Is Real (theatlantic.com)
490 points by zwieback 3 months ago | hide | past | web | favorite | 276 comments

You would think that computer science can't have replication failures, but it can. And I'm talking about my own field: machine learning. There is so much hype that I suspect that people are pushing papers and trying to actively hide the irrelevance of the methods and algorithms they develop.

Artificial intelligence faces reproducibility crisis http://science.sciencemag.org/content/359/6377/725

Reproducibility in Machine Learning-Based Studies: An Example of Text Mining https://openreview.net/pdf?id=By4l2PbQ-

Missing data hinder replication of artificial intelligence studies http://www.sciencemag.org/news/2018/02/missing-data-hinder-r...

>In a survey of 400 artificial intelligence papers presented at major conferences, just 6% included code for the papers' algorithms. Some 30% included test data, whereas 54% included pseudocode, a limited summary of an algorithm.

> 6% included code for the papers' algorithms

This is completely ridiculous.

In pure mathematics, everything you need is right there in the paper. If instead you're in the physical sciences, you obviously can't include your lab in the paper. For software, it's perfectly possible to include the lab in the paper, so to speak, and there's no excuse for depriving the rest of us of access.

Not interested in releasing the source? Ok, just don't brand your project as 'computer science'.

Including full source-code (and ideally data-sets, unless there's good reason this can't be done) should be a basic requirement for serious publications. It's disappointing that this isn't (yet?) the norm.


Agreed, at least pseudo-code can be included.

I have to strongly disagree here. The source code itself - and not some proxy for it - must be made available.

Pseudocode is just another description of the approach the authors took. It doesn't allow another researcher to carefully recreate the exact experiment that the authors ran.

Software is famously hard to get right. This can be exacerbated by certain research problems where you don't know the result to expect (modelling, say). If you're going to publish results, the source you used should be available for inspection, for similar reasons to why mathematicians have to publish the proof, not just the conclusion plus a promise.

The upsides here are many: protecting science against software bugs, easier replication of experiments, better protection against academic fraud, and helping other researchers extend the work.

(I realise this last point may be at odds with toxic academic careerism. All the more reason for the publication process to insist on it.)

Most papers include pseudo-code when it helps clarify things. This tends to be a very high-level description of the algorithm though, and is not really the same as including actual source.

I was talking to a friend who moved from the Google Brain team to the DeepMind team and he said point blank that no one was going to reproduce their work and that they were not going to reproduce other groups' work. Everyone has their own research agenda and the cost of the hardware and compute time needed to process the data is available to a tiny number of research groups around the world.

> the cost of the hardware and compute time needed to process the data is available to a tiny number of research groups

You'd think this would make them more interested in replication, since otherwise the likelihood that they are actually doing something of value is uncertain.

Science and research is full of all kinds of false starts, you don't increase your speed by blinding yourself.

Falsification is the very definition of science.

I do research in psychology and also a little in machine learning (classification using text samples). I've actually been working with the Many Labs data (the project that's the focus of the article).

My impressions of machine learning match up with yours. There's just so many parameters that there's so many opportunities to capitalize on chance, combined with [necessarily] huge datasets that preclude replication, that it's hard to avoid. I have also been surprised at how much tweaking there is of parameters; I'm used to work in traditional statistics where more is derived from theoretical principles, as opposed to trying a bunch of values to see what "works." This all lends itself to overfitting. I strongly believe that a lot of the adversarial input work is basically capitalizing on this overfitting.

To be fair, I don't think people are necessarily being nefarious, I think people across all fields just don't appreciate the dangers of overfitting.

The one upside of this Many Labs work, mentioned in the paper, is that it tends to show that the common criticism of "how can you generalize from this sample of undergrads" is not so much of a problem. Not that it's not an issue at all, but if you're studying some basic cognitive process it's probably not going to matter that much if you use undergrads versus some perfectly representative sample of the population. People have shown this in different ways before but it's useful to know more about. Obviously with some things sociogeographic variation will matter more though.

One thing that might not be totally obvious from this article and others is that although some effects clearly replicate, and others do not, there are some effects that seem to be in a grey area, where the effects are probably real but much smaller than originally reported. The distribution of effect size estimate distributions is more continuous, with shades of grey, than these news reports would have you think. Whether or not it matters that an effect is tiny versus zero might not practically matter, but at some level it is important to be mindful of.

> One thing that might not be totally obvious from this article and others is that although some effects clearly replicate, and others do not, there are some effects that seem to be in a grey area, where the effects are probably real but much smaller than originally reported.

The two prior replication studies that I know of - the first was several years ago and raised the issue in the public mind afaik, then second maybe in the last year - both replicated almost all of the effects. The problems were that some effects were weaker than in the original studies. Are you saying that this most recent attempt at replication shows a large number of studies with no significant effects at all?

real effects might be weak enough that they only make it through the first time (non replicative) publication process when chance makes the effect seem stronger than it actually is.

> real effects might be weak enough that they only make it through the first time (non replicative) publication process when chance makes the effect seem stronger than it actually is.

That's hypothetical. What actually happened was what I wrote in the GP: The effects were replicated almost every time, except sometimes they were weaker.

you miss my point. the strength of a novel main finding must be sufficient to warrant the opportunity cost of trying to get it published. a weak but real effect will be more or less strongly indicated in an experiment, just by the vagaries of the sample, but weak results typically will not get published. it is thus not surprising to me that findings might frequently be replicable but weaker than initially was found.

In my mind, 50% replication is not that horrible. Even decent seeming sample sizes might have an 80% chance of replicating at a certain effect size, and the effect size originally reported could be (by chance or other reasons) over-stated, leading to insufficient power. More certainty ends up meaning vast increases in sample size). Another issue that should be tested (have not read paper but doubt it) is not just whether the result is statistically significant, but whether it statistically differs from the original published result. If the effect is weaker, but not statistically different from the published result, I would hesitate to say the effect is not real. Then there is that some of the psychology findings investigated were rather dubious to begin with, and the ones that even non-specialists can identify as being likely to be true replicated.

So, a result that is statistically no more accurate than flipping a coin is acceptable to you? Large cultural shifts have been pushed on a large scale based on studies that may have no more probability of truth than flipping a coin and this is OK to build a society on? I couldn't disagree more.

Your characterization of what I said is overly simplistic and ..wrong. I am saying that even with a good sample, replication studies only have about at most an 80% chance to replicate a true effect for studies that had reported a "moderately-sized" effect size. A lot of other things can account for the extra 30%, including picking the more dubious of findings out there (which seem to be mainly social psychology not cognitive psychology studies). Looking at your other comments on HN, you seem to love throwing bombs and trolling.

That isn't actually even close to what you said (very contrasting statements). 50% is quite horrible (I was not the one who down voted you, but I see others agree). BTW, mentioning your opinion of someone's "other comments" doesn't help your case either. It makes you appear desperate to cause a distraction or detraction.

Why don't the peer reviewers demand higher standards?


One of the foundational problems with the soft/social sciences is that many of its practitioners have such weak analytical thinking skills (Note below) and the educational requirements for Psychology majors typically just include a few semesters of statistics courses designed to be understandable for them.

Peer reviewers in Psychology generally don't expect authors to go much beyond the basic statistical tricks students are taught in undergrad. From prior discussions with professors in these fields, it's my impression that most are unable to understand just how flimsy such methods are.

Personally, my prediction is that this won't get better until we have more AI-guided analytical analysis software packages that make more robust analysis accessible to those in soft-science fields. This is, something to replace reliance on stuff like R^2 and p- values.


Note: Just to provide some numerical evidence for the assertion that Psychology practitioners tend to be weak with their analytical skills, here's [a PDF of GRE scores](https://www.ets.org/s/gre/pdf/gre_guide_table4.pdf). GRE test-takers planning to study "Psychology" had quantitative reasoning (QR) mean of 149. This is pretty bottom-of-the-barrel; even people planning to go for grad school to do "Arts - Performance & Studio" did a little better, with an average of 151.

"One of the foundational problems with the soft/social sciences is that many of its practitioners have such weak analytical thinking skills (Note below)"

Andrew Gelman argued that a fundamental problem with peer review is that it is done by peers. Peers whose background and skills are similar to those of the authors, and thus not likely to catch things that were missed.

[When does peer review make no damn sense?](https://andrewgelman.com/2016/02/01/peer-review-make-no-damn...)

I'm sure something could be gained via education. I don't think you necessarily have to be good at math to understand the concepts. But a lot of undergrad courses focus on hypothesis testing and p-values (despite those methods being condemned by the American Statistical Association), and encourage memorization of steps to do simple math, over understanding what any of it means. I think the ad-hoc approach of many machine-learning intros would do better. Maybe programming isn't any easier for a psychology student, but simply hammering the ideas, with short demos of pitfalls, may help.

For "small n" problems like in psychology, psych students are much better off with a statistics background than machine learning. So what I'm advocating is a change in how these classes are taught.

Of course, I've known plenty of students who program via copy and pasting code, and modifying it until it runs as necessary. The equivalent of memorizing steps math steps. So it will take more to solve the problem.

I'll admit that might have something to do with it, although psychology GREs are misleading because a lot, probably the vast most majority, of those test-takers are hoping to go into clinical psych, especially to practice therapy. This leads to issues in a number of ways: (1) many of those applicants really don't understand what psychology is about, (2) most (90% in research-oriented programs) are rejected, and (3) among what's left, it has to be said that they probably don't need to be the most quantitative to be a good therapist, if they're not doing research.

Psychology is a strange science in that it's a mixture of people with very unquantitative backgrounds, and those who deal with very complex math and statistics. What many don't realize is that meta-analysis itself really was developed as a method in psychology (even if it technically has its origins earlier with Pearson). This registered replication work is an extension of that, again being done by psychologists. It's probably safe to say that more empirical and statistical research on the scientific process itself has been done by psychologists (along with statisticians and many public health researchers) than any other discipline.

In any event, replication problems happen in other domains as well. This has been documented empirically. It might be worse in the biomedical domain than, say, physics or chemistry, but it's not limited to psychologists. What I see in the neurosciences per se is just as bad, if not worse (because it's ignored more).

I think ignorance of issues pertaining to overfitting, etc. definitely contributes, but I also think that ignorance is pretty widespread, and the problems can be sort of pernicious in that they don't always operate intuitively.

The GRE is nonsense anyway so I wouldn't really take it as meaning much of anything. Just look at how awful the subject tests are to see why nobody should ever give GRE scores more than a passing glance.

I think the fact we're even discussing a "replication" crisis points to some small movement in academic research to address this (some journals are changing / have changed policies as a result).

There's probably different causes at different times. Some of these effects that are the target of replication tests are relatively older, when people were less aware of some of this phenomena. To be fair, some of this stuff is unintuitive: for example, some of the major journals would require internal replications, over several samples, the rationale being that if someone shows an effect in 5 samples with slightly different designs, it's probably "real." It's not like some of these things were just based on single samples (although some of them certainly were). Of course, now people are aware that 5 small samples does not large-sample replication make.

My intuition is that another cause is that academics is flooded with researchers collecting a lot of data, under a lot of pressure to produce positive findings to attract money. Hype, TEDtalks, grants, hypercompetition, and so forth. Academics now is horribly incentivized to be popular and bring in money (from peers, it's important to note), rather than to be correct. Add in a complex subject, like human behavior, and it's a recipe for disaster. Academics is also full of conventions, that attain the strength of power structures; when these are baked in it's hard to change them because you have people who attained power under those conventions in charge (nothing conspiratorial really, just people have their biases and blindspots).

> Why don't the peer reviewers demand higher standards?

I'm not sure peer review can change a field's direction. Say the top journal in a field used to publish the top 10% of papers. If only 1% of papers meet some higher standard, then if the journal chose to enforce this, they would essentially cease to exist.

And the people who would otherwise have got hired/promoted/tenured based on publications there, will now win these tournaments based on their publications in the second-best journal. Their contests are with others in the same field.

I guess the more hopeful scenario is that, besides the 1 of 10 top papers that are actually solid, there are (somewhere) 9 other solid papers with less-flashy results -- presumably carefully proving things that everyone thinks, not counter-intuitive things which get you a TED talk. Here it's more complicated, as if the journal chose to publish these instead, it seems to me that hiring/promotion/tenure committees may simply start to view it as a less-prestigious venue.

Yes. As a practitioner it’s very frustrating

I think the situation is pretty analogous to other area with replication crises like psychology and nutrition: local variables dominate. Just like our own body’s there’s a lot that just depends on the domain or the specifics of the problem being solved...

This has been my experience in a related field (information retrieval). There are trends and best practices, but the market and Impossible to recplicate overhyped academic research reinforce each other.

It's for this reason that I like publishing straight to my blog: http://statwonk.com/weibull.html Let the code / math do the work.

I intentionally make it easy to pick up, modify and inspect. The benefit to me is precisely being alerted if it isn't replicable.

So a replication crisis in artificial neuropsychology.

You may be joking, but I'm tired that everytime neural networks comes up this unfortunate conflation of different aspects of AI occurs; one based in accurate biological modeling, the other based in tractable statistical methods.

Whereas it is true that both are inspired by neural activation and that both can produce a complex, yet coherent mapping from input to output, this is like saying that birds and aircraft are similar because both fly and have wings.

Yet both birds and aircraft can stall. (Granted, birds can get out of stalls much more easily than aircraft can.) While many analogies between natural neural systems and various ML arrangements are false, not all of them are.

Indeed, they have similar characteristics in certain respects, and sometimes ML is inspired by neurobiology. I suppose my exasperation is primarily directed at non-engineers/researchers hyping AI; to abuse the analogy, sometimes these futurist writers seem to think that just because both can fly, aircraft can one day lay eggs too.

The analogy is like "both a human and a baseball launcher can throw a ball, so they must be the same." People really need to let go of the terminology around machine learning that leads people to really believing that it's close to actual cognition.

But what is cognition? People should reevaluate their views on how it works.

The brain is very complex and a very big ball of highly specialized circuitry (and the software running on those and managing various aspects and parts of it and them). The similarities are at the same time very striking but the differences are just as drastic in number of layers and other size related parameters our various brain components have.

Yes, AlphaGoZero is not going to learn to cook and sing and dance, but it beats our gaming component in a lot of areas. Of course a synapse is not really a memristor or a ReLU, but at the same time not that far off either. Similarly a biological neuron is not just a simple backpropagatin integrator like a perceptron, but it works very similarly.

And just as we don't really know how all the representations work in an ANN, we don't exactly know the biological aspects of memory/learning/seeing work in our brains, yet we have made enormous progress on both. See all the demos/visualizations on how various layers encode in deep nets and look at the data about the brain's visual cortex and the V1 circuit, look at the gene spliced mouse studies where memory encoding is studied (and sometimes only one neuron encodes for a memory/face/concept - just as with deep nets).

Why isn't more research and research materials open source in the AI world? I don't really understand.

If one doesn't have your training dataset or your code, how could they possibly replicate your results?

If you have a good paper with important result, providing code and data is not necessary.

Providing the code to replicate is good form. It shows good faith and confidence. Exact replication (exactly replicating the study) is just the starting point to check that the code works and no obvious mistakes were made.

replication / reproducibility / hyperparameter sensitivity

If the research yields something really important and the method is well documented, usually it can be easily checked without having the data and the code. Things like dropout, batch normalization, residual learning, .. work over multiple different datasets and hyperparameters. You can reproduce the results without faithfully replicating the experiment.

If the claimed result vanishes unless you have the exact data, code or the hyperparameters, the research can't be said to be meaningfully reproducible in the scientific sense. Hyperparameter sensitivity is ML equivalent to P-Hacking.

As someone who's often implementing models from ML papers, the issue is that english-language descriptions of methods is often found to be sorely lacking. Even good authors simply forget to mention small critical details of model design or training that wouldn't remain ambiguous if they simply released their code. Part of the effort of reproducing work is certainly figuring out which aspects of the model design are critical or incidental - but this is all greatly aided by not having to guess at what was actually done from a heap of english and LaTeX generally written feverishly over a few days before a deadline!

> Part of the effort of reproducing work is certainly figuring out which aspects of the model design are critical or incidental

Isn't this the work of writing up? A paper is a claim that you have discovered something, and implicitly that other details are standard / unimportant. If it turns out that some hidden assumption was in fact doing all the work, then the claim you made was false.

I'm all in favor of sharing working code. But working code which magically does something... amounts to an anomaly awaiting an explanation. Or an advertisement.

Alas, training times are quite long... At any given time, there's probably a couple best-in-class architectures out there for the problem you're interested in, and one or two dozen interesting bells and whistles one can add as decoration, each with a paper that makes pretty reasonable arguments and has some stats demonstrating modest gains.

The right thing to do in this circumstance is an ablation study - throw together your best-possible model and then test different subsets of features sitting between your model and the 'basic' prior work. For large datasets, though, each of these models might take a very long time to train (especially if you don't work at a place with a stupid number of GPUs available).

So, lacking resources, you get your new best-ever accuracy number with your 'everything' model, and do an extensive write-up about how awesome the new bell and/or whistle that you added to the pile is... (The problem is compounded by a need to publish quick, lest someone else describe your bell/whistle first.)

Another big problem is that adding a bell/whistle to the base model often means adding more parameters to the model. There's decent evidence coming out of the AutoML world that number of parameters matters a hell of a lot more than how you arrange them. (It's real real easy to convince yourself that your clever new idea is more important than the shitpile of new parameters you've added to the model, after all.) So a really solid ablative study probably needs to scale the number of parameters in a reasonable way as you add/remove features... And there may not be obvious ways to do that smoothly.

And this is closely related to the replication study in psych: it's real real easy to do kinda sloppy work with big words attached, and convince yourself and all your peers that you're a genius.

I think a big database for reporting and searching for results with various architecture+dataset combinations would be much more useful than pushing more papers to the arxiv in many cases. (though, really, whynotboth.gif) Let me do some searches to see if a particular bell/whistle actually adds value across the god-knows-how-many-times someone's used it to train up imagenet from scratch...

Then it sounds like people are doing something quite far from science. That's not a criticism, there is lots of knowledge which isn't scientific... for example if you want to know how to run a restaurant, you need to talk to lots of people who do, and try to learn what they know about the art. Maybe you even go to trade shows and listen to talks. But if you follow what they said and you fail, it's not really a replication crisis.

It's a mix; some groups are more disciplined than others (and have better resources for ablation studies).

To be sure, some sciences are more rigorous than others. Various natural sciences are reliant on transient observations, which might or might not be 'reproducible' in any sense... And still progress is made, though the results might contain more prejudices than one would encounter in other areas.

It's also worth noting that some areas within machine learning are extremely difficult to test rigorously (e.g., generative processes), but are still totally worth pursuing. So be careful with cries of 'it's not even science, man!'

Right I genuinely don't mean that as an insult, perhaps my cooking example was flip. There is lots of genuine progress made by other means -- anyone think that Toyota's process inventions in were done by double-blind trials on a large sample of factories?

And of course some guys a few doors down are proving theorems. And lots of other things.

It is slightly strange though that we try to vet all of these with the same peer review mechanism. This is one major source of the differing opinions in this thread about how this ought to work, I think.

how many papers important results are simply bugs?

Numerical code is already bug prone due to subtle and hard to test errors, research code that's not code reviewed or necessarily even tested for correctness can easily generate important results erroneously.

It's also counter-productive not to publish the underlying source code for these papers, as it adds a barrier to other researchers applying the algorithm in new situations. I'd be interested in seeing if those 6% of papers which include the code get more citations than the population of papers which do not include code.

> important results are simply bugs

Probably none. If the paper is important and collects citations, the algorithm is in use. Computer science != working code.

Code is required when you produce something where the scientific importance is less clear. There is need to provide more evidence. Many papers are just "Hey I made some some tweaks and it works in this particular case." Those papers should have working code.

Most papers leverage a results table which compares the newly proposed approach with existing approaches. This section is baselines the results of the new approach with prior work and helps determine whether a new result is actually important or yet another way to achieve the same results as previous work. e.g. the tables on page 7 of this paper https://www.semanticscholar.org/paper/Automatic-Acquisition-...

These tables are generated using real implementations that may or may not be correct, and should be subject to review when the paper is published.

Reproducible builds make it much easier for others to check for things like hyperparameter sensitivity. Right?


So the argument is that running buggy code makes your replication tainted. So ideally experiments replicate everything, meaning part of replication is implementing the code & collecting a dataset. The exact code & dataset aren't suppose to be required for the results: you should be able to replicate the results by substituting your own code & data. When you have a rat maze experiment, it isn't expected that they include a 3d printer blueprint of the maze & genomes of the rats involved

Or, more succinctly, code & data are left as an exercise for the reader

But if more journals archive artifices (code + data) and there was a bigger push to keep these around, when replication fails, they can at least go back to the exact code and say, "huh .. they implemented this differently. I think this is the problem with their algo" or "oh, we forgot to account for this." .. ideally after you've written your own, looking only at the methodology and without looking at the code for the original.

Having the journals archive artifices seems like a good idea, but the librarians hate it. The last thing they want to do is to give the for-profit journals another thing to monetize against the academy. What makes more sense is to have the universities and the library systems archive that stuff.

why not use a public artifact repository? e.g. github, bitbucket, or a new academic collaboration.

It is common to publish the layouts and dimensions of mazes in such studies. The mazes[1] themselves often reflect the nature of the study (e.g. a simple fork to test desicion-making or a complex maze with dead ends to test navigation and/or memory.)

1. http://www.ratbehavior.org/RatsAndMazes.htm

I see. That does make sense, like making a MIT version of GPL reference code.

Having more researchers avaliable to audit code seems like it would help prevent flaws from slipping through, too, to prevent false conclusions.

Thank you for explaining a bit more.

Amusingly, this is often taking to the other extreme. Just being able to run the exact same code on the exact same data doesn't really tell me much about if it replicates anywhere else.

That is, I agree with you that the code and the data should, ideally, be available. I lose confidence when people just rerun the same code on the same data. The slides a while back about why someone didn't like notebooks resonates well with me. Something like "Shift enter through the lesson. Is this learning?"

Are the 30% of papers including training data actually hosting the data set somewhere for download, or are they referencing a public data set elsewhere? Both approaches are acceptable in my book.

If a study uses a private data set, or one that the researcher controls and only gives out to approved partners, that study should be discounted.

I understand corporate labs cannot give away their data in many cases, but corporate research carries less authority than academic research anyways. Academic research should always make the data set publicly available.

I think if you're going to do a replication study, you should collect a different data set about the same subject, and use the same method, then see if you get the same results.

Yes, simply compiling the provided code and running it against provided data set does not really count as replication.

Writing your own code and creating your own dataset is the simplest way to rule out the situations where there is something fishy in either the original code or data. And the paper itself should contain enough details to make it possible to recreate the experiment this way.

I think compiling the provided code and running it against the provided data set does do something---you know the code reports on the data. If you get a different result with the provided code and data, then there's something different with your environment vs. the original researcher, perhaps the rounding mode, or some assumption [1]. Once that's straightened out, then you can use new data and see if you can replicate the result with the provided code and new data.

Think of the provided data as a sanity test.

[1] I recently fixed a bug wherein I was inadvertently relying upon Linux-specific behavior that failed when tested under Solaris.

Research is often funded by firms that want the study to serve as a “scientific” basis for the efficacy of a product or service. Experiments that cannot be replicated are meaningless. That standard should not be compromised.

There is some famous citation from a physicist about people that used to replicate each other experiments before, but now they share their fortran models, so they can agree on all the bugs.

Yea, I have always found academic machine learning papers to have very little value. The helpful people create a github and write a blog post, not publish a paper.

Because you are not the target audience for these papers. ML researchers (or any researcher for that matter) don't have time to go through a long form blogpost or skim through a GitHub repo. There are tens of interesting results that come out everyday. Papers with abstract and sections is the most efficient way to understand the overall idea.

Blog post are nice for reaching outside of the research community.

Code should always be linked for people that want to reproduce, but they don't replace a paper.

>"ML researchers (or any researcher for that matter) don't have time to go through a long form blogpost or skim through a GitHub repo."

I don't see why this would take any more time than properly studying a paper, in fact it should be quicker since the information is presented in the proper format (not translated from code to math/prose).

Because you don't "properly" study all papers that you read.

For most, you just want to know the general ideas and understand the novel techniques that were applied. You might decide to do a deep dive in a very select few. At this point code becomes useful for reproducibility.

Also once you are experienced in the field, you don't relearn everything for every new paper. Most of the paper just propose incremental minor changes. The only things I care about is what these changes are (conceptually, not a code comparison) and data about the effect of these changes. I don't want to have to read your code to find out where you have applied a certain new hyper-parameter to your gradient descent.

Stop thinking that the entire research population is publishing all results as papers because they don't know better. It's indeed the most efficient way to share knowledge for experts in a given field.

> since the information is presented in the proper format (not translated from code to math/prose)

Do you really think researchers just write some random code until they come up with good results and then try to retrofit that with some equations?

You have got it completely backward.

>"Because you don't "properly" study all papers that you read.

For most, you just want to know the general ideas and understand the novel techniques that were applied. You might decide to do a deep dive in a very select few. At this point code becomes useful for reproducibility."

I have much experience reading research literature (but not ML specifically). The vast majority of the time I just skim a paper looking for certain info.

What I am reading from you is that basically the only useful part of the ML paper is the abstract. If the repo/blog had a good abstract there would be no point to the paper.

>"Do you really think researchers just write some random code until they come up with good results and then try to retrofit that with some equations?"

This sounds like a strawman, but yea I do think the code is the "actual" method while the math is some idealized cleaned-up translation.

Engineering researcher, but not-ML.

"Code is everything" approach presumes that communication is computational by default. I'm not sure if researchers agree on that. This is particularly important for a field that aspire for artificial intelligence. Language is the best bet we have at the moment.

Secondly, there are social aspects. I am becoming more well read in my field, and there are time when genuine "rediscovery" occurs. Many phd students, depending on their research group, do not come up with ground breaking work right off the bat. It takes them a few years. In the publish and perish economy, there are venues to show your paper. If their genuine work is rejected, it may stop them from progressing in their career to come up with great work. It is like expecting an undergrad to come up with a full master's thesis for a course project. Happens, but not often.

Above being said, now if I am to read the code, I will be rereading many code repretitions every year. Whereas reading similar abstracts is less time consuming.

Your blogpost comment touches a bigger issue of how to tell who conducts legitimate research. The best we have so far is that those at the top of their field to provide assessment. They're journal editors. They have dedicated their life to their field, and have read paprs from decades of research work (whether or not the field is scientifically paramount is irrelevant, not all can learn the same thing, and education system is there to _educate_ the population at large, along with generating new knowledge, and along with advancing new researchers). For the sake of completeness I'd add this, at least in my case and perhaps many others, as one becomes more experienced, one can assess their earlier grasp of the field, or their earlier misunderstandings, better.

Hence it seems reasonable to publish, and have abstracts.

> What I am reading from you is that basically the only useful part of the ML paper is the abstract. If the repo/blog had a good abstract there would be no point to the paper.

The whole point of a blog post is vulgarization (which is definitely needed as well): explaining an idea without assuming that the reader has a lot of prior knowledge in the domain. Which goes in the opposite direction of what researchers look for: the most efficient way to understand what is being proposed. Most of ML papers abstract would be pretty horrible on a blog post in term of vulgarization.

Plus, papers are all very well centralized in Arxiv. How am I supposed to find blog posts? I should just hope that what I am interested in will blowup on twitter or /r/machinelearning and I will notice it? This sounds extremely sub-optimal.

> This sounds like a strawman, but yea I do think the code is the "actual" method while the math is some idealized cleaned-up translation.

Do you also think that biologist should stop writing papers and just upload video of their actual experiments?

I work in an ML research lab with 30 or so researchers, research never begins with some code snippets. It always begins on the whiteboard. Even between us, where we have the code and data readily available to share with each others, when someone presents it's work, they will never show code, but always equations or model diagrams. It's just an order of magnitude easier to visualize.

Take this paper for example: https://arxiv.org/pdf/1805.11604v3.pdf how would you even begin to communicate something like that through code?

Self-publishing code repositories or notebooks is clearly the superior approach, but unfortunately researchers are given less institutional credit if they are not published in an academic journal. The best approach is to get the benefits of peer review and then publish all your research artifacts alongside the paper. I am not sure if the academic journals prohibit this kind of thing though.

I'm pretty certain PLOS allows and even encourages this: https://www.plos.org/open-access

Elsevier is by far the worst publisher regarding open access though: https://twitter.com/protohedgehog/status/1028819653982736389...

I’ve raised the issue of pseudo code and missing data multiple times in my masters work. My contention is that public funding should require public access to such information for external verifying. I’ve been shutdown on every front. People say that the private work needs to remain private in order to ensure future research monitization from grants and IP sales.

(A small note on your interchangeable use of reproducibility and replicability in research because they mean slightly different things.

reproducible: independently achieving the same result using existing data;

replicable: independently achieving the same result using new data.)

At a recent Boulder deep learning talk a distibuished scientist asked the speaker 'how can I use your result?' There are not easy ways to publish and distribute such results / models.

Is there a video of this talk?

Richard Feynman had a speech warning about this problem back in 1974. Complete with a story about how a student was excited about doing some replication experiments on rats and was told "no, you cannot do that, because the experiment has already been done and you would be wasting time".

Others have also been worried about replication problems from the 1960's on. Hopefully some of this worry sticks this time and we can get a better understanding of what is really true in these fields. Physics likes to have very well defined uncertainties on everything they observe and don't like to say something is "true" unless the fact is likely at 5 to 7 sigma. That seems like a good margin. In psychology, 2 sigma is the standard for publishing a result.


P-values are not good measures for reliability, because they absolutely are not the same thing as the chance that a result is wrong. This confusion, and ignorance of the prior probability of hypotheses is the reason why replication rates are so bad.

In physics you can get away with not caring about the distinction, because of how accurate the measurements and precise the theories are. That doesn't fly in most other sciences, though.

> P-values are not good measures for reliability, because they absolutely are not the same thing as the chance that a result is wrong

As an undergrad, I even had tenured professors try to tell me that a p-value is the chance that a result is wrong. Most researchers in psych or bio sciences have a weak understanding of statistics, usually taking a single statistics class in undergrad, then a single one in grad.

Even in hard sciences, there are papers that everyone in the niche knows is false, but fear to speak out about or publish a paper refuting it for fear of their career being damaged.

My wife is a biochemist, and during her PhD they were working on an area of research with a handful of labs publishing about it. One lab in particular was known for a decent amount of questionable publications, but the PI was a big deal and no one would officially question anything. So they would whisper amongst each other and just ignore that paper (and all the additional papers built on top of it) because they all knew it was bullshit.

Wow, that's a clear sign science has become too political. Proving other scientists wrong used to be how careers were made, not something to fear.


Brouwer's career suffered as a consequence of his disagreements with Hilbert.

EDIT (with links to the political aspect): Letters written more about the politics of publishing than the math itself are referenced in a book called "The War of the Frogs and the Mice" (PDF) http://citeseerx.ist.psu.edu/viewdoc/download?doi= (see bottom of pg 9 as displayed for the letter Brouwer wrote to Hilbert's wife)

>In the later 1920s, Brouwer became involved in a public and demeaning controversy with Hilbert over editorial policy at Mathematische Annalen,

It doesn't read like the political fights were about math. Although Hilbert disagreed with Brouwer, it sounds like the falling out was precipitated by arguing over something subjective.

Science has always been political, within itself and in relation to the rest of society.


Your link is not a story about politics, it's a story about psychological bias.

Political has multiple meanings in English, and I think this use is legitimate.

This could be just anchoring bias. But I think it's much more likely that people after Millikan did not want to look bad in front of their professional peers for fear of social and professional consequences. Which I think is something we can reasonably call political.

But I think it's more likely that the bias was internal, in which case it could not be reasonably called political. Without arguing further we can agree that the parent's claim was unfounded, because the linked article did not demonstrate that the motivation was social.

(The reason I know we can agree is that unfounded is not the same as false, it just means it's a non-statement.)

We cannot agree on that. Feynman's talk is mainly about social phenomena, so it's entirely reasonable to assume that this anecdote is also meant to illustrate a social point: http://calteches.library.caltech.edu/51/2/CargoCult.htm

Just because Feynman diagnosed the gradual charge drift as internal pressure doesn't make it internal pressure. The effect would be the same between fearing one was wrong and fearing one would be condemned and punished as wrong - changing the data to be closer to the official line.

Authority is always political.

It's not a story about authority, either, if by authority you mean something narrow enough to always be political.

If you don't think Millikan's authority as a scientist was not a factor then ¯\_(ツ)_/¯

That's a broad enough definition of authority that all authority no longer originates from politics. When scientists are behaving correctly authority is awarded on scientific merit irrespective of politics.

> When scientists are behaving correctly authority is awarded on scientific merit irrespective of politics.

I suppose then one can conclude that scientists have behaved incorrectly since Aristotle's time.

Obviously, hence the word "too".

All the labs working on this spoke to each other, collaborated occasionally, and saw each other at conferences. So making a big stink about this wouldn’t have made life easier. Plus peer reviewers for submitted papers are “hidden” but because the field is so niche they are likely people you know, and they know who you are when you publish (reviews aren’t double-blind), so it could hurt your career if the PI became angry at you and started bad-mouthing you to to the other labs.

I understand, but science isn't supposed to make life easier, it's supposed to progress mankind's knowledge and that necessarily entails eliminating bad information from the body of knowledge. Avoiding challenging bad papers because someone you know wrote it and you're worried they might not like you for it; well, yuk.

You've missed the point: getting along with your peers is the most important thing. What is truth?

No I haven't...

> getting along with your peers is the most important thing

No it isn't, hence the yuk. Anyone who thinks getting along with your peers is more important than getting the science right, isn't someone I trust doing science. The most important thing is getting the science right, full stop, end of story.

The problem appears to be is that anyone who can't "get along with their peers", or otherwise "social game", is not getting close enough to attempt to do any kind of science/serious project/study in the first place.

So you would argue that Galileo Galilei was a bad scientist? He was imprisoned for disagreeing with the pope. I suppose you'd have to argue that a good scientist in the middle ages had to be in good standing with the church and monarchy?

Galileo only needed basic tools (like feathers, balls, an inclined plane, and a handheld telescope) to pursue science.

If he'd needed access to impossibly expensive equipment available only in a few palaces around Europe, politics might have stopped him before he ever got started.

You can't predict what the next great discovery will be, or where it'll come from. Generally however, they're on the theoretical side, not the experimental side, so your point doesn't really stand. Those big machines rarely do little more than confirm exiting theories and those theories can very well come from one person who is anti-social.

Science is about being right; not about getting along with your peers.

What he actually did was closer to calling the pope a moron.

No, something like this can spiral: they can write bad reviews about your work, tanking your publication rate, which makes it harder to find funding etc...

Though worth pointing out "political" is ambiguous and that this a case of politics in the "office politics" sense not in the "use science to justify policy" sense

It turns out, people are always political, because they're people.

You want to know where this is the worst? Climate science. Everyone has an axe to grind on one side or the other and moderated opinions are flamed into oblivion. We can't even talk about the science because it becomes a discussion about destroying the planet.

Isn’t the moderate position in climate science (regarding anthropogenic climate change) the mainstream consensus position? There are no two ”extreme views” with the truth somewhere in the middle. There’s the overwhelmingly supported by evidence scientific position and then there’s the pseudoscientific position.

Yes, I suppose I've conflated two things. The first is the strident shrieking of politicians and those who echo what politicians say,for which I would argue there are two extreme sides. And neither of these sides actually has constructive discussionin my opinion.

And then, there are the different sides of scientific discussion which produce the conclusions that the politicians warp and shriek about. Some of this is good science, and a rot of itis bad science. But whatever the politicians get to blather on about is what most people are aware of, regardless of whether it's good or bad.

And your comment points to part of the problem. It's science, so if we believe that we know everything, we're wrong. If we don't know everything, then that means we need to have the lattitude to argue about things. The argument is obviously not "is the earth warming?" but "how and why is the earth warming?". If your immediate response is "well, it's obviously humans producing CO2", then you have fallen into the trap that I'm trying to explain. Yes, humans produce CO2, and CO2 is a greenhouse gas. But the story is more complicated than that, and science needs room to wonder. The problem is the political shrieking dampens the curiosity and wonder that science is about.

But if Earth in ancient prehistory was dominated by CO2 that was eventually sequestered by plants under the ground, and that today our oil reserves are running low, then wouldn't it make simple sense that human use of fossil fuels must have a significant impact on atmospheric CO2 levels?

In that case, it's pretty settled that there is a significant human impact on atmospheric CO2 levels; the space for room to wander are, say, the exact distribution of CO2 released. I mean, physicists don't wonder and revise basic kinematics since that science is pretty much settled that Newtonian kinematics are very good approximations of reality.

Well, it does make sense yes, which is why that view has become accepted.

There are some facts that seem to give pause for thought though, like it apparently being much warmer than today in medieval times. This can be hard to see because temperature records don't go back that far, but we know that the Romans appeared to once grow wine near Scotland, something totally impossible today (in fact it only became possible to grow wine in the south of England very recently, a "new" thing blamed on global warming). It's unclear how the Romans could have done that unless grapes were very different or the weather was very different.

>today our oil reserves are running low

This argument is not quite up to snuff because, as it turns out, only a tiny fraction of the oil ever can be made to come out. The truth is that even after the wells have been abandoned most of the oil remains stuck in the rocks that it started in. I found this out while wondering whether or not it would be possible to burn all of the oxygen in the air (it's not, sadly ;) )

Also the oil reserves are not really running low, if they were that would be a lucky answer to carbon control.

Of course a fair point, but I think it's worth considering that basic physics was conceived quite a bit longer ago than the idea of CO2 as a greenhouse gas began to spread. So physics has had quite a bit more time to settle and for people to prove it incorrect (which they haven't), while the reasoning for cause(s) of planetary warming is really still in its infancy, however sure many people think they are.

Most of the impression of a mainstream consensus is formed when some kids polled a ton of scientists what they thought of global warming, but then restricted their published number to a small subset of those, climate scientists. Plenty of other scientists are like, hey, these guys are full of crap.

Scientists who aren't climatologists have uninformed opinions about the climatology, aka the climate; they shouldn't be listened to just as you don't poll brain surgeons about the intricacies of heart surgery. Being a scientist doesn't make you qualified to speak on the climate, being a climatologist is what makes you qualified to speak on the climate. You don't ask climatologists about physics, and you don't ask physicists about the climate. If you don't have a PhD in the field, you don't belong in the big discussions where consequences matter.

So even if your story is true, those kids did exactly the right thing.

Regardless, the consensus that's bandied about comes from the IPC meta study of all published peer reviewed articles from climatologists; not from a few kids.

I believe the mainstream take on climate change - but it’s worth noting that you can’t even become a climatologist today if you disagree with the mainstream view of climate change.

So it shouldn’t exactly surprise us that most of them agree.

> being a climatologist is what makes you qualified to speak on the climate

> If you don't have a PhD in the field, you don't belong in the big discussions where consequences matter.


I think what we need is accountability - there should be serious negative consequences for being wrong proportional to how strongly you stand behind an idea.

Were all the climatologists who were confident we were facing global cooling driven from the industry? Stripped of their academic standing?

Then why would I trust the industry?

That's uhh, not true. At any rate, shouldn't we be listing to the climatologists on this matter, because their specialty is, you know, climate?

Yes it is true. The often cited statistic comes from https://agupubs.onlinelibrary.wiley.com/doi/epdf/10.1029/200...

Counting climatologists is a bad idea because they shut critics and skeptics out of the field. We know that climate models are generally wrong. Trying saying that while having a career as a climatologist.

You're really veering off into conspiracy theory territory here. It's clear you haven't the faintest how climate models or any geophysics works. Don't mistake a facility for computers for expertise in all science. I actually have a PhD in a geophysics field, I know what I am talking about here.

1. That study is 10 years old at this point. The evidence for climate change has become vastly more undeniable since then.

2. Majorities of scientists from all categories believed in climate change, even back then, according to this study.

I'm not disputing global warming. I am disputing the impression of a 97% consensus. And this is a great example of what I'm talking about, you're trying to call me a conspiracy theorist, and making up stuff about what you think I know.

No, you aren't just disputing the consensus. You are claiming that climatologists "shut critics and skeptics" out of the field, so we shouldn't listen to them. Yes, they do shut out pseudoscience, as do other legitimate sciences. That is different from disagreement on scientific facts. Climatologists disagree about very fundamental things all the time.

There are people on the record saying certain kinds of results simply shouldn’t be published.

Also, anybody who views climate prediction as intrinsically suspect pseudoscience isn’t going to want to enter the field either.

So now you are saying that climatology is literally pseudoscience.

That isn't what was said though. This entire conversation chain is painful to read, I would encourage you to read up on the principle of charity. It's a dialectic tactic that seems to be fast fading in the world of click bait and outrage, and I think everyone is suffering for it.

Putting on my conspiracy helmet and tightening it all the way, how can you prove that that's not the same as polling the Department of Constructive Mathematics about the law of the excluded middle? Every field self-selects for people who believe that whatever the field studies is actually possible to figure out, all the way from physics to Paranormal Studies. Someone who didn't believe that climate models could predict anything wouldn't make any.

Appealing to the authority of climatologists is not going to convince anybody because everyone who thinks they are authoritative already agrees that global warming is caused by humans...

There is a definitive difference. Global warming has been confirmed across multiple different experiments that have been repeated... this study covers obscure results that are unlikely to be repeated.

I believe this applies mostly to public science. Private science is anchored by a market-based mechanism that naturally expunges fraud.

Is this a joke? Profit motive is far more likely to put the cart before the horse: pushing for evidence to support outcomes.

I think they were referring to internal research that says 'if we do X we can sell more' which has naturally aligned incentives.

This as opposed to releasing research along the lines of 'cigarettes are fine for you! smoke more!'

> if we do X we can sell more' which has naturally aligned incentives.

Even then there are agency problems which mean that once a company gets over a certain size there is no natural alignment.

I know plenty of managers tweaking A/B tests until they get the results they want.

Easy fix: hire a statistician. :-)

This is a quote from a preeminent scientist: https://twitter.com/bmbaczkowski/status/1069891514275385344?...

"in academia overfitting can get you a nature paper, in industry get you fired"

Not so much. When you pay someone to get a result, that’s what they do.

Max Planck:

“A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.”

Or more succinctly, science advances one funeral at a time.

Except this isnt just "one funeral at a time." This is genocide.

True and funny. Looks like you got downvoted into oblivion. Someone being a bit too sensitive about a genocide metaphor I suspect.

A version of this happens in software too. See machine learning. There’s definitely valid field with real benefits. But also lots of BS. Nobody wants to call out their peers because everyone benefits from the hype.

> Nobody wants to call out their peers because everyone benefits from the hype.

As a counterpoint, it seems that minsky and papert's work was partly driven from their desire to stop the funding of larger perceptrons , in which they did not believe. The winter that followed their work may or may not have delayed progress in AI by years or decades.


The theory of deep learning was developed about 50 years ago, but useful applications weren't possible until recently because the hardware was too slow.

I'd be interested to learn about results in the "hard sciences" that I work in (mathematics, logic, program verification, typing systems), what papers "everyone in the niche knows is false", but "fears to speak out about".

Sure, mistakes happen, my own PhD contained a major blunder (discovered several years after submission -- at least somebody read it ...), but I'm extremely open about it.

In my view, the hard sciences aren't really directly comparable. We doubtlessly have our own problems, but not necessarily the same problems as the health and behavioral sciences.

I think that the hard sciences aren't necessarily so dependent on papers derived from isolated results, or even replication of those results, because there is more of a "web" of knowledge. An idea is hopefully supported by an experiment, but possibly also by theory and multiple different kinds of experiments, sufficient to be usable by non-experts and laypeople. For instance the basic theory of electromagnetism is good enough for virtually any kind of engineering, and it would take some monumental discovery to prove it false, even if a portion of the supporting work fails upon replication.

Another thing is, perhaps because of the precision of physics, we get proven wrong every day, and it ceases to be an embarrassment. The road is littered with the wreckage of failed theories and experiments.

FWIW, your examples aren’t hard sciences per se. The term hard science is referring to natural world subjects like physics, chemistry, biology and astronomy. It’s not a difficulty or value judgement, it’s just a term with a specific categorical meaning.

I wonder sometimes whether math isn't a natural science by accident. If 2+2=4 wasn't applicable to the real world, I don't believe we would have defined math the same way

"I will begin by saying that there is probably less difference between the positions of a mathematician and a physicist than is generally supposed, and that the most important seems to me to be this, that the mathematician is in much more direct contact with reality." -- G.H. Hardy (1922)

Mathematics isn't a science because it doesn't provide us with empirical answers. That's precisely the difference between physics and mathematics: the former tells us what is, while the latter only tells us what must be, given certain specific assumptions.

Or more succinctly: mathematics doesn't give us truths, it gives us consequences. The statement, "2 + 2 = 4" is not an empirical fact intrinsic to the universe. It's a consequence of our definition of natural numbers. It's considered a useful definition because (among other things) of its application as a foundation for engineering and the sciences. But many other useful areas of mathematics have essentially no basis in reality.

> Mathematics isn't a science because it doesn't provide us with empirical answers.

Then physics isn't a science either since only parts of the field fits this criteria. I think you will realize that the distinction between physics and math isn't as clear cut as you imagine, it is not uncommon for physicists to publish in math journals and vice versa.

The distinction is pretty clear cut. The two disciplines are philosophically different by category, not degree. Just because much of theoretical physics is applied math and physicists can publish in math journals doesn't mean the two occupy the same domain. One is ostensibly guided by empiricism (even if it's not always possible), the other is not.

That’s an open (and interesting) question, I think, whether there are absolute physical truths in math, and whether our basic math rules can be defined any other way. But certainly there are huge swaths of math that don’t correspond to the physical world, and there are parts of math that are decided rather than natural consequences.

Ultimately though, the math that is natural science is just called “physics”, right? ;) The reason math can’t be a hard science is because we can’t prove abstract math by physical experiment. That’s all “hard science” means - things you can show by physical experiment. We can prove math theorems by showing they’re consistent under all the other math theorems, but generally not by conducting physical tests.

Not all mathematics formalises physics. Nowadays, much math formalises other maths, e.g. set theory, category theory etc.

Right, exactly. Physics already is the subset of math that is a hard science, the rest of math isn’t.


I suspect that to the ancient Greeks, math and natural science were not yet separate disciplines. "Geometry" means something like measuring the world. They had to be aware of the relationship between their math and the ability to function as a minimally technological society, surveying land, navigating, and so forth. As I understand it, they believed that math was telling them something deep about reality.

I think turning math into a purely abstract game came later, and we owe our fun to the discipline's more humble origins.

I think it would be more accurate to say that they were far more separate disciplines anciently, and far more rarely mixed. See for example, Euclid vs the the contents of this work of Aristotle: https://en.wikipedia.org/wiki/Physics_(Aristotle)

Now you're getting into philosophy. As a materialist, I do believe everything comes from the natural world. I take the view that if every time we combined two objects with two other objects we ended up with something other than four objects, we would not be saying "2+2=4".

There are of course other viewpoints that math has a Platonic metaphysical reality in and of itself, independent of the natural world.

>Sure, mistakes happen, my own PhD contained a major blunder (discovered several years after submission -- at least somebody read it ...), but I'm extremely open about it.

So what was it?

That sounds grave. I believe in most cases the science is not questionable per se, but it passes just above the significance threshold to be considered technically valid. The problem is the extrapolations made from that, and the false media attention.

what your wife is describing is what is commonly called "progress one death at a time"

Sometimes I wonder if researchers are too conflict-averse temperamentally to effectively police themselves. I remember grad school being suffused with below-the-surface grudges, innuendo, and shunning. Those aren't healthy ways to resolve conflicts, but they seemed like the path of least resistance for people who fear the raising of voices.

Is that really temperament? I always thought of that as learned behavior necessary to survive in an academic context, where one's fate is for years in the hands of various authorities who can cause you a great deal of harm.

Not all "hard" sciences. I'm pretty sure there are no papers in physics that are well known to be wrong but people fear speaking out, though I'm open to being proven wrong.

There is an interesting anecdote about how Millikan's oil drop experiment that measured the elementary electric charge was way off when he measured. Feynman talks about it in his book:

"We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.

Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that."

Richard Feynman was laughed at when presenting new ideas in physics when he was an up-and-comer. The human ego tends to intervene when careers are on the line.

Laughing at true new ideas is actually the opposite behavior flaw of whispering about false ideas that are known to be false but are associated with too much social power.

It’s the same family: public judgement of a claim influenced by the person making it, rather than its contents.

Point being: Feynman’s ideas should have been judged on their merit, rather than Richard’s career. Same for the papers published by the institute mentioned above.

They eventually were judged on merit, which is a testament to the system at the expense of the characters that were involved at the time.

To be fair, Feynman wasn't above proof by intimidation, either.

I adore Feynman's work, but he was human and had human failings, too.

There are definitely papers in physics that are well-known to be wrong. I know of results published in high-profile journals where the authors have stated (in person, but not officially in a published update) that even they cannot reproduce their own results.

i agree that this is a problem in the biosciences as well.

the dead ringer for these groups is a PI who has a reputation for taking on extremely ambitious projects with all sorts of funding sources. typically, the PI is vindictive and controlling in my experience.

but yeah. you can see them from a mile away. they tend to run sweatshop-style labs where everyone is very hard pressed for results. so "results" are produced, whether or not they are real.

Why don't they publish a paper anonymously? Hopefully that would at least work for cases where "everyone" knows it is bullshit.

Isn't that fear something which the "Journal of Controversial Ideas" aims to solve?

Your story sounds like what I have seen, but biology is not a "hard science".

Biochemistry and creating pharmaceuticals is absolutely a hard science

Not at all. In the hard sciences you can run reproducible experiments and have mathematical models used to make precise predictions and come up with "natural laws". Biologists were moving in that direction up to 1940 or so but that is not what is going on now.

Here is the biggest funder of biomedical research in the world admitting that $27 billion out of $30 billion per year is wasted on unreproducible research: https://nihrecord.nih.gov/newsletters/2016/07_01_2016/story3...

And I would say that is optimistic because it does not consider misinterpreted results (interpreting your data correctly is even harder than figuring out how to perform and report reproducible experiments)...

If biochemistry is not a hard science then the only hard sciences are math and physics

I'd say both organic and inorganic chemistry are hard sciences since they make precise predictions about what will happen. But not biochemistry, and definitely not cell biology or molecular biology.

I also wouldn't consider math a science since it doesn't require comparing your results to data.

Biology is absolutely a hard science. The soft sciences are things like economics, sociology and psychology. Look up "hard science" on wikipedia. Look it up in a dictionary. Look it up in any reference work you like.

It may very well be that you have your own peculiar definition of "hard science", or that you personally disagree that biology or biochemistry should be classified that way. That's fine, you're allowed to have your own opinions. But you having your own crackpot opinions doesn't make it fact in the real world.

>"Are you honestly saying that medicine (a branch of biology) is not a "hard science"? That epidemiology, or genetics, or oncology isn't?"

These fields in particular are not even close to being hard, especially oncology which appears to (somehow) be even worse off than psychology.[1:3]

I bet less than 1% of people with masters or greater in a bio field have used calculus for anything in the last 5 years. While of course the definition of "hard science" is arbitrary, here are some decent overviews of the differences between physics and bio:




[1] https://www.nature.com/articles/483531a

[2] https://www.the-scientist.com/news-opinion/effort-to-reprodu...


I am using this definition:

>"Precise definitions vary,[4] but features often cited as characteristic of hard science include producing testable predictions, performing controlled experiments, relying on quantifiable data and mathematical models, a high degree of accuracy and objectivity, higher levels of consensus, faster progression of the field, greater explanatory success, cumulativeness, replicability, and generally applying a purer form of the scientific method." https://en.wikipedia.org/wiki/Hard_and_soft_science

That same page does group biology in with physics and chemistry, but with no justification. From personal experience biology is much closer to a social science than it is to physics. I mean, most people who study biology-related stuff "hate math".

Are you honestly saying that medicine (a branch of biology) is not a "hard science"? That epidemiology, or genetics, or oncology isn't? Are you saying that when Crick and Watson deduced the structure of DNA based on the X-ray diffraction work done by Rosalind Franklin, they were not doing "hard science"?

Hell, go back further: consider Mendel's discovery of the rules of genetic inheritance (and thus founding the field of genetics) by painstakingly planting, replanting, combing and recombining pea plants. It's hard to imagine a more pure and rigorous application of the scientific method, in any field.

I'm sorry, but your position is beyond absurd. Even the page you link disagrees with you. It sounds very much like you once met a biologist who said that they didn't very much like math, and since then you've carried a personal prejudice against biologists. You admit as much in your comment.

Just move on. Someone made a lazy and uninformed opinion and then, instead of admiting fault or walking away, they struggled to justify their own ignorance.

Ironically in the same vein as the OP article!

Interesting you concluded my opinion was lazy and uninformed when I was the one sharing links and references. That behavior usually indicates the opposite. Indeed, that heuristic would be correct in this case. Not that it matters to the point, but I have a PhD in a biomedical field and spent many years working in it before I quit because I got fed up with how "soft" it was.

The sad part is it will never improve to become a "hard science" if people think it already is one. I mean really think about it. Biology is the study of how complex systems change over time, yet learning the primary tool we have to study/analyze change (calculus) is not a prerequisite. Doesn't that seem strange for a "hard science"?

This comment is insulting and rude. There are other ways to inform someone that you think he is wrong while being civil and mature.

That being said I get where the commenter is coming from. Much of biology is a qualitative science. Darwin's evolution was originally elucidated through descriptions not hard numbers and statistics. Taxonomy in zoology is also largely qualitative.

That comment was only insulting if you believe that biology is not a hard science; an idea which is itself insulting, dismissive and farcical.

But, if a person does share that opinion, I can see how that person might be inclined to ignore tone, and focus on supporting that opinion.

This is utterly incorrect. You can disagree with people without being rude.

You cannot call an idea itself insulting unless you yourself are a biased person. Ideas in itself are just statements with no malevolence intended. Only people can express malevolence.

I don't agree with him. But I see where he's coming from. A lot of biology is just observing behavior and documenting observations. It is very different from employing the scientific method in an experiment. It is also very different from trying to quantify everything. The line between hard and soft is blurry.

Sometimes reality can make people feel bad, that doesn't mean we should try to ignore it. Lets take the cancer research example. Using this definition:

>"Precise definitions vary,[4] but features often cited as characteristic of hard science include producing testable predictions, performing controlled experiments, relying on quantifiable data and mathematical models, a high degree of accuracy and objectivity, higher levels of consensus, faster progression of the field, greater explanatory success, cumulativeness, replicability, and generally applying a purer form of the scientific method." https://en.wikipedia.org/wiki/Hard_and_soft_science

1) Producing testable predictions

--- The predictions are almost all vague, of the form "x will be correlated to some degree with y".

2) Performing controlled experiments

--- Maybe, but I very often see that proper blinding procedures were not used.

3) Relying on quantifiable data and mathematical models

--- Mathematical models are rarely used to make precise testable predictions.

--- There was some progress on this front from Armitage and Doll way back in 1954 but interest has largely fizzled out.

4) A high degree of accuracy and objectivity

--- As mentioned in #1 and #3, making a precise prediction at all is rare. Accuracy doesnt matter much if you predict vague stuff.

5) Higher levels of consensus

--- Not sure this should be here. "Hard" vs "soft" is a matter of the procedures used, not the results.

6) Faster progression of the field

--- Not sure this should be here. "Hard" vs "soft" is a matter of the procedures used, not the results.

7) Greater explanatory success

--- Not sure this should be here. "Hard" vs "soft" is a matter of the procedures used, not the results.

8) Cumulativeness

--- The more research that has been done on cancer the more complex things have gotten, to the point where now they say "cancer is many diseases". This is the opposite of what happens when you gain a cumulative understanding and figure out "natural laws" that make things easier to understand.

9) Replicability

--- Seems to be well less than 50% of studies can be repeated (see my links here: https://news.ycombinator.com/item?id=18578282)

Good advice.

>"Precise definitions vary,[4] but features often cited as characteristic of hard science include producing testable predictions, performing controlled experiments, relying on quantifiable data and mathematical models, a high degree of accuracy and objectivity, higher levels of consensus, faster progression of the field, greater explanatory success, cumulativeness, replicability, and generally applying a purer form of the scientific method."

This describes some of the mathematical economics I've read.

Can you give an example of a testable prediction?

That there is a general tendency for the rate of profit across society to fall with time, subject to certain countervailing measures.

I would have preferred a reference to the paper making the prediction, but that sounds very vague so whatever.

Uh chemistry is only predictable from first principles for diatomic molecules, or very simple extended structures like perfect bulk graphene. Beyond that chemistry is all about approximations

There is a spectrum of precision regarding the predictions you can make. At the high end is a point prediction, at the low end is "x is correlated with y somehow". In between are intervals of various sizes.

>In the hard sciences you can run reproducible experiments and have mathematical models used to make precise predictions and come up with "natural laws".

Spend time with condensed matter physicists. What you describe is not what they do. A lot of results are not reproducible.

I've been getting the impression that many areas of physics have been getting "softer", but have no direct experience. Even Feynman mentioned that physicists were discouraging replication way back in the "cargo cult science" talk.

>Even Feynman mentioned that physicists were discouraging replication way back in the "cargo cult science" talk.

If replication is the benchmark, almost no one does it (physicist or otherwise). In all my time in academia, I didn't find a single person attempting to replicate anything. You don't get funding, tenure or papers from it.

I was referring to everything else in your comment.

I've seen a couple direct replications in biomed, NIH even specifically funded some a few years back under the “Facilities of Research-Spinal Cord Injury” project. AFAIK, pretty much nothing replicated:





But yes it is extremely rare. It is very frustrating because it is hard to come up with a precise model when you only half-trust the data to begin with, and when the replications are (rarely) run it appears that half-trusting the reported results is optimistic...

I know of one major team in my field. I tried to passively discuss it, was graduating and I was so naive. Not a good idea! At the time I had a few papers in review.

All my work got rejected (everything was being accepted before, and the rejected batch was a full level higher quality). So yes, it did set back my career for two and a half years if I have to approximate. I am probably never submitting to certain publications ever again to avoid having them as reviewers. I am gradually moving away from their field entirely. I was so short on publication then that I couldnt even move away to nearby fields, having accomplished vengeful rejections in my field. These days I wish I had done a second PhD in a different field, just to get away from them! I'd have had a much easier career path.

This is by no means an exaggeration, I am actually downplaying what happened.

I've started a crowdfunding campaign that aims to tackle the replication crisis:

Check out


We're raising funds for Professor Chris Chambers at Cardiff University. Chambers is a leading proponent of a new and better way of doing and publishing research, called ‘Registered Reports’, where scientific papers are peer-reviewed before the results are known.

This might:

-make science more theory-driven, open and transparent

-find methodological weaknesses prior to publication

-get more papers published that fail to confirm the original hypothesis

-increase the credibility of non-randomized natural experiments using observational data

The funds will free up his time to focus on accelerating the widespread adoption of Registered Reports by leading scientific journals.

This ‘meta-research’ project might be exceptionally high-impact because we can cause a paradigm shift in scientific culture, where Registered Reports become the gold standard for hypothesis-driven science.

It's somewhat heart-warming to read the comments here about machine learning. I did my PhD in machine learning from 2007 to 2012, and the main reason I left research was because of the widespread fraud.

Most papers reported an improved performance over some other methods in very specific data sets, but source code was almost always not provided. Once, I dug so deeply into a very highly cited paper that I understood not only that the results were faked, but precisely the tricks that were used to fake them.

I believe scientific fraud arises primarily from two causes:

- Publish or perish. Everyone's desperate to publish. Some Principal Investigators have a new paper roughly every other week!

- Careerism. For some highly ambitious people, publishing papers comes before everything else, even if that means committing fraud. This happens even with highly successful researchers, who have the occasional brilliant, highly cited paper, but who also publish a lot of incremental, dubious work.

P.S. Mildly off-topic, but I love the Ethereum research community at https://ethresear.ch/ , precisely because it is so open and transparent! I wish an equivalent community existed for machine learning.

One thing I love about Ethereum is that it is self-funded, open, and basically separated from mainstream academia. They created their own money, convinced everyone that it had value, and then used it to self-fund their own fundamental research. It's an incredible alternative to academic research.

I suspect Ethereum itself may provide a feasible basis for supporting open, transparent research.

> the main reason I left research was because of the widespread fraud.

I'm super curious about this. I had a similar experience in another field.

Is this something you witnessed going down in person, or did you just develop a strong feeling given a lot of clues?

> Is this something you witnessed going down in person, or did you just develop a strong feeling given a lot of clues?

The latter.

Okay thanks. I was unaware that that was happening. Sounds like a frustrating experience.

There's no replication crisis. The replication failures are the only thing that is properly done.

We have a pseudoscience crisis.

And as others pointed out, it is not limited to psychology.

There is no easy way to tell whether something is real or pseudoscience. A lot of things that we consider real today where perceived as pseudoscience in their time. Science is largly a gigantic collection of dead ends that look silly in hindsight and finding out what's good science and what's not is an integral part of the scientific process. The idea that there can be a science of just Newtons, Einsteins, and Feynmans is naive.

The problem is that replication studies are almost nonexistent for most discoveries. Science is to some extent a business, and very few people with a budget have any interest in allocating resources to replicating the work someone else did, for no to little gain.

If it's not reproducible it's not science.

If you p-hacked your way into a statistically significant result, it is not science.

You don't have to be Einstein. Even Einstein wasn't Einstein most of the time. But the results you do get must be reproducible.

The replication failure in psychology is inevitable. The whole field boils down to taking an incredibly complex system (human beings) ... and measuring a few variables on them across a few people.

Most of the "statistical" justifications stem from assuming distributions are normal and samples are uniformly collected (both false).

This is why I trust the works of clinical psychologists like Jung and Freud so much more seriously than the current approaches.

It just makes a lot more sense to observe something complex descriptively, rather than start with a yes/no hypothesis and then test it with a few dozen people.

This 'study of reproducibility ' found the opposite: different sampling generally failed to produce different results. These contrived lab situations are remarkably relevant to the larger population, which has quite limited variation.

The issue was sloppy science, not a poor approach.

This interpretation of the reproducibility crisis is not supported by this massive international study, at least according to the last paragraphs of the original article (which quote someone directly addressing your concern).

The results they found were that the studies that could not be successfully reproduced generally failed across all tested cultures, while the ones that could be successfully reproduced generally worked across all tested cultures. So it seems that there are some results in statistical psychology that are robust, but that researchers have not been sufficiently reliable at identifying which results those are.

I'm not an expert on experimental psychology, but I'm fairly certain that the principles of how experiments are conducted in the sense that they control for other factors, choose sufficiently large samples etc., are pretty sound. An often mentioned problem in psychology research is "p-hacking", where researchers manipulate the data in a way that validates their hypotheses statistically, which seems like a more logical reason for a reproducibility crisis than the reasons you mentioned.

Also just because there are some problems in the field should not cause you to dismiss science completely in favor of completely subjective speculating about how the mind works, which certainly also has value, but I don't think it's warranted to dismiss actual science in favor of it.

But they're just not -- that's exactly my point. If you test a bunch of Harvard undergrads (the most common subject type in these studies) -- that's not a uniform sample. In any way whatsoever.

Furthermore, even if you get a high p-value ... your experimental design can introduce so much noise.

Even when you do find these high p-value correlations -- they don't really ever say much about "why" ... because they can't. So much of that is the study design.

Let's say I make a study on how well people navigate a maze under the influence of alcohol vs. not. You can probably get a high p-value that they do worse while under the influence of alcohol... but it reveals very little about why.

So much of that is dependent on the particular maze. In fact I bet I could design a maze specifically to prove whatever conclusion I wanted. That is the whole problem in psychology. There is very little focus given to actually characterizing cognition in any meaningful way.

I like the clinical psychologists because they attempt to do exactly that, even though they have less "scientific" findings.

Humans are complicated, but we're also fairly predictable in some respects. Hungry people tend to go get food. The field is about the predictable aspects. We're not impossible to study.

Freud believed all sorts of obviously wrong ideas, so the idea that he was somehow immune to this seems a bit odd: https://en.wikipedia.org/wiki/Emma_Eckstein#Surgery

> Hungry people tend to go get food

Sure, we tend to want to eat once we're hungry but when we go just a stop above hunger, meaning the desire or non-desire for sexual reproduction, then things get really, really complicated really, really fast. For example just the other day I was reading about a 92-year old guy who was choosing daily sex over him allowing his body to get physically better, while on the other hand we've got asexual 20-year olds from places like Japan who don't really think about sex at all. There's no way for a lab-made psychology experiment to make sense of our libido (or lack of).

That's not disagreement with "The field is about the predictable aspects.". You're just listing something you think can't be predicted.

But sex is one of the fundamental aspects of the human existence. If psychology cannot say anything meaningful about our sex life then it cannot say anything meaningful about us at all. Yes, it can probably guess when we should purchase or sell some stocks, like people following guys like Kahneman do, but their studies do almost nothing to “illuminate” us on what us humans really are on the inside.

Except, there are actually replicated studies related to sex/attraction.

I agree - descriptive studies also matter. I think in the last century, we have overdone statistics. It is time to counterbalance with a more descriptive approach as well.

"But skeptics have argued that the misleadingly named “crisis” has more mundane explanations. ... Third, people vary, and two groups of scientists might end up with very different results if they do the same experiment on two different groups of volunteers."

Isn't that the definition of a failure to replicate?

Yes, but the difference is subtle. The theory might still be true after all, but only for a particular subset.

"We've tested the theory that 'swans are white', and it turns out the swans in our test were all black". (The conclusion isn't that swans aren't white, it's that some swans are, while others are black.)

(add.: the article mentions the 'WEIRD' category. Obviously the applicability/set range of a theory matters. If it only holds for "people who, between 2 and 5 pm on Sunday are walking around in Reading with a bowler hat.". Well, that's nice, but not very useful.)

But the issue is still major, because the results are nearly always announced with broad applicability. I.e. the headline was "The simple act of smiling can make you happier!" not "The simple act of smiling can make you happier if you are an 18-22 year old college student living in Virginia." Plus, a lot of these explanations seem really statistically dubious. The whole point of applying p-values is to determine the probability that your sample represents population at large. If you did a very poor job picking that sample and extrapolating to a larger population that wasn't warranted, that's a real problem in and of itself.

Absolutely. And it's sadly also correlated to the overall poor quality of science news reporting in our societies. One of my favorite PhD comics: http://phdcomics.com/comics/archive.php?comicid=1174

That problem has nothing to do with science, though.

If you are getting your "science" from newspaper articles, then it's not actually science. You're getting some over-worked, underpaid journalist's attempt at getting a catchy article.

This is common enough that it seems psychology as a whole is more art than science.

It's like alchemy - they noticed some patterns and have some success exploiting them, but the abstractions and explanations they provide are mostly bullshit, so the predictive power is basically 0.

The problem is the Universe of which it speaks can not be reduced to a handful of beautiful and concise mathematical formulae. A group of humans (or even a single human) is more chaotic and complex in their behavior than a group of atoms ever could be. All Carbon-14 atoms are the same. No two humans, however, are.

A group of humans is a group of atoms.

Even a simple double pendulum is chaotic, yet we can describe its movement quantitatively.

For me the failure of psychology are very similar to the failures of pre-enlightenment physics, chemistry, biology. Wrong abstractions, storytelling instead of formal models, unfalsifiable claims.

My girlfriend was seeing a therapist, and the methods used were basically clairvoyance. Telling the patient "draw a tree", then interpreting the kind of tree as an indication of inherent characteristics of the patient. When she told me that - I've drawn a tree as well, and it turned out my tree was basically the same (the most common kind in our climate, in the most traditional drawing possible). I'm nothing like the description she got from the therapist and neither is she (but in a different way). Of course - she accepts the interpretation of that picture the same way people find matching details in their horoscopes.

Psychology sounds to me a lot like bloodletting or voodoo programming, and scientific community should distance itself from people claiming its discoveries to be scientific, and its therapies to be scientifically-proven.

I would call it a failure of interpretation. at one extreme one can assume that your sample is unique, and results inapplicable to anyone else. At the other extreme is the assumption that your sample is sufficiently representative so that your results apply to everyone. obviously in most instances it is somewhere between these, and that it "failed" is the basis of new inquiry

There's no such thing as a replication crisis in science, that's just a conflict-averse way of saying that some "scientists" aren't actually doing science. It's not limited to Psychology and this type of conflict-averse approach is a big part of the problem. Other scientists, the "good ones," if there are indeed any good ones left, need to realize that what's at stake is the trust of their profession as a whole. If you're not willing to expose false research in your field, then you're no longer working in science.

Science (academic work) is a career for many people, rather than a vocation. Rewards include high status, stable income, opportunities to travel, long vacations, etc.

You can say it's not science but most "scientists" are just normal people with everyday priorities, rather than Einstein-like people who go on doing science work while working as patent clerks.

I don't see how people can doubt this. Positive results get published. Even without p hacking that means a significant number of experiments at the p <0.05 level are published when they're one of the 1 in 20 due to chance. Combine that with p hacking and small samples with small magnitude effects... well replication is going to be a problem. Part of the solution here might be more impetus and incentive to publish negative results.

Agreed. I think all fields should require pre-registration, where studies are announced before they are run to help counter the positive-results bias. Still not a 100% fix (I could imagine a lot of studies that look like they'll show no effect then move into the "oops, nevermind, I knocked over the test tube" variety), but certainly better than what we have now

I’ve always thought that p<0.05 is ridiculously low bar, missing at least one zero after the dot.

It should be more than enough if the experiment's replicated rnough times.

Yes, and with larger samples. Too many small sample experiments which, at most, should be considered a "pilot" study, and when results are positive even with p < 0.01 they should still only be considered "suggestive" until replicated, preferably with larger samples.

That's equivalent to having a lower p threshold in the first place though.

It's positive news mostly - they took 15,000 volunteers and ran 28 "big" studies across them, working with original teams to get the nuances right. So they had larger samples than originals and ran experiments multiple times across multiple samples.

Where a study failed in one it mostly failed everywhere, and where it worked it mostly worked in all the experiments.

So, yes, priming someone with the number 32 won't get them to bet it in the casino, but it does mean that social science can be experimental - you can build and run lab scale social science - a positive !

This is a really optimistic take. I agree with you, ultimately, this is positive because it shows that it's possible for psychology to be done and have widespread meaningful results.

In the short term though this seems extremely negative as it indicates that the psychology field and researchers we have today are producing substandard results. Two parts of this article increase my negative feelings. First, they mention that these 28 studies were selected because they were well known and influential. If well known and influential studies only have a 50% reproducibility rate - what about the average study?

Another point I find very troubling is that people betting online were able to predict, with high accuracy, which studies were accurate and which weren't. This tells me that the problem is an "Emperor's New Clothes" type of problem and not just a challenge with the fundamental difficulty of the subject since the subject is so yielding to the public's intuition. Instead of a problem with the fundamental difficulty, the problem seems to be with the psychology of the researchers - being unable to diagnose and prevent methodological flaws in their research. This class of failure is much less excusable to me.

"If well known and influential studies only have a 50% reproducibility rate - what about the average study?"

Actually, they might be better off. In the related article* they mention: "Beyond statistical issues, it strikes me that several of the studies that didn’t replicate have another quality in common: newsworthiness."

It could be that many well known works are well known because their finding are counterintuitive, and therefore more interesting. However, this could also imply that the prior probability is lower - hence a p = 0.05 result is more likely to be due to chance than for a study finding something more in line with prior understanding of things.

* https://www.theatlantic.com/science/archive/2018/08/scientis...

The article that goes deeper into the betting markets says that they suspect many of the traders were people invested in the reproducibility crisis and related research, that is, they are more knowledgable than the general public would be.

That doesn't take away from your point, but the problem here may be more refined than "the public" vs "psychologists". It may be more like "some psychologists" vs "journal editors and other psychologists".

What I wrote was influenced by the fact that I had taken a "Psychology Replication Quiz" which I'll link to below. The quiz presents you with the thesis of a study and asks whether you think it will replicate. When I took it the first time I got ten questions correct out of ten. I took the quiz again a moment before writing this comment and got nine out of ten (I think I was hurrying this time). The average person taking the quiz gets seven out ten correct.

The depressing part about this quiz is how easy it is. You just have to think for a moment if what they are describing matches what you know about human nature. Some statements just feel very intuitive and likely, and those replicate. Other statements seem quite wild and abstract, and those don't replicate.

I haven't seen a list of the 28 papers they looked at here, but I'd bet it is similar. Some of the findings are incredible (and not reproducible) while others are intuitive and reproducible. I'd encourage you to take the quiz below and see if you feel the same way about the results as I do.

Psychology Replication Quiz - https://80000hours.org/psychology-replication-quiz/

For me the unrepeatable experiments suggest something even more interesting - that unless the originals were flawed or fraud, then something about the people tested has changed.

We are all more aware of conmen movies, of fake news, or have read popsci books and now know what to look out for.

ok having looked deeper into this it seems some of the i replicated studies are ... counter intuitive.

I wish there was more of this optimism. Yes, it sends a lot of conceptual ideas back to the drawing board, but that just means there are all types of new experiments to be composed and preformed.

A number of these experiments did replicate, and have such a huge set in multiple regions/culture is really really important.

Say this: in 120 years some of those experiments in the positive replicated set failed. It won't be as easy to just contribute that to sloppy findings. It could me the culture of human beings has changed, as well as our underlying psychological tendencies. If it was just two or three studies, you could dismiss them and say they might have been wrong, but with this data set, you could say with higher confidence that it's more likely something about our society has changed.

I also read it as a qualified positive. (1) Sound and unsound studies can be done in any field. (2) In this examination of the field of psychology, half the observed studies were found to be sound. (3) The implication that half the work in psychology may be unsound is concerning, and suggests a large cohort with problems in rigor, ethics, careerism, etc. (4) The sound work being done in psychology is as sound as the work in other sciences, and it's the unsound work that's the problem. (5) As an interesting side note, unsoundness can be intuited to a statistically significant degree.

I'm not sure that there's necessarily a problem with psychology itself, but just that the number of participants in studies is consistently too small. Just like how physicists increasingly have to spend more money to get more accurate results (LIGO, LHC, redefinition of the kg, etc.), psychology needs to realize that 1000+ study participants is necessary to get valid results. Unlike physics, though, where it's easy to show that results are bad when the equipment is bad, the nature of psychology makes it easy to publish spurious results without enough support. When the psych community starts to demand higher N values, you expect fewer, lengthier, more costly, but more accurate results.

Thanks. It's usually more informative to read the original source.

50% replication success rate.

I suspect, the authors of the article are a bit too generous to call the original work 'sloppy'. And that the alternative would be: no experiments with conclusions' -- and that's, in their view, is worse.

I disagree with that, I think 'being lied to', is worse than to 'not being told'.

What if the original work was purposely fake just get a grant from some group (political or commercial)?

If that's the case, than the field of psychology is a composed of big number of 'full-stack con-artists'.

As someone who is both a developper and a psychology student, your average coder knows about as much psychology as your average psychologist knows about coding. That is to say, not a whole lot.

I am saddened but not surprised by psychology's replication crisis. But really it applies to all of academia, psychology is only in the spotlight because some of branches have less than stellar methodology in the first place. Social psychology gets criticized a lot, partly because it's one of the only branches rigorous enough to even be tested!

Obviously I'm not in favour of this. But it's a very big machine that needs fixing, and I don't think it starts with tweaking the details; drastic changes are required if we are to gain back the lost credibility. To me, this applies both to social and hard sciences. Since this is a general problem with academia, either your field has already been called out or it has yet to be. I don't believe the current system offers an alternative (yet), though I'm cautiously optimistic about preregistration.

> * A mention of the marshmallow test was removed from an early paragraph, since the circumstances there differ from those of other failed replications.

I tried digging into the study webpage, but holy cow it's not organized in an easily digestable fashion.

So does anyone know what happened with the marshmallow test replication?

Reading your article, I wouldn't necessarily interpret it as "failed":

> Instead, it suggests that the capacity to hold out for a second marshmallow is shaped in large part by a child’s social and economic background—and, in turn, that that background, not the ability to delay gratification, is what’s behind kids’ long-term success.

I.e. it would still seem to say that the capability to delay gratification is tied to economic success, except that's really a confounding variable. It's also very possible that the ability to delay gratification that one gets from being more affluent is also a factor in future success.

Both of those, though, spell failure for the original experiment. It claimed that the predictor was delayed gratification. That is false as false as the predictor of future success being how successful you are in 10 years. :)

I'm guilty of not reading the article, but I find pleasure in poking apart statistical analysis.

There are so many things to consider, like average sugar intake of affluent/non-affluent families at the time of the first test and now.

If both classes consume candy, perhaps the affluent have access to tastier candy (maybe important because the wealth distribution in the early 90s was more equal), so marshmallows are not tasty to them. (Note: I haven't had a marshmallow in a long time, but they are no where near as tasty as candy bought off the check out shelves).

If present affluent families are educated about sugar (which was less so the case in the early 90s), they would probably feed their children less junk food, so the desire of wanting more of it is not as hard wired in their neurons, or it would taste overly sweet to them.

Maybe less affluent families give their children sweets all the time because it shuts them up.

Maybe the recipe of marshmallows has changed between then and now (see point 1).

That much is obvious. But your link is to a different replication study. I want to know why there is an asterisk at the bottom of this article.

I think it's the Stanford marshmallow experiment in delayed gratification: children could take a single marshmallow immediately or wait a certain amount of time and get an additional reward. The results were tracked longitudinally for the participants, with positive life outcomes linked to those who were able to delay gratification as children.

I thought it was later found that the delayed gratification simply correlated with parental wealth and that is why it linked to positive outcomes later in life.

It would be interesting to try and replicate the experiment with other treats. As a child I sort of liked marshmallows, but not a lot. So if I had been an experimental subject, I might have taken the first marshmallow not due to a lack of willpower but just because I didn't care much about marshmallows.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact