Artificial intelligence faces reproducibility crisis
Reproducibility in Machine Learning-Based Studies:
An Example of Text Mining
Missing data hinder replication of artificial intelligence studies
>In a survey of 400 artificial intelligence papers presented at major conferences, just 6% included code for the papers' algorithms. Some 30% included test data, whereas 54% included pseudocode, a limited summary of an algorithm.
This is completely ridiculous.
In pure mathematics, everything you need is right there in the paper. If instead you're in the physical sciences, you obviously can't include your lab in the paper. For software, it's perfectly possible to include the lab in the paper, so to speak, and there's no excuse for depriving the rest of us of access.
Not interested in releasing the source? Ok, just don't brand your project as 'computer science'.
Including full source-code (and ideally data-sets, unless there's good reason this can't be done) should be a basic requirement for serious publications. It's disappointing that this isn't (yet?) the norm.
Pseudocode is just another description of the approach the authors took. It doesn't allow another researcher to carefully recreate the exact experiment that the authors ran.
Software is famously hard to get right. This can be exacerbated by certain research problems where you don't know the result to expect (modelling, say). If you're going to publish results, the source you used should be available for inspection, for similar reasons to why mathematicians have to publish the proof, not just the conclusion plus a promise.
The upsides here are many: protecting science against software bugs, easier replication of experiments, better protection against academic fraud, and helping other researchers extend the work.
(I realise this last point may be at odds with toxic academic careerism. All the more reason for the publication process to insist on it.)
You'd think this would make them more interested in replication, since otherwise the likelihood that they are actually doing something of value is uncertain.
Science and research is full of all kinds of false starts, you don't increase your speed by blinding yourself.
Falsification is the very definition of science.
My impressions of machine learning match up with yours. There's just so many parameters that there's so many opportunities to capitalize on chance, combined with [necessarily] huge datasets that preclude replication, that it's hard to avoid. I have also been surprised at how much tweaking there is of parameters; I'm used to work in traditional statistics where more is derived from theoretical principles, as opposed to trying a bunch of values to see what "works." This all lends itself to overfitting. I strongly believe that a lot of the adversarial input work is basically capitalizing on this overfitting.
To be fair, I don't think people are necessarily being nefarious, I think people across all fields just don't appreciate the dangers of overfitting.
The one upside of this Many Labs work, mentioned in the paper, is that it tends to show that the common criticism of "how can you generalize from this sample of undergrads" is not so much of a problem. Not that it's not an issue at all, but if you're studying some basic cognitive process it's probably not going to matter that much if you use undergrads versus some perfectly representative sample of the population. People have shown this in different ways before but it's useful to know more about. Obviously with some things sociogeographic variation will matter more though.
One thing that might not be totally obvious from this article and others is that although some effects clearly replicate, and others do not, there are some effects that seem to be in a grey area, where the effects are probably real but much smaller than originally reported. The distribution of effect size estimate distributions is more continuous, with shades of grey, than these news reports would have you think. Whether or not it matters that an effect is tiny versus zero might not practically matter, but at some level it is important to be mindful of.
The two prior replication studies that I know of - the first was several years ago and raised the issue in the public mind afaik, then second maybe in the last year - both replicated almost all of the effects. The problems were that some effects were weaker than in the original studies. Are you saying that this most recent attempt at replication shows a large number of studies with no significant effects at all?
That's hypothetical. What actually happened was what I wrote in the GP: The effects were replicated almost every time, except sometimes they were weaker.
One of the foundational problems with the soft/social sciences is that many of its practitioners have such weak analytical thinking skills (Note below) and the educational requirements for Psychology majors typically just include a few semesters of statistics courses designed to be understandable for them.
Peer reviewers in Psychology generally don't expect authors to go much beyond the basic statistical tricks students are taught in undergrad. From prior discussions with professors in these fields, it's my impression that most are unable to understand just how flimsy such methods are.
Personally, my prediction is that this won't get better until we have more AI-guided analytical analysis software packages that make more robust analysis accessible to those in soft-science fields. This is, something to replace reliance on stuff like R^2 and p- values.
Note: Just to provide some numerical evidence for the assertion that Psychology practitioners tend to be weak with their analytical skills, here's [a PDF of GRE scores](https://www.ets.org/s/gre/pdf/gre_guide_table4.pdf). GRE test-takers planning to study "Psychology" had quantitative reasoning (QR) mean of 149. This is pretty bottom-of-the-barrel; even people planning to go for grad school to do "Arts - Performance & Studio" did a little better, with an average of 151.
Andrew Gelman argued that a fundamental problem with peer review is that it is done by peers.
Peers whose background and skills are similar to those of the authors, and thus not likely to catch things that were missed.
[When does peer review make no damn sense?](https://andrewgelman.com/2016/02/01/peer-review-make-no-damn...)
I'm sure something could be gained via education. I don't think you necessarily have to be good at math to understand the concepts.
But a lot of undergrad courses focus on hypothesis testing and p-values (despite those methods being condemned by the American Statistical Association), and encourage memorization of steps to do simple math, over understanding what any of it means.
I think the ad-hoc approach of many machine-learning intros would do better. Maybe programming isn't any easier for a psychology student, but simply hammering the ideas, with short demos of pitfalls, may help.
For "small n" problems like in psychology, psych students are much better off with a statistics background than machine learning. So what I'm advocating is a change in how these classes are taught.
Of course, I've known plenty of students who program via copy and pasting code, and modifying it until it runs as necessary. The equivalent of memorizing steps math steps.
So it will take more to solve the problem.
Psychology is a strange science in that it's a mixture of people with very unquantitative backgrounds, and those who deal with very complex math and statistics. What many don't realize is that meta-analysis itself really was developed as a method in psychology (even if it technically has its origins earlier with Pearson). This registered replication work is an extension of that, again being done by psychologists. It's probably safe to say that more empirical and statistical research on the scientific process itself has been done by psychologists (along with statisticians and many public health researchers) than any other discipline.
In any event, replication problems happen in other domains as well. This has been documented empirically. It might be worse in the biomedical domain than, say, physics or chemistry, but it's not limited to psychologists. What I see in the neurosciences per se is just as bad, if not worse (because it's ignored more).
I think ignorance of issues pertaining to overfitting, etc. definitely contributes, but I also think that ignorance is pretty widespread, and the problems can be sort of pernicious in that they don't always operate intuitively.
There's probably different causes at different times. Some of these effects that are the target of replication tests are relatively older, when people were less aware of some of this phenomena. To be fair, some of this stuff is unintuitive: for example, some of the major journals would require internal replications, over several samples, the rationale being that if someone shows an effect in 5 samples with slightly different designs, it's probably "real." It's not like some of these things were just based on single samples (although some of them certainly were). Of course, now people are aware that 5 small samples does not large-sample replication make.
My intuition is that another cause is that academics is flooded with researchers collecting a lot of data, under a lot of pressure to produce positive findings to attract money. Hype, TEDtalks, grants, hypercompetition, and so forth. Academics now is horribly incentivized to be popular and bring in money (from peers, it's important to note), rather than to be correct. Add in a complex subject, like human behavior, and it's a recipe for disaster. Academics is also full of conventions, that attain the strength of power structures; when these are baked in it's hard to change them because you have people who attained power under those conventions in charge (nothing conspiratorial really, just people have their biases and blindspots).
I'm not sure peer review can change a field's direction. Say the top journal in a field used to publish the top 10% of papers. If only 1% of papers meet some higher standard, then if the journal chose to enforce this, they would essentially cease to exist.
And the people who would otherwise have got hired/promoted/tenured based on publications there, will now win these tournaments based on their publications in the second-best journal. Their contests are with others in the same field.
I guess the more hopeful scenario is that, besides the 1 of 10 top papers that are actually solid, there are (somewhere) 9 other solid papers with less-flashy results -- presumably carefully proving things that everyone thinks, not counter-intuitive things which get you a TED talk. Here it's more complicated, as if the journal chose to publish these instead, it seems to me that hiring/promotion/tenure committees may simply start to view it as a less-prestigious venue.
I think the situation is pretty analogous to other area with replication crises like psychology and nutrition: local variables dominate. Just like our own body’s there’s a lot that just depends on the domain or the specifics of the problem being solved...
This has been my experience in a related field (information retrieval). There are trends and best practices, but the market and Impossible to recplicate overhyped academic research reinforce each other.
I intentionally make it easy to pick up, modify and inspect. The benefit to me is precisely being alerted if it isn't replicable.
Whereas it is true that both are inspired by neural activation and that both can produce a complex, yet coherent mapping from input to output, this is like saying that birds and aircraft are similar because both fly and have wings.
The brain is very complex and a very big ball of highly specialized circuitry (and the software running on those and managing various aspects and parts of it and them). The similarities are at the same time very striking but the differences are just as drastic in number of layers and other size related parameters our various brain components have.
Yes, AlphaGoZero is not going to learn to cook and sing and dance, but it beats our gaming component in a lot of areas. Of course a synapse is not really a memristor or a ReLU, but at the same time not that far off either. Similarly a biological neuron is not just a simple backpropagatin integrator like a perceptron, but it works very similarly.
And just as we don't really know how all the representations work in an ANN, we don't exactly know the biological aspects of memory/learning/seeing work in our brains, yet we have made enormous progress on both. See all the demos/visualizations on how various layers encode in deep nets and look at the data about the brain's visual cortex and the V1 circuit, look at the gene spliced mouse studies where memory encoding is studied (and sometimes only one neuron encodes for a memory/face/concept - just as with deep nets).
If one doesn't have your training dataset or your code, how could they possibly replicate your results?
Providing the code to replicate is good form. It shows good faith and confidence. Exact replication (exactly replicating the study) is just the starting point to check that the code works and no obvious mistakes were made.
replication / reproducibility / hyperparameter sensitivity
If the research yields something really important and the method is well documented, usually it can be easily checked without having the data and the code. Things like dropout, batch normalization, residual learning, .. work over multiple different datasets and hyperparameters. You can reproduce the results without faithfully replicating the experiment.
If the claimed result vanishes unless you have the exact data, code or the hyperparameters, the research can't be said to be meaningfully reproducible in the scientific sense. Hyperparameter sensitivity is ML equivalent to P-Hacking.
Isn't this the work of writing up? A paper is a claim that you have discovered something, and implicitly that other details are standard / unimportant. If it turns out that some hidden assumption was in fact doing all the work, then the claim you made was false.
I'm all in favor of sharing working code. But working code which magically does something... amounts to an anomaly awaiting an explanation. Or an advertisement.
The right thing to do in this circumstance is an ablation study - throw together your best-possible model and then test different subsets of features sitting between your model and the 'basic' prior work. For large datasets, though, each of these models might take a very long time to train (especially if you don't work at a place with a stupid number of GPUs available).
So, lacking resources, you get your new best-ever accuracy number with your 'everything' model, and do an extensive write-up about how awesome the new bell and/or whistle that you added to the pile is... (The problem is compounded by a need to publish quick, lest someone else describe your bell/whistle first.)
Another big problem is that adding a bell/whistle to the base model often means adding more parameters to the model. There's decent evidence coming out of the AutoML world that number of parameters matters a hell of a lot more than how you arrange them. (It's real real easy to convince yourself that your clever new idea is more important than the shitpile of new parameters you've added to the model, after all.) So a really solid ablative study probably needs to scale the number of parameters in a reasonable way as you add/remove features... And there may not be obvious ways to do that smoothly.
And this is closely related to the replication study in psych: it's real real easy to do kinda sloppy work with big words attached, and convince yourself and all your peers that you're a genius.
I think a big database for reporting and searching for results with various architecture+dataset combinations would be much more useful than pushing more papers to the arxiv in many cases. (though, really, whynotboth.gif) Let me do some searches to see if a particular bell/whistle actually adds value across the god-knows-how-many-times someone's used it to train up imagenet from scratch...
To be sure, some sciences are more rigorous than others. Various natural sciences are reliant on transient observations, which might or might not be 'reproducible' in any sense... And still progress is made, though the results might contain more prejudices than one would encounter in other areas.
It's also worth noting that some areas within machine learning are extremely difficult to test rigorously (e.g., generative processes), but are still totally worth pursuing. So be careful with cries of 'it's not even science, man!'
And of course some guys a few doors down are proving theorems. And lots of other things.
It is slightly strange though that we try to vet all of these with the same peer review mechanism. This is one major source of the differing opinions in this thread about how this ought to work, I think.
Numerical code is already bug prone due to subtle and hard to test errors, research code that's not code reviewed or necessarily even tested for correctness can easily generate important results erroneously.
It's also counter-productive not to publish the underlying source code for these papers, as it adds a barrier to other researchers applying the algorithm in new situations. I'd be interested in seeing if those 6% of papers which include the code get more citations than the population of papers which do not include code.
Probably none. If the paper is important and collects citations, the algorithm is in use. Computer science != working code.
Code is required when you produce something where the scientific importance is less clear. There is need to provide more evidence. Many papers are just "Hey I made some some tweaks and it works in this particular case." Those papers should have working code.
These tables are generated using real implementations that may or may not be correct, and should be subject to review when the paper is published.
Or, more succinctly, code & data are left as an exercise for the reader
Having more researchers avaliable to audit code seems like it would help prevent flaws from slipping through, too, to prevent false conclusions.
Thank you for explaining a bit more.
That is, I agree with you that the code and the data should, ideally, be available. I lose confidence when people just rerun the same code on the same data. The slides a while back about why someone didn't like notebooks resonates well with me. Something like "Shift enter through the lesson. Is this learning?"
If a study uses a private data set, or one that the researcher controls and only gives out to approved partners, that study should be discounted.
I understand corporate labs cannot give away their data in many cases, but corporate research carries less authority than academic research anyways. Academic research should always make the data set publicly available.
Writing your own code and creating your own dataset is the simplest way to rule out the situations where there is something fishy in either the original code or data. And the paper itself should contain enough details to make it possible to recreate the experiment this way.
Think of the provided data as a sanity test.
 I recently fixed a bug wherein I was inadvertently relying upon Linux-specific behavior that failed when tested under Solaris.
Blog post are nice for reaching outside of the research community.
Code should always be linked for people that want to reproduce, but they don't replace a paper.
I don't see why this would take any more time than properly studying a paper, in fact it should be quicker since the information is presented in the proper format (not translated from code to math/prose).
For most, you just want to know the general ideas and understand the novel techniques that were applied. You might decide to do a deep dive in a very select few. At this point code becomes useful for reproducibility.
Also once you are experienced in the field, you don't relearn everything for every new paper. Most of the paper just propose incremental minor changes.
The only things I care about is what these changes are (conceptually, not a code comparison) and data about the effect of these changes. I don't want to have to read your code to find out where you have applied a certain new hyper-parameter to your gradient descent.
Stop thinking that the entire research population is publishing all results as papers because they don't know better. It's indeed the most efficient way to share knowledge for experts in a given field.
> since the information is presented in the proper format (not translated from code to math/prose)
Do you really think researchers just write some random code until they come up with good results and then try to retrofit that with some equations?
You have got it completely backward.
For most, you just want to know the general ideas and understand the novel techniques that were applied. You might decide to do a deep dive in a very select few. At this point code becomes useful for reproducibility."
I have much experience reading research literature (but not ML specifically). The vast majority of the time I just skim a paper looking for certain info.
What I am reading from you is that basically the only useful part of the ML paper is the abstract. If the repo/blog had a good abstract there would be no point to the paper.
>"Do you really think researchers just write some random code until they come up with good results and then try to retrofit that with some equations?"
This sounds like a strawman, but yea I do think the code is the "actual" method while the math is some idealized cleaned-up translation.
"Code is everything" approach presumes that communication is computational by default. I'm not sure if researchers agree on that. This is particularly important for a field that aspire for artificial intelligence. Language is the best bet we have at the moment.
Secondly, there are social aspects. I am becoming more well read in my field, and there are time when genuine "rediscovery" occurs. Many phd students, depending on their research group, do not come up with ground breaking work right off the bat. It takes them a few years. In the publish and perish economy, there are venues to show your paper. If their genuine work is rejected, it may stop them from progressing in their career to come up with great work. It is like expecting an undergrad to come up with a full master's thesis for a course project. Happens, but not often.
Above being said, now if I am to read the code, I will be rereading many code repretitions every year. Whereas reading similar abstracts is less time consuming.
Your blogpost comment touches a bigger issue of how to tell who conducts legitimate research. The best we have so far is that those at the top of their field to provide assessment. They're journal editors. They have dedicated their life to their field, and have read paprs from decades of research work (whether or not the field is scientifically paramount is irrelevant, not all can learn the same thing, and education system is there to _educate_ the population at large, along with generating new knowledge, and along with advancing new researchers). For the sake of completeness I'd add this, at least in my case and perhaps many others, as one becomes more experienced, one can assess their earlier grasp of the field, or their earlier misunderstandings, better.
Hence it seems reasonable to publish, and have abstracts.
The whole point of a blog post is vulgarization (which is definitely needed as well): explaining an idea without assuming that the reader has a lot of prior knowledge in the domain. Which goes in the opposite direction of what researchers look for: the most efficient way to understand what is being proposed.
Most of ML papers abstract would be pretty horrible on a blog post in term of vulgarization.
Plus, papers are all very well centralized in Arxiv. How am I supposed to find blog posts? I should just hope that what I am interested in will blowup on twitter or /r/machinelearning and I will notice it? This sounds extremely sub-optimal.
> This sounds like a strawman, but yea I do think the code is the "actual" method while the math is some idealized cleaned-up translation.
Do you also think that biologist should stop writing papers and just upload video of their actual experiments?
I work in an ML research lab with 30 or so researchers, research never begins with some code snippets. It always begins on the whiteboard.
Even between us, where we have the code and data readily available to share with each others, when someone presents it's work, they will never show code, but always equations or model diagrams. It's just an order of magnitude easier to visualize.
Take this paper for example: https://arxiv.org/pdf/1805.11604v3.pdf how would you even begin to communicate something like that through code?
Elsevier is by far the worst publisher regarding open access though: https://twitter.com/protohedgehog/status/1028819653982736389...
reproducible: independently achieving the same result using existing data;
replicable: independently achieving the same result using new data.)
Others have also been worried about replication problems from the 1960's on. Hopefully some of this worry sticks this time and we can get a better understanding of what is really true in these fields. Physics likes to have very well defined uncertainties on everything they observe and don't like to say something is "true" unless the fact is likely at 5 to 7 sigma. That seems like a good margin. In psychology, 2 sigma is the standard for publishing a result.
In physics you can get away with not caring about the distinction, because of how accurate the measurements and precise the theories are. That doesn't fly in most other sciences, though.
As an undergrad, I even had tenured professors try to tell me that a p-value is the chance that a result is wrong. Most researchers in psych or bio sciences have a weak understanding of statistics, usually taking a single statistics class in undergrad, then a single one in grad.
My wife is a biochemist, and during her PhD they were working on an area of research with a handful of labs publishing about it. One lab in particular was known for a decent amount of questionable publications, but the PI was a big deal and no one would officially question anything. So they would whisper amongst each other and just ignore that paper (and all the additional papers built on top of it) because they all knew it was bullshit.
Brouwer's career suffered as a consequence of his disagreements with Hilbert.
EDIT (with links to the political aspect):
Letters written more about the politics of publishing than the math itself are referenced in a book called "The War of the Frogs and the Mice"
(PDF) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.224... (see bottom of pg 9 as displayed for the letter Brouwer wrote to Hilbert's wife)
It doesn't read like the political fights were about math. Although Hilbert disagreed with Brouwer, it sounds like the falling out was precipitated by arguing over something subjective.
This could be just anchoring bias. But I think it's much more likely that people after Millikan did not want to look bad in front of their professional peers for fear of social and professional consequences. Which I think is something we can reasonably call political.
(The reason I know we can agree is that unfounded is not the same as false, it just means it's a non-statement.)
I suppose then one can conclude that scientists have behaved incorrectly since Aristotle's time.
> getting along with your peers is the most important thing
No it isn't, hence the yuk. Anyone who thinks getting along with your peers is more important than getting the science right, isn't someone I trust doing science. The most important thing is getting the science right, full stop, end of story.
If he'd needed access to impossibly expensive equipment available only in a few palaces around Europe, politics might have stopped him before he ever got started.
Science is about being right; not about getting along with your peers.
And then, there are the different sides of scientific discussion which produce the conclusions that the politicians warp and shriek about. Some of this is good science, and a rot of itis bad science. But whatever the politicians get to blather on about is what most people are aware of, regardless of whether it's good or bad.
And your comment points to part of the problem. It's science, so if we believe that we know everything, we're wrong. If we don't know everything, then that means we need to have the lattitude to argue about things. The argument is obviously not "is the earth warming?" but "how and why is the earth warming?". If your immediate response is "well, it's obviously humans producing CO2", then you have fallen into the trap that I'm trying to explain. Yes, humans produce CO2, and CO2 is a greenhouse gas. But the story is more complicated than that, and science needs room to wonder. The problem is the political shrieking dampens the curiosity and wonder that science is about.
In that case, it's pretty settled that there is a significant human impact on atmospheric CO2 levels; the space for room to wander are, say, the exact distribution of CO2 released. I mean, physicists don't wonder and revise basic kinematics since that science is pretty much settled that Newtonian kinematics are very good approximations of reality.
There are some facts that seem to give pause for thought though, like it apparently being much warmer than today in medieval times. This can be hard to see because temperature records don't go back that far, but we know that the Romans appeared to once grow wine near Scotland, something totally impossible today (in fact it only became possible to grow wine in the south of England very recently, a "new" thing blamed on global warming). It's unclear how the Romans could have done that unless grapes were very different or the weather was very different.
This argument is not quite up to snuff because, as it turns out, only a tiny fraction of the oil ever can be made to come out. The truth is that even after the wells have been abandoned most of the oil remains stuck in the rocks that it started in. I found this out while wondering whether or not it would be possible to burn all of the oxygen in the air (it's not, sadly ;) )
Also the oil reserves are not really running low, if they were that would be a lucky answer to carbon control.
So even if your story is true, those kids did exactly the right thing.
Regardless, the consensus that's bandied about comes from the IPC meta study of all published peer reviewed articles from climatologists; not from a few kids.
So it shouldn’t exactly surprise us that most of them agree.
> If you don't have a PhD in the field, you don't belong in the big discussions where consequences matter.
I think what we need is accountability - there should be serious negative consequences for being wrong proportional to how strongly you stand behind an idea.
Were all the climatologists who were confident we were facing global cooling driven from the industry? Stripped of their academic standing?
Then why would I trust the industry?
Counting climatologists is a bad idea because they shut critics and skeptics out of the field. We know that climate models are generally wrong. Trying saying that while having a career as a climatologist.
1. That study is 10 years old at this point. The evidence for climate change has become vastly more undeniable since then.
2. Majorities of scientists from all categories believed in climate change, even back then, according to this study.
Also, anybody who views climate prediction as intrinsically suspect pseudoscience isn’t going to want to enter the field either.
Appealing to the authority of climatologists is not going to convince anybody because everyone who thinks they are authoritative already agrees that global warming is caused by humans...
This as opposed to releasing research along the lines of 'cigarettes are fine for you! smoke more!'
Even then there are agency problems which mean that once a company gets over a certain size there is no natural alignment.
I know plenty of managers tweaking A/B tests until they get the results they want.
"in academia overfitting can get you a nature paper, in industry get you fired"
“A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.”
Or more succinctly, science advances one funeral at a time.
As a counterpoint, it seems that minsky and papert's work was partly driven from their desire to stop the funding of larger perceptrons , in which they did not believe. The winter that followed their work may or may not have delayed progress in AI by years or decades.
Sure, mistakes happen, my own PhD contained a major blunder (discovered several years after submission -- at least somebody read it ...), but I'm extremely open about it.
I think that the hard sciences aren't necessarily so dependent on papers derived from isolated results, or even replication of those results, because there is more of a "web" of knowledge. An idea is hopefully supported by an experiment, but possibly also by theory and multiple different kinds of experiments, sufficient to be usable by non-experts and laypeople. For instance the basic theory of electromagnetism is good enough for virtually any kind of engineering, and it would take some monumental discovery to prove it false, even if a portion of the supporting work fails upon replication.
Another thing is, perhaps because of the precision of physics, we get proven wrong every day, and it ceases to be an embarrassment. The road is littered with the wreckage of failed theories and experiments.
Or more succinctly: mathematics doesn't give us truths, it gives us consequences. The statement, "2 + 2 = 4" is not an empirical fact intrinsic to the universe. It's a consequence of our definition of natural numbers. It's considered a useful definition because (among other things) of its application as a foundation for engineering and the sciences. But many other useful areas of mathematics have essentially no basis in reality.
Then physics isn't a science either since only parts of the field fits this criteria. I think you will realize that the distinction between physics and math isn't as clear cut as you imagine, it is not uncommon for physicists to publish in math journals and vice versa.
Ultimately though, the math that is natural science is just called “physics”, right? ;) The reason math can’t be a hard science is because we can’t prove abstract math by physical experiment. That’s all “hard science” means - things you can show by physical experiment. We can prove math theorems by showing they’re consistent under all the other math theorems, but generally not by conducting physical tests.
I think turning math into a purely abstract game came later, and we owe our fun to the discipline's more humble origins.
There are of course other viewpoints that math has a Platonic metaphysical reality in and of itself, independent of the natural world.
So what was it?
what your wife is describing is what is commonly called "progress one death at a time"
"We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that."
Point being: Feynman’s ideas should have been judged on their merit, rather than Richard’s career. Same for the papers published by the institute mentioned above.
I adore Feynman's work, but he was human and had human failings, too.
the dead ringer for these groups is a PI who has a reputation for taking on extremely ambitious projects with all sorts of funding sources. typically, the PI is vindictive and controlling in my experience.
but yeah. you can see them from a mile away. they tend to run sweatshop-style labs where everyone is very hard pressed for results. so "results" are produced, whether or not they are real.
Here is the biggest funder of biomedical research in the world admitting that $27 billion out of $30 billion per year is wasted on unreproducible research:
And I would say that is optimistic because it does not consider misinterpreted results (interpreting your data correctly is even harder than figuring out how to perform and report reproducible experiments)...
I also wouldn't consider math a science since it doesn't require comparing your results to data.
It may very well be that you have your own peculiar definition of "hard science", or that you personally disagree that biology or biochemistry should be classified that way. That's fine, you're allowed to have your own opinions. But you having your own crackpot opinions doesn't make it fact in the real world.
These fields in particular are not even close to being hard, especially oncology which appears to (somehow) be even worse off than psychology.[1:3]
I bet less than 1% of people with masters or greater in a bio field have used calculus for anything in the last 5 years. While of course the definition of "hard science" is arbitrary, here are some decent overviews of the differences between physics and bio:
>"Precise definitions vary, but features often cited as characteristic of hard science include producing testable predictions, performing controlled experiments, relying on quantifiable data and mathematical models, a high degree of accuracy and objectivity, higher levels of consensus, faster progression of the field, greater explanatory success, cumulativeness, replicability, and generally applying a purer form of the scientific method."
That same page does group biology in with physics and chemistry, but with no justification. From personal experience biology is much closer to a social science than it is to physics. I mean, most people who study biology-related stuff "hate math".
Hell, go back further: consider Mendel's discovery of the rules of genetic inheritance (and thus founding the field of genetics) by painstakingly planting, replanting, combing and recombining pea plants. It's hard to imagine a more pure and rigorous application of the scientific method, in any field.
I'm sorry, but your position is beyond absurd. Even the page you link disagrees with you. It sounds very much like you once met a biologist who said that they didn't very much like math, and since then you've carried a personal prejudice against biologists. You admit as much in your comment.
Ironically in the same vein as the OP article!
The sad part is it will never improve to become a "hard science" if people think it already is one. I mean really think about it. Biology is the study of how complex systems change over time, yet learning the primary tool we have to study/analyze change (calculus) is not a prerequisite. Doesn't that seem strange for a "hard science"?
That being said I get where the commenter is coming from. Much of biology is a qualitative science. Darwin's evolution was originally elucidated through descriptions not hard numbers and statistics. Taxonomy in zoology is also largely qualitative.
But, if a person does share that opinion, I can see how that person might be inclined to ignore tone, and focus on supporting that opinion.
You cannot call an idea itself insulting unless you yourself are a biased person. Ideas in itself are just statements with no malevolence intended. Only people can express malevolence.
I don't agree with him. But I see where he's coming from. A lot of biology is just observing behavior and documenting observations. It is very different from employing the scientific method in an experiment. It is also very different from trying to quantify everything. The line between hard and soft is blurry.
>"Precise definitions vary, but features often cited as characteristic of hard science include producing testable predictions, performing controlled experiments, relying on quantifiable data and mathematical models, a high degree of accuracy and objectivity, higher levels of consensus, faster progression of the field, greater explanatory success, cumulativeness, replicability, and generally applying a purer form of the scientific method." https://en.wikipedia.org/wiki/Hard_and_soft_science
1) Producing testable predictions
--- The predictions are almost all vague, of the form "x will be correlated to some degree with y".
2) Performing controlled experiments
--- Maybe, but I very often see that proper blinding procedures were not used.
3) Relying on quantifiable data and mathematical models
--- Mathematical models are rarely used to make precise testable predictions.
--- There was some progress on this front from Armitage and Doll way back in 1954 but interest has largely fizzled out.
4) A high degree of accuracy and objectivity
--- As mentioned in #1 and #3, making a precise prediction at all is rare. Accuracy doesnt matter much if you predict vague stuff.
5) Higher levels of consensus
--- Not sure this should be here. "Hard" vs "soft" is a matter of the procedures used, not the results.
6) Faster progression of the field
7) Greater explanatory success
--- The more research that has been done on cancer the more complex things have gotten, to the point where now they say "cancer is many diseases". This is the opposite of what happens when you gain a cumulative understanding and figure out "natural laws" that make things easier to understand.
--- Seems to be well less than 50% of studies can be repeated (see my links here: https://news.ycombinator.com/item?id=18578282)
This describes some of the mathematical economics I've read.
Spend time with condensed matter physicists. What you describe is not what they do. A lot of results are not reproducible.
If replication is the benchmark, almost no one does it (physicist or otherwise). In all my time in academia, I didn't find a single person attempting to replicate anything. You don't get funding, tenure or papers from it.
I was referring to everything else in your comment.
But yes it is extremely rare. It is very frustrating because it is hard to come up with a precise model when you only half-trust the data to begin with, and when the replications are (rarely) run it appears that half-trusting the reported results is optimistic...
All my work got rejected (everything was being accepted before, and the rejected batch was a full level higher quality). So yes, it did set back my career for two and a half years if I have to approximate. I am probably never submitting to certain publications ever again to avoid having them as reviewers. I am gradually moving away from their field entirely. I was so short on publication then that I couldnt even move away to nearby fields, having accomplished vengeful rejections in my field. These days I wish I had done a second PhD in a different field, just to get away from them! I'd have had a much easier career path.
This is by no means an exaggeration, I am actually downplaying what happened.
We're raising funds for Professor Chris Chambers at Cardiff University. Chambers is a leading proponent of a new and better way of doing and publishing research, called ‘Registered Reports’, where scientific papers are peer-reviewed before the results are known.
-make science more theory-driven, open and transparent
-find methodological weaknesses prior to publication
-get more papers published that fail to confirm the original hypothesis
-increase the credibility of non-randomized natural experiments using observational data
The funds will free up his time to focus on accelerating the widespread adoption of Registered Reports by leading scientific journals.
This ‘meta-research’ project might be exceptionally high-impact because we can cause a paradigm shift in scientific culture, where Registered Reports become the gold standard for hypothesis-driven science.
Most papers reported an improved performance over some other methods in very specific data sets, but source code was almost always not provided. Once, I dug so deeply into a very highly cited paper that I understood not only that the results were faked, but precisely the tricks that were used to fake them.
I believe scientific fraud arises primarily from two causes:
- Publish or perish. Everyone's desperate to publish. Some Principal Investigators have a new paper roughly every other week!
- Careerism. For some highly ambitious people, publishing papers comes before everything else, even if that means committing fraud. This happens even with highly successful researchers, who have the occasional brilliant, highly cited paper, but who also publish a lot of incremental, dubious work.
P.S. Mildly off-topic, but I love the Ethereum research community at https://ethresear.ch/ , precisely because it is so open and transparent! I wish an equivalent community existed for machine learning.
I'm super curious about this. I had a similar experience in another field.
Is this something you witnessed going down in person, or did you just develop a strong feeling given a lot of clues?
We have a pseudoscience crisis.
And as others pointed out, it is not limited to psychology.
If you p-hacked your way into a statistically significant result, it is not science.
You don't have to be Einstein. Even Einstein wasn't Einstein most of the time. But the results you do get must be reproducible.
Most of the "statistical" justifications stem from assuming distributions are normal and samples are uniformly collected (both false).
This is why I trust the works of clinical psychologists like Jung and Freud so much more seriously than the current approaches.
It just makes a lot more sense to observe something complex descriptively, rather than start with a yes/no hypothesis and then test it with a few dozen people.
The issue was sloppy science, not a poor approach.
The results they found were that the studies that could not be successfully reproduced generally failed across all tested cultures, while the ones that could be successfully reproduced generally worked across all tested cultures. So it seems that there are some results in statistical psychology that are robust, but that researchers have not been sufficiently reliable at identifying which results those are.
Also just because there are some problems in the field should not cause you to dismiss science completely in favor of completely subjective speculating about how the mind works, which certainly also has value, but I don't think it's warranted to dismiss actual science in favor of it.
Furthermore, even if you get a high p-value ... your experimental design can introduce so much noise.
Even when you do find these high p-value correlations -- they don't really ever say much about "why" ... because they can't. So much of that is the study design.
Let's say I make a study on how well people navigate a maze under the influence of alcohol vs. not. You can probably get a high p-value that they do worse while under the influence of alcohol... but it reveals very little about why.
So much of that is dependent on the particular maze. In fact I bet I could design a maze specifically to prove whatever conclusion I wanted. That is the whole problem in psychology. There is very little focus given to actually characterizing cognition in any meaningful way.
I like the clinical psychologists because they attempt to do exactly that, even though they have less "scientific" findings.
Freud believed all sorts of obviously wrong ideas, so the idea that he was somehow immune to this seems a bit odd: https://en.wikipedia.org/wiki/Emma_Eckstein#Surgery
Sure, we tend to want to eat once we're hungry but when we go just a stop above hunger, meaning the desire or non-desire for sexual reproduction, then things get really, really complicated really, really fast. For example just the other day I was reading about a 92-year old guy who was choosing daily sex over him allowing his body to get physically better, while on the other hand we've got asexual 20-year olds from places like Japan who don't really think about sex at all. There's no way for a lab-made psychology experiment to make sense of our libido (or lack of).
Isn't that the definition of a failure to replicate?
"We've tested the theory that 'swans are white', and it turns out the swans in our test were all black". (The conclusion isn't that swans aren't white, it's that some swans are, while others are black.)
(add.: the article mentions the 'WEIRD' category. Obviously the applicability/set range of a theory matters. If it only holds for "people who, between 2 and 5 pm on Sunday are walking around in Reading with a bowler hat.". Well, that's nice, but not very useful.)
If you are getting your "science" from newspaper articles, then it's not actually science. You're getting some over-worked, underpaid journalist's attempt at getting a catchy article.
It's like alchemy - they noticed some patterns and have some success exploiting them, but the abstractions and explanations they provide are mostly bullshit, so the predictive power is basically 0.
Even a simple double pendulum is chaotic, yet we can describe its movement quantitatively.
For me the failure of psychology are very similar to the failures of pre-enlightenment physics, chemistry, biology. Wrong abstractions, storytelling instead of formal models, unfalsifiable claims.
My girlfriend was seeing a therapist, and the methods used were basically clairvoyance. Telling the patient "draw a tree", then interpreting the kind of tree as an indication of inherent characteristics of the patient. When she told me that - I've drawn a tree as well, and it turned out my tree was basically the same (the most common kind in our climate, in the most traditional drawing possible). I'm nothing like the description she got from the therapist and neither is she (but in a different way). Of course - she accepts the interpretation of that picture the same way people find matching details in their horoscopes.
Psychology sounds to me a lot like bloodletting or voodoo programming, and scientific community should distance itself from people claiming its discoveries to be scientific, and its therapies to be scientifically-proven.
You can say it's not science but most "scientists" are just normal people with everyday priorities, rather than Einstein-like people who go on doing science work while working as patent clerks.
Where a study failed in one it mostly failed everywhere, and where it worked it mostly worked in all the experiments.
So, yes, priming someone with the number 32 won't get them to bet it in the casino, but it does mean that social science can be experimental - you can build and run lab scale social science - a positive !
In the short term though this seems extremely negative as it indicates that the psychology field and researchers we have today are producing substandard results. Two parts of this article increase my negative feelings. First, they mention that these 28 studies were selected because they were well known and influential. If well known and influential studies only have a 50% reproducibility rate - what about the average study?
Another point I find very troubling is that people betting online were able to predict, with high accuracy, which studies were accurate and which weren't. This tells me that the problem is an "Emperor's New Clothes" type of problem and not just a challenge with the fundamental difficulty of the subject since the subject is so yielding to the public's intuition. Instead of a problem with the fundamental difficulty, the problem seems to be with the psychology of the researchers - being unable to diagnose and prevent methodological flaws in their research. This class of failure is much less excusable to me.
Actually, they might be better off. In the related article* they mention: "Beyond statistical issues, it strikes me that several of the studies that didn’t replicate have another quality in common: newsworthiness."
It could be that many well known works are well known because their finding are counterintuitive, and therefore more interesting. However, this could also imply that the prior probability is lower - hence a p = 0.05 result is more likely to be due to chance than for a study finding something more in line with prior understanding of things.
That doesn't take away from your point, but the problem here may be more refined than "the public" vs "psychologists". It may be more like "some psychologists" vs "journal editors and other psychologists".
The depressing part about this quiz is how easy it is. You just have to think for a moment if what they are describing matches what you know about human nature. Some statements just feel very intuitive and likely, and those replicate. Other statements seem quite wild and abstract, and those don't replicate.
I haven't seen a list of the 28 papers they looked at here, but I'd bet it is similar. Some of the findings are incredible (and not reproducible) while others are intuitive and reproducible. I'd encourage you to take the quiz below and see if you feel the same way about the results as I do.
Psychology Replication Quiz - https://80000hours.org/psychology-replication-quiz/
We are all more aware of conmen movies, of fake news, or have read popsci books and now know what to look out for.
A number of these experiments did replicate, and have such a huge set in multiple regions/culture is really really important.
Say this: in 120 years some of those experiments in the positive replicated set failed. It won't be as easy to just contribute that to sloppy findings. It could me the culture of human beings has changed, as well as our underlying psychological tendencies. If it was just two or three studies, you could dismiss them and say they might have been wrong, but with this data set, you could say with higher confidence that it's more likely something about our society has changed.
I suspect, the authors of the article are a bit too generous to call the original work 'sloppy'.
And that the alternative would be: no experiments with conclusions' -- and that's, in their view, is worse.
I disagree with that, I think 'being lied to', is worse than to 'not being told'.
What if the original work was purposely fake just get a grant from some group (political or commercial)?
If that's the case, than the field of psychology is a composed of big number of 'full-stack con-artists'.
I am saddened but not surprised by psychology's replication crisis. But really it applies to all of academia, psychology is only in the spotlight because some of branches have less than stellar methodology in the first place. Social psychology gets criticized a lot, partly because it's one of the only branches rigorous enough to even be tested!
Obviously I'm not in favour of this. But it's a very big machine that needs fixing, and I don't think it starts with tweaking the details; drastic changes are required if we are to gain back the lost credibility. To me, this applies both to social and hard sciences. Since this is a general problem with academia, either your field has already been called out or it has yet to be. I don't believe the current system offers an alternative (yet), though I'm cautiously optimistic about preregistration.
I tried digging into the study webpage, but holy cow it's not organized in an easily digestable fashion.
So does anyone know what happened with the marshmallow test replication?
> Instead, it suggests that the capacity to hold out for a second marshmallow is shaped in large part by a child’s social and economic background—and, in turn, that that background, not the ability to delay gratification, is what’s behind kids’ long-term success.
I.e. it would still seem to say that the capability to delay gratification is tied to economic success, except that's really a confounding variable. It's also very possible that the ability to delay gratification that one gets from being more affluent is also a factor in future success.
There are so many things to consider, like average sugar intake of affluent/non-affluent families at the time of the first test and now.
If both classes consume candy, perhaps the affluent have access to tastier candy (maybe important because the wealth distribution in the early 90s was more equal), so marshmallows are not tasty to them. (Note: I haven't had a marshmallow in a long time, but they are no where near as tasty as candy bought off the check out shelves).
If present affluent families are educated about sugar (which was less so the case in the early 90s), they would probably feed their children less junk food, so the desire of wanting more of it is not as hard wired in their neurons, or it would taste overly sweet to them.
Maybe less affluent families give their children sweets all the time because it shuts them up.
Maybe the recipe of marshmallows has changed between then and now (see point 1).