For example, Amgen reporting that of landmark cancer papers they reviewed,
47 of the 53 could not be replicated . I would have assumed that most of them didn't involve 'machine learning'
In the past people were manually fishing for results in available datasets. Now they have algorithms to do it for them.
In medicine a popular way to use ML is to improve diagnosis. Now there's already a problem in medicine that the benefits of early diagnosis are overrated and the downsides (overtreatment etc.) usually ignored. You get more of that.
And TBH computer scientists aren't exactly at the forefront when it comes to scientific quality standards. (E.g. practically noone is doing preregistration in CS, which in other fields is considered a prime tool to counter bad scientific practices.)
It's not an interesting question for many scientists, who prefer focusing on technical solutions over political ones.
Even obvious wins that almost everyone can agree on, like getting publishing out of the hands of for-profit entities that add no value is taking forever, because cultural, social and political institutions are hard to move.
Consider the simple task of peak fitting to determine the result for some data. You're probably using a commercial tool to identify the peak position, calculate the baseline, and come up with parameters for your model.
But if there's an error, and at least when I was in grad school the tools often would get stuck in weird local minima that take experience to recognize, it could easily just never be noticed. If your baseline is way off, good luck calculating your peak areas reproducibly...
Data analysis is hard, and it's easy to trust algorithms to be at least more reproducible than doing it more manually. Plus side if you provide your dataset and code others can at least redo the analysis! Really excited to see more Jupyter notebooks used for publications in the future.
> COMPare: Qualitative analysis of researchers’ responses to critical correspondence on a cohort of 58 misreported trials
> Discrepancies between pre-specified and reported outcomes are an important and prevalent source of bias in clinical trials. COMPare (Centre for Evidence-Based Medicine Outcome Monitoring Project) monitored all trials in five leading journals for correct outcome reporting, submitted correction letters on all misreported trials in real time, and then monitored responses from editors and trialists. From the trialists’ responses, we aimed to answer two related questions. First, what can trialists’ responses to corrections on their own misreported trials tell us about trialists’ knowledge of correct outcome reporting? Second, what can a cohort of responses to a standardised correction letter tell us about how researchers respond to systematic critical post-publication peer review?
> Trialists frequently expressed views that contradicted the CONSORT (Consolidated Standards of Reporting Trials) guidelines or made inaccurate statements about correct outcome reporting. Common themes were: stating that pre-specification after trial commencement is acceptable; incorrect statements about registries; incorrect statements around the handling of multiple time points; and failure to recognise the need to report changes to pre-specified outcomes in the trial report. We identified additional themes in the approaches taken by researchers when responding to critical correspondence, including the following: ad hominem criticism; arguing that trialists should be trusted, rather than follow guidelines for trial reporting; appealing to the existence of a novel category of outcomes whose results need not necessarily be reported; incorrect statements by researchers about their own paper; and statements undermining transparency infrastructure, such as trial registers.
But as bad as this is: What the COMPare project is doing here is documenting the flaws of a process to counter bad scientific practice. The reality in most fields (including pretty much all of CS and ML) is that no such process exists at all, because noone even tries to fix these issues.
So you have medicine where people try to fix these issues (and are - admittedly - not very good at it) versus other fields that don't even try.
Not even the people who write the software understand what patterns are being found. All they can do is point to the results that seem good at a glance. But while we can train software to beat humans within a constrained dataset with careful checking, these techniques cannot find new insights. The software does not understand what the data it’s processing represents and thus it doesn’t recognize the abstractions and limits of the data. The patterns it finds are in the low resolution data, not in reality that data is a poor copy of. But science needs to be analyzing the real world and to do that you must comprehend the errors in your data and what they mean. We are nowhere close to making software that can do that.
Not even the people who write the software understand what patterns are being found. All they can do is point to the results that seem good at a glance.
the field has come a long way since 2012, and the whole “it’s magic, we don’t understand why it works or what it learns” is no longer true.
Therefore we know why hypothesis produces wrong results but sometimes it not possible to mend the model due to outliers, rare events, lack of data and/or randomness that surrounds our world. Just as you point out that statistical analysis is done by people who don’t really understand the math, these false result can be due to scientists using ML without understanding its advantages and limitations.
Of course the problem is compounded if we also fail to propose a causal mechanism and you don't try to validate it — something I see data scientists doing all too often since we seldom employ anything like Design Of Experiment practices, and the data we're working with is very rarely created by us.
IMHO, the RC is a reminder to scientists that to confirm a hypothesis you need to pass more than one test, and a reminder to data scientists that we must test using more than one model.
One of the big reasons that the more modeling, variables, and filtering in a study - the more you should discount it. It's too easy to prove something when there's nothing actually there.
An even bigger risk here is that you can engage in the above process and spot check against other data sets to see if it can be validated elsewhere. And you can find correlations that are predictive, yet are in no way whatsoever causal. If we took a sample with enough data on all individuals in the US you'd be able to find some correlation that people who have an E as the second letter in their name, a last name of five characters in length, and went to a high school whose third letter is 'A' have a 23% higher earned income average than those outside the group. And it predicts going forward.
You'd be mapping onto something that obviously has nothing to do with these variables in and of themselves. Perhaps the real issue would be it's simply a very obscure proxy to a certain group of individuals in a certain subset of educational institutions. But the problem is that this is only obviously spurious (even if predictive) because these sort of variables clearly cannot have any sort of a causal relationship. When instead you only look at a selection of variables that, in practically any combination, could be made to seem meaningful through some explanation or another - you open the door to completely 'fake' science that provides results and even predictivity, but has absolutely nothing to do with what's being claimed. So people might try to maximize towards the correlations (which are/were predictive) only to find nothing more happens than if people started actively making sure the second letter of their children's name was an E and legally changed their last name to 5 letter ones.
As a pop culture example of this something similar to this happened with video game reviews. Video game publishers noticed that there was a rather strong correlation with positive game reviews and high sales. So they started working to raise average game scores through any means possible, eventually including 'incentivizing' game reviewers to provide higher scores. As a result game reviews began to mean next to nothing, and the strength of the correlation rapidly faded. Because obviously the correlation was never about high review scores, but about making the sort of games that organically received high review scores. Though in this case we already see "obviousness" fading, since there was some argument to be made that the high review scores were what was driving sales in and of themselves - though that was clearly not the case.
True, but an ML routine can try an approximately infinitely greater number of models. If I'm using x,y and z to predict w, and I've tried all the linear terms, all the squared terms, all the interactions, all the log terms, and I start throwing in other things, my readers will, rightly, raise an eyebrow. Maybe there's some discontinuity to exploit, but if so, I'll explain it -- a policy change for people age 65 or older, say, or a market that exists in one state and not an adjacent one.
The ML, by contrast, can invent the most absurdly jagged multivariate functions imaginable, and we typically* don't even know that it's doing so, let alone why.
*as others have written here, we can actually investigate the how (not the why) by inspecting the algorithm -- but the number of papers that do is much smaller than the number that don't.
And every other comment was like: “What did he expect, he had much less than the usual two papers a year.”
So it seems that even here on HN, the mindset of quantity over quality still persists.
In the long term, it seems like journals that publish lots of falsified papers should be punished, and journals that don't (e.g. because of a judge-upon-pre-registration policy) should crowd them out.
Though this will inevitably lead to the problem that grant boards face, that it'll be a lot harder to differentiate between proposed experiments. And it'll be even harder to do boring stuff. (So if we assume all submitted plans are sound, they have to publish them all. Though then we'll have journals based on how strict they are with experiment design requirements, 1 sigma, 2 sigma, 5 sigma, etc.)
From their 2019 website:
"Do you have insightful and exciting work sitting in a drawer somewhere because it never quite panned out? Are you willing to share your failed approaches so that others can learn from them without having to re-travel the same road? Are you tired of reading papers that pretend the incremental result they happened to achieve was well-motivated and was their goal all along?
We are! That’s why we are founding a new conference: a place for papers that describe instructive failures or not-yet-successes, as they may prefer to be called."
PLOS e.g. specifically stated multiple times that they'll publish what meets their quality standards regardless of a positive or negative outcome.
The problem is: Even if you publish failed research it won't get cited as much. And people still use citation metrics to evaluate "quality" of science.
Just having journals that publish your "failed" research is a good start, but it's not changing the incentive structure.
The greatest thinkers in history had no pressure outside their own urgence to solve/explore. Surely, not everyone is Galileo but today science need that kind of freedom.
There are an infinite set of configurations that could fail, and it's not sufficiently useful to know they failed without understanding why they failed. And analysing failures in a useful way is an extremely difficult and fundamental problem. On the other hand, a successful experiment is an extremely rare event and hence interesting by itself.
Getting data is always expensive in any field so it's much easier to analyse data that already exists as you don't need a large grant application.
Furthermore, I think the use of certain ML techniques may be akin to resume-driven development particularly for PhD students given that the career prospects in Data Science in industry (and using the HR buzzwords like AI, ML, Deep Learning etc.) are much better than the thin pickings that remain in academia in many fields.
I did the same thing as a physics student, but with electronics and programming, which furnished me with a marketable resume.
> If we had an additional dataset would we see the same scientific discovery or principle on the same dataset?
The same holds true for traditional science based on traditional statistics. It just seems that traditional datasets are under less scrutiny of reproducibility and are taken more easily at face value.
The first means it is possible to get results that don't generalize (even if they survive cross validation). The second means it is a lot harder to detect use of correlations that cannot possibly be causal.
One article: https://slate.com/technology/2013/05/weird-psychology-social...
Could you not interpret this, in the widest sense, as overfitting interpretations to a 'weird' dataset, with the results that do not generalize, even though the stats (in this case, likely t-test) say everything is fine? In which case overfitting isn't a ML-only problem?
The example you cite has an issue earlier in the chain. Here, we are dealing with a biased sample from some distrubution. The results from that won't generalize to the unbiased distrubution, no matter how great your model is.
By contrast, traditionally in science you're coming in with a hypothesis ahead of time about what variables predict what target. The goal is to come up with a model that is consistent with your hypothesis (and possibly some existing theory), and which can be applied generally, and which should need no tuning. For example, the very simple model for Beer's Law-- absorbance vs concentration. That is a law that will apply in every other circumstance, but if modern ML methods had been applied, the scientist might have chosen the model with a slightly better score but which includes nonsense variables in addition to concentration.
All that to say, it seems to me the problem stems from scientists' lack of hypotheses at the outset of a project, and/or the understandable desire to get the best bang for their buck out of an experiment by measuring dozens of variables at once and hoping the magic of ML can find a hypothesis for them.
Hope that made sense.
Unfortunately everyone thinks he can use it for finding "new stuff" and so in my field they "predict material properties", etc. using ML fed with data where every review about the physics tells you that the algorithms they use for extracting that data are domain-specific and might yield results different on the order of magnitudes. But nobody cares; take some SW off the net, which claims to be able to extract what you want, run it, train your ML, publish your results.
The tools are usually packaged for use by people who don't have the math background to understand the underpinnings. In the results of DOE's, I've seen everything that is now generalized as the "replication crisis."
Everything I've learned about ML so far (granted not a huge amount) invokes DOE -- fitting data sets to arbitrary functions whose form is more flexible than a Taylor polynomial but otherwise cut from the same cloth.
I've seen exactly the same problem as with ML, but 30 years ago: It can help you optimize a process that you don't understand, turning it into a better process that you also don't understand. But it can't tell you how something works. DOE seems to be a microcosm of ML, with all of the pitfalls such as overfitting and underfitting.
That is an idealistic view of what science should be, it's not what happens in the real world. HARKing ("Hypothetizing after the results are known") was a thing before ML was cool. But ML is amplifying that, it's a more effective tool to perform bad science.
Obviously if this were done sloppily it would be a huge problem and could produce a ton of false positives. But that's not actually what happens. The idea that ML practitioners just fit crazy complicated models to data and blindly believe whatever the model fits seems to be a common stereotype but is completely inaccurate.
We are acutely aware that powerful models can overfit all to easily and spend perhaps the majority of our time understanding and fighting this exact phenomena. Because we tend to work with models
for which few closed-form analytic theorems exist, we tend to do this empirically but no less rigorously. In fact, we tend to be more scientific and rely on fewer assumptions than classical statistics.
The dominant paradigm is empirical risk minimization, sometimes called structural risk minimization[SRM], especially when complexity is being penalized. The idea is to acknowledge that
models are always fit to one particular sample from the population but that the goal is to generalize to the full population. We can never truly evaluate a model on a
whole population, but we can form an empirical estimate for how well our model will do by taking a new sample from the population (not used for fitting/training) and evaluating
model performance on this new sample. Computational learning theories such as VC Theory[VC] and Probably Approximately Correct Learning[PAC] provide theorems that give bounds on how
tight these empirical bounds are. For example, VC Theory and Hoeffding's Inequality[HI] can give us an upper bound on how large the gap between "true" performance and this empirical estimate is for a binary classifier in terms of the number of observations used to measure performance and the "VC Dimension" (roughly the number of parameters) of the model.
A typical SRM workflow would be to divide a data set up into "training," "validation," and "test" sets, fit a set of candidate models to the training set, estimate their performance from the validation set, select the best based on validation set performance[MS], then evaluate the final model performance from the test set. This procedure can be used on arbitrary models to demonstrate the validity of fit models. For example, a model which is just randomly picking 5 genes based on noise in the training set is extremely unlikely to perform better than chance on the final test set.
Not every machine learning practitioner is familiar with VC Theory or PAC, but almost everyone uses the practical tools[CV] and language[BV] that arose from SRM. If you're following Andrew Ng's or Max Kuhn's advice[NG][MK] on "best practices" you are in fact benefiting from VC Theory although you may never have heard of it.
So that's my answer to the question of validity: ML researchers use different techniques, but their techniques have equally good theoretical foundations but make very few assumptions and are very robust in practice. If researchers aren't using these techniques, or abusing them, it's not because ML is unsatisfactory or broken, but because of the same perverse incentives we see everywhere in academia.
There's another criticism floating around that ML models are "black boxes", useful only for prediction and totally opaque. This is only true because non-linear things are harder to understand, and to the extent to which it is true, it is equally true of classical models. A linear model with lots of quadratic and interaction terms, or a model on stratified bands, or a hierarchical model, can be just as hard to interpret. A properly regularized ML model only fits a crazy non-linear boundary when the data themselves require it. A classical model fit to the same data will either have to exhibit the same non-linearity or will be badly wrong. A lot of researcher papers are wrong because someone fit a straight line to curved data!
I also think the "total opaque black box" meme is overstated. We can often understand even very complex models to some degree with a little effort. A basic technique is to run k-means with high k, say, 100, to select a number of "representative" examples from your training set and look at the model's predictions for each. It's also incredibly instructive just to look at a sample of 100 examples the model got wrong. One way to understand a non-linear response surface is by focusing in on different regions where the behavior is locally linear and trying perturbations[LIME]. There are also ML models which do fit easy to understand models[MARS]. It's also usually possible to visualize the low level features[DFV].
ML does completely opposite. Data first, and then the model is discovered using data. It's pretty easy to see why it would lead to non-reproducible models.
I'm not hammering the ML-keyword above my work (and thus am getting considerably less academic attention), but it's nice to hear from people who made it in academia that they support my theory. 50% off the people are just showoffs throwing buzzwords and positivity around while they produce a load of sh...
So if I can actually write decent code, have solid understanding of software development principles, have studied math and statistics from the ground up to an advanced level, am familiar with relevant research - then do I really want to call myself a data scientist (or ML-something) just because it might improve my job/salary prospects, or do I want to stay away from it because everyone who takes a one-week course on Udemy calls themselves a data scientist without being able to back it up with actual skills?
Better get on this train because 10 years after undergraduate I can say it seems that’s all there is out there. Once you make it then maybe you don’t have to be full of sh but I don’t see people making it on merit unless they’re Albert Einstein.
Also when looking at experiments (did that too in a lab course, where I got data from an existing apparatus): "Interpolating" data with a spline. "No, your result is not good, I get different result". "Maybe you should get more data then". "No, data's good, we need different result, also colormap is bad, use same like Matlab". "Well, the matlab colormap is colorful but not true to reality". "No, I see interesting things in plot with Matlab colormap". Stopped arguing, went back home, used jet instead of viridis and smoothed the spline. Got an A+. What a great day!
And while I might be not the brightest guy around and thus might not be able to just run around spitting out solutions for hard problems I have certain standards on integrity. Basically faking results is something I won't do to create (optional, published) work (which a paper is for undergrad work and also was for a PhD until not too long ago here...).
Besides that, most of the people do a PhD for getting better paid jobs, not academic tenure (where such things could backfire).
I think she knows what she is talking about.
Nearly all statisticians realize the need for more inferential thinking in modern ML. E.g. http://magazine.amstat.org/blog/2016/03/01/jordan16/
We still don't do that well in high dimension, low sample size regimes that make up the majority of life science research.
On the other hand AI is fantasy BS hype boosting off the fact that ML sounds similar to AI to people who aren't aware Machine Learning is stats.
Maybe AI one day but today it is utterly ridiculous. No really. Every single article should mention both of those things at least in passing. Downvote away all you AI hype Surfers but you know it's true.
AI is real and is a legit field of study, of which ML is currently very popular, so some articles conflate these two. But it is far from bullshit. If you want to learn more read Artificial Intelligence: A modern approach.
AI in a pitch is exactly as much BS as interstellar travel. Definitely go ahead and study all the failed attempts. Norvig wrote the most popular text nearly a quarter of a century ago. It didn't exist then either. There is no going to alpha centuri there is no Hal9000. There is no intelligence that is artificial.
Machine Learning exists and is real and is statistical inference done by computer. There is no point where you can look at it and say "Here is the boundary between statistics and ML." Try it. Glorified curve fitting has some fantastic applications and killer demos. Along comes the hype train exactly as you'd expect. Put it on the blockchain or has the hype for that died now?
Shallow BS detection, worth doing if only for one's own sanity.
"In precision medicine, it's important to find groups of patients that have genomically similar profiles so you can develop drug therapies that are targeted to the specific genome for their disease," Allen said. "People have applied machine learning to genomic data from clinical cohorts to find groups, or clusters, of patients with similar genomic profiles.
"But there are cases where discoveries aren't reproducible; the clusters discovered in one study are completely different than the clusters found in another," she said. "Why? Because most machine-learning techniques today always say, 'I found a group.' Sometimes, it would be far more useful if they said, 'I think some of these are really grouped together, but I'm uncertain about these others.'"
Allen will discuss uncertainty and reproducibility of ML techniques for data-driven discoveries at a 10 a.m. press briefing today, and she will discuss case studies and research aimed at addressing uncertainty and reproducibility in the 3:30 p.m. general session, "Machine Learning and Statistics: Applications in Genomics and Computer Vision." Both sessions are at the Marriott Wardman Park Hotel.
& the context of the AAAS session
> ... developing the next generation of machine learning and statistical techniques that can ... also report how uncertain their results are and their likely reproducibility.
So she's actually using machine learning to access systematic uncertainties, i.e. to get better, more reproducible research. Of course, like all forms of automation, people tend to sensationalize progress as a crisis because it makes it too easy to shoot yourself in the foot.
But doing things "the old fashioned way" isn't any better. Early particle physics experiments would get armies of undergrads classify photographs of collisions in bubble chambers. These results took thousands of researcher-hours to compile, which might seem all fine and dandy, until you realize that there may have been a systematic bias in your classification. Now what do you do?
Thanks to machine learning, there are a lot of things we can do: we can try to remove the bias and retrain the algorithm, or we can train with extreme examples of bias and use that to quote a systematic uncertainty. We can try a multitude of approaches to estimate uncertainties rerun our entire analysis in a few hours. Good luck doing that with an army of undergrads.
I postulate that out of 12 billion random events it would be remarkable if a booster didn't extract 100 or so items that looked similar to a Higgs detection.
Well, let's give it 20 years and a new generation of PI's who aren't invested in this and have grad students who are keen to find something different in the data.
But ohh.. all the data has been thrown aways... oh! 
As for throwing all the data away, the article you link to actually does a good job of explaining how this is done: we look at every collision with thousands of sensors before deciding whether to keep it. At this stage there is absolutely no machine learning anyway (just physics knowledge), so be careful blaming machine learning for any missed discoveries.
That's a ridiculously disingenuous summary of the article, did you read past the Forbes headline?
ML or more generally mathematics do not cause anything. People who misuse mathematics are to blame here. Some fields are simply using tools they don't understand and this predates ML advances by decades. Thinking of stats use in psychology and medicine for instance.
This trend of presenting ML are some kind of magic powder is ridiculous. I blame hyped presentations by influential ML scientists for this.
Ps: I have no experience with anything regarding ML.
However, even if you're able to do that, the problem isn't solved. You have to answer level of confidence for what? The uncertainty / confidence that you get assumes your model is right. No model can tell you whether it is a true reflection of reality. I had written more about this on my twitter: https://twitter.com/paraschopra/status/1075033048767520768
ML is actually a field with very high standards for replication, in part because emperical results are currently the focus. If certain methods don't generalize to other datasets, then all bets are off: you are dealing with data that violates the IID assumption. No statistics, bean counting, or ML is going to help you get significant results.
Unfortunately if you're an engineer/physicist/chemist/biologist/social-scientist your background in statistics is neither fresh nor deep. So your professor comes to you: can you do something with ML, it's such a hot topic (your boss has also no background in statistics nor do his peers (which review your stuff...)) you say: yes (because a no I don't know about it won't be good for you). Then you go to some google or blockchain sponsored-tutorial where some self-taught-Indian-CS-Bachelor is telling you how to use ML with Python and Tensorflow. You might wonder about some things but in the end you need to get things done and feed your data (which is often garbage, but verifying that it's not is not hot) into an algorithm you don't understand. Then you find some other guys, doing this, cite them and publish. 0 scientific value generated, but a great step for your academic carrer nonetheless.
Replication might not turn out to be the big problem. The lack of progress towards a unifying theory might be a more important long term issue.
If you research ML, you can publish in ML journals, there are several. If your research is about applying ML to domain problems, are you then an ML researcher or a domain researcher?
I think there's actually a very simple explanation for this which lots and lots of people hate, so they're sort of in denial about it. Academia is entirely government funded and has little or no accountability to the outside world. Academic incentives are a closed loop in which the same sorts of people who are producing papers are also reviewing them, publishing them, allocating funding, assessing each other's merits etc. It's a giant exercise in marking your own homework.
Just looked at in purely economic terms, academia is a massive planned economy. The central planners (grant bodies) decide that what matters is volume and novelty of results, so that's what they get, even though the resulting stream of papers is useless to the people actually trying to apply science in the real world ... biotech firms here but the same problem crops up in many fields. It's exactly what we'd expect to see given historical precedent and the way the system works.
There's another huge elephant in the room here beyond the replication crisis ("to what extent are the outputs wrong") which is the question of to what extent are the outputs even relevant to begin with? Whenever I sift through academic output I'm constantly amazed at the vast quantity of obviously useless research directions and papers that appear to be written for their cleverness rather than utility. The papers don't have to be wrong to be useless, they can just solve non-problems or make absurd tradeoffs that would never fly in any kind of applied science.
I read a lot of CS papers and I've noticed over time that the best and most impactful papers are almost always the ones coming out of corporate research teams. I think this is because corporate funded research has some kind of ultimate accountability and connection to reality that comes from senior executives asking hard questions about applicability. For instance in the realm of PL research academia pumps out new programming languages all the time, but they rarely get any traction and the ideas they explore are frequently ignored by the industrial developers of mainstream languages because they're completely impractical. This problem is usually handwaved away by asserting that the ideas aren't bad ideas, they're just incredibly futuristic and 30 years from now we'll definitely be using them - but this kind of reasoning is unfalsifiable on any kind of sensible timescale so it's the same as saying, "I shouldn't be held accountable within the span of my own career for how I spend tax and student money".
As time goes by I am getting more and more sympathetic to the idea of just drastically cutting academic funding and balancing the books by drastically reducing corporation tax. The amount of total research would fall significantly because corporations wouldn't invest all the newly available money in research, or even most of it, but it's unclear to me that this would be a bad thing - if 75% of research studies coming out of academic biotech are wrong then it stands to reason that if standards were improved significantly, funding could be reduced by (say) 50% and still get a similar quantity of accurate papers out the other end. It's possible the science crisis is really just reflecting massive oversupply of scientists, massive undersupply of accountability and in general research should be a much smaller social effort than it presently is.
To play the devil's advocate: scientists in industry/corporations do not come out of nowhere - they come from academia. Will the academics not move to countries where academic research is better funded? The students will follow. Corporations will set up their research labs in those countries near the universities to poach the best talent. Suddenly, your country is at a disadvantage.
Machine learning makes it easier to test great many hypothesis, but even going fully "by hand" it is very easy to deviate from what the statistical framework of hypothesis testing would demand. There are now some discussions about counter-measures, e.g. about preregistration of studies:
You can see this as another chapter in the long debate about the correct way to test scientific hypotheses:
All experiments have a limit it seems
Testing a hypothesis against a pre-existing dataset is a valid thing to do, and it is also almost trivially simple (and completely free) for someone with a reasonable computational background. There are researchers who spend a decent portion of their careers performing these analyses. This is all well and good- we want people to spend time analyzing the highly complex data that modern science produces- but we run into problems with statistics.
Suppose an analyst can test a hundred hypotheses per month (this is probably a low estimate.) Each analysis (simplifying slightly!) ends with a significance test, returning a p-value indicating the likelihood that the hypothesis is false. If p < 0.01, the researcher writes up the analysis and sends it off to a journal for publication, since the odds that this result was spurious are literally hundred-to-one. But you see the problem; even if we assume that this researcher tests no valid hypotheses at all over the course of a year, we would expect them to send out one paper per month- and each of these papers would be entirely valid, with no methodological flaws for reviewers to complain about.
In reality, of course, researchers sometimes test true hypotheses, and the rate of true to false computational-analysis papers would depend on the ratio of "true hypotheses that analysis successfully catches" to "false hypothesis that squeak by under the p-value threshold" (i.e., the True Positive rate vs the False Positive rate.) It's hard to guess that this ratio would be, but if AAAS is calling things a "crisis," it's clearly lower than we would like.
But there's a further problem, since the obvious solution- lower the p-value threshold for publication- would lower both the False Positive rate and the True Positive rate. The p-value that gets assigned to the results of an analysis of a true hypothesis are limited by the statistical power (essentially, size and quality) of the dataset being looked at; lower the p-value threshold too much, and analysts simply won't be able to make a sufficiently convincing case for any given true hypothesis. It's not a given that there is a p-value threshold for which the True Positive/False Positive ratio is much better than it is now.
"More data!" is the other commonly proposed solution, since we can safely lower the p-value threshold if we have the data to back up true hypotheses. But even if we can up the experimental throughput so much that we can produce True Positives at p < 0.0001, that simply means that computational researchers can explore more complicated hypotheses, until they're testing thousands or millions of hypotheses per month- and then we have the same problem. In a race between "bench work" and "human creativity plus computer science," I know which I'd bet on.