Hacker News new | past | comments | ask | show | jobs | submit login
AAAS: Machine learning 'causing science crisis' (bbc.co.uk)
136 points by adzicg 30 days ago | hide | past | web | favorite | 115 comments



Is machine learning really to blame for the reproducibility crisis? I'm not in academia, but it seemed to me that the problem was entirely present without machine learning being involed.

For example, Amgen reporting that of landmark cancer papers they reviewed, 47 of the 53 could not be replicated [1]. I would have assumed that most of them didn't involve 'machine learning'

[1] https://www.reuters.com/article/us-science-cancer/in-cancer-...


The problem was there before, but there are reasons why Machine Learning is amplifying bad practices.

In the past people were manually fishing for results in available datasets. Now they have algorithms to do it for them.

In medicine a popular way to use ML is to improve diagnosis. Now there's already a problem in medicine that the benefits of early diagnosis are overrated and the downsides (overtreatment etc.) usually ignored. You get more of that.

And TBH computer scientists aren't exactly at the forefront when it comes to scientific quality standards. (E.g. practically noone is doing preregistration in CS, which in other fields is considered a prime tool to counter bad scientific practices.)


It seems what ML is really doing is exposing weaknesses in our scientific processes. The appropriate response here is to fix the processes, instead of blaming the latest fad and imploring people to "try harder". If the root cause isn't fixed, the next fad after ML will cause the same thing again. What feasible systemic changes can we make so that scientists can't get away with publishing sloppy results?

It's not an interesting question for many scientists, who prefer focusing on technical solutions over political ones.


I think it is a very interesting question for a lot of scientists, but also an extremely hard one to answer. And an even harder one to implement.

Even obvious wins that almost everyone can agree on, like getting publishing out of the hands of for-profit entities that add no value is taking forever, because cultural, social and political institutions are hard to move.


If we broaden Machine Learning to apply to data fitting tools, I can see how it would apply broadly and with no ill intent to produce errors.

Consider the simple task of peak fitting to determine the result for some data. You're probably using a commercial tool to identify the peak position, calculate the baseline, and come up with parameters for your model.

But if there's an error, and at least when I was in grad school the tools often would get stuck in weird local minima that take experience to recognize, it could easily just never be noticed. If your baseline is way off, good luck calculating your peak areas reproducibly...

Data analysis is hard, and it's easy to trust algorithms to be at least more reproducible than doing it more manually. Plus side if you provide your dataset and code others can at least redo the analysis! Really excited to see more Jupyter notebooks used for publications in the future.


Medicine may be better than ML but there’s not much in the difference.

> COMPare: Qualitative analysis of researchers’ responses to critical correspondence on a cohort of 58 misreported trials

> Background

> Discrepancies between pre-specified and reported outcomes are an important and prevalent source of bias in clinical trials. COMPare (Centre for Evidence-Based Medicine Outcome Monitoring Project) monitored all trials in five leading journals for correct outcome reporting, submitted correction letters on all misreported trials in real time, and then monitored responses from editors and trialists. From the trialists’ responses, we aimed to answer two related questions. First, what can trialists’ responses to corrections on their own misreported trials tell us about trialists’ knowledge of correct outcome reporting? Second, what can a cohort of responses to a standardised correction letter tell us about how researchers respond to systematic critical post-publication peer review?

> Results

> Trialists frequently expressed views that contradicted the CONSORT (Consolidated Standards of Reporting Trials) guidelines or made inaccurate statements about correct outcome reporting. Common themes were: stating that pre-specification after trial commencement is acceptable; incorrect statements about registries; incorrect statements around the handling of multiple time points; and failure to recognise the need to report changes to pre-specified outcomes in the trial report. We identified additional themes in the approaches taken by researchers when responding to critical correspondence, including the following: ad hominem criticism; arguing that trialists should be trusted, rather than follow guidelines for trial reporting; appealing to the existence of a novel category of outcomes whose results need not necessarily be reported; incorrect statements by researchers about their own paper; and statements undermining transparency infrastructure, such as trial registers.

https://trialsjournal.biomedcentral.com/articles/10.1186/s13...


I am just in the process of digging into this paper and covering it in an article, so I'm quite familiar with it.

But as bad as this is: What the COMPare project is doing here is documenting the flaws of a process to counter bad scientific practice. The reality in most fields (including pretty much all of CS and ML) is that no such process exists at all, because noone even tries to fix these issues.

So you have medicine where people try to fix these issues (and are - admittedly - not very good at it) versus other fields that don't even try.


This is definitely true. A common "blueprint" for articles in applied CS is first they propose a "novel" algorithm. This algorithm may be very similar to an existing algorithm and in many cases is identical to one. Then they benchmark the algorithm and shows that it performs better on some metrics than existing solutions. These benchmarks are often quite poor, and if you vary them a little, the purported performance increases vanishes.


Are you planning on touching on autoML? Do you think that could help?


its quite a stretch to say that CS does not have reproducibility ! .. probably want to define that a bit


Can you give a citation about ML being used this way and producing overtreatment? I'm aware of experimental results with e.g. Watson that were first overstated and then rejected, bit that seemed like a situation where an institution experimented with a poor use of ML and successfully rejected it, not where they falsely accepted an ML result.


No but it’s making it worse because it’s giving false confidence in results and amplifying failures in experiment design. It’s also been held out as a fix for the reproducibility crisis but just as lots of statistical analysis has been done by people who don’t really understand the math but are just cargo culting other experiments, machine learning is taking that ignorance-of-your-toolset risk to the next level.

Not even the people who write the software understand what patterns are being found. All they can do is point to the results that seem good at a glance. But while we can train software to beat humans within a constrained dataset with careful checking, these techniques cannot find new insights. The software does not understand what the data it’s processing represents and thus it doesn’t recognize the abstractions and limits of the data. The patterns it finds are in the low resolution data, not in reality that data is a poor copy of. But science needs to be analyzing the real world and to do that you must comprehend the errors in your data and what they mean. We are nowhere close to making software that can do that.


  Not even the people who write the software understand what patterns are being found. All they can do is point to the results that seem good at a glance. 
This isn’t really true. For example, we can pass in an image to a convolutional net and see which filters are activated; this can give us a clear indication if it’s edge detectors that are activating or textures or specific shapes (eg a dog would activate edge detectors, textures that look like fur, and shapes that resemble a dogs face). We can also train models to disentangle its representations and make specific variables stand for specific things (eg for a net trained on handwriting,values in one variable can represent the slant of writing, another one the letter, another the thickness, etc.). There is also a ton of work being done in training causal models. We also have decent ways now of visualizing high dimensional loss surfaces.

the field has come a long way since 2012, and the whole “it’s magic, we don’t understand why it works or what it learns” is no longer true.


Some of the points are true specifically the last part but you are wrong about the fact that people who write these software doesn't understand what patterns are being found. We can clearly see in ML and Deep models why the decision was made by the hypothesis using various libraries such as eli5, Tensorboard and others. Deep Learning models are in general harder to debug but still possible.

Therefore we know why hypothesis produces wrong results but sometimes it not possible to mend the model due to outliers, rare events, lack of data and/or randomness that surrounds our world. Just as you point out that statistical analysis is done by people who don’t really understand the math, these false result can be due to scientists using ML without understanding its advantages and limitations.


The problem with ML is the same one that brought about the reproducibility crisis (RC) — believing that simply exceeding one predefined threshold for some probabilistic metric (like correlation or p value) is 'good enough'.

Of course the problem is compounded if we also fail to propose a causal mechanism and you don't try to validate it — something I see data scientists doing all too often since we seldom employ anything like Design Of Experiment practices, and the data we're working with is very rarely created by us.

IMHO, the RC is a reminder to scientists that to confirm a hypothesis you need to pass more than one test, and a reminder to data scientists that we must test using more than one model.


Machine learning trivializes p-hacking. Take a database of random datums. Pick e.g. 3 input datums at random and map them against one manually chosen output datum. Run the machine learning system and observe the error rate. If it decreases below some value 'p' you now have a [most likely completely spurious] correlation. Spin up an explanation for it - the more sensationalized the better. Claim that the process was done in the reverse order, claim it's science, publish -- you now have a ground breaking hypothesis that was validated by experimentation.

One of the big reasons that the more modeling, variables, and filtering in a study - the more you should discount it. It's too easy to prove something when there's nothing actually there.

An even bigger risk here is that you can engage in the above process and spot check against other data sets to see if it can be validated elsewhere. And you can find correlations that are predictive, yet are in no way whatsoever causal. If we took a sample with enough data on all individuals in the US you'd be able to find some correlation that people who have an E as the second letter in their name, a last name of five characters in length, and went to a high school whose third letter is 'A' have a 23% higher earned income average than those outside the group. And it predicts going forward.

You'd be mapping onto something that obviously has nothing to do with these variables in and of themselves. Perhaps the real issue would be it's simply a very obscure proxy to a certain group of individuals in a certain subset of educational institutions. But the problem is that this is only obviously spurious (even if predictive) because these sort of variables clearly cannot have any sort of a causal relationship. When instead you only look at a selection of variables that, in practically any combination, could be made to seem meaningful through some explanation or another - you open the door to completely 'fake' science that provides results and even predictivity, but has absolutely nothing to do with what's being claimed. So people might try to maximize towards the correlations (which are/were predictive) only to find nothing more happens than if people started actively making sure the second letter of their children's name was an E and legally changed their last name to 5 letter ones.

---

As a pop culture example of this something similar to this happened with video game reviews. Video game publishers noticed that there was a rather strong correlation with positive game reviews and high sales. So they started working to raise average game scores through any means possible, eventually including 'incentivizing' game reviewers to provide higher scores. As a result game reviews began to mean next to nothing, and the strength of the correlation rapidly faded. Because obviously the correlation was never about high review scores, but about making the sort of games that organically received high review scores. Though in this case we already see "obviousness" fading, since there was some argument to be made that the high review scores were what was driving sales in and of themselves - though that was clearly not the case.


No one uses or needs ML for overfitting 4 variables. You can do that with regular statistics just fine. And how you interpret ML results is just as fraught with error as any statistical argument—just because the technique gives you some result doesn’t mean it’s explanatory, that is science 101.


> No one uses or needs ML for overfitting 4 variables. You can do that with regular statistics just fine.

True, but an ML routine can try an approximately infinitely greater number of models. If I'm using x,y and z to predict w, and I've tried all the linear terms, all the squared terms, all the interactions, all the log terms, and I start throwing in other things, my readers will, rightly, raise an eyebrow. Maybe there's some discontinuity to exploit, but if so, I'll explain it -- a policy change for people age 65 or older, say, or a market that exists in one state and not an adjacent one.

The ML, by contrast, can invent the most absurdly jagged multivariate functions imaginable, and we typically* don't even know that it's doing so, let alone why.

*as others have written here, we can actually investigate the how (not the why) by inspecting the algorithm -- but the number of papers that do is much smaller than the number that don't.


The mandatory xkcd significance link https://xkcd.com/882/


Fails to touch on the perverse incentives in academia, "publish or perish" etc. Torturing a dataset to find a p value that a journal will like (or equivalent stat measure) is better for your career than not publishing a paper that will be discredited in time. You have no incentive at all to decide "my results are unconvincing at this point, I'm not going to submit them" and every reason to write them up as a useful contribution to human understanding even if you kind of know, deep down, it really isn't. Especially if you're not senior...


Some days ago there was a post here on HN about a post doc who failed to get tenure.

And every other comment was like: “What did he expect, he had much less than the usual two papers a year.”

So it seems that even here on HN, the mindset of quantity over quality still persists.


No...that's not what it seems. I remember reading all the comments in that HN thread on Friday. The top comments were dispassionately explaining the reality of academic career trajectories. That's a basic reporting of facts, not a normative claim that quantity is better than quality. They did not themselves say you need to publish more often to be a good professor; rather they discussed the incentives that lead this to be the practical reality of the field.


It seemed obvious to me that those comments weren’t extolling any merit of the “two papers per year” heuristic, but instead were just cynically resigned to the fact that that’s just the reality of it (good or bad) and so someone pursuing tenure has to be planning on dealing with being evaluated that way (even if it’s a bad practice generally).


The problem really is: there is no alternative (which I know of). Having a number of papers in well known journals is everything we have to gauge the quality of people that look for a life time position. It’s sad but true


No, the number of papers is an okay metric, thr problem is that journals don't like to publish honest negative results


There are a few journals that will promise to publish your results based entirely on your pre-registered plan -- e.g. whether or not you find the correlation you were looking for.

In the long term, it seems like journals that publish lots of falsified papers should be punished, and journals that don't (e.g. because of a judge-upon-pre-registration policy) should crowd them out.


Pre-registration is great, and I think it even helps researchers/scientists/post-docs to submit something (it doesn't have to be perfect, after all, it's just an experiment design, and they don't have to worry about massaging the data to have promising results to report) and then stick to the topic, and then carry out the experiment as well as they can, and gather as much high quality & fidelity data as they can, and do the analysis according to the plan, and then report it. No extra pressure to think about how to frame what you "found".

Though this will inevitably lead to the problem that grant boards face, that it'll be a lot harder to differentiate between proposed experiments. And it'll be even harder to do boring stuff. (So if we assume all submitted plans are sound, they have to publish them all. Though then we'll have journals based on how strict they are with experiment design requirements, 1 sigma, 2 sigma, 5 sigma, etc.)


that author’s problem was that they had been a postdoc for 23 years. That duration alone raises fatal red flags.


It's possible you're right, but that's not what people were saying IIRC.


That's exactly what they were saying. Go read the thread again. They were being prescriptive of academic careers, not descriptive of the author's inherent merit as a researcher.


and to recurse a layer, i’m being descriptive of the prescriptivists that i know to exist and i easily believe did that guy in.


Those were not normative statements.


That's possible of course.


We need to find resources to fund "Failures in Science" journals that exclusively seek to publish interesting research that went nowhere. Personally I'd find these far more interesting to study. "Here's some background. Here's a pretty logical, plausible hypothesis we came up with and how, here's our experiment, here's our results, here's our thoughts as to why we were wildly wrong."


There is a new conference in cryptology that promotes exactly this: CFAIL.

From their 2019 website[0]: "Do you have insightful and exciting work sitting in a drawer somewhere because it never quite panned out? Are you willing to share your failed approaches so that others can learn from them without having to re-travel the same road? Are you tired of reading papers that pretend the incremental result they happened to achieve was well-motivated and was their goal all along?

We are! That’s why we are founding a new conference: a place for papers that describe instructive failures or not-yet-successes, as they may prefer to be called."

[0] https://www.cfail2019.com/


We have that. People don't use it.

PLOS e.g. specifically stated multiple times that they'll publish what meets their quality standards regardless of a positive or negative outcome.

The problem is: Even if you publish failed research it won't get cited as much. And people still use citation metrics to evaluate "quality" of science.

Just having journals that publish your "failed" research is a good start, but it's not changing the incentive structure.


Agreed. This isn't a one-action problem to solve. But we need to lay the foundation for alternative incentives to be possible.


Instead of allows failures (which would be also good btw), i would increase the time at disposal needed to pubblish something worthwhile.

The greatest thinkers in history had no pressure outside their own urgence to solve/explore. Surely, not everyone is Galileo but today science need that kind of freedom.


The FADS workshop[0] at VLDB was a nice attempt at this although I think the work ended up focusing on known historical failures as opposed to failed work that was not otherwise published.

[0] https://fads.ws/


The problem with a failed experiment, at least in machine learning, is that it's not always clear what caused the failure. Was it not enough training, was there some small "trick" the model could have used, were the hyper parameters off etc.

There are an infinite set of configurations that could fail, and it's not sufficiently useful to know they failed without understanding why they failed. And analysing failures in a useful way is an extremely difficult and fundamental problem. On the other hand, a successful experiment is an extremely rare event and hence interesting by itself.


This is the unfortunate truth. Furthermore, those who take more time to find a general, robust and theoretically sound result (beyond just the p-optimized publishable result), get filtered out of te tenure positions as not have a high impact factor, despite their papers in fact having a higher impact.


You may have meant "bullshitable result".


They aren’t mutually exclusive


This is the real issue.

Getting data is always expensive in any field so it's much easier to analyse data that already exists as you don't need a large grant application.

Furthermore, I think the use of certain ML techniques may be akin to resume-driven development particularly for PhD students given that the career prospects in Data Science in industry (and using the HR buzzwords like AI, ML, Deep Learning etc.) are much better than the thin pickings that remain in academia in many fields.


I don't blame the students too much. If you get into a PhD program, it's very hard to change fields or even specialties without starting over. A simpler tactic is to weave a field of interest into your existing research project, then market yourself in that field when you graduate.

I did the same thing as a physics student, but with electronics and programming, which furnished me with a marketable resume.


Publish or perish is gone. The NIH is no longer the funding source of choice, now it is pharma or private donors. These people want large impressive datasets, not publications. You can rent out data to pharma without having to publish anything. These don't even have to be useful datasets, mere size is enough to make them valuable to greedy prospectors with lots of cash. Then they can plumb them for bunk results ad infinitum.


Seems like we need general solutions to overfitting, not just in machine learning.


Didn’t we used to value negative outcomes?


ML is not causing a reproducibility crisis, it just exposes one that is already there.

> If we had an additional dataset would we see the same scientific discovery or principle on the same dataset?

The same holds true for traditional science based on traditional statistics. It just seems that traditional datasets are under less scrutiny of reproducibility and are taken more easily at face value.


A specific issue with machine learning is overfitting and non-interpretability.

The first means it is possible to get results that don't generalize (even if they survive cross validation). The second means it is a lot harder to detect use of correlations that cannot possibly be causal.


One of the issues with problems of reproducibility in the social sciences/psychology is that early studies usually choose WEIRD (white/educated from a rich/industrialised/democractic country) subjects, which are often uni students who are very different from the rest of the world.

One article: https://slate.com/technology/2013/05/weird-psychology-social...

Could you not interpret this, in the widest sense, as overfitting interpretations to a 'weird' dataset, with the results that do not generalize, even though the stats (in this case, likely t-test) say everything is fine? In which case overfitting isn't a ML-only problem?


Overfitting is about reading too much in a properly sampled data set from any distribution. It comes about when your model has so many parameters it can be made to fit anything. Regularization keeps it in check, but it remains an issue.

The example you cite has an issue earlier in the chain. Here, we are dealing with a biased sample from some distrubution. The results from that won't generalize to the unbiased distrubution, no matter how great your model is.


Curious (possibly naive) question: isn't there a fundamental difference between the goals behind creating models with ML vs the "old-fashioned" way? That is, in modern ML applications, you're creating a model with dozens/hundreds of potential variables, without a hypothesis of how they relate or contribute to the target (other than that they might, hence your including them in the modeling process). You're using the model for predictions more than for explainability (though there is work ongoing into improving explainability, but it seems kind of post hoc to me). And there's an expectation that you will retrain, or at least tune, the model as its predictive accuracy decays over time.

By contrast, traditionally in science you're coming in with a hypothesis ahead of time about what variables predict what target. The goal is to come up with a model that is consistent with your hypothesis (and possibly some existing theory), and which can be applied generally, and which should need no tuning. For example, the very simple model for Beer's Law-- absorbance vs concentration. That is a law that will apply in every other circumstance, but if modern ML methods had been applied, the scientist might have chosen the model with a slightly better score but which includes nonsense variables in addition to concentration.

All that to say, it seems to me the problem stems from scientists' lack of hypotheses at the outset of a project, and/or the understandable desire to get the best bang for their buck out of an experiment by measuring dozens of variables at once and hoping the magic of ML can find a hypothesis for them.

Hope that made sense.


I think you got the point. A lot of people don't seem to realize that ML might be great for finding patterns but will never yield scientific knowledge in the sense of cause-reaction sense.

Unfortunately everyone thinks he can use it for finding "new stuff" and so in my field they "predict material properties", etc. using ML fed with data where every review about the physics tells you that the algorithms they use for extracting that data are domain-specific and might yield results different on the order of magnitudes. But nobody cares; take some SW off the net, which claims to be able to extract what you want, run it, train your ML, publish your results.


What method would you use to yield scientific knowledge in the sense of cause-reaction? Many important processes really do have large numbers of causal factors that interact non-linearly. If we want to try to learn about that, some statistical method that deals with many parameters will be needed. Such models are generally referred to as "Machine Learning". Their generalisation or causal inference properties are particular to each implementation and identification strategy, but you can't just say "ML will never yield scientific knowledge".


A tool called Mathematics which can exactly describe this interactions. And if those processes have a lot of variables, a ML-model might certainly be useful, but it will never be generally applicable! This probably also contributes to "scientific knowledge" but it's not the same as scientific facts (or whatever you call universally transferable results).


You're building your mathematical model based on the knowledge you have, which is from the data you have, and there is still the same risk that your theory won't generalize to new observations.


For decades, folks in the area of industrial quality control have used methods called "design of experiments" (DOE), that could be loosely described as fitting minimalistic experimental data sets to arbitrary functions (typically low order multivariate Taylor polynomials).

The tools are usually packaged for use by people who don't have the math background to understand the underpinnings. In the results of DOE's, I've seen everything that is now generalized as the "replication crisis."

Everything I've learned about ML so far (granted not a huge amount) invokes DOE -- fitting data sets to arbitrary functions whose form is more flexible than a Taylor polynomial but otherwise cut from the same cloth.

I've seen exactly the same problem as with ML, but 30 years ago: It can help you optimize a process that you don't understand, turning it into a better process that you also don't understand. But it can't tell you how something works. DOE seems to be a microcosm of ML, with all of the pitfalls such as overfitting and underfitting.


> By contrast, traditionally in science you're coming in with a hypothesis ahead of time about what variables predict what target.

That is an idealistic view of what science should be, it's not what happens in the real world. HARKing ("Hypothetizing after the results are known") was a thing before ML was cool. But ML is amplifying that, it's a more effective tool to perform bad science.


I would guess both you and author of the article have in mind something like gene expression[GE] in bioinformatics. Thousands of markers, but only hundreds of examples, and the researchers are using some automated feature selection approach[LB] to pick genes that predict some disease.

[GE]: https://en.wikipedia.org/wiki/Machine_learning_in_bioinforma...

[LB]: https://www.quora.com/How-is-Lasso-method-used-in-bioinforma...

Obviously if this were done sloppily it would be a huge problem and could produce a ton of false positives. But that's not actually what happens. The idea that ML practitioners just fit crazy complicated models to data and blindly believe whatever the model fits seems to be a common stereotype but is completely inaccurate. We are acutely aware that powerful models can overfit all to easily and spend perhaps the majority of our time understanding and fighting this exact phenomena. Because we tend to work with models for which few closed-form analytic theorems exist, we tend to do this empirically but no less rigorously. In fact, we tend to be more scientific and rely on fewer assumptions than classical statistics.

The dominant paradigm is empirical risk minimization, sometimes called structural risk minimization[SRM], especially when complexity is being penalized. The idea is to acknowledge that models are always fit to one particular sample from the population but that the goal is to generalize to the full population. We can never truly evaluate a model on a whole population, but we can form an empirical estimate for how well our model will do by taking a new sample from the population (not used for fitting/training) and evaluating model performance on this new sample. Computational learning theories such as VC Theory[VC] and Probably Approximately Correct Learning[PAC] provide theorems that give bounds on how tight these empirical bounds are. For example, VC Theory and Hoeffding's Inequality[HI] can give us an upper bound on how large the gap between "true" performance and this empirical estimate is for a binary classifier in terms of the number of observations used to measure performance and the "VC Dimension" (roughly the number of parameters) of the model.

A typical SRM workflow would be to divide a data set up into "training," "validation," and "test" sets, fit a set of candidate models to the training set, estimate their performance from the validation set, select the best based on validation set performance[MS], then evaluate the final model performance from the test set. This procedure can be used on arbitrary models to demonstrate the validity of fit models. For example, a model which is just randomly picking 5 genes based on noise in the training set is extremely unlikely to perform better than chance on the final test set.

[SRM]: http://www.svms.org/srm/

[VC]: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_th...

[PAC]: https://en.wikipedia.org/wiki/Probably_approximately_correct...

[HI]: https://people.cs.umass.edu/~domke/courses/sml2010/10theory....

[MS]: https://en.wikipedia.org/wiki/Model_selection

Not every machine learning practitioner is familiar with VC Theory or PAC, but almost everyone uses the practical tools[CV] and language[BV] that arose from SRM. If you're following Andrew Ng's or Max Kuhn's advice[NG][MK] on "best practices" you are in fact benefiting from VC Theory although you may never have heard of it.

[VC]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

[BV]: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

[NG]: https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599

[MK]: http://appliedpredictivemodeling.com/

So that's my answer to the question of validity: ML researchers use different techniques, but their techniques have equally good theoretical foundations but make very few assumptions and are very robust in practice. If researchers aren't using these techniques, or abusing them, it's not because ML is unsatisfactory or broken, but because of the same perverse incentives we see everywhere in academia.

There's another criticism floating around that ML models are "black boxes", useful only for prediction and totally opaque. This is only true because non-linear things are harder to understand, and to the extent to which it is true, it is equally true of classical models. A linear model with lots of quadratic and interaction terms, or a model on stratified bands, or a hierarchical model, can be just as hard to interpret. A properly regularized ML model only fits a crazy non-linear boundary when the data themselves require it. A classical model fit to the same data will either have to exhibit the same non-linearity or will be badly wrong. A lot of researcher papers are wrong because someone fit a straight line to curved data!

I also think the "total opaque black box" meme is overstated. We can often understand even very complex models to some degree with a little effort. A basic technique is to run k-means with high k, say, 100, to select a number of "representative" examples from your training set and look at the model's predictions for each. It's also incredibly instructive just to look at a sample of 100 examples the model got wrong. One way to understand a non-linear response surface is by focusing in on different regions where the behavior is locally linear and trying perturbations[LIME]. There are also ML models which do fit easy to understand models[MARS]. It's also usually possible to visualize the low level features[DFV].

[LIME]: https://www.oreilly.com/learning/introduction-to-local-inter...

[MARS]: https://en.wikipedia.org/wiki/Multivariate_adaptive_regressi...

[DFV]: https://distill.pub/2017/feature-visualization/


Science works because it posits models first, and then data is sought to confirm or disconfirm it. The benefit of having a model first is that it is much more likely to be general (and hence reproducible).

ML does completely opposite. Data first, and then the model is discovered using data. It's pretty easy to see why it would lead to non-reproducible models.


From this perspective, where is the line between ML and automated p-hacking?


An undergrad to his supervisor in our office talking about publishing a paper: I've fixed the data, now the plots look ok. I (undergrad too) am sitting there thinking - well, you are using ML as a regression blackbox to plot a line, I can do that too w/o ML if I'm fixing the data. Supervisor: ok, that's really great. Me cringing...

I'm not hammering the ML-keyword above my work (and thus am getting considerably less academic attention), but it's nice to hear from people who made it in academia that they support my theory. 50% off the people are just showoffs throwing buzzwords and positivity around while they produce a load of sh...


Yup. As someone who is in a junior position in the field - I'm torn between, on one hand, riding the wave so to speak and taking advantage of all the buzzwords that I can put on my resume (which is fine with me because I can back them up), and on the other hand avoiding association with a certain type of person/career path that might turn out be just hot air in a couple of years.

So if I can actually write decent code, have solid understanding of software development principles, have studied math and statistics from the ground up to an advanced level, am familiar with relevant research - then do I really want to call myself a data scientist (or ML-something) just because it might improve my job/salary prospects, or do I want to stay away from it because everyone who takes a one-week course on Udemy calls themselves a data scientist without being able to back it up with actual skills?


Having been subject to a lot of buzzword-blarers over the last term I'd say that you just tell everybody that you are working the foundations of what all the other people are doing if they ask, what makes you special. If you can back it up, why not ride the wave. I would refrain from putting the sticker on everything though. I als made a turnaround and now spin everything I do which takes time (basically automating my research and data-vis) as building blocks for (gradschool) ML-based research.


That supervisor is someone who "made it in academia" so it might be good to not sneer and cringe at them and your peers.


I know. But basically at this point I'd say that a large portion of the academic output is crap even in "hard" science.

Also when looking at experiments (did that too in a lab course, where I got data from an existing apparatus): "Interpolating" data with a spline. "No, your result is not good, I get different result". "Maybe you should get more data then". "No, data's good, we need different result, also colormap is bad, use same like Matlab". "Well, the matlab colormap is colorful but not true to reality". "No, I see interesting things in plot with Matlab colormap". Stopped arguing, went back home, used jet instead of viridis and smoothed the spline. Got an A+. What a great day!

And while I might be not the brightest guy around and thus might not be able to just run around spitting out solutions for hard problems I have certain standards on integrity. Basically faking results is something I won't do to create (optional, published) work (which a paper is for undergrad work and also was for a PhD until not too long ago here...).


Luckily the academic system will typically weed out unethical behaviour (e.g. faking results).


If you think the current process does, you either are in a very good environment or are plainly lying to yourself. I could just add some bits here and there, get excellent results and publish. Probably nobody will notice as it won't be a "big issue" and seems all fine with existing experimental data.


If you've faked your data to get these results then it could mean the end of your career, hence must be a very uncommon practice. This is a different from cherrypicking data, or putting a positive spin on your findings. But again, it's expected now that data and code are open, so you wont get away with much these days in high-impact conferences/journals.


Not talking about high impact journals (where the majority of papers does not end up, when everyone is expected to publish twice a year...). Also I think cherrypicking your ML training data to achieve a certain result is akin to faking results?! And if you think that all data and code are open today you should certainly take a better look around you. Half of our lab is running on proprietary (simulation) software, good luck with auditing the code there (and evaluating the data because most of the people don't really document there workflow...). And both these examples take/took place on upper-mid-tiere european universities which consistently rank in the global upper 100/european upper 20-citation rankings...

Besides that, most of the people do a PhD for getting better paid jobs, not academic tenure (where such things could backfire).


> 50% off the people are just showoffs throwing buzzwords and positivity around while they produce a load of sh...

Better get on this train because 10 years after undergraduate I can say it seems that’s all there is out there. Once you make it then maybe you don’t have to be full of sh but I don’t see people making it on merit unless they’re Albert Einstein.


regression is not a black box.


no, but using ML for what is basically regression and then telling everyone who asks: "I don't know, how it works, it's advanced ML" is "using it like a black box"


That was a terrible article. I didn't see even one concrete example of their complaint. Blaming the reproducibility crisis on machine learning methods is just a cheap dodge.


My impression from the article was that the doctor stating those opinions has no idea how ML works and how to apply it properly, leading to statements like that. "ML gap" is real I guess...


She has a PhD in Statistics from Stanford. The title of her thesis was "Transposable Regularized Covariance Models with Applications to High-Dimensional Data". (http://www.stat.rice.edu/~gallen/)

I think she knows what she is talking about.


Thanks for pointing this out. The professor cited most certainly knows the best of ML. I'm guessing the author of this article simply attended the AAAS session where Allen gave a talk on recent work on addressing inferential challenges with modern ML and wrote this piece that doesn't do her work justice. See https://aaas.confex.com/aaas/2019/meetingapp.cgi/Session/215... & list of recent papers https://arxiv.org/search/?query=genevera+allen&searchtype=al...

Nearly all statisticians realize the need for more inferential thinking in modern ML. E.g. http://magazine.amstat.org/blog/2016/03/01/jordan16/ We still don't do that well in high dimension, low sample size regimes that make up the majority of life science research.


Smart and beautiful! :)


At least (s)he's realizing that (s)he has no idea. Most of the non-math/non-statistics (probably even some of the CS) people don't do that and just apply some ML-algorithm out of a tutorial on their data.


Good read. It's also refreshing to see a mainstream article that talks about ML without once mentioning 'AI'.


ML is statistics with a different name using a computer, that should always be mentioned in articles for the general public.

On the other hand AI is fantasy BS hype boosting off the fact that ML sounds similar to AI to people who aren't aware Machine Learning is stats.

Maybe AI one day but today it is utterly ridiculous. No really. Every single article should mention both of those things at least in passing. Downvote away all you AI hype Surfers but you know it's true.


I'd say currently ML is heuristics done by a computer. The rigour of statistics isn't quite present in ML yet. At least, not in the basic courses yet.


You can equally do the thing named statistics without rigour. In fact that describes most of it. Sadly. See replication crisis, p-hacking, garden of forking data etc. etc. etc. I'd say the overwhelming majority of university stats courses have no rigour at all and that may not be a bad thing in and of itself?


ML is different from statistics. If you want to learn more read Breiman's Statistical Modeling: Two cultures.

AI is real and is a legit field of study, of which ML is currently very popular, so some articles conflate these two. But it is far from bullshit. If you want to learn more read Artificial Intelligence: A modern approach.


"statistical modelling:" <- 1 discipline "Two cultures" Maybe re-read it? Or read Geoff Hinton, or Wasserman, or Judea Pearl or Hastie & Tibrishani or Murphy on the subject..?

AI in a pitch is exactly as much BS as interstellar travel. Definitely go ahead and study all the failed attempts. Norvig wrote the most popular text nearly a quarter of a century ago. It didn't exist then either. There is no going to alpha centuri there is no Hal9000. There is no intelligence that is artificial.

Machine Learning exists and is real and is statistical inference done by computer. There is no point where you can look at it and say "Here is the boundary between statistics and ML." Try it. Glorified curve fitting has some fantastic applications and killer demos. Along comes the hype train exactly as you'd expect. Put it on the blockchain or has the hype for that died now?

https://www.technologyreview.com/s/612437/what-is-machine-le...

Shallow BS detection, worth doing if only for one's own sanity.


How does this article manage not to mention a single actual example of ML-related misconceptions?? I'm sure they exist, but there is literally nothing here except some assertions and a plug for a vaguely remedial research line.


Here is the missing context from press release:

``` "In precision medicine, it's important to find groups of patients that have genomically similar profiles so you can develop drug therapies that are targeted to the specific genome for their disease," Allen said. "People have applied machine learning to genomic data from clinical cohorts to find groups, or clusters, of patients with similar genomic profiles.

"But there are cases where discoveries aren't reproducible; the clusters discovered in one study are completely different than the clusters found in another," she said. "Why? Because most machine-learning techniques today always say, 'I found a group.' Sometimes, it would be far more useful if they said, 'I think some of these are really grouped together, but I'm uncertain about these others.'"

Allen will discuss uncertainty and reproducibility of ML techniques for data-driven discoveries at a 10 a.m. press briefing today, and she will discuss case studies and research aimed at addressing uncertainty and reproducibility in the 3:30 p.m. general session, "Machine Learning and Statistics: Applications in Genomics and Computer Vision." Both sessions are at the Marriott Wardman Park Hotel. ``` https://eurekalert.org/pub_releases/2019-02/ru-cwt021119.php

& the context of the AAAS session https://aaas.confex.com/aaas/2019/meetingapp.cgi/Session/215...


This is a misleading title. The researcher they quote is

> ... developing the next generation of machine learning and statistical techniques that can ... also report how uncertain their results are and their likely reproducibility.

So she's actually using machine learning to access systematic uncertainties, i.e. to get better, more reproducible research. Of course, like all forms of automation, people tend to sensationalize progress as a crisis because it makes it too easy to shoot yourself in the foot.

But doing things "the old fashioned way" isn't any better. Early particle physics experiments would get armies of undergrads classify photographs of collisions in bubble chambers. These results took thousands of researcher-hours to compile, which might seem all fine and dandy, until you realize that there may have been a systematic bias in your classification. Now what do you do?

Thanks to machine learning, there are a lot of things we can do: we can try to remove the bias and retrain the algorithm, or we can train with extreme examples of bias and use that to quote a systematic uncertainty. We can try a multitude of approaches to estimate uncertainties rerun our entire analysis in a few hours. Good luck doing that with an army of undergrads.


Case in point : LHC Higgs results - how many detection's vs how many events? How were the detection's determined... The answer is with a large booster [1]

I postulate that out of 12 billion random events it would be remarkable if a booster didn't extract 100 or so items that looked similar to a Higgs detection.

Well, let's give it 20 years and a new generation of PI's who aren't invested in this and have grad students who are keen to find something different in the data.

But ohh.. all the data has been thrown aways... oh! [2]

[1] https://indico.cern.ch/event/705941/contributions/2897000/at...

[2] https://www.forbes.com/sites/startswithabang/2018/09/13/has-...


90% of the work in LHC physics is estimating the amplitude of backgrounds that look almost exactly like your signal process. Coming up with the "large booster" for classification is only a small part of it. So yes, machine learning is used, but no, we don't use it blindly like you imply.

As for throwing all the data away, the article you link to actually does a good job of explaining how this is done: we look at every collision with thousands of sensors before deciding whether to keep it. At this stage there is absolutely no machine learning anyway (just physics knowledge), so be careful blaming machine learning for any missed discoveries.


" all the data has been thrown aways... oh"

That's a ridiculously disingenuous summary of the article, did you read past the Forbes headline?


You clearly have no idea what you are talking about. Did you try reading any of the papers we wrote? There are many analyses and multiple channels.


Overfitting is a well-known problem in the ML community. There are methods to avoid this: cross validation, train-test splits, etc. There are also models that give you an estimate of the standard deviation of a prediction. What is the point? We don't need new algorithms, we just have to apply existing methods properly.


Title makes it sound as if the AAAS made this statement, its a single researcher who is making this claim.


> Machine learning 'causing science crisis'

ML or more generally mathematics do not cause anything. People who misuse mathematics are to blame here. Some fields are simply using tools they don't understand and this predates ML advances by decades. Thinking of stats use in psychology and medicine for instance.

This trend of presenting ML are some kind of magic powder is ridiculous. I blame hyped presentations by influential ML scientists for this.


I wonder: don't machine learning frameworks' results come with a level of confidence?

Ps: I have no experience with anything regarding ML.


Because it's difficult. Bayesian methods don't scale well to the number of parameters that most modern ML demands. Doing inference for large models is intractable.

However, even if you're able to do that, the problem isn't solved. You have to answer level of confidence for what? The uncertainty / confidence that you get assumes your model is right. No model can tell you whether it is a true reflection of reality. I had written more about this on my twitter: https://twitter.com/paraschopra/status/1075033048767520768


Yes, you can create confidence estimates for both large neural networks and gradient boosting (see for instance the thesis of Yarin Gal). This covers the majority of commercial and academic applications.

ML is actually a field with very high standards for replication, in part because emperical results are currently the focus. If certain methods don't generalize to other datasets, then all bets are off: you are dealing with data that violates the IID assumption. No statistics, bean counting, or ML is going to help you get significant results.


They should, it shouldn't be difficult. I think the issue is more that ML gives more tools that a reasearcher could employ to get any arbitrary result if they knew what they are looking for.


Yeah, as most people here wrote already, it's statistics. And your models will yield those statistics...

Unfortunately if you're an engineer/physicist/chemist/biologist/social-scientist your background in statistics is neither fresh nor deep. So your professor comes to you: can you do something with ML, it's such a hot topic (your boss has also no background in statistics nor do his peers (which review your stuff...)) you say: yes (because a no I don't know about it won't be good for you). Then you go to some google or blockchain sponsored-tutorial where some self-taught-Indian-CS-Bachelor is telling you how to use ML with Python and Tensorflow. You might wonder about some things but in the end you need to get things done and feed your data (which is often garbage, but verifying that it's not is not hot) into an algorithm you don't understand. Then you find some other guys, doing this, cite them and publish. 0 scientific value generated, but a great step for your academic carrer nonetheless.


I can see there being issues with reproducibility, i.e. getting the exact same results, but has there ever been a time when science was more replicable? Data/techniques/findings/papers are under more scrutiny than ever. No positive results will be taken as sacrosanct in CS anymore. This is a complete 180 from 10+ years ago.


In my view, rather than talking about time periods, it might be preferable to consider different fields of science wrt replicability. In fact it doesn't even make sense to put all of "science" in one basket. The medical and behavioral sciences get the most attention these days, but are not comparable to physics, chemistry, geology, astronomy, etc. My field (physics) doubtlessly has its own problems, yet has produced theories of astounding generality and accuracy in spite of potential flaws in the individual studies that led to the success of those theories.

Replication might not turn out to be the big problem. The lack of progress towards a unifying theory might be a more important long term issue.


I've often imagined how different Newtonian physics would be if we had gone the ML route from the beginning.


Hopefully machine learning helps with confidence and making predictions out of experiments as opposed to the limited capability of "understanding" from the way things are done now (as if an experiment with slightly higher p values are ignored or with smaller values might have hidden biases, etc).


The other day someone lamented that you can't get published as an honest ML researcher, because other scientists are rendering whole professions obsolete all the time...


> you can't get published as an honest ML researcher

If you research ML, you can publish in ML journals, there are several. If your research is about applying ML to domain problems, are you then an ML researcher or a domain researcher?


The point was that it is difficult to get noticed with down-to-earth work when the whole field seems to be aiming for the stars.


It's not like teaching to the test works better for humans.


As other comments observe, the replication crisis predates the use of ML, so the causes are clearly deeper.

I think there's actually a very simple explanation for this which lots and lots of people hate, so they're sort of in denial about it. Academia is entirely government funded and has little or no accountability to the outside world. Academic incentives are a closed loop in which the same sorts of people who are producing papers are also reviewing them, publishing them, allocating funding, assessing each other's merits etc. It's a giant exercise in marking your own homework.

Just looked at in purely economic terms, academia is a massive planned economy. The central planners (grant bodies) decide that what matters is volume and novelty of results, so that's what they get, even though the resulting stream of papers is useless to the people actually trying to apply science in the real world ... biotech firms here but the same problem crops up in many fields. It's exactly what we'd expect to see given historical precedent and the way the system works.

There's another huge elephant in the room here beyond the replication crisis ("to what extent are the outputs wrong") which is the question of to what extent are the outputs even relevant to begin with? Whenever I sift through academic output I'm constantly amazed at the vast quantity of obviously useless research directions and papers that appear to be written for their cleverness rather than utility. The papers don't have to be wrong to be useless, they can just solve non-problems or make absurd tradeoffs that would never fly in any kind of applied science.

I read a lot of CS papers and I've noticed over time that the best and most impactful papers are almost always the ones coming out of corporate research teams. I think this is because corporate funded research has some kind of ultimate accountability and connection to reality that comes from senior executives asking hard questions about applicability. For instance in the realm of PL research academia pumps out new programming languages all the time, but they rarely get any traction and the ideas they explore are frequently ignored by the industrial developers of mainstream languages because they're completely impractical. This problem is usually handwaved away by asserting that the ideas aren't bad ideas, they're just incredibly futuristic and 30 years from now we'll definitely be using them - but this kind of reasoning is unfalsifiable on any kind of sensible timescale so it's the same as saying, "I shouldn't be held accountable within the span of my own career for how I spend tax and student money".

As time goes by I am getting more and more sympathetic to the idea of just drastically cutting academic funding and balancing the books by drastically reducing corporation tax. The amount of total research would fall significantly because corporations wouldn't invest all the newly available money in research, or even most of it, but it's unclear to me that this would be a bad thing - if 75% of research studies coming out of academic biotech are wrong then it stands to reason that if standards were improved significantly, funding could be reduced by (say) 50% and still get a similar quantity of accurate papers out the other end. It's possible the science crisis is really just reflecting massive oversupply of scientists, massive undersupply of accountability and in general research should be a much smaller social effort than it presently is.


Somehow, it seems like during the Cold War (50s, 60s), there were fewer scientists and the quality of the output was higher. Not sure if that is the case, or survivorship bias. But if it is the case, what was different about the system back then?

To play the devil's advocate: scientists in industry/corporations do not come out of nowhere - they come from academia. Will the academics not move to countries where academic research is better funded? The students will follow. Corporations will set up their research labs in those countries near the universities to poach the best talent. Suddenly, your country is at a disadvantage.


Maybe better lower corporation taxes and concurrently raise the tax on earnings, because most of the big, old corps are not really driven by people wanting to expand their knowledge but by people wanting to squeeze out a little more cash (might be different if the turf is contested)...


A dishonest scientist can mine a dataset for statistically significant hypotheses and for a long time no institutional protection against it was in place:

https://en.wikipedia.org/wiki/Data_dredging

https://www.xkcd.com/882/

Machine learning makes it easier to test great many hypothesis, but even going fully "by hand" it is very easy to deviate from what the statistical framework of hypothesis testing would demand. There are now some discussions about counter-measures, e.g. about preregistration of studies:

http://www.sciencemag.org/news/2018/09/more-and-more-scienti...

You can see this as another chapter in the long debate about the correct way to test scientific hypotheses:

https://en.wikipedia.org/wiki/Statistical_hypothesis_testing...


As your number of samples increase the chance that a hidden variable that explains the phenomenon but correlates with the thing you're testing also increases.

All experiments have a limit it seems


The issue talked about here is distinct from the larger "reproducibility crisis"; the latter is a result of shoddily designed (or simply fraudulent) experimental work, whereas the issue here is the aggregate effects of the huge amount of computational work that is being done- even when that work is being done correctly and honestly.

Testing a hypothesis against a pre-existing dataset is a valid thing to do, and it is also almost trivially simple (and completely free) for someone with a reasonable computational background. There are researchers who spend a decent portion of their careers performing these analyses. This is all well and good- we want people to spend time analyzing the highly complex data that modern science produces- but we run into problems with statistics.

Suppose an analyst can test a hundred hypotheses per month (this is probably a low estimate.) Each analysis (simplifying slightly!) ends with a significance test, returning a p-value indicating the likelihood that the hypothesis is false. If p < 0.01, the researcher writes up the analysis and sends it off to a journal for publication, since the odds that this result was spurious are literally hundred-to-one. But you see the problem; even if we assume that this researcher tests no valid hypotheses at all over the course of a year, we would expect them to send out one paper per month- and each of these papers would be entirely valid, with no methodological flaws for reviewers to complain about.

In reality, of course, researchers sometimes test true hypotheses, and the rate of true to false computational-analysis papers would depend on the ratio of "true hypotheses that analysis successfully catches" to "false hypothesis that squeak by under the p-value threshold" (i.e., the True Positive rate vs the False Positive rate.) It's hard to guess that this ratio would be, but if AAAS is calling things a "crisis," it's clearly lower than we would like.

But there's a further problem, since the obvious solution- lower the p-value threshold for publication- would lower both the False Positive rate and the True Positive rate. The p-value that gets assigned to the results of an analysis of a true hypothesis are limited by the statistical power (essentially, size and quality) of the dataset being looked at; lower the p-value threshold too much, and analysts simply won't be able to make a sufficiently convincing case for any given true hypothesis. It's not a given that there is a p-value threshold for which the True Positive/False Positive ratio is much better than it is now.

"More data!" is the other commonly proposed solution, since we can safely lower the p-value threshold if we have the data to back up true hypotheses. But even if we can up the experimental throughput so much that we can produce True Positives at p < 0.0001, that simply means that computational researchers can explore more complicated hypotheses, until they're testing thousands or millions of hypotheses per month- and then we have the same problem. In a race between "bench work" and "human creativity plus computer science," I know which I'd bet on.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: