This seems like an extension of the replication crisis. In many fields, most published research is already bogus. The idea that peer review is enough to ensure that research is valid is perhaps not scaling well.
It would be great to have things like open data sharing. At least in astronomy, which I'm somewhat familiar with, it doesn't seem like we're that close. Most scientists cannot even reproduce their own results. It's common to use things like manual Jupyter notebooks, unlabeled CSVs, and a bunch of disorganized data files, in a one-off process that a scientist manually summarizes to produce a paper.
To me, in an ideal world each paper would sit in a GitHub repository, with an integration test that verifies the code actually produces the results used in the paper. That isn't really what academics prioritize, but perhaps things will move in this direction as more people realize that we have a replication crisis, and also as scientists tend to have more software engineering skills over time.
The current scientific system has long been known to have serious problems of incorrect results and conflicts of interest. This article seems like an attempt to pin the crisis on an AI scapegoat.
From 2015, the Editor of The Lancet:
The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness. As one participant put it, “poor methods get results”. The Academy of Medical Sciences, Medical Research Council, and Biotechnology and Biological Sciences Research Council have now put their reputational weight behind an investigation into these questionable research practices. The apparent endemicity of bad research behaviour is alarming. In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory of the world. Or they retrofit hypotheses to fit their data. Journal editors deserve their fair share of criticism too. We aid and abet the worst behaviours. Our acquiescence to the impact factor fuels an unhealthy competition to win a place in a select few journals. Our love of “significance” pollutes the literature with many a statistical fairy-tale. We reject important confirmations. Journals are not the only miscreants. Universities are in a perpetual struggle for money and talent, endpoints that foster reductive metrics, such as high-impact publication. National assessment procedures, such as the Research Excellence Framework, incentivise bad practices. And individual scientists, including their most senior leaders, do little to alter a research culture that occasionally veers close to misconduct.
I would note there is a difference between untrue and fraudulent.
"Afflicted by studies with small sample sizes, tiny effects..."
Neither these are inherently the result of malfeasance or fraud, or even poor statistical practice.
Sample size is a compromise between a number of things - statistical power, yes, but also trying to minimize the number of human or animal subjects involved, and to be frank, budget (as I've noted elsewhere, the NIH R01 non-modular budget hasn't changed since the 90's). Before there's a body of work done, statistical power is often speculative. What do we think the effect estimate will be. If we're lucky, maybe we have a mathematical model suggesting at least something. In that arena, it's likely that we may undershoot the needed sample size - though I'll note underpowered findings are null findings, and far less likely to get published (and likely hopeless in say...The Lancet). There are also some questions that we may still want answers to where the sample size is inherently small. There are a finite number of Ebola outbreaks, or veterinary clinics in the U.S. (both real examples).
Similarly, tiny effects are hard to estimate, but that doesn't mean they're bad to estimate. Something that increased the risk of death in American citizens by 1% for example, would have a relative risk of 1.01, which is as small as many medical and epidemiology journals are apt to report. Yet this would impact thousands of people. Measuring that may be very hard, and very noisy, producing a number of wrong answers, but it's not self-evidently a bad idea. Especially if we don't know if the effect is tiny ahead of time.
How do you see this as scapegoating? The headline specifically says "intensifies", the article very clearly positions AI-fabricated data as an extension of existing problems, and I don't see anything in the article downplaying those existing problems (the entire closing section is about how the summit was on issues broader than AI).
Nature's reporting on the problem of paper mills is surprisingly high quality and honest, given that these reports directly attack the credibility of Nature itself (and many other journals).
Does the argument attack Nature and other journals or does it setup a position that makes journals even more important to provide filters, verification, etc. services?
What were facing is an explosion of Brandolini's law on a scale people are not prepared to deal with and when all financial incentives promote this direction and financial incentives rule the world, I'm not sure how we get around the issue in environments that support free speech.
I'm not proposing we curtail free speech but we have serious issues to deal with as a society in terms of the believability and sheer volume of false information. To some degree this has always been an issue but my concern is that there's a critical tipping point in a free speech environment where there's so much BS out there people completely stop believing any information not matter how reputable or valid it is.
It could make journals even more important, but in practice the reason paper mills exist is because journals aren't doing much (visible?) QA on papers, so spamming them with fake claims and auto-generated papers is a viable business model.
Journals unfortunately aren't stepping up to meet the challenge. I was writing articles about this problem several years ago and things haven't noticeably improved. This thread is full of people pointing out the obvious solution - take claim audit seriously and start by paying people to do professional peer reviews. Their actual solutions tend to look like spam filters for paper submission queues. It's enough to be able to say they're doing something, but not enough to actually make a big difference especially post ChatGPT.
I'm not so pessimistic, I think people learn to discriminate between sources. A lot of people right now are learning to generalize to "experts aren't" but it's a rather more nuanced understanding than the media like to make out when you dig in, for instance people understand that "expert" in this context usually means public sector funded academic or civil servant, and not e.g. a roughneck on an oil rig, some UI programmer at Apple, the guy who fixed their car last weekend. They learn that the first form of expertise is the type where making false claims can be beneficial for the people making them, whereas if you lie a lot about oil on an oil rig eventually something will explode.
I think we'll eventually get to a place where incentives are better aligned, for instance where data collection and aggregation is fully divorced from the people analyzing it. A lot of the distortion in science comes from the fact that academics need to collect data but aren't rewarded much for doing it, so once a dataset is collected it gets kept secret and that allows for a lot of dishonest game playing. It also means people are strongly incentivized to see things in the data that aren't really there. If data collection and analysis was fully divorced, those issues would go away (you'd get other problems of course but would they be worse?).
> "The idea that peer review is enough to ensure that research is valid is perhaps not scaling well."
As others have said, this is not what peer review has ever been for, at all. It only checks for gross omissions and violations of form and syntax that are obvious to other scientists who are in adjacent fields (not even necessarily the same one). It's a relict from the times when publications were not target metrics. Peer reviewers never replicate the work. If you are a grad student tapped for peer review and you spend the time to replicate the work, you are ruining your career and your advisor will also be mad.
I have had reviewers and editors dig throughly into work, including code, and it has never been met with anything other than appreciation. I've sent reviews in detailing what further test or figure might be needed to make a point more persuasive, or noting inconsistencies, and I encourage my students to do the same.
> I have had reviewers and editors dig throughly into work, including code, and it has never been met with anything other than appreciation.
Yes obviously that kind of free and unexpectedly diligent labor would be appreciated.
> I've sent reviews in detailing what further test or figure might be needed to make a point more persuasive, or noting inconsistencies, and I encourage my students to do the same.
Yes these are the kinds of things peer reviewers do. Notably not replicating the work.
"Yes obviously that kind of free and unexpectedly diligent labor would be appreciated."
Then perhaps characterizing doing work - including often doing partial replication, independently deriving results, checking code to make sure it's both complete and consistent, etc. as "ruining your career and your advisor will also be mad" is unfair.
"I've sent reviews in detailing what further test or figure might be needed to make a point more persuasive, or noting inconsistencies, and I encourage my students to do the same."
Some of this is replication. "I cannot get from X to Y in your paper without the addition of Z" is, at its essence, a statement that a result cannot be replicated given what has been provided.
> "Then perhaps characterizing doing work - including often doing partial replication, independently deriving results, checking code to make sure it's both complete and consistent, etc. as "ruining your career and your advisor will also be made" is unfair."
You are mischaracterizing my argument in multiple ways. What I said was "Peer reviewers never replicate the work. If you are a grad student tapped for peer review and you spend the time to replicate the work, you are ruining your career and your advisor will also be mad." which I stand by. I don't think it's unfair. Maybe your reinterpretation is unfair, but I can't speak to that because you were the one who wrote it not me.
Then in response to you saying "I have had reviewers and editors dig throughly into work, including code, and it has never been met with anything other than appreciation." I said "Yes obviously that kind of free and unexpectedly diligent labor would be appreciated." which I also stand by.
In response to you saying "I've sent reviews in detailing what further test or figure might be needed to make a point more persuasive, or noting inconsistencies, and I encourage my students to do the same." I said "Yes these are the kinds of things peer reviewers do. Notably not replicating the work." which I also stand by. You are saying that "Some of this is replication." Well it's obviously not. Asking for further tests and figures and noticing inconsistencies are exactly the kind of things that peer reviewers usually do, and it's not replicating the work.
(PhD student in STEM here) I think most people have the wrong idea about what peer review is. My advisor teaches us to treat it as a first check, but it is not a guarantee of correct results.
Most of the time I spend on research is actually trying to understand the literature and reproducing their results. If I can't do it, it probably means I don't understand enough about the work I'm reading, but there is also the small chance that the published analysis is wrong, which already happened to me.
Peer review cannot catch if they performed the procedure correctly (e.g. in a lab), but it is there to check things like:
- Validity of the experimental setup
- Validity of the statistical analysis methodology
- Validity of the conclusions
Take a look at the items in this comment:[1]
> Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses
All these can be caught with peer review.
> together with an obsession for pursuing fashionable trends of dubious importance
In contrast, this problem is intensified with peer review.
> In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory of the world. Or they retrofit hypotheses to fit their data.
Not the role of peer review to catch these, but making your data and analyses scripts open will allow these to be caught by anyone who wishes to. The problem we have with current publishing is that I could use this technique to get faulty results, publish in a prestigious journal, get a 100 citations on my paper, and one of those citations will be from the person who looked at my data and saw obvious biases in my selection (which cannot be caught by a mere peer review). That one citation pointing out how wrong I am is lost in the noise.
We need a way to highlight that citation - merely making a simple graph of connections via references doesn't get us there.
Most people have the wrond idea about what publication is, too. For almost all papers, for almost all readers, it doesn't matter if results are correct. It's just part of the game of academia. If you need to build something based on a paper, and you need it to work correctly (so, outside of public policy or macroeconomics), you need to do the work yourself.
But if you just want to cite the work and write your own paper on top of it, go ahead.
It's never going to be a guarantee, but it's also implemented very shittily at present. There's a difference between what peer review is and what it should be. It makes sense to look at it realistically in day to day practice, but in discussion of systemic problems it should be viewed in a different light.
It might help to expand on "bogus". Bogus has a few levels, going from "not good, but possible from a well-intentioned author trying to do the right thing" (low-level bogus) to "outright deception" (high-level bogus). Small sample sizes, statistical errors, and flawed (but honest) experiment design are all, I suggest, low-level bogus. Faking data and plagiarism are high-level bogus.
I think peer review is capable of, eventually, mitigating low-level bogus. The quantitative standards in fields where low-level bogus is a problem (e.g., but definitely not only, medicine) are rising. Peer review is not a scalable solution to high-level bogus. Figuring out high-level bogus seems to be almost a full-time job [1]. You cannot expect this level of effort from researchers, especially if they are reviewing for free; I would even argue that it's easier to fake data than to figure out it's fake. It also requires more expertise to assess quality research than to write a low-level bogus paper and submit it. There's a mismatch here. There are not enough expert reviewers to handle all the low-level bogus papers.
The solution therefore seems to require some kind of reputational component. There needs to be a cost to engaging in high-level bogus. But this is a hard problem. Do you ban any lead author of a paper with demonstrated high-level bogus? Publicize their names? Ban any author of a paper with demonstrated high-level bogus? Throttle the submissions any one person can make to a conference/journal at a time? I don't know. But the current model will have to change.
It's also possible to be functionally bogus while doing everything 100% by the book. If you control things super well you can inadvertently make your result so narrow that it's basically meaningless, at least with the way the scientific system functions today. If people worked together in a more concrete way to build on prior results this effect might not be so bad.
One of my favorite examples is a mouse study that did not replicate between two genetically identical mouse populations raised in the same conditions run by the same lab. The difference was the supplier the two sets of mice originally came from, and the researchers were able to pin down the cause as differences in the gut microbiome between the two (and in fact one particular bacterial strain). That is an example of great research, but the vast majority of studies will never catch something like this before they publish because they will only use one mouse supplier as they keep things controlled while minimizing costs.
Because designed replication studies are fairly rare and people often do not officially publish when they find things that don't replicate in biology, we are approaching interpretation/downstream use of these highly controlled studies in an extremely inefficient way.
But that isn't really the fault of individual researchers. Technically they're applying the scientific method correctly to their niche problem. It's the definition of the problem coupled with how we combine results across groups that causes the inefficiencies. As problems of interest increase in complexity we can't define them on the scale of individual labs anymore, and for some reason we've addressed it by breaking them up into subproblems in this homogeneous way. Then we just assume that piecing together the little results will work great...
Anyways, I agree there is also straight up fraud and blatantly bad practices on the level of individual papers, it's definitely a continuum. Sometimes such bad results slip into the mainstream of science or even have a huge impact on subsequent funding directions like with the fabricated amyloid beta paper. But I do suspect that for the most part the blatantly bad work stays on the fringes, and the largest negative impact on scientific productivity actually comes from a level of abstraction up.
The way academic research is setup is not currently conducive to effective research. Research must intrinsically be allowed to fail. If you incentivize success, then what happens when someone is 2-4 years in and their only skill is apparently worthless?
Without Tenure or the ability for a scientist to accumulate their own savings, there is a strong incentive for most scientists to charitably interpret results/papers to maintain relevance.
Imagine a world where all the raw data sits in a public repository next to a script to process the data which was reviewed as part of the publication process, which would accompany the paper and could be quickly and easily replicated with new data. Anyone could produce the graphs shown in the paper simply by downloading both and running them together. What a wonderful thing that would be. And you could imagine the government funding the storing and serving of all the data. What a wonderful daydream of an idea.
I hate to be sincere, but the reality is the data is our product. If we open source our data too much, we won't have anything left to publish as others who would have to expend nowhere near the resources we must do to produce it can scrape it and publish it. (that literally is how "AI" of the current hype bubble works today, lol, why would I want that online?)
The incentives are definitely bad, and that's where the actual fix should be.
If that happened it would indirectly change the incentives anyway, because everyone would start being required to release data, so bad practices that are currently incentivized would become impossible.
However I think it would be better to directly incentivize data release rather than require it, at least in biomedical sciences. Because of patient privacy issues there is no way raw data release can be required across the board. And I certainly do not trust the NIH to come up with a coherent set of rules for when it is versus isn't allowed, which would mean loopholes and more corruption.
Most projects are funded by a patchwork of different sources, not all of which would agree to the same terms of release of the information, but all of which are required sufficiently fund its creation. Not to mention maintaining ongoing storage and accessibility. single source government grants for scientific research are the exception, not the rule.
I don't think this is the main reason people don't open source data. If your data is high quality and others are using it to find interesting stuff then you are going to get citations and authorship on additional publications with minimal work, which sounds like a great ROI. The reason data doesn't get open sourced is because researchers are not actually confident in the data and don't want their other works to be exposed.
"If your data is high quality and others are using it to find interesting stuff then you are going to get citations and authorship on additional publications with minimal work..."
This has not been born out in my experience, or the experience of others. Data products are chronically undervalued and undercited in science, and do not come with guarantees of authorship unless you put up barriers to access them without it.
Open data is, at this point, a decision made for either ideological reasons or as a condition of funding and/or publication, not a decision that is itself usually "worth" the investment of curating a public data set.
This is the limit to what people do. In my field, even source code sometimes isn't free software. However, if you ask a researcher, they will give you a copy you can freely modify, because the act of asking means they know who you are as a person, and can later ask for citations when you publish. Data is similarly treated. Just putting it out there without any barriers guarantees that other parties will use it without citing you.
Even this would require somehow verifying the raw data. It's plausible a bad actor could "reverse engineer" their data from a pre-determined conclusion.
But yes, overall more openness is good. Still, the cost losing trust in society is very high (as you need to verify everything).
> It's plausible a bad actor could "reverse engineer" their data from a pre-determined conclusion.
I've already heard of someone planning a product ( initially targeted at lazy^H^H^H^Hbusy high schoolers and undergrads) that will use AI to reverse discover citations that fit a predetermined narrative in a research paper. Write whatever the hell you want, and the AI will do its best to backsolve a pile of citations that support your unsourced claims and arguments. The founder, and I use that term very generously, expects the academic community to net support this because it will boost citation counts for the vast majority of low citation, low visibility works.
Funding the storing and serving of all of that data doesn't sound like a difficult problem to me. That has gotten SO cheap over the past couple of decades. There are plenty of well funded institutions that can support that kind of resource.
You'd be surprised. One headache there is what did you tell study participants you would do with the data?
Did you say you'd keep it forever? Did you say 5 years? Who's in charge of making sure that this centralized repository isn't inappropriately holding and distributing data?
Funding organizations also have different requirements.
Then a script is only useful if paired with a set of libraries of a particular version, a specific compiler/interpreter, an OS, also there may be specialized hardware involved. Some of the languages used in science like SAS, Stata, SPSS, and Matlab, etc aren't free and open source so you can't always just bundle it.
And even data storage isn't trivial. For a recent small conference abstract I processed ~150GB of data. Hundreds (thousands? Tens of thousands?) of other papers have looked at that same data. You would really want some way to deduplicate that storage, but that introduces some additional complexity.
I do like this vision but I think it would be a major undertaking that would require a lot of well funded institutions coming together rather than any one in particular doing it on their own.
"Your data is open and available in perpetuity for whatever use" is in deep conflict with how we think about human subjects data, often for very good reason.
One thing that was really jarring for me moving from the startup/consulting/advertising space to the research space was that human subjects data really gets deleted.
It's not just the deleted_at column being set, there's no backup, it's really gone. Every copy, forever.
I appreciate the ethics of it, and part of my reason for working in this area is because of these ethics, but even 5 years in there is so much reluctance to press that delete button.
I can only really speak to astronomy, but the biggest problem is generally not hosting the raw data. The problem is maintaining the process of analyzing the data.
When you are doing science, you often do not just have a single standardized data format. It's not like taking a picture with a camera, where the jpg format is a standard and the metadata is a standard. If you are, say, storing data from a radio telescope, the metadata is more like taking a snapshots of a production database. Over time you might track additional data, like how much the telescope slewed recently, how much radio interference was nearby according to some new algorithm, etc etc.
Your data formats are constantly changing. So your analysis scripts are constantly changing as well. But this sort of thing doesn't just maintain itself. You very often need to add code like, how do you handle versions of the data where column X is missing. A research project might spend a year gathering data, and change the data format a tiny bit ten times over that year. If you publish something a few times a year, there eventually are a huge number of data versions and script versions that old publications rely on.
It isn't impossible to maintain data like this. You can have code that regularly runs integration tests and reruns past analyses. But most research doesn't operate to this level of "software engineering quality". One-off Jupyter notebooks, code that the developer only got working on their local machine, and so on.
I think we could do better, but it would involve hiring more software engineers and building engineering teams to support scientific research. It is not as simple as allocating more budget towards hosting large files.
Some journals in fact implement this idea (e.g. having the raw data underlying each figure one click away). That is however not the crux of the reproducibility crisis; it would be great if it was just "I can't make an exact copy of Supplementary Figure 5D", but rather "I can't confirm that protein X causes dementia using orthogonal techniques". There is no easy code fix for that problem.
The unfortunate truth is that academic lineage matters. In a world with decreasing SNR from paper content, the SNR from finding the paper on a reliable researcher's homepage massively increases (because sadly, people are added to papers without their permission in an effort to boost their own signal).
> Most scientists cannot even reproduce their own results. It's common to use things like manual Jupyter notebooks, unlabeled CSVs, and a bunch of disorganized data files, in a one-off process that a scientist manually summarizes to produce a paper.
The only way to get that under control is if universities had a career track of data scientists and programmers, basically a shared resource pool of specialists that all researchers could use. But most US universities seem to prefer to "invest" their endowments into professional athlete teams.
No offense, but astronomy is ripe for being rife with this sort of thing because the validity of astronomical research doesn't matter for the "real world" and can't really easily be tested. I feel like it could be better, but since you guys do not have any external pressures beyond yourselves (as academics), you can still publish however you feel like.
A lot of the more applied fields cannot survive scrutiny. Either you have to produce something that leads to a product or something a PM will scrutinize (especially if you work for a national lab or the like, or for pharma, and so on). There, a lot of published research might be bogus, but it is nowhere near the majority.
Astronomy is fantastic for preventing this type of fraud.
The data is incredibly open. Anyone can download datasets from telescopes, or build their own. Papers built on invalid data can be refuted by simply pointing telescopes at the location the paper is lying about. There are very few possible discoveries or theories in an astronomy journal that would immediately impact the wider culture.
Compare astronomy to sociology, in which questionnaire results can be fabricated and are essentially unconformable without using other unconformable questionnaires. Compare astronomy to economics, which the articles can be used bolster political positions.
My instinct about numbers (i.e. incl. statistics) is that they should be reproducible: on-demand, with results and workings easily subjected to interrogation.
In reality in many fields - my experience is in finance, but also in science - calculations are scattered across different languages and systems. This adds friction to any process about reproducing, understanding, analysing numbers.
For the HN crew it's an under-development(!) functional language with properties to permit flexible designs that can scale. It's for numbers and if you share the model alongside your numbers people can check them according to the model and see the workings.
It goes well with visual number [dev/person]tools that I will release one of soon, and in the future watch for a browser extension for the workings behind numbers you are reading.
Important, it won't address the raw data part of the problem, but where numbers following from that are concerned, it might get closer to that instinct.
I love the idea. But do be aware that the high-quality software engineering like this will require has a cost: be prepared for basic science costs to scale appropriately (or for productivity to be reduced.) More to the point, not all science can be usefully verified by unit tests. Even machine-verified mathematical proofs are entirely dependent on the definitions being correct, and that requires expert human analysis.
Given the non-modular budget for an NIH R01 has not been updated since Clinton was president, there is absolutely no preparation for basic science costs to scale appropriately with inflation, let alone adding something that is a lot of work for relatively minimal per-paper return.
> Kahn says that, although there will undoubtedly be positive uses of AI to support researchers writing papers, it will still be necessary to distinguish between legitimate papers written with AI and those that have been completely fabricated.
Unfortunately this assumes that the only thing which can go wrong is lack of reproducibility. Not so. I read a lot of public health papers during COVID and a staggering quantity (IMHO nearly all of them) should not have been published; many of them would have been reproducible despite that.
Other things that can and do regularly appear in reproducible, peer reviewed papers:
• Nonsensical methodologies
• Logical fallacies
• Mis-representation of their data
• Incorrectly implemented software
• Source datasets that are cherry-picked
One might think that things like incorrectly implemented software would fall under the umbrella of irreproducible research, but that won't work. Some fields don't really recognize a distinction between model implementation and specification. The model is the implementation and if a description was once published it's quite possibly either too vague to implement, out of date, wrong, or all three. IIRC Prof Neil Ferguson's team actually rejected an attempted replication of one of their epidemic models on the basis that only they were qualified to use it! Pseudo-science like this goes unremarked in universities, only outsiders seem to care.
Sometimes you get an honest academic who knows what they're supposed to do and actually does it, but there's no observable benefit to them from doing so because the ones who don't bother don't seem to suffer any consequences.
tl;dr The biggest problem with the replication crisis is the framing of it as being about replication. What the world actually faces is a misleading research crisis. You can drive the replicability rate to 100% and you'll still find whole fields consisting of logical fallacies and misinformation.
I think you'll be demonstrated to be wrong in short order over a period of the next 1-2 years.
My called shot: We'll have fully automated scientific research with useful and valuable conclusions and little to no human intervention. I think it will take some time, but can tell you right now its on its way to happening.
Instead of fighting against "paper mills", let’s fight against journals. There are strong arguments against the need for peer-review for example [1]. Science does not get worse when there are more bad papers, science gets better when there are more good papers. Most papers in AI aren’t even reviewed by peers or editor and guess what: we have lots of progress happening. I’m not saying that is because the lack of review, I’m saying that reviews are not necessary for progress. Furthermore, was the iPhone good because it was reviewed by a board of "independent" reviewers? No it wasn’t. Let’s just ignore Nature. Good papers are good because they have good arguments and if they are not good, then time will tell. Papers are not good simply because Nature (TM) put a stamp on it.
Re: "Science does not get worse when there are more bad papers"...
I don't think that this is true at all. Weeding through bad papers is, at a minimum, an opportunity cost, as is a good paper built on top of a bad one.
Also, there is a societal cost in that bad research can get picked up and believed by people, like the anti-vax crowd. Or, bad research can be used to push an agenda, like anti-climate change.
The solution to this is reputation. The flaw is that we're using the reputation of for-profit journals rather than the research institutions.
You shouldn't have confidence in a paper because it was published in Nature, you should have confidence in it because it was published by Harvard or the University of California or Google Research, who puts the name of their institution on it and thereby stakes their reputation. For the preeminent researchers in a field, their own names may be enough for people in that field to trust the result.
You can still have a paper published by nobody, and if the results are interesting, researchers at known institutions can try to replicate it. Until then it's nothing. But if they do, the nobody gets cited and credited with being the first and takes a step to building their own reputation, and the known institution gets credited with knowing when to spend resources following up on something significant.
The research institutes are also for profit bodies, just much more of an closed shop because there's only so many people they can employ at any one time (and that's before we even get into tenure or the fact all you're really doing is shifting the peer review burden onto people that report into the person that wrote the paper).
You can have papers published by Google Research or by nobody now: turns out academics like the journals' attempts to curate interesting stuff more than only ever reading stuff on their bookmark list or that's mailed to them in a desperate attempt to get some engagement. And for all that they might not like the long delays and low chance of success associated with submitting to Nature, they like the idea of trying to persuade an academic at an elite institution to go to the trouble of replicating their research as the only means to build their reputation even less...
> The research institutes are also for profit bodies
Their business model isn't to put the research behind a paywall, which is the relevant distinction.
> turns out academics like the journals' attempts to curate interesting stuff more than only ever reading stuff on their bookmark list or that's mailed to them in a desperate attempt to get some engagement.
This is what conferences are for, and then you have the conference organizers curating the talks but everyone gets the list of talks and links to all the freely available papers even if they can't afford to attend the conference.
A research institution can produce bad researchers (Harvard, UC, and Google included), so we can't go with this suggestion either, instead we'd need reputation for each individual researcher in the world.
how many such reputations can you keep track of and gauge trust along?
and how is reputation gauged if peers don't review work?
> A research institution can produce bad researchers (Harvard, UC, and Google included), so we can't go with this suggestion either
Why not? The same is true of an individual researcher. But if they do, it damages their reputation, so they have an incentive to not. The same as the institution.
because it can produce bad researchers, and thus cannot be trusted to indicate good vs. bad researchers
thus, it is a bad proxy for trust of individual researchers
>The same is true of an individual researcher
individual researchers are not research institutions, so the same literally cannot be true: one employs the other, the inverse is not true
> if they do, it damages their reputation, so they have an incentive to not
empirically speaking, they objectively do, and their reputation is not damaged, and thus above proposition about reputation and incentives does not appear to be true enough to stop it from happening
recall the topic is how to gauge individual researcher reputation in the first place. Either we do it on an individual basis or a group/heuristic basis, and of the latter, research publications are a better proxy than what school one went to, but the former is better than both
You are correct. Many people want research to work like a social media site where everyone can contribute, believing this is an egalitarian--and therefore better--solution. In reality, it would slow research to a crawl, as noise and conflicted interests dominate the conversation and drown out more rigorous or valuable information.
> noise and conflicted interests dominate the conversation and drown out more rigorous or valuable information.
That's actually a great point. Who says this isn't already the case? There isn't much evidence to suggest that things have gotten better after the Why Most Published Research Findings Are False paper in 2005. I think people should be extremely skeptical of anything they read, irregardless of a peer review stamp.
>I think people should be extremely skeptical of anything they read
This is a mistake that many HN readers make; they think that if one is equipped with some above-average level of intelligence, one can discern the validity of new research. But this is wrong. It usually takes many years of study before one can begin to clearly understand what is even being said, let alone whether it has any veracity.
People of above-average intelligence often resent this fact, because it suggests that their smart opinion isn't as valuable as the opinion of an expert, but that is simply a sad fact of life. If they haven't put in the work in that field, then they don't know what they're talking about in that field, and they are incapable of applying skepticism in that field correctly. This is precisely and unfortunately why we are forced to place our trust in experts.
In practice that's not the case. Maybe for some fields it is, but there are definitely plenty of fields where minimal expertise is required to do basic sanity checks of a paper. And there are many generic 'tells' that are indicative of issues across a wide range of fields, e.g. if a paper is reporting mostly P = 0.049 results, well ...
As an example of an issue that doesn't require much expertise to spot, I was reading a health paper a couple of days ago that reported on an intervention in a specific population. The paper said words to the effect of "[the intervention] was effective, especially for men". The associated chart not only had (somehow?!) got a legend that didn't match the actual chart lines, but whilst the line for men did indeed go down the line for women was very clearly flat! This should have been reported as "no effect in women but an effect in men" but wasn't. When the wording doesn't match the data being reported, that's a good sign in any field that the researchers know they're treading on thin ice. That particular claim was a correlation/causation fallacy anyway, which is very common in health. They didn't have any proof it was their intervention causing the reduction and there were a bunch of reasons to suspect it wasn't. But the intervention was long term and high effort so it's not a surprise they wanted to find something.
This is a great example of what I'm talking about. P-values of 0.049 is not a "tell"; that is an HN meme. You can have p-values above 5 percent and the paper can still convey novel and valid research, depending on the nature of the study. The 5-percent value is a common cutoff, but it is a largely arbitrary one. Sometimes there is justification for a higher one, and sometimes it should be lower.
To gauge what sort of p-value suggests a useful model requires familiarity with the area and the broader context of the research.
That's the point - it's an arbitrary cutoff, so the high frequency with which results cluster around that point is indicative of a problem.
If a field accepts P=0.05 as significant it means 1 in 20 results can be false positives just by random chance. Now think about how many papers get published, and many of them report more than one thing. The right threshold should really be an order of magnitude lower. It's not OK for scientists to report FPs at that rate, and that's a big part of the reason for declining confidence in science.
>If a field accepts P=0.05 as significant it means 1 in 20 results can be false positives just by random chance.
That is not the correct interpretation of a p-value. See the ASA's statement on p-values.[0]
>Researchers often wish to turn a p-value into a state- ment about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.
Moreover, while it's fine to suspect that p-hacking took place with p-values just under 5 percent, it is merely a suspicion and nothing more. Throwing out p-values you don't like without further evidence is a different sort of violation of the methodology; it is like p-hacking in the other direction.
>The right threshold should really be an order of magnitude lower.
This is false. Again, see the ASA's statement.[0]
>Practices that reduce data analysis or scientific infer- ence to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other. Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis. Pragmatic considerations often require binary, “yes-no” decisions, but this does not mean that p-values alone can ensure that a decision is correct or incorrect. The widespread use of “statistical significance” (generally interpreted as “p & 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distor- tion of the scientific process.
In short, these things often require a strong statistical literacy to interpret, which most people complaining about p-hacking do not possess.
Yes, obviously the whole notion of a hard threshold is a bit nonsensical to begin with, but as the ASA statement says "Pragmatic considerations often require binary, yes-no decisions". There doesn't seem any way around that. People face decisions like, shall we continue to fund this investigation? Yes/no. Should we recommend lifestyle changes to the public? Yes/no. It doesn't make sense to try and map a P value into a budget, for example. At some point you need a threshold (and likewise for effect size and other things). Of course at some level there is fuzzyness, which is why I said if a paper is mostly reporting 0.049 values then ... and I didn't specify what, exactly because the conclusion should be something like "fuzzily suspicious and should look closer".
So it's fine for the ASA to complain about statistical significance leading to "distortion of the scientific process", but their proposed alternative can be boiled down to doing more peer review, as most of the things they tell researchers to look at are things researchers will never conclude in the negative about their own results, like study design (because if they did they wouldn't have got to the point of calculating a P value to begin with).
Finally, I disagree when you say "That is false". Scientists aren't going to stop using statistical significance thresholds because they do need to make binary decisions at some point, and dropping the threshold they currently use by 10x would immediately yield major improvements in replicability and robustness.
Anyway, put P-hacking to one side if you dislike that discussion, it's fine. It's not actually the thing that bothers me the most when I read papers. A big gap between the prose summaries and what the data [analysis] actually shows is a much more common source of distortion, IMO.
>dropping the threshold they currently use by 10x would immediately yield major improvements in replicability and robustness
This is precisely the sort of misunderstanding that surrounds p-values, and it is what the statement aims to correct. Lowering the standard p-value threshold does not imply an improvement in these factors. The problems of p-hacking and creating false positives lie in the disclosure of the methodology, not the threshold of p-value. That is the whole point.
This will be my last comment in this chain. I'm just trying to clear up some persistent misconceptions that I see on the internet.
In many cases results can easily be independently verified. This is why it works for AI. If you publish a result, it should come with code anybody can run. If the code doesn't exist or doesn't do what you say it does, you're a fool and everyone can ignore you. If it does, you don't need anyone's stamp of approval to prove it.
But that doesn't work with medical trials or things of that nature where independently verifying the claims is expensive.
> Also, there is a societal cost in that bad research can get picked up and believed by people, like the anti-vax crowd. Or, bad research can be used to push an agenda, like anti-climate change.
Responsible media would act as a gatekeeper. The problem is, most media utterly gutted scientific journalists for more profit, so they completely lack the basis to evaluate and supply context on research. Others, particularly boulevard media, willfully ignore any kind of ethics for clicks.
On top of that comes a general media illiteracy and media distrust that makes it even harder to combat because there's an awful lot of media that thrives on intentionally pushing crap to people.
the gatekeeper job belongs to the publication, rather than people reporting on what the publication chose to publish, especially when the journalists reporting on it aren't experts in the field (which we obviously can't expect them to be for everything they report on)
How does one come across these bad papers in the first place, such that they must be weeded through? Maybe we need a different method of curation such that they stay below the noise floor.
No, science definitely goes worse when there are more bad papers.
There's a popular argument about freedom of speech that comes down to "the solution to bad speech is more good speech, not limiting bad speech," but that only works (assuming it's true at all) when there's some way for a listener to discern truth with some research or effort.
> "Good papers are good because they have good arguments"
No they are not. Good papers are good when they collected data correctly and then presented the results fairly. Arguments about what that data means hardly enters into it. You can't argue whether the data is correct or not without gathering it yourself, and you can't gather it yourself.
Yeah, the problem with that "more speech" argument is the imbalance... lies are way easier and faster to generate than truth, and can be crafted perfectly to play on what the audience wants to hear. The truth has to be the truth, so it can't be crafted in the same way.
> There's a popular argument about freedom of speech that comes down to "the solution to bad speech is more good speech, not limiting bad speech," but that only works (assuming it's true at all) when there's some way for a listener to discern truth with some research or effort.
Yes exactly. If it is too hard for a group of scientists to figure out a flaw, then it is also too hard for two peer reviewers to figure out a flaw. Only time can detect all errors in science, assuming that nonconformist papers are allowed into the conversation.
> Good papers are good when they collected data correctly and then presented the results fairly.
This is only true for empirical papers. Good data is a part of a good argument in the case of empirical papers.
> If it is too hard for a group of scientists to figure out a flaw, then it is also too hard for two peer reviewers to figure out a flaw.
The issue isn't that flawed papers are necessarily hard to spot, the issue is that indiscriminate publishing means that anyone studying a topic has to wade through a lot of flawed ones before they find anything halfway useful. Time and attention are finite. Peer review and journals caring about reputation, in theory, caps the number of readers of rubbish papers at 2 (and disincentivises writing unpublishable crap to submit in the first place... although this is now eroded since generating rubbish papers is now effort-free)
> You can't argue whether the data is correct or not without gathering it yourself, and you can't gather it yourself.
You actually can argue about the data being "correct".
The way to do is, is studies need to be much more radically open with their raw data sources.
In AI, that would mean actually releasing the code you used, so people can try it themselves.
Or, in more sociology stuff, where you are doing, I don't know, interviews with humans, you could release the physical videos of your interviews.
And then for science stuff, show pictures/vidoes of the science you are doing.
I'm sure there would still be way to hack this. But significant more code and data transparency would do a huge amount for allowing other people to replicate or verify your work.
This line of thinking is dangerous and ignores very real problems as the other comments point out.
Instead of targeting peer-review, i recommend approaching this problem as one of incentives. Under the current system, what is the incentive for a reviewer to read and critique a paper? None. If they were paid in cash, there is a clear incentive. The publishers do not want this overheard and the associated legal requirements so they seek out "volunteers" and "compensate" them with rubbish like book copies and "reputation". The fault is with publishers, not the peer-review process.
At the moment, the current model allows parasites like Elsevier to derive the maximum benefit while not paying most folks involved in creating the actual value - the reviewers, editors and authors.
Citizen journalism tried this, failed and is now hollowing out established journalism. We need proper funding and proper incentives for journalism just as we need proper incentives for academic research.
The incentives of paid peer review would not be easy to get right. Especially if you want the compensation to be high enough that it would actually be an incentive to the reviewer.
Many universities limit the number of hours their faculty can commit to paid external activities. Paid peer review would be one of those activities, and it would have to compete against other activities. Such as consulting, which can be very lucrative in some fields.
So maybe you have to pay $10k peer review fee when you submit a paper. And then it gets rejected after the first round of reviews, because you aimed too high or the editor and the reviewers just didn't like the topic. You resubmit to another journal and pay another $10k. After a couple of additional rounds of reviews ($5k each), the journal seems to be interested in publishing the paper. But reviewer 2 wants you to cite some of their papers that are not really that relevant. Do you agree, or do you argue against and risk another round of reviews (another $5k)? Or maybe you get the feeling that reviewer 3 is stalling the process with superficial requirements, as they get easy money from the reviews.
Except that most academics can't afford that. American academics are massively overpaid by global standards, and peer review would naturally be outsourced to developing countries, where the expectations of pay are more reasonable. Many of the academics there are reputable, after all. Unfortunately the institutional culture is often more problematic, and corruption also tends to be more prevalent. Do we really want to let those institutions shape the practices of science worldwide?
One of the unfortunate features of capitalism is that being a middleman is more profitable than doing the actual work. Instead of a system where you pay $x for reviews, we could end up with one where you pay $1.1x to a middleman, who then pays $0.8x for the reviews. The middlemen get even richer than in the current system, because there is more money in publishing than there used to be.
This makes a good point but is largely a strawman. Review fees do not need to be this high. At $ 25 an hour, most engineering papers take about 2-5 hours to review. That’s $ 125 per reviewer or $ 500 for three reviewers, one editor and another $ 500 for publication fees. That’s $ 1,000 in total and the number of reviewers can be fewer.
At the moment, greed drives open access publication fees. Journals charge thousands of dollars because they can. There is no logical justification of the cost. There are several journals that charge hundreds of euros for review although they do not pay reviewers.
Note also that not all review needs to be done by academics. folks in industry can contribute equally but their employer pays for their time so that boils down to altruism or business priorities. A reasonable payment for time creates an incentive for the world beyond academia and this is a net positive. The payment does not need to be prohibitive, just enough that the reviewer has an incentive to do a good job and isn’t pressured to rush. The problem thus is not limited to the journals or the peer review process but inventives.
I was thinking more about something like $200 to $500/hour.
A small payment like $200 can be more of a disincentive than an incentive. You have to deal with bureaucracy to get paid and more bureaucracy to pay taxes for it. And if the money comes from a foreign source, amount of the compliance bureaucracy can be absurd. Getting paid is simply not worth it, if the payment is too small.
>> Most papers in AI aren’t even reviewed by peers or editor and guess what: we have lots of progress happening.
If by progress you mean a constant creep of SOTA on meaningless benchmarks, then yeah, we have "progress", but that progress is a strange kind of progress that is measured only on its own, self-selected criteria, and that does nothing to advance the total sum of knowledge. The main beneficiaries of this "progress" are large technology corporations who have now taken over research in AI and are turning it to profit. There has definitely been proress in money-making schemes and personal aggrandisement schemes of charlatans and mountebanks, in AI, aplenty.
If by progress you meant scientific progress, then, no, there has been none of that in recent years. It is questionable if there has ever been any kind of scientific progress associated with AI. While early pioneers of AI, like John McCarthy and Claude Shannon, wished to establish a scientific programme of research in human and machine intelligence, with the purpose of understanding the former by developing the latter, this programme was soon set aside, and activity concentrated instead in what McCarthy used to call the "look ma, no hands disease of AI":
>> Much work in AI has the ``look ma, no hands'' disease. Someone programs a computer to do something no computer has done before and writes a paper pointing out that the computer did it. The paper is not directed to the identification and study of intellectual mechanisms and often contains no coherent account of how the program works at all.
The cases you mentioned have clear and immediate commercial incentives that push them toward quality. Not quite so for basic research that might (optimistically) be decades away from commercialization and will probably be commercialized from someone very very far away in the value chain from the original researcher.
FWIW I'm also really critical of peer review/journals, just don't think this is a great line of analogy.
I've found most people who are deeply critical of peer review as a process (and not just how shady modern journals are) have never engaged in serious scientific work. They have no idea how lost in the weeds researchers can get, and how useful having a somewhat uninformed 3rd party see if they can parse your prose is for having a work be understandable. Commercialization really only applies to a very small subset of scientific discoveries
Let's be real, there are a lot of papers in AI right now because it's been hot, not just recently but for years. Still, no one is reading random papers from insert-unknown-group-here. People read papers from already established scientists who have pedigree. And, how did they become established? Because they actually published before arxiv.
In that sense, peer review should be an equalizer, because the only way to be read and cited then is to have the holy academic lineage, which is what would happen if we abolished peer-review as it exists in the current academic word we're in.
I forget which episode, but Andrew Huberman discussed somewhere how the incentives behind getting published in science journals and the peer review process often lead to poor studies.
I understand how journals made sense during a time when distribution was a hard problem. Now with the internet I don’t get the reason for them to exist in the form it does in 2023.
A journal is a way of ensuring that the paper you are about to read meets a minimum standard of quality. My anecdote: the second worst paper I've ever read was an ArXiv paper, presented by a young researcher who had no idea of how bad this paper was, Had I not teared it down (which was a very unpleasant experience for everyone involved) it would have been further propagated by other young researchers. And the worst paper? Submitted to a conference, rejected outright during review, and I still wonder if someone was trying to sneak a machine-generated article for fun.
No one has the time to keep up with all the research coming out, not to mention all the fake research published in bad faith. Life is not restricted to Elsevier and ArXiv, and there are sane alternatives out there.
Prediction: as a result of fake research and it’s ilk (fake credentials, fake trade-journal pubs, fake reviews, etc.), we will observe a renewed interest in referrals and meatspace networking. It will become correspondingly difficult for outsiders to enter professional circles.
On the other hand, maybe we'll abandon peer review and return to a pre 1950's like style? Everyone in ML just uses arxiv because the field is moving so fast. All work is realistically peer reviewed once it is out there. We treat conferences (more important than journals) as very noisy signals. Unfortunately, we still use those as metrics for completing degrees or hiring people.
The way I would try to make this better is just to post to OpenReview rather than arxiv or integrate the two. That way discussions can be had on the works, in the open, and public.
I think peer review exists for reasons beyond building or affirming the reputation of a scholar. In particular, it’s regarded as a filter for eliminating the very worst-quality research. And judging from the manuscripts that came across my desk when I was still in academia, I’d say it’s performing this particular duty reasonably well. As such, the connection between AI fakery and peer-review isn’t all that clear cut to me.
I might expect the peer-review pipeline to become inundated with prima facie credible manuscripts, which would overwhelm the reviewer’s ability to process publications even more than is already the case. At this point, I’d expect scholarly reputation to become even more important for getting published, and I’d expect to see an increase in signaling and other out-of-band communications (e.g. talking about your draft at a conference). Critically, I’d expect this to generally avoid AI generated “studies”, despite the obvious drawbacks.
And moreover, I actually have serious reservations about the project of purging the scientific landscape of informal processes. While there are some obvious and notorious drawbacks to the way things currently work, I am also wary of top-down, “authoritarian high-modernist” (to borrow the turn of phrase from James C. Scott’s Seeing Like a State) projects in all their forms, especially in science. The production of scientific knowledge is not pure technae; it cannot be reduced to a well-behaved formula. A large part of it is a craft requiring a significant share of informal processes in order to be truly creative. But that is a separate topic, and indeed, one might argue that doing away with formal peer-review altogether would be a step in the direction of informality.
So at the end of the day, I’m not convinced by any predictions that relate the effects of AI on peer-review. I think it could go either way :)
I mostly agree with your assessment of peer review, but feel that it's gone too far. It's decades of metric hacking going unresolved. My research area is ML and it is definitely far out of hand. When I was in physics I'd probably be closer to your statement. But right now we have measurements that conclude "reviewers are good at identifying bad papers, but bad at identifying good papers." Either I think we need to accept more works (as long as they are valid and useful) and/or reducing the value of a top tier venue (which right now is top or none). I think papers should get multiple rounds, as our goal is to improve works, and submissions should always be rolling. I think this can help, because I agree that peer review should be about eliminating the worst, not identifying the best (far too difficult, and hundreds of years of evidence).
I think one important thing that venues could do is host data and code/tools for works. Harddrive space is rather cheap now (especially with tax benefits and donations) and can only result in high value to the community. The other thing is having formats like OpenReview, where works can continuously be discussed in the open. With authors and others being able to defend/criticize/question works. But I think there should be some filter, even if low, and rules for control of quality.
For the most part, I do actually think getting rid of venue based review would be a step in the right direction. I use these words because I like to encourage the idea that open publication still leads to peer reviewing. In ML a lot is done with arxiv + twitter, I just think we should better formalize this. It allows for a lot of freedom and for high speed. It has problems, but I don't see them as any worse than the current system (I see an overall decrease in problems). Good science requires risks and a lot of creativity. The modern word advantage is that we have higher numbers of researchers and monte carlo sampling in parallel helps optimize. You want to encourage tail end samples too to escape local minima. The history of science is a history of upsets. You don't get upsets by doing what everyone is doing. I agree that top down hierarchies discourage such thinking and are an overall net negative to science and advancing human knowledge. Creativity is critical.
For predictions, I agree. I lean towards predicting disruption, but I'm not even certain of that. And disruption can go many different ways. I'm glad we're starting the discussions, but I think they need to be deeper and more honest.
Elimination of venue-based reviews is an interesting proposition. I’ll have to think on that.
The contrast between ML and physics is equally interesting, and seems to corroborate the experience of another poster, who expresses doubt about removing peer-review from long-haul studies in biological sciences. I’m still struggling to put a finger on why exactly this difference might exist, though.
The value of venues I see as facilitating networking and collaboration. Maybe these could be centered around invited speakers and presenters?
For the difference between ML and physics/other sciences, there are a few points that I see.
- First, ML submits to conferences rather than journals. This means typically you have one shot at acceptance, and possibly one chance to rebuttal. Other fields use journals, where there often is a push to accept and that the reviewer's job is to make the paper better rather than filter absolutely. This results in no one actually advocating for the paper to be published other than the authors (no incentive to accept). Worse, we have a zero sum game, so if reviewers are also authors of other works (common) they have an optimal strategy of rejecting other players' works (large incentive to reject).
- Second, ML is extremely hype driven. The field just moves lightning fast, it is insane. This leads to incremental improvements (technically true for all science) but muddies the waters since it is easy to reject on low novelty but if you don't release your work fast you get scooped and get rejected for again low novelty. Unfortunately we don't define concurrent work well (and I've had works actually rejected because we did not cite works that were publicly released (arxiv) around the same time as us or, worse, after we submitted to the conference!).
- Third, there are a large number of people submitting works (nearly 10k per conference), the field is wide (reducing domain expertise), and highly empirical (few learn theory and base understandings and instead rely on benchmark results to evaluate). The size brings with it a lot of noise and the empirical nature reduces rigor (noise in evaluation). The large number of works also results in low reviewer quality as chairs scramble to add reviewers to works (in my last review, 3 of 4 reviewers self reported 3/5 confidence scores, meaning they did not work in this domain). So we just have way more sources of noise than would be seen in many other fields. Consistency experiments support this as well: finding reviewers are good at rejecting bad works but bad at rejecting good works. Which concludes that reviewers over reject (default strategy).
I also appreciate your opinions. I certainly don't have all the answers. As I see it, we got to decide as a community. The only way to do that is to hear complaints, proposals, and challenges. We're all in this together so to me it is most important that we openly discuss these. Thank you for your thoughts and comments.
I don't think ML is a good example, because it's an outlier within an outlier.
CS is already a weird field, because it leans so heavily on conference papers that see a single round of reviews. Journal papers with multiple rounds of reviews are more common in other STEM fields. Because of this difference, peer review in CS is more focused on accepting/rejecting the paper than on improving it.
And then the CS model breaks down in ML, which is far too popular for its own good. There are too many people working on the same topics, writing too many papers and submitting them to too few conferences, which have too many people attending. CS conferences are supposed to be more like community meetings than formal conferences. Informal discussions are the main point, but that doesn't scale beyond a few hundred attendees.
Honestly, I agree with every word here. I'd rather see conferences as networking events than quality filters. Conferences should invite published works. The conference system just adds noise, and like you are suggesting, the single round results in a zero sum game and no advocates for the paper to be accepted/improved. Thus reviewers optimize their strategy: reject everything.
While I still have problems with peer review and still advocate for open publication, I do think many the the specific issues of ML could be resolved by focusing on journals rather than conferences.
It's easy to do this with anything in computer science as others can implement or recreate whatever is being discussed easily. Not so with a 6 month experiment with several groups of humans, or even a 2 week one with rats.
This is not always true. ML has reproducibility issues despite the status quo being open models. Even if we don't include proprietary datasets there are still issues. Compute is one issue, but we can even ignore that[0]. The Lottery Ticket[1,2] plays a big role, in that you can just get lucky. This really should have resulted in treating benchmarks as weaker indicators but see other comments about reviewing incentives. Another issue is that it is status quo to optimize hyperparameters on test data and this results in information leakage. While this won't affect results of running a checkpoint there are issues in reproducing the work. It also adds noise. Generative papers also have a large issue in not showing random uncurated samples, which introduces huge biases and we can argue this is a reproducibility issue as you can't reproduce the results show in the works. Anyone that's played with things like Stable Diffusion (or any generative model) will be familiar with this.
There are more nuanced issues and things that require intimate domain knowledge to fully understand, but I wanted to push back at this comment because I see a lot of people just brush off reproducibility concerns by pointing to GitHub. While it helps, it definitely doesn't make reproducibility "easy" or concerns nonexistent.
[0] Sometimes we can't though as scaling has more effects than obvious. e.g. GANs can't scale batch on a per GPU level but scaling by multiple GPUs/nodes does give an advantage in quality (not just training times). This is non-obvious and many things can play a role.
That doesn't really seem relevant to the discussion. AI obviously makes it 1000x easier to generate realistic-looking fake papers, so it "intensifies" the issue as this article explains. It is newsworthy and discussion-worthy, and as you said AI doesn't have agency so we don't need to worry about hurting its feelings by referencing it in a headline.
The only reason it's a problem is that papers became the measure - more papers = more prestige. The measure will simply change because now it won't have a strong signal.
That is not the only reason and it's not the most important reason. Papers are useful in the real world and people use them to learn stuff, and also reference them as evidence. If you're trying to learn about a topic and you have to wade through 90% bullshit AI papers when search for the topic, when you aren't experienced enough to immediately tell which papers are real, that's a problem. If malicious actors are able to spread disinformation that cites very real-looking research, or pitch journalists with very real-looking papers, that's a problem.
No, papers are not always useful. That isn't an inherent property. They can be useful. It depends on the content of the paper.
If you can't tell whether a paper is real or bullshit it doesn't matter if an AI or a human created it. Journalists will, heaven forbid, have to do real journalism instead of blindly trusting some piece of paper they found somewhere.
> If you can't tell whether a paper is real or bullshit it doesn't matter if an AI or a human created it.
That's true, but now it's incredibly easy to create a human-quality bullshit paper. That wasn't true a year ago. It doesn't matter who created it, but it does matter how easy it was to create.
> Journalists will, heaven forbid, have to do real journalism instead of blindly trusting some piece of paper they found somewhere.
That's true, but now it's incredibly easy to generate a real-looking piece of paper and send it to a journalist, prompting them to spend a day doing real journalism to verify it. That wasn't true a year ago.
> That's true, but now it's incredibly easy to create a human-quality bullshit paper.
No, it's incredibly easy to create human-quality syntax. Not meaning. The source of the value was never the paper or the syntax - it was the meaning of the paper. Whether a human or an AI creates bullshit matters not.
Legitimate journals (or groups of journals) might need to institute a throttling mechanism to limit the rate at which individual authors can submit papers. They might also need some kind of reputation tracking system so that reviewers can flag articles as low quality, and then future submissions by the same author will be treated as lower priority. Open access journals can raise author fees.
Punish the people using it, disallow and shame people who are found to use it. The usual stuff when you want people to stop doing things. (Not legally ban the whole population from making use of it even in cases where it's productive.)
Illegal guns kill people. Something like 80-90% of guns involved in homicide are illegally obtained/possessed. This is relevant because like banning guns, “banning” AI shouldn’t be the focus. It seems like an AI detection arms race is inevitable.
Illegal guns start as legal guns. With both AI and guns you need to be willing to think about the problem at every step in the chain, not just at the end.
To be pedantic, there are very few actually illegal guns. Mostly we are talking about crimes committed by felons who have committed prior crimes are no longer legally allowed access to firearms.
That same gun, if possessed by a non-felon, would be legal.
I don't know if this is or isn't important to the analogy, but I do think it applies.
Likely, yes. One of the practical difficulties in the US, however, is that the pool has historically been replenished by the trafficking of guns. So yes, reducing the pool of available guns is prima facie sensible, but it takes a fair bit more than declaring guns illegal. Further complicating the issue is the fact that guns have a very long service life. I have an FB Vis from 1942 that still works flawlessly, despite having seen years of service in actual war. With ordinary care and the replacement of high-wear parts (the extractor, mostly, and maybe the barrel), I’d expect it to run at least another 100k rounds.
The proposition to reduce gun violence by reducing the number of guns in circulation has to account for the guns already in circulation, and for the propensity of black markets to introduce new guns into circulation. Otherwise, the only thing achieved is the disarming of the groups of people with legitimate uses for firearms.
It’s not as simple as “guns are the problem”, nor is it as simple as “more guns are the solution”, I’m afraid.
Wait a moment. If 80-90% of guns used to kill people are illegally obtained, that means the laws successfully generate a regime by which guns are obtained much more safely. Now all that remains is to enforce the laws and crack down on illegal gun sales (although in USA that isn’t all since old guns already purchased illegally will continue to work for decades)
It’s like when people were saying that 80-90% of people dying in hospitals were unvaccinated. So vaccination helped right?
From my perspective as an outsider, I have always been amazed by the use of papers in academic research as a means of communicating findings to the wider world. I find it problematic that these papers are often formatted in a way that makes them highly unreadable, with two columns and compressed text. In my opinion, adopting more modern methods of publishing research could greatly enhance the overall quality of research by making papers more accessible and increasing the likelihood of them being read.
Imagine a scenario where there is a standardized format for academic papers, where the conclusion is explicitly derived from specific data and accompanied by confidence intervals. This standardized schema would enable easier referencing of other papers and easy incorporation of additional data through features like autocomplete. Implementing such a system could potentially reverse the trend of academic papers that use excessive and unnecessary language to appear more intellectually rigorous, even when the actual information being conveyed is limited.
By embracing these changes, we could create a more transparent and efficient research environment that promotes clearer communication and enhances the impact of academic findings.
Whatever you think academic research is, it's not always like that, and it's usually easy to find counterexamples. That's why the attempts to standardize the processes usually fail.
There is not necessarily any data behind the paper. Even if there is data, the conclusions may not be about the data. Even if the conclusions are about the data, the paper may not use quantitative methods. Even if it uses quantitative methods, confidence intervals may not make sense. Even if confidence intervals do make sense, adding new data might not. And so on.
you don't know how to read papers then. first read the abstract. then the conclusion/discussion. then the methodology to see if the study was conducted reasonably. papers are made to be read by scientists -- dumbing them down to be understandable by layman would waste time for little gain. that is the job of science communicators and journalists -- and even these often do a pretty poor job of it.
There was an interesting claim I heard from Eric Weinstein regarding peer review that he characterized as a cancerous infiltration into all of science from some medical area? Aha, found it in my notes...
"""
1:32:28 Eric: but let me jump in--peer review is a cancer from outer space. It came from the biomedical community, it invaded science. The old system, because I have to say this because many people who are now professional scientists have an idea that peer review has always been in our literature and it absolutely [mff __ ] has not.
Bret: right
Eric: Okay. It used to be that the editor of a journal took responsibility for the quality of the journal which is why we had things like nature crop up in the first place because they had courageous, knowledgeable, forward-thinking editors. And so I just want to be very clear because there is a mind virus out there that says peer review is the sine qua non of scientific excellence, yada, yada, yada, bs, bs, bs. And if you don't believe me go back and learn that this is a recent invasive problem in the sciences.
Bret: recent invasive problem that has no justification for existing in light of the fact
Eric: Well not only does it have no justification for existing... When Watson and Crick did the double helix and this is the cleanest example we have. The paper was agreed should not be sent out for review because anyone who was competent would understand immediately what its implications were. There are reasons that
great work cannot be peer-reviewed. Furthermore you have entire fields that are existing now with electronic archives that are not peer reviewed. Peer review is not peer review, it sounds like peer review it is peer injunction. It is the ability for your peers to keep the world from learning about your work.
Bret: keep the world from learning about your work.
Eric: because peer review is what happens, real peer review is what happens after you've passed the bs thing called "peer review".
"""
https://youtu.be/JLb5hZLw44s?t=5608
This is a fascinating bit of history to know about if true. I wonder if anyone has ever documented the evolution of its adoption across fields. This would be interesting too to see the HN community react to this characterization. It was the first time I ever heard of it from his wild interview with his brother.
Well it's certainly in keeping with the view he has of himself as an under appreciated scientific genius, however I don't think it makes for a very compelling critique of peer-review. Frankly, all his boohoo-hooing about being shut out of the in-group at Harvard probably has more to do with him being an insufferable narcissist, rather than any attempt by the establishment to prevent heterodox views in physics from reaching the wider scientific community.
There's really only one solution to this. Luckily it is the same solution to a lot of other things. Unluckily it is the same solution that no one has sought to implement for decades. Reviewing takes a lot of nuance, care, and time. You're not going to get this when all the incentives for the reviewers are currently to reject works. You can only get a group to work on ethics alone when that group is small and accountable. Incentivize the reviewers to be high quality and worthwhile work. Incentivize the chairs to review the reviewers and force high quality reviews. An expert paying close attention to a work makes the papermill's job exponentially more difficult. It also just makes the review process actually useful in the first place, as it ensures authors get actual feedback.
1) You must therefore pay the reviewers. Probably a good starting point.
2) It is in fact, hard to publish things already in top journals especially if you're not established. Given the fact you also want the pool of reviewers to shrink, you're also making difficulty of publishing works even harder. Yes, the issue is paper mills, got it, but you're making it even harder for sincere scientists who aren't part of a mill in doing this.
3) The flood of work that will fall to the smaller, more accountable group will require culling, which of course, the top journals will cull the horde based on the biases they already have: pedigree, which will further calcify the issues in science and research already, or even intensify it.
1) I mostly agree. There are at least other rewards that can also be offered, such as conference discounts. There's also the token system (pay tokens to submit works, receive tokens for reviewing). But some incentive structure needs to happen, agreed.
> especially if you're not established
This is commonly stated but I don't think many internalize what it means. I don't think a meritocracy can exist -- this indicator supports that -- and shows that there's strong factors influencing acceptance beyond the quality of work.
2) I do think we need to be very careful about how we sell the prestige of publications/venues. As fields get more popular and our metrics rely more on them (publish or perish) then this not only encourages paper mills, but cheating in general. As two simple examples, look at how ML publishes works where papers use proprietary datasets/models (reveals authors' lab and thus violates ethics), or how it is status quo to tune hyperparameters on test data results (information leakage). You're are a huge disadvantage if you don't cheat. We have a similar problem in schools and this is why students cheat. If it is hard to catch and bad actors aren't punished (high risk because high false positive) then you actively reward cheaters. This needs to be a serious conversation and I don't think we are having this (this is several domains, even outside academia).
3) There's a coupling effect here though, that is pretty destructive and can have bigger social ramifications. As the noise in venue publication increases, the trust in the venue decreases. But I think it increases first, as people are metric chasing. But I suspect it'll be like a rubber band, and snap back hard. The larger ramification is social trust around science, where only the large venues matter and not the small unknown journals. We already have a growing distrust as the "just asking questions" anti-science strategy has been growing, and we need to be pretty careful and think beyond a local (spatially and temporally) window.
This is No.2 in the list of existential AI risks [1]
> A deluge of AI-generated misinformation and persuasive content could make society less-equipped to handle important challenges of our time.
One way to think about it as significant chunks of information exchange turning into a Market for Lemons. Namely the information asymmetry between the producer of AI junk and the receiver of said junk means that the receiver cannot distinguish between a high-quality message (a "peach") and a zero (or negative) value "lemon". Then receivers are only willing to pay a fixed price for a message that averages the value of a "peach" and "lemon". Given the zero marginal cost of producing junk, this will mean that in the limit receivers will be willing to pay exactly zero. Information exchange is completely discredited.
But is this really an "existential risk" or an opportunity to think deeply about human relations, trust and the meaning of exchange?
Maybe the transactional, "a fool is born every second", buyer beware, caveat emptor society we have built was never fit-for-purpose in the first place?
For 'lay' people here on HN, I just want to make a quick point:
Peer-review does not mean that the reviewer is re-doing experiments or re-running analyses. They are only reviewing to see if the paper merits inclusion in the journal. Often this means telling the authors to do more experiments or check other things. But, to be clear, the peer-reviewer does not re-do things and check if they are 'right'
Although there is a type of peer review that includes redoing experiments and analysis: artifact evaluation. All of the top conferences in my field (real-time embedded systems) include this as an opt-in option, and papers get a special badge if they also pass artifact evaluation. I strongly believe that other fields in computer science would benefit by including and normalizing this process.
Besides the reproducibility benefits, artifact evaluation forces documentation of the experiments and process; I've found this enormously useful when on-boarding new students to an existing project.
I'm nearly certain that your's is the only field that does anything like re-experimentation then. I'm in biotechy fields and it's a totally different beast out here man.
The Jan Hendrik Schon scandal of two decades ago and the fallout from it point to how to limit the spread of fraudulent research into the accepted literature:
The basic issue is that all raw data must be preserved and made available for scrutiny by other researchers after publication, as must experimental materials. Why?
> "The committee requested copies of the raw data, but found that Schön had kept no laboratory notebooks. His raw data files had been erased from his computer. According to Schön, the files were erased because his computer had limited hard drive space. In addition, all of his experimental samples had been discarded or damaged beyond repair."
The opposition to this standard is strongest in the corporatized patent-centric research sectors, which is most of applied science in the USA and China, etc. Ambitious academics in non-commercial sectors don't really like it either as it means competitors can jump-start their research by having access to their raw data and experimental protocols. Regardless, implementing standard practices with respect to laboratory notebooks, raw data, and experimental materials in any institution receiving federal research money makes a lot of sense - along with regular audits, with failure leading to a cutoff in funding.
This problem is much broader than just the paper-mill outfits the article focuses on; some highly public and contentious related issues are public access to the raw data, research records and database sequences from the Wuhan Institute of Virology from c.2016-2019, the raw clinical trial data from Pfizer/Moderna/J&J Covid vaccine trials in 2020, and so on.
I think the main problem is not in itself incentivizing papers or using papers as a metric. The problem is that they're in many ways the only thing incentivized and the only metric given serious consideration. And the general expectations for/form of a paper are so homogenized across a given field. This makes it very feasible to game the system in a particular way that is bad for science.
It's also a serious misallocation of incentives purely from a system functioning perspective. At the very least you need to incentivize peer review alongside incentivizing papers. Currently there is hardly any incentive at all to help carefully review papers, so we have no checks and balances in place to deal with the natural result of incentivizing the writing of papers. If there were greater heterogeneity in scientific roles it wouldn't be so bad for some to have heavy paper incentives -- because there'd be others with heavy incentives to do a good job highlighting the best papers and finding flaws in the bad ones.
It feels like AI is exposing Kessler Syndrome for other areas - that is, some small scale amount of junk isn’t a problem necessarily, but if you scale up that problem, it fundamentally changes the thing for everyone permanently. I guess the jury is still out whether it’s net good or bad, but it feels like it’s going to force us to confront some fundamental issues we have in society, which have perhaps always been there, but are now unavoidable, and will demand quick societal change. That almost never goes over well.
The problem is letting this be a step that directs our attention in the first place:
> platform X published this so it must be good
It's a root-of-trust scheme, and those create high value targets which fail to corruption. Better would be:
> human Y cited this, it's about Z, and you've configured Y to be trusted in domain Z
Webs of trust require maintenance, which isn't convenient, but if roots of trust continue to degrade in trustworthiness, then that maintenance will eventually be a price worth paying.
The height of trust rot is Science publishing Woo Suk Hwang's 2005 stem cell research. This was one of the top scientific journals publishing research that appeared to be a nobel-prize track line of research that would have been a breakthrough in medical treatment. Instead it tainted a line of research.
The research results were fraudulent, claiming a much higher success rate at generating a stem cell line than what was achieved. They lied about the number of stem cell lines generated, the number of oocytes used to generate the stem cell lines, and the number of donors the oocytes came from.
As if bad data wasn't enough, they lied both to their donors, lied about their donors, and miscredited authors.
I wasn't aware of the this kind of thing in 2005. How do you think the trustworthiness of papers has been trending since then, generally speaking. It sounds like that was a bit of an outlier.
I don't think a single worse example has occurred since, even though there have been other high profile instances of falsifying data and other unethical conduct.
I think after the replication crisis the scientific community is more aware of the limitations and faults of the current system. This has impacted different fields in different ways, to varying degrees of improvements. Now, I think a lab like Hwang's would be meet with more skepticism from both readers and the top tier journals.
The biggest cause for skepticism now would be Hwang's claimed success rate with the techniques. If other labs couldn't reach similar levels of success then the most generous assumption would be that the techniques weren't described well enough in the articles. Compare this to CRISPR gene editing, which is a slightly more modern advancement in genetics that is valued because of how easily other labs can incorporate it.
In the short term, the solution may be reputation, distributed by keychain.
When author C submits a paper and they're unknown to the journal, the journal consults the keychain to see whom has vouched for C as a credible researcher. If authors A and B have vouched for them, and A and B are in similarly good standing, C is regarded as being in good stead and the editorial process can move forward as usual.
If C is later found to have maliciously faked data, it isn't just C who gets dinged on the keychain, so do A and B by extension, providing an enforcement mechanism.
This is, in effect, how it has worked for centuries. Editor D calls/writes to A and B to ask, "hey, there's this new author C with a provocative paper that's in your subject area, but I've never heard of them. Are they the real deal?" If A and B vouch for C, but C turns out to be a fraud, Editor D will take any future consultation with A and B with a large grain of salt.
It is possible that AI will soon be able to generate entirely-credible looking papers. What AI will struggle to do, until it starts funding research of its own, is to generate novel research that reflects whatever is actually true of nature. A reputation system is the last bulwark against entropy, if no automated tools can sort wheat from chaff.
The easiest solutions won't be appreciated -- chain of trust from older professors/academics who co-author(or otherwise sign off that it's legit), metrics based on historical outputs from university+lab, and i'm sure some will start looking at names.
Feels like AI makes the fight against paper mills easier. A group could submit AI-generated papers to different places and create a public index based on how successful they are at getting their nonsense papers accepted.
"Later, after Sokal revealed the hoax in Lingua Franca, Social Text's editors wrote that they had requested editorial changes that Sokal refused to make, and had had concerns about the quality of the writing: 'We requested him (a) to excise a good deal of the philosophical speculation and (b) to excise most of his footnotes.' Still, despite calling Sokal a 'difficult, uncooperative author", and noting that such writers were 'well known to journal editors', based on Sokal's credentials Social Text published the article in the May 1996 Spring/Summer 'Science Wars' issue"
AI makes the fight against spam harder since it is harder to detect fakes. The group would only get more successful over time. The “AI detection” tools get worse over time and overhyped already anyway, as we learned from the guys running that sci fi submission site.
That’s exactly the point. What you previously assumed review processes could catch is no longer true. AI can now infiltrate pretty much anything, especially if there is a Generative Adversarial Network (which “AI detection tools” can be used to train at scale).
And if you say it “deserves to get published” then taken to its logical conclusion, human generated content in all fields and interactions will soon be dwarfed by AI content and interactions.
The issue is that the AI can be switched out after it’s infiltrated and taken over, or it can be gradually used to shift public opinion or organize any sort of coordinated attack. Heck, a reputational attack is easy to pull off at scale within 6 months via AutoGPT already, and it takes 1 button press.
First they’ll separate us, then they’ll herd us into echo chabers and cause our protests won’t be heard by anyone anymore amid all the AI glut.
The upside from this downside is that it increases the performance of papers that can be replicated easily. In fact, once AI can replicate papers, that'll be helpful for verifying a lot of research.
Is this AI to be provided with a group of poorly-remunerated postgrads and postdocs and a laboratory full of relevant equipment in which they may conduct experiments under its direction?
(Full disclosure: have a couple of very minor published papers obtained under the non-AI version of this exact system...)
This seems like a bleak dystopian fiction concept actually. AIs being in control of human experimentation on the basis of well meaning but poorly interpreted ethical guidelines (with, of course, fully hallucinated loopholes).
Maybe add in that this has become one of the few sources of income for people as more and more jobs have been lost to the machines so many are compelled to participate.
"Take all the research done in [field] and compare the papers written by N authors, and look into their citations and compare the outcomes of what was published and find where the summation of the reads of their papers as shared citations and summarize the research where they agree, disagree and compare with international organizations and their papers coming from specifically the leads in this field from countries X,Y,Z"
Create a paper based on this information and coalesce all this data into a new outcome. Do not Lie or make up data. list where you think you are lacking in data - or based on searching, which datasets or companies should be included in the list. Create a table for all the citations sourced, with links and a comment why it is or is not included in your findings"
EDIT: Someone down voted this, and I am learning 'Prompt Engineering' - can someone ELI5 why this is a bad prompt? Seriously, can someone explain what sucks about the question?
Is it a stupid premise or is the crafting of the prompt lame?
I had hoped this was from the angle of “AI does better review than the journals” and “AI is being used to weed out a lot of papers that are just plain wrong”.
Fair enough, but this was a major 'event' in both tech and academic circles..
Funny that Nature publishes an article about fake papers and fails to mention the one model specifically trained exclusively on white paper to write (fake?) white papers..
Because I was sub'ed to openAI until yesterday. And my experiments with it were pretty promising.
I took the whole corpus of Magic the Gathering rules ( https://magic.wizards.com/en/rules ) as a textfile, and fed it into 4.0 . parsed it in a few seconds. I was then able to send it cards from Gatherer (MtG card database), and then ask it pointed questions about multiple card interactions.
I also compared it to what DCI judges have made ruling on as well, and matched 100%. It quite impressed me.
I was thinking next is to give it the rules text, and a JSON of every card. And then ask for all combos. But I'd run out of response before it could.
> I took the whole corpus of Magic the Gathering rules ( https://magic.wizards.com/en/rules ) as a textfile, and fed it into 4.0 . parsed it in a few seconds.
You mean the TXT file? ChatGPT(4) on the website literally can't comprehend it. It has much, much, much, much more tokens than ChatGPT(4) can take.
So this particular example proved what your parent comment pointed out. You think AI can do something that it can't.
Out of curiosity, did you try the same experiment without specifically training it on the MtG rules? Could it have all of that, including card interaction decisions, from its training data (sourced from the whole Internet)?
I only tried it after giving the URL of the rules. There's multiple rules documents, and I wanted to make sure to use the current rules.
It might have given good results without explicit rules provided. Or it could have spouted garbage.
I did however ask very pointed questions about timing and layers. The ones that had DCI judge writings matched 100% (could be overfit with matching these documents). And the ones not written about also appeared to be completely accurate as well, since it also cited the rules that it came to its decision.
However the larger problem is that GPT4 has been degrading quite a bit recently. It also precipitated my decision to unsubscribe. And I'm not the only one to notice this https://news.ycombinator.com/item?id=36134249
It would be great to have things like open data sharing. At least in astronomy, which I'm somewhat familiar with, it doesn't seem like we're that close. Most scientists cannot even reproduce their own results. It's common to use things like manual Jupyter notebooks, unlabeled CSVs, and a bunch of disorganized data files, in a one-off process that a scientist manually summarizes to produce a paper.
To me, in an ideal world each paper would sit in a GitHub repository, with an integration test that verifies the code actually produces the results used in the paper. That isn't really what academics prioritize, but perhaps things will move in this direction as more people realize that we have a replication crisis, and also as scientists tend to have more software engineering skills over time.