When I went to the APS March Meeting earlier this year, I talked with the editor of a scientific journal and asked them if they were worried about LLM generated papers. They said actually their main worry wasn't LLM-generated papers, it was LLM-generated reviews.
LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.
We already got an LLM generated meta review that was very clearly just summarization of reviews. There were some pretty egregious cases of borderline hallucinated remarks. This was ACL Rolling Review, so basically the most prestigious NLP venue and the editors told us to suck it up. Very disappointing and I genuinely worry about the state of science and how this will affect people who rely on scientometric criteria.
This is a problem in general, but the unmitigated disaster that is ARR (ACL Rolling Review) doesn't help.
On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.
On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").
While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.
To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.
Most conferences have been flooded with submissions, and ACL is no exception.
A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.
Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.
GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).
The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round.
Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.
We had what we strongly suspect is an LLM-written review for NeurIPS. It was kind of subtle if you weren't looking carefully and I can see that an AC might miss it. The suggestions for improvement weren't _wrong_, but the GPT response picked up on some extremely specific things in the paper that were mostly irrelevant (other reviewers actually pointed out the odd typo and small corrections or improvemnts where we'd made statements).
Pretty hard to combat. We just rebutted as if it were a real review - maybe it was - and hope that the chairs see it. Speaking to other folks, opinions are split over whether this sort of review should be flagged. I know some people who tried to query a review and it didn't help.
There were other small cues - the English was perfect, while other reviewers made small slips indicative of non-native speakers. One was simply the discrepancy between the tone of the review (generally very positive) and the middle-of-the-road rating and confidence. The structure of the review was very "The authors do X, Y, Z. This is important because A, B, C." and the reviewer didn't bother to fill out any of the other review sections (they just wrote single-word answeres to all of them).
The kicker was actually putting our paper in to 4o and asking it to write a review and seeing the same keywords pop up.
Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.
Well, given that the only thing that matters for tenure reviews is the “service”, i.e., roughly a list of conferences the applicant reviewed/performed some sort of service at, this is barely a surprise.
Right now there is now incentive to do a high quality review unless the reviewer is motivated.
See my other post - we had exactly this for NeurIPS. It is definitely worth seeing what GPT says about your paper if only because it's a free review. The criticisms it gave us weren't wrong per se, they were just weakly backed up and it would still be up to a reviewer to judge how relevant they are or not. Every paper has downsides, but you need domain knowledge to judge if it's a small issue or a killer. Amusingly, our LLM-reviewer gave a much lower score than when we asked GPT to provide a rating (and also significantly lower than the other reviewers).
One example was that GPT took an explicit geographic location from a figure caption and used that as a reference point when suggesting improvements (along the lines of "location X is under-represented on this map") I assume because it places some high degree of relevance to figures and the abstract when summarising papers. I think you might be able to combat this by writing defensively - in our case we might have avoided that by saying "more information about geographic diversity may be found in X and the supplementary information"
LLMs reviewing LLM generated articles via LLM editors is more or less guaranteed to become a massive thing given the incentive structures/survival pressures of everyone involved.
Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.
It might follow to say that current LLM;s arent trained to generate papers, BUT they also don't really need to reason.
They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.
The outputs aren't really the same, they simply seem plausible at first glance.
For example, I recently experimented with using ChatGPT to translate a Wikipedia article, on the grounds that it mighy maintain all the formatting and that Transformer models are also used by Google Translate.
As it was an experiment, I did actually check the results before submitting the translated article.
First roughly 3/4 were fine. Final quarter was completely invented but plausible, including references.
LLMs are very useful tools, I'll gladly use them to help with various tasks and they can (with low reliability but it has happened) even manage a whole project, but right now they should treated with caution and not left unsupervised — Peter principle, being promoted beyond their competence, still applies even though they're not human employees.
Because the results aren't the same? I use AI every day for software development and a number of other topics. It's very easy to recognize the points where the illusion breaks and how it breaks clearly indicates to me that there's no actual reasoning in the response I've gotten. It often feels like I'm doing the reasoning for the AI and not the other way around.
From what I’ve seen, the results are not the same. In the latter scenario, there’s a risk of encountering a non sequitur all of a sudden, and the citations may be nonexistent. There’s also no guarantee that what you’re stating is factually correct when your logic is unbounded by reality.
I can see how LLMs contribute to raise the standard in that field. For example, surveying related research. Also, maybe in the not too distant future, reproducing (some) of the results.
Writing consists of iterated re-writing (to me, anyways), i.e. better and better ways to express content 1. correctly, 2. clearly and 3. space-economically.
By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).
If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.
I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.
Pretty much yes. Critical analysis is a necessary skill that needs practice. It's also necessary to be aware of the intricacies of work in one's own topic area, defined narrowly, to clearly communicate how one's own methods are similar/different to others' methods.
Hmm there may be a bug in the authors’ python script that searches google scholar for the phrases "as of my last knowledge update" or "I don't have access to real-time data". You can see the code in appendix B.
The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.
Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.
That sounds important enough to contact the authors. Best case, they fixed it up manually; worst case, lots of papers are publicly accused of being made up and the whole farming/fish-focused summary they produced is completely wrong.
Hi there! My name is Kristofer, one of the authors of this research note. I also wrote the script. We were notified via email about this comment. Please see below for our response. Thank you for your interest in our research! (I'm removing the sender's name to respect their privacy)
"""
Dear XXXX,
MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.
First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable.
I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.
We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.
Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X
Please don’t hesitate if you have any more questions.
As a tangent to the paper topic itself, what should be the standard procedure for publishing data gathering code like this? Given that they don't specify which version of any libraries or APIs used and that updates occur over time, API's change etc. inevitably resulting in code rot. It will eventually be impossible to figure out exactly what this code did.
With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)
In my opinion, archive the data that was actually gathered and the code's intermediate & final outputs. Write the code clearly enough that what it did can be understood by reading it alone, since with pervasive software churn it won't be runnable as-is forever. As a bonus, this approach works even when some steps are manual processes.
GPT might make fabricating scientific papers easier, but let's not forget how many humans fabricated scientific research in recent years - they did a great job without AI!
For any who haven't seen/heard, this makes for some entertaining and eye-opening viewing!
I think it’s important to remember that while the tidal wave of spam just starting to crest courtesy of the less scrupulous LLM vendors is uh, necessary to address, this century’s war on epistemology was well underway already in the grand traditions of periodic wars on the idea that facts are even aspirationally, directionally worthwhile. The phrase “alternative facts” hit the mainstream in 2016 and the idea that resistance is futile on broad-spectrum digital weaponized bytes was muscular then (that was around the time I was starting to feel ill for being a key architect of it).
Now technology is a human artifact and always ends up resembling its creators or financiers or both: I’d have nice fonts on my computer in 2024 most likely either way, but it’s directly because of Jobs they were available in 1984 to a household budget.
If someone other than Altman had or some other insight than “this thing can lie in a newly scalable way” was the escape velocity moment on LLMs then we’d still have test sets and metrics and just science going on in the Commanding Heights of the S&P 500, but these people are a symptom of our apathy around any noble instinct. If we had stuck firm on our values no effective altruism cult leader type would even make the press.
Indeed. I used to think that when it hybridized with Objectivism that was the nastiest malware around but god damn if Amodei and co haven’t rootkitted society to a new level.
Difficulty and scale matter where it comes to fabrication.
Academia is a lot about barriers, which while sometimes unpleasant and malfunctioning nevertheless serve a purpose (unfortunately, it is impossible to evaluate everything fully on per-case basis, so humans need shortcuts to filter out noise and determine quicker if it is worth spending attention on). One of the barriers is in the form of the paper itself. The fall of this barrier (notably through often unauthorised use of others’ IP) would likely bring about not sudden idyllic meritocracy but increased noise and/or strengthening of other barriers.
Sure, but that takes time, AI has the potential to generate “real sounding”papers in under a second. At least the fake papers before were rate limited.
This kind of fabricated result is not a problem for practitioners in the relevant fields, who can easily distinguish between false and real work.
If there are instances where the ability to make such distinctions is lost, it is most likely to be so because the content lacks novelty, i.e. it simply regurgitates known and established facts. In which case it is a pointless effort, even if it might inflate the supposed author's list of publications.
As to the integrity of researchers, this is a known issue. The temptation to fabricate data existed long before the latest innovations in AI, and is very easy to do in most fields, particularly in medicine or biosciences which constitute the bulk of irreproducible research. Policing this kind of behavior is not altered by GPT or similar.
The bigger problem, however, is when non-experts attempt to become informed and are unable to distinguish between plausible and implausible sources of information. This is already a problem even without AI, consider the debates over the origins of SARS-CoV2, for example. The solution to this is the cultivation and funding of sources of expertise, e.g. in Universities and similar.
Non-experts actually attempting to become informed (instead of just feeling like they're informed) can easily tell the difference too. The people being fooled are the ones who want to be fooled. They're looking for something to support their pre-existing belief. And for those people, they'll always find something they can convince themselves supports their belief, so I don't think it matters what false information is floating around.
It seems to be kind of a new thing for laymen to be reading scientific papers. 20 years ago, they just weren't accessible. You had to physically go to a local university library and work out how to use the arcane search tools, which wouldn't really find what you wanted anyway. And even then, you couldn't take it home and half the time you couldn't even photocopy it because you needed a student ID card to use the photocopier.
For a paper that includes both a broad discussion of the scholarly issues raised by LLMs and wide-ranging policy recommendations, I wish the authors had taken a more nuanced approach to data collection than just searching for “as of my last knowledge update” and/or “I don’t have access to real-time data” and weeding out the false positives manually. LLMs can be used in scholarly writing in many ways that will not be caught with such a coarse sieve. Some are obviously illegitimate, such as having an LLM write an entire paper with fabricated data. But there are other ways that are not so clearly unacceptable.
For example, the authors’ statement that “[GPT’s] undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society” suggests that, for them, using LLMs for “proofreading” is okay. But “proofreading” is understood in various ways. For some people, it would include only correcting spelling and grammatical mistakes. For others, especially for people who are not native speakers of English, it can also include changing the wording and even rewriting entire sentences and paragraphs to make the meaning clearer. To what extent can one use an LLM for such revision without declaring that one has done so?
Last time we discussed this, someone basically searched for phrases such as "certainly I can do X for you" and assumed that meant GPT was used. HN noticed that many of the accused papers actually predated openai.
> Two main risks arise... First, the abundance of fabricated “studies” seeping into all areas of the research infrastructure... A second risk lies in the increased possibility that convincingly scientific-looking content was in fact deceitfully created with AI tools...
A third risk: ChatGPT has no understanding of "truth" in the sense of facts reported by established, trusted sources. I'm doing a research project related to use of data lakes and tried using ChatGPT to search for original sources. It's a shitshow of fabricated links and pedestrian summaries of marketing materials.
Existence of LLMs make Google search even more relevant for cross-checking rather than less relevant for deep research. Daniel Dennett said we should have all levels of searches available for everyone i.e. from basic string matching to semantic matching. [0]
> tried using ChatGPT to search for original sources
That's a bad idea, do not do that. Regardless of the the knowledge contained in ChatGPT, it's a completely wrong tool/tech - like using a jackhammer as a screwdriver. If your want original sources, then services like https://perplexity.ai can do it. It's not even an issue with ChatGPT as such, it was never intended for that - that's why they're trying to create search as well https://openai.com/index/searchgpt-prototype/
It's silly that there's a stigma attached to AI generated images in cases where it's perfectly reasonable to do. People seem to appreciate things more for the fact that they were created by spending time out of another human's life more than what it actually is.
It would be silly if they were indistinguishable from human-created images, but they aren't, exhibiting the typical AI artifacts and weirdness, and thereby signal a lack of care/caring.
"lack of care" - that's the part about spending time out of another human's life. It's not the poor quality that's the problem but the lack of human effort. Oil paintings are full of visible brush strokes which are an artifact but people love them. For most applications of art - advertising, background decorations, news article pictures, etc. there really is no need to show that humans spent effort on it.
The human effort idea is even a bit morally objectionable. You can feel that you're worth more than others because more of the lives of others were consumed to create your possessions. It's a zero sum game where poor people can never afford high-care art because their time is worth less than the artist's.
> create a picture of scrabble pieces strewn on a table, with a closeup of a line of scrabble letters spelling "CHATGPT" on top of them. photographic, realistic quality, maintain realism and believability
The biggest problem with the default Flux model is that it generates images with that strong AI look, probably caused by the distillation of the CFG. You should try some LoRAs for this, and also prompt the model to generate the rack that holds the letters.
Good point. I have a comfyui setup for it but its super basic right now just the diffusion model / clip loader / vae. Another thing you've probably noticed is that 99% of images from Flux tend to have that classic narrow depth of field look. I've seen people occasionally be able to get around it with pretty amusing prompt tokens like "instagram photo, selfie, gopro, etc." though.
The number markings on the Scrabble pieces are nonsensical, the wooden ground looks like plastic, there are strange artifacts like the white smudge on the edge of the “E” tile in the front, and so on.
AI-generated images are clearly identifiable as such, and it just gets annoying to continually see those desultory fabrications.
I wonder how many of the GPT-generated papers are actually made by people whose native language is not English and who want to improve their English. That would explain various "as of my last knowledge update" still left intact in the papers, if the authors don't fully understand what it means.
I'm guessing that we don't want people to write papers in a language where they don't understand "as of my last knowledge update", as probably a lot of terms in their paper have more advanced language than that.
Would be better in those cases for people to write their paper in their native language and let readers translate it for themselves.
It’s not a black and white problem. Some people may have good ability to read but not write/speak a language (I’m that way with Spanish) so the cases will vary as to which would work best user or author translated, it could be good to include both version in any given paper and fix both problems.
How about people stop responding to titles for a change. This isn’t about papers that merely used ChatGPT and got caught by some cutting edge detection techniques, it’s about papers that blatantly include ChatGPT boilerplates like
> “as of my last knowledge update” and/or “I don’t have access to real-time data”
which suggests no human (don’t even need to be a researcher) read every sentence of these damn “papers”. That’s a pretty low bar to clear, if you can’t even bother to read generated crap before including it in your paper, your academic integrity is negative and not a word from you can carry any weight.
True. I am seeing chatgpt used by my colleagues (mostly no English native speakers) day to day and it mostly improves their writing (except for those wotfd that pop up a bit too often [0] like utilize [1]). So not all bad.
I am also hearing that a lot of reviewers and readers use it though. So we are often joking that PhD students (in CS) nowadays only write bullet point from their research. Generate prose that is used to generate bullet points.
How do you know there is no proper proofreading? There is no way to tell, is there? Just because content was generated by an LLM doesn't in itself mean that it wasn't proofread.
> We searched and scraped Google Scholar using the Python library Scholarly (Cholewiak et al., 2023) for papers that included specific phrases known to be common responses from ChatGPT and similar applications with the same underlying model (GPT3.5 or GPT4): “as of my last knowledge update” and/or “I don’t have access to real-time data” (see Appendix A).
If noone bothered to even spot and remove these, you can be pretty sure that no human ever read the whole paper before publication.
IMO, at this point, AI is very necessary as a pre-reviewer to weed out such papers that haven't been proofread. This is at both the journal as well as the preprint levels, preventing them from getting an audience.
The problem is not that a paper has fabricated content generated by ChatGPT,
the problem is that there are many papers and they are polluting scholarship to the point that the base of evidence used in policy-making could be poisoned to the point of uselessness.
Firstly, "fabricated content" is a meaningless phrase. For the sake of argument, I use Github Copilot for "fabricating" every line of code. Does this make my code polluted? No, because I review every line of code, editing what's necessary, and more. It's the same way with scholarship. It doesn't say anything in itself.
Perhaps "unreviewed scholarship" would be a more concerning claim, but I don't yet see the evidence for it being a major concern.
Colour me surprised. An IT related search will generally end up with loads of returns that lead to AI generated wankery.
For example, suppose you wish to back up switch configs or dump a file or whatever and tftp is so easy and simple to setup. You'll tear it down later or firewall it or whatever.
All good until you try to use the --create flag which should allow you to upload to the server. That flag is not valid for tftp-hpa, it is valid on tftpd (another tftp daemon)
That's a hallucination. Hallucinations are fucking annoying and increasingly prevalent. In Windows land the humans hallucinate - C:\ SFC /SCANNOW does not fix anything except for something really madly self imposed.
It says to put the --create option in /etc/default/tftpd-hpa. tftpd-hpa does support --create (at least on Ubuntu). The client program tftp-hpa (no d) doesn't support --create, but that's not what the instructions are talking about.
It's funny you mention this because yesterday I had it write me a shell script to set up a TFTP server from scratch. I had it walk me through the process first, then said "ok now make that into a script." And it did and it works.
There is article shows no evidence of fabrication, fraud or misinformation, while making accusations of all of them. All it shows is that ChatGPT was used, which is wildly escalated into "evidence manipulation" (ironically without evidence).
Much more work is needed to show that this means anything.
If the result was not read even to check for obvious boilerplate GPT markers, then we can't expect anything else in them was. That means anything else, numbers, interpretation, conclusion was potentially never checked.
The authors use fraud in a specific sense here: "using ChatGPT fraudulently or undeclared" where they proved that the produced text was included without proper review. They also never accused those papers of misinformation, so they don't need to show evidence of that.
Honestly what we need to do is establish much stronger credentialing schemes. The "only a good guy with an AI can stop a bad guy with an AI" approach of trying to filter out bad content is just a hopeless arms race and unproductive.
In a sense we need to go back two steps and websites need to be much stronger curators of knowledge again, and we need some reliable ways to sign and attribute real authorship to publications. So that when someone publishes a fake paper there is always a human being who signed it and can be held accountable. There's a practically unlimited number of automated systems, but only a limited number of people trying to benefit from it.
In the same way https went from being rare to being the norm because the assumption that things are default-authentic doesn't hold, the same just needs to happen to publishing. If you have a functioning reputation system and you can put on a price on fake information 99% of it is dis-incentivized.
Is this not already a thing? You can look up purported papers by DOI, and whatever journal it came from supposedly had it reviewed and should know who sent it to them.
(And if that doesn't work, how is what you're suggesting meaningfully different?)
It's not at all a thing. Here's a recent study looking at citation fraud on Google Scholar including professional citation boosting services including with fake identities. It's widespread practice. https://arxiv.org/abs/2402.04607
Having a machine verifiable, cryptographic identity system that renders these kinds of things transparent, basically the equivalent of a ledger but instead of using it for get-rich schemes using it for identity would probably make verification enforceable.
LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.