Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.
I like this project very much and would like to see some overall scores, and it might not hurt to allow for a verified result link to detect bragging rather than actual results (not that anybody on HN would ever brag about their score ;) ).
Overall: I'm not worried that generated papers will swamp the publications any day soon but for spam/click farms this must be a godsend and for sure it will cause trouble for search engines to classify real content from generated content.
Fortunately, SEO spam is currently nowhere near as coherent as this, and often features some phrases that are a dead giveaway ("Are you looking for X? You've come to the right place!" or a strangely-thesaurised version thereof), but I am also worried about this new generation of manufactured deception.
I found that people would also refuse the test and would believe whatever the output of the model was due to my choice of subject.
Others that did a similar exercise and tried to verify their results using reddit had a great deal of people who would be able to spot fakes quite easily.
The biggest issue would be someone using a system to deliberately fool a targeted set of people which is easy given how ad networks are run.
Now that's a challenge.
Also, if you train GPT on the whole corpus of Nature / Science / whatever articles up to, say, 2005, could you feed it leading text about discoveries after 2005 and see if it hypothesizes the justification for those discoveries in the same way that the authors did?
GPT can write "about" something from a prompt. This is not much different than me interpreting data that I'm analyzing. I'm constantly generating stories and checking them, until one story survives it all. How do I generate stories!? Seriously. I'm sure I have a GPT module in my left frontal cortex. I use it all the time when I think about actions I take, and it's what I try to ignore when I meditate. Its ongoing narrative is what feeds back into how I feel about things, which affects how I interact with things and what things I interact with ... not necessarily as a goal-driven decision process, more as a feedback-driven randomized selection. Isn't this kind of the basis of Cognitive Behavioral Therapy, meditation, etc. See [1,2]. If you stick GPT and sentiment analysis into a room, will they produce a rumination feedback like a depressed person?
Anyway, if you can tell a coherent story to justify a result (once presented with a result), one that is convincing enough for people to believe and internalize the result in their future studies, how is that different from understanding that result and teaching it to others? The act of teaching is itself story generation. Mental models are just story-driven hacks that allow people to generalize results in an insanely complex system.
1. Happiness Hypothesis Jonathan Haidt
2. Buddhism and modern psychology, coursera
Faking an entire 10 page paper with figures and citations is much harder. I'm sure it'll happen next week, but until then I can still say that's where real understanding is demonstrated.
So, basically it's achieved undergraduate level skills.
Under some definition of "understanding". GPT understands how to link words and concepts in a broadly correct manner. As long as the training data is valid, it's very plausible that it could connect some concepts that it genuinely and correctly understands are compatible, but doing so in a way which humans had not considered.
It can't do research or verify truth, but I've seen several examples of it coming up with an idea that as far as I can tell had never been explored, but made perfect sense. It understood that those concepts fit together because it saw a chain connecting them through god-knows how many input texts, yet a human wouldn't have ever thought of it. That's still valuable.
As to how far that understanding can be developed... I'm not sure. It's hard to believe that algorithmically generated text would ever be able to somehow ensure that it produces true text, but then again ten years ago I would have scoffed at the idea of it getting as far as it already has.
To what extent have our brains already decided what to say while we still perceive ourselves as 'thinking about the wording'?
I was really impressed with gpt-2 but seeing this really gave me a feel for how much of a lack of understanding it has.
This challenge would be more interesting if there were "Neither is fake" and "Both are fake" buttons (and obviously, the test randomly showed two fake and two real articles in the mix)
A lot of the communication I have with folks is subtly flawed in logic or grammar, but that doesn’t make me think I’m working with a bunch of androids.
It’s natural and often even necessary to try to figure out an author’s intent when their writing doesn’t fully make sense.
The chicken genome (the genome of a chicken that is the subject of much chicken-related activity) is now compared to its chicken chicken-to-pecking age: from a genome sequence of chicken egg, only approximately 70% of the chicken genome sequences match the chicken egg genome, which suggests that the chicken may have beenancreatic.
(Related: https://www.youtube.com/watch?v=yL_-1d9OSdk )
4/4 on hard. Never read a Nature paper before.
Scores under 5 on what amounts to a coin flip doesn't strike me as so remarkable, especially when coupled with an incentivised reporting-bias as we see here. ("I got a high score! Proud to share!" Vs. "I got a low score, or an even score and look at all the people reporting high scores, think I might keep it to myself")
Being as it is, at this juncture, I think the AI may still have a chance to be strong with this one.
Also, were the AI to do well consistently, I'd think it might say more about the external unfamiliarity with, and the internal prevalence of, field-specific scientific jargon, than any AI's or human's innate intelligence.
>If I can't figure out what a paper is supposed to be talking about, it's fake.
Depends on the field...
The medical and biotech ones are much harder.
Three highly pathogenic β-coronaviruses have crossed the animal-to-human species barrier in the past two decades: SARS-CoV, MERS-CoV and SARS-CoV-2. To evaluate the possibility of identifying antibodies with broad neutralizing activity, we isolated a monoclonal antibody, termed B4, that cross-reacts with eight β-coronavirus spike glycoproteins, including all five human-infecting β-coronaviruses.
That's close enough for this to be a success if the purpose was to persuade or fool laymen.
> A new era of hyperaridididididemia revealed by single-cell RNA-seq
So some are better than others. :-)
Which is absolutely a real thing except that the exact quantum properties in fact didn't commute while they claimed they did commute and said for some reason simultaneous measurement required a third state anyway.
I don't know how I would have been able to distinguish that from completely reasonable methods for quantum error correction without knowing ahead of time which quantum states commute and which don't... pretty cool.
If I were skimming or half asleep I definitely wouldn't have caught a lot of these on hard, abstracts are always so poorly written and usually trying too hard to be complicated sounding by using big words when small ones would do just fine!
Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.
Yeah, I ran into at least one example which basically regurgitated a real paper. The "fake" article was:
Efficient organic light-emitting diodes from delayed fluorescence
A class of metal-free organic electroluminescent molecules is designed in which both singlet and triplet excitons contribute to light emission, leading to an intrinsic fluorescence efficiency greater than 90 per cent and an external electroluminescence efficiency comparable to that achieved in high-efficiency phosphorescence-based organic light-emitting diodes.
Highly efficient organic light-emitting diodes from delayed fluorescence
Here we report a class of metal-free organic electroluminescent molecules in which the energy gap between the singlet and triplet excited states is minimized by design4, thereby promoting highly efficient spin up-conversion from non-radiative triplet states to radiative singlet states while maintaining high radiative decay rates, of more than 106 decays per second. In other words, these molecules harness both singlet and triplet excitons for light emission through fluorescence decay channels, leading to an intrinsic fluorescence efficiency in excess of 90 per cent and a very high external electroluminescence efficiency, of more than 19 per cent, which is comparable to that achieved in high-efficiency phosphorescence-based OLEDs
Knowing how they're generated, a sequence of sentences that make sense are likely copied almost verbatim from an article written by a human. Without understanding the concepts, the algorithm may simply repeat words that go well together - and what goes together better than sentences that were written together in the first place?
What the GPT model is really good at is at is identifying when a sentence makes sense in the current context. Given that it has half of the internet as its learning corpus, it is easy that it's simply returning a piece of text that we do not know about. The real achievement thus is finding ideas that are actually appropriate in relation to their input text.
With these GPT models, I don't get the appeal of creating fake text that at best can pass as real to someone who doesn't understand the topic and context. What's the use case? Generating more believable spam for social media? Anything else? Because there's no real knowledge representation or information extraction going on here.
I don't know if this translates to technical writing, but it's possible someone might complete a prompt on some specific topic(s) and then use that as a point to start from, especially if they're knowledgeable enough on the topic to correct the output. It's nice to be able to skip a lot of boilerplate words (how many words in this comment are actually the meat of this idea, and how many words are just there to tie all those morsels together?)
I also built a more polished version to add to the Notebook.ai document editor (so writers can get some continuation prompts whenever they get a bit of writer's block), but the pricing made it unfeasible to actually release. Notebook.ai is also open source though, and you can see the GPT-3 functionality in the unmerged PR here: https://github.com/indentlabs/notebook/pull/739
One use-case is as a creative writing assistant. Typical ways to beat writer's block are to do something else (read, walk, talk, dream). At some point, either consciously or subconsciously, the hope is that these activities will elicit inspiration in the form of an experience or idea that will connect with the central vision of the author's work, allowing the writing to continue. So too it could be with prompts generated from these models, just another way of prompting and filtering ideas from the sensory soup of reality.
(While the "dark" version of reality has bedroom "writers" pooping out entire forests of auto-generated pulp upon ever-jaded readers, being able to instantly mashup the entire literary works of mankind into small contextual prompts would be another tool in the belt for more measured and experienced authors.)
There are other more practical use-cases for these models. There's work being done on auto-generating working or near-working software components from human language descriptions. Personally I'd love to just write functional tests and let the "AI" keep at it until all the tests pass. So seeing the models improve over time is a sign that this may not be an impossible feat.
From a less utilitarian perspective, I'd love computer assistants to have a bit of "personality". "Hey Jeeves, tell me the story about the druid who tapdanced on the moon." and just let the word salad play out in the background. Yeah it's a toy, but it would jazz up the place a bit, add a bit of sass even.
I do think we're a ways off the first computer generated science paper being successfully peer reviewed and contributing something new to human understanding of nature, I'd have scoffed at the idea ten years ago but now I'm sure it's only a matter of time.
Screenshot including the spam post, in case it's removed:
This is going to suck.
I've reviewed articles that were completely made up and the other reviewer didnt even detect that. Nor did the editor.
I've contacted editors about utterly wrong papers, criticized the article on pubpeer, and the article is still published... Because it would harm their notoriety. Thats one of the madenning ascpects of academic publishing.
Trying hard right now, will report results after.
Edit: Yeah, it's the same deal. Length becomes less of a giveaway but its errors become more obvious.
We have known for a while that language models can generate superficially good looking text. The real question is whether they can get to actually understand what is being said. As humans don't understand either, the exercise sadly moot.
edit: don't understand why i'm getting downvoted. is my comment not relevant to a post about the plausibility of abstracts generated by ML models?
Only one was convincing enough to be truly challenging, I got it right because the mechanism proposed was fishy, 1) I had domain expertise, and 2) the date of the paper made no sense relative to when that sort of a discovery would be made (2009 is too early)
Suggested tweak - train it against papers written by people with an Erdos number < 3 (or Feynman contributors, etc.), so that the topics and fake topics are more closely related in style and content. Maybe even feed it some of their professional letters as well. That would produce some very hard to decipher fakes.
Another great corpus for complex writing is public law books. Have it compare real laws from the training set with fake laws. I bet it would be very difficult to figure out the fake laws.
Training one of these on an entire corpus of one author (Roger Ebert, Justice Ginsberg, Joyce, anyone with a large enough body of work), and having people spot the fake paragraphs from the real ones would be very, very difficult. An entire text, however, would likely be discernible.
It is getting really, really close to being able to fool any layman, though. Impressive work!
On the other hand, an interesting possibility with well-designed text-mining and AI models would be for them to generate valid hypotheses that hadn't been contemplated earlier, based on the massive corpus of scientific publications. The model may be able to find possible correlations or interesting ideas by combining sources from different fields that would normally be ignored by the over-specialised research community. However, in that case the model wouldn't be valuable for providing answers—rather, it's value would be in providing questions.
Priyanka Ranade, Aritran Piplai, Sudip Mittal, Anupam Joshi, and Tim Finin, Generating Fake Cyber Threat Intelligence Using Transformer-Based Models, Int. Joint Conf. on Neural Networks, IEEE, 2021. https://ebiq.org/p/969
The incomprehensibility comes from the fact that abstracts (and particularly NPG abstracts) are trying to do many things at once--and all in 200 words. In theory, the abstract should describe why your work is of broad general interest (so Nature's editors will publish it), while explaining the specific scientific question and answer(!) to a specialist audience of often-picky, sometimes-hostile peer reviewers, and conforming to a fairly specific style that doesn't reference the rest of the paper.
It's tough to do well, and even moreso for non-native English speakers.
I kept seeing certain types of grammatical error, such as constructs like "... and foo, despite foo, so..." or "with foo, but not foo..." where foo is the exact same word or phase appearing twice in a sentence.
I also kept seeing sentences with two clauses that should have agreed in number or tense but did not.
It really does like to repeat itself.
"This study presents the phylogenetic characterization of the beak and beak of beak whales; it is suggested that the beak and beak-toed beaks share common cranial bones, providing support for the idea that beaks are a new species of eutriconodont mammal."
[I know nothing - I'm pretty ignorant about practical ML]
Some of my favorites: "A new onset of primeval black magic in magic-ring crystals"
"The genetic network for moderate religiosity in one thousand bespectacled twins"
"Thermal vestige of the '70s and '00s disco ball trend"
I'd been having trouble with ones which had a reasonable logical flow, but didn't communicate a complete idea.
Of course, pretty small N so YMMV
I would be curious to try again with GPT-3.