
Plagiarism detectors are a crutch, and a problem - pseudolus
https://www.nature.com/articles/d41586-019-00893-5
======
anamexis
Once in undergrad, I got an email from my professor one day instructing me to
meet him in his office with all of my notes and materials for a research paper
I had recently turned in.

When I arrived, his first words to me "I think we both know why you're here."
I didn't.

He explained that he had run my paper through a plagiarism detector, and it
had come back as a 100% match with a paper turned into a university in another
state. I was rather dumbfounded because I had researched and written the paper
entirely on my own.

We quickly came to an impasse, because he had faith in this rather damning
score from the plagiarism detector, and I obviously was quite sure I had in
fact written the paper myself.

As I left the office and began gearing up for a battle with the Office of
Academic Integrity, I got another email from him profusely apologizing and
telling me he had run the paper through the same plagiarism detector again and
it had come back clean.

I'm really glad I didn't have to fight that fight, but I got awful close. My
theory is that there was some problem during the upload, and it just checked
an empty document against another empty document and found they matched
exactly, but I'll never know.

~~~
umvi
One time in a take home programming interview problem I was asked to create my
own quicksort implementation in python. I had already done this previously in
answer to a StackOverflow question, so I just submitted a copy of that
implementation.

Later I was told I was disqualified because I had cheated and copied my answer
from StackOverflow. I tried to explain that the user I copied it from was
indeed me and that I could prove it by logging into SO in front of them but I
never heard back.

~~~
zaccus
FFS how many different implementations of quicksort in Python can there
possibly be? How would they expect you to make yours unique? Different
variable names or something?

I'll never understand the concept of plagiarizing long-established solutions
to problems.

~~~
jplayer01
It's ridiculous. If you're going to require a take home programming test, at
least make it something that isn't solved immediately just by opening SO. Ask
a cookie cutter question, get a cookie cutter answer.

~~~
ghaff
Or Wikipedia. (OK, you'd have to turn the pseudocode into Python. I'm not even
primarily a developer and that would take me about 10 minutes.) Or wherever.
There are apparently a few slightly different implementation details to
Quicksort but how unique can a Quicksort be?

------
btrettel
I once reviewed a journal article that at least a few pages of plagiarized
text in it. I noticed only because I thought some sentences sounded like I
wrote them, and as it turned out, I did write them. Not surprisingly, I found
the paper even worse from a technical standpoint, so I gave it a rather
negative review. I tried to be as helpful as possible, even suggesting what
they should do instead of what they did. It appears the paper was rejected.

Fast forward 6 to 12 months. I found what seems to be a descendant of the
paper published in another journal. While the vast majority of the plagiarism
I identified was removed, they added some parts _plagiarized from my review_.
I contacted the journal, who replied that plagiarism detection software didn't
find anything. I replied that it wouldn't because a confidential review would
not be in their system. Plus, I gave them a list of plagiarized parts; all
they had to do was check against the paper. I told them that I did not spend
much time checking for plagiarism, so for all I know there's a lot more
plagiarism in the paper. The journal representative never did anything about
the paper, unfortunately. I now refuse to publish in that journal and
discourage others from publishing there as well.

~~~
gruez
>The journal representative never did anything about the paper, unfortunately.

I wonder if you could submit DMCA requests for this.

~~~
btrettel
I thought about that, but was too busy. I might try that later this year,
though.

~~~
btrettel
A followup: I am in contact with lawyers at my university now...

------
caconym_
Are students really comparing their work against the scores generated by these
tools to game the scores lower because they're worried about being punished
for false positives?

That is, pardon my French, fucked up.

Beyond that it's egregious that any academic institution would allow the
success of students/researchers/whatever to hinge on scores generated by a
black box algorithm that hasn't been rigorously proven as very close to 100%
accurate. There should always be a human in the loop, and if we don't have
enough humans for that then I think we have a problem that goes beyond
automated plagiarism checkers.

~~~
dahfizz
In my experience, that's not happening.

First of all, if your work is submitted through these detection systems, you
almost never have the opportunity to see your 'score' and then resubmit.
That's just asking for people to cheat the system.

These plagarism checkers are also not really smart enough to differentiate
proper citation from actual plagarism. Thier function is to alert the person
grading the papers to sections that might be problematic. It will highlight
"plagarized" sections, but the professor will ignore it if it's properly
quoted and attributed or if it's just a short phrase, etc.

~~~
StavrosK
What stops me from renting the software and running my own paper through it to
see what it says?

~~~
umvi
They usually only grant licenses to certain organizations. Same reason you
can't simply buy the solutions manual to any given textbook.

~~~
paulmd
Quite often you can buy the solutions manual to a given textbook. They are on
half.com or available in pdf format, a lot of the time.

They won't show you the work, so copying the answer is not sufficient to get
credit in practically any course.

"Academic integrity" has ridiculous rules (eg self-plagiarism is not really a
thing that anyone should care about) and I'm sure some people would be pissed
about that, but if you're doing the work I personally don't see anything wrong
with getting immediate feedback on whether you've done it correctly or not.
Particularly given that many textbooks do like one or two simple examples and
then jump to complex problems in the assignments... being able to check a
couple problems to be sure you understand the process is extremely helpful as
a learning aid.)

If you want a real-world assessment of whether someone has mastered a concept,
schedule an in-class quiz or a test.

~~~
cortesoft
> self-plagiarism is not really a thing that anyone should care about

Sure it is. The entire methodology of schooling is to learn through exercise;
if you are asked to write a paper, it isn't because the teacher wants the
information in the paper. They are wanting you to learn through the act of
writing the paper.

If you just turn in a paper you have written for another class, you are not
doing that exercise.

Turning in a paper that you wrote for a different class and getting upset that
the school doesn't approve is like thinking that sending a picture of you
lifting weights 3 years ago is going to make you stronger.

~~~
hopler
All of this nonsense comes one root evil: giving publicly viewable grades for
fundamentally in original work. Grades and homework should be for the
student's private benefit. The only public marks should be proctored exams for
demonstrating skills, and whatever original work someone creates that has
value that stands on its own outside he artificial school environment.

~~~
cortesoft
A proctored exam can only really measure a few types of skills, and don't work
very well for demonstrating skills learned in many classes.

Take, for example, a class that is about learning to write effectively, which
is clearly a valuable skill that adults need no matter what they do in life.

Part of that skill is being able to construct, from scratch, an effective
piece of writing in a length of time that is longer than a proctored exam
session. You can only measure that skill by having them do that, and if they
turn in work they have written prior, you aren't learning anything about their
ability to write something effectively on demand.

Research classes are similar; you can't demonstrate that you know how to do
research well in a 3 hour exam. Programming classes, too; you can't measure
the ability to create a large project in a 3 hour exam.

~~~
hopler
I think a defense (oral examination) of the paper can cover that scenario.

------
ksaj
When teachers rely on the output of these detectors without checking the
source material, they are effectively doing what they are accusing their
students of doing with Wikipedia, etc.

The end result may be totally convincing, but nobody really knows until they
actually look at what they _think_ they are citing. The teachers should
operate at least at the lowest level they impose on their students -- Check
your work. Don't blindly copy someone else's results.

------
nordicway
This is really nothing new. "Ethical problems will never be resolved by a
software tool" [1]

The author uses a straw man argument that plagiarism detection software
promises to "replace the professor" when in fact, it is just a tool.

Besides, there are highly sophisticated ways of cheating that even a professor
would not catch reliably. I wrote my undergrad thesis about plagiarism and it
was interesting to learn that there are students who are highly motivated to
"beat the system" instead of doing the task properly.

[1]
[https://www.researchgate.net/publication/248606848_Support_f...](https://www.researchgate.net/publication/248606848_Support_for_checking_plagiarism_in_e-
learning)

------
kelnage
As someone who teaches degree level content to apprentices, I definitely think
these tools have value. I know that until very recently my current institution
has not bothered with such a tool and I saw with my own eyes that quite
obvious (to me) plagiarism was rampant and that the already overburdened
assignment markers were not catching these instances. By moving to an
automated tool, our students were immediately put on notice that such
behaviour would not be acceptable.

That said, I do still hear worrying comments from my colleagues (“what number
is too much?” is I’m afraid a common question) - but I can clearly see the
change in student behaviour - so I don’t think this is such a bad thing. Yes,
markers do still need to watch for plagiarism - and there are of course other
types of plagiarism that these tools can’t usually deal with (ghost-written
essays being a common talking point these days), but just because a tool may
be misused or may be misunderstood, it certainly doesn’t necessarily mean it
is all bad.

~~~
ghaff
I don't have any real experience with these tools but for writing (as opposed
to some types of code, especially well-defined algorithms), the odds that two
pieces of writing are going to be highly similar to each other by chance are
pretty tiny.

Sure, the same quotes from another source may be used here and there (and
hopefully cited). And there may be some rote definition or other short text
lifted from Wikipedia, etc. that maybe doesn't rise to the level of requiring
a footnote. But if any sections of text longer than a line or two aren't
substantially different, something is going on.

------
TanyaGal
Plagiarism detectors are more akin to linters than code reviewers. Arguably
helpful but not sufficient.

~~~
wwarner
Agree and would add that there is no reason that the algorithms need to be
secret. Well, there _might_ be some value in keeping the algorithm from the
writers, but definitely none in keeping it from the editor/teacher. Also, the
algorithm can report on the likelihood of it's own error, which can help
teachers to interpret the result.

~~~
Zigurd
Coders live in a bit of a utopia. Imagine your code quality being scored by a
proprietary black box. That would be intolerable and contrary to the spirit of
producing a quality product and developing better coders.

Keeping the algorithms secret for plagiarism detectors is about as dubious. If
they are not good enough to reject gaming they are not robust enough to
deliver reliable scores. If you don't trust students to see that gaming a
plagiarism detector is more effort than honest work, you've admitted the wrong
student, and/or you are using a duff plagiarism detector.

~~~
paulmd
At the very tail end of when I was going to college, classes were starting to
move to "online assignments" where you type your answer into an online tool
which matches it as a string against the expected answer, and it's exactly as
intolerable as you imagine.

Teachers like them because they don't have to grade papers, textbook
publishers like them because the online code prevents reselling the textbook
(or at least, the new customer will need to purchase a key anyway).

~~~
hopler
This seems to complete the charade that a college degree is just a taxi
medallion, an inherently worthless object that only has value because powers
conspire to require it as a roadblock to something impot, and then extract
money from people who need to get past the roadblock

------
sov
Honestly, I'm a little disappointed with this article. Plagiarism detectors
themselves are neither a crutch nor a problem--they're simply a tool. People's
_use_ of them as the grand arbiter on plagiarism as though they were stone
tablets delivered to them from Mount Sinai is the problem.

I thought that, despite the title, that's what the author was talking about
for the majority of this article, but then they mention things like...

> Only if a text is somehow off, and online searching does not help, should
> software systems be consulted.

Sorry, what? The only reason text would sound weird is if it's been
specifically mangled to defeat plagiarism software (which makes plagiarism
software already useful). There's practically no way you, as a student marker,
busy professor, or someone reviewing hundreds of academic proposals has the
time to slowly wade through each paper you get by googling choice sentences
manually--something the plagiarism checker software can do just fine. I'm
more-so confused, by the pair of statements that 1) "Software cannot determine
plagiarism [...] That decision must be taken by a person" but also 2) "Only if
a text is somehow off, and online searching does not help, should software
systems be consulted". Isn't the author's whole point that plagiarism software
false flags all the time? Isn't this, then, just "hey this sentence sounds
funny, time to fail this student on some plagiarism."

If their argument, however, is that you should use plagiarism software for
curatable results, then this seems like the _opposite_ order of what should be
done. Why waste all your time fruitlessly finessing Google if the software
will straight up just find the OG source for you? You're not going to have
read every single piece of writing conceivably connected to the essay/etc.
you're reviewing (unless it's a field you know much about and also so narrow
that you'd never consider using plagiarism software in the first place), so
you're bound to be missing actual plagiarism all the time.

------
slaymaker1907
I think the idea that they work is more important than these systems actually
working. The hope is that these systems scare students enough that they don’t
try to plagerize in the first place.

I don’t think these systems are nearly as egregious as the 5 paragraph essay
scoring system they used around 2010 when I was in 9th grade back in 2009. I
thought it was fine back then, but now I wonder how that thing had even the
appearance of working some of the time.

~~~
Jach
Heh, in my English class in 9th grade (2005-2006) for one essay our teacher
had us try one of those new automatic essay scoring programs. The scores were
all over the place, for mine it claimed I didn't even address the essay
subject at all. Anyway the teacher was reassured of his skepticism and
continued with the practice of manual grading and feedback.

I'd rather be the victim of a computer's bogus score than the victim of a
computer's bogus accusation of wrong-doing. On the other hand it's important
for everyone to experience the latter (by computer or not) at least once, and
at least with plagiarism even if it's true you're not going to go to jail...

------
jcrawfordor
The university where I'm studying for my MS uses such a system and,
interestingly, has it integrated into their course management system such that
I can actually see the reports on my own assignments. I've noticed that on a
number of lengthy papers I've turned in it marks nothing at all, which is
conspicuous considering that I tend to quote from other literature very
heavily.

I'm not sure if the plagiarism service is just not very effective or if the
fact that I turn in mostly LaTeX-generated PDFs is causing a problem with
their text extraction. I actually suspect the latter, because a couple of
times I've turned in MS Word files it's marked one or two random phrases as
suspect.

------
saternius
As a founder of an startup that uses AI to paraphrase text
([https://quillbot.com](https://quillbot.com)), I find that this article is
hitting a very valid point. As fluency enhancing tools become more prolific
and higher quality, submissions will be more standard and thus there will be
more unintentional overlap. Plagiarism detection software will become
increasingly less reliable.

~~~
craftinator
Sounds like good territory for implementing a GAN. You end up with one network
that aggregates and summarizes, and another that penalizes for lack of
originality. Though I suppose theres a good chance the GAN would drift away
from human patterns as it trains.

------
cmroanirgo
I've worked a bit in this space, porting Dr Bloomfield's WCopyfind [0] algo to
js [1] for a small non profit (& other proprietary solutions). One of the
harder forms of plagiarism over copy-paste issues is that of ideas. It's easy
to cross reference phrases in text A to text B, but what if it's reworded in
such a way that it's conceptually the same? In some areas of research this is
a huge ongoing problem and although there are some small inroads being made,
it's still a huge problem. It's also compounded by the issue of language
translation. eg. text A (in English) was reworded in text B (in Chinese)

[0]
[https://plagiarism.bloomfieldmedia.com/software/wcopyfind/](https://plagiarism.bloomfieldmedia.com/software/wcopyfind/)

[1] [https://github.com/cmroanirgo/pl-
copyfind/](https://github.com/cmroanirgo/pl-copyfind/)

------
paulcole
Michael Crichton (somewhat famously) retyped and submitted a George Orwell
essay for an undergrad literature course at Harvard. Orwell got a B- and
Crichton decided lit wasn't for him and ended up in anthropology/pre-med.

Today he'd just be expelled!

------
loosetypes
The CS department at my university does the same with code submissions using a
tool called Moss[0].

Is this the norm?

[0]
[https://theory.stanford.edu/~aiken/moss/](https://theory.stanford.edu/~aiken/moss/)

~~~
fokinsean
Yup, my university used Moss and caught a ring of cheaters with it. Funnily
enough at my first job I actually worked on a product that was essentially
Moss with a slicker UI, and it would also check online sources such as Github.
It worked fairly well, essentially we highlight areas of code that look
suspect and it is up to the professor to determine whether it is copied or
not.

Haven't worked there in almost a year so I'm not sure if the product took off
or not.

------
kemiller2002
What is sad is that this could be a great tool for students. It would be
really nice to have something you can have it check your paper and give you
warnings about missing citations etc. The ones I've seen you can only check
your work after you submit it for grading. I'm not saying that people don't
deliberately plagiarize, but it's also easy to make a mistake when you're
under pressure.

------
pard68
I used to work in a helpdesk at a university. I don't know how many times I
had to explain how to interpret the plagiarism scores they would get from the
plagiarism program we had. Many times they wanted to trust the simple number
in front of them and not do the leg work needed to interpret that score.

------
dekhn
This is funny because I've talked to multiple computer scientists who teach
classes where all the programming projects are gated by a plagiarism detector
and they all insist it has 100% TP and TN rate, 0% FP and FN rate.

------
hartator
Some of these tools use our company SerpApi.com as a data backend.

To be fair, they seem to be doing a good job and they are good a detecting
copy/paste from online sources. But, of course, results are to be taken with a
grain of salt.

------
m463
I think it is all just a data grab. They HAVE to identify students on their
computers, then link with lots and lots of other companies which tie
everything together. Look at the turnitin privacy policy.

------
tsumnia
Plagarism Detectors for Code, like MOSS, are a major issue. Take a simple
getter/setter method. MOSS will mark these as duplicates unless you explicitly
go in an say "no, its okay for THIS to be the same". This adds a degree of
bias to what is justifiably "expected" and "unique" for student submissions.
As someone that uses MOSS, I regularly look at any flags, "eyeball" them, and
in only 2 instances did it result in a legitimate integrity violation. Not to
mention that I have to hunt for all possible solutions across the internet and
add them to the pile to ensure students didn't "copy it from a website".

I think current CS is doing itself a disservice because we want students to
collaborate, but discourage them from even looking at each other's code (that
is literally how students have spoken to me about it). I don't know if English
or History students refuse to look over each other's work, but I doubt it.

Soloway described programmers as having "recurring basic patterns". If you
accept his theories, we literally have tiny little rolodexes in our minds and
when we encounter a problem we go "hang on, I've done this before" and pull
out some code we've worked on in the past. We literally have tiny snippets of
code we try to reuse if we can.

Instead, for well known algorithms, just give them the code! My research [1]
require them to at least do it once, but either way - what good is it to have
a student reinvent how A* search works? What about inserting into a Linked
List? There are only a finite number of ways to implement these things.
Instead of discouraging StackOverflow, and letting students try to sneak that
type of work into their code, control it through the use of additional
examples. They are less pressed to copy and paste code from a 3rd party if
you've literally given it to them.

We want students to understand how to take something like a programming
language and use it to solve a problem, but then we confound the entire
process by not training them on the language. Even among SIGCSE, people can't
agree what CS1 should be and somehow everyone wants it to be "more than just
learning to program" like its a bad thing.

Back to plagiarism detectors for code, I think our current track discourages
students to be open and help each other because they fear if they describe how
they solved the exact same problem, there goes college. The result is
encouraging the loner stereotype instead of reducing it. Hell, paired
programming has to be forced on us because we can't "work with other people".
I recognize some simply don't like it, however, my point is we've trained
ourselves to not work together for so long that we designed a coding paradigm
to have "controlled helping someone code".

[1]
[https://research.csc.ncsu.edu/arglab/projects/exercises.html](https://research.csc.ncsu.edu/arglab/projects/exercises.html)

[2]
[https://www.sciencedirect.com/science/article/pii/B978093461...](https://www.sciencedirect.com/science/article/pii/B9780934613125500422)

~~~
Jach
There are just too many competing tugs. Something has to give. The problem of
plagiarism goes away entirely if you make the course grade consist entirely of
the final in-person exam. That doesn't mean you have to give up assigning
homework and scoring it, but for mysterious reasons probably related in part
to modern students and teachers not wanting to do anything unless forced to by
the system colleges have moved away from the "final is everything" model.

~~~
tsumnia
This is something that I emphasized in my CS1 courses (I haven't taught them
in a few years). Homework was 50% divided into 10% for Typing Exercises (my
research), 10% for Lecture Exercises (MCQ, TF, Fill in the Blanks), and 30%
Programming. Exams accounted for the other 50% (Midterm 20%, Final 30%). The
big thing is that if you look at 20% of their grade was simply on drilling.
Explicit cheating wouldn't help for them and they are just enough to still
impact your grade if you don't do them.

Exams were in class, 2-3 hours long, and with a coding without Internet
section. This approach does become more difficult with a 75 minute course
(which is what NC State has), simply due to time constraints, so I can agree
with you there.

However, your point about "Something has to give" is one of the reasons I
argue that we expect too much out of CS1. In other technical skills, like
cooking, music, martial arts, dance, etc., the introduction is much more about
accumulating the technical skills necessary. Music theory isn't as prominent
in an Intro to Singing class; however, in Intro to CS, I don't think we do
that. As my course structure should indicate, I think intro courses should be
more drilling to build the neuromuscular memory because then at higher levels
we can do more critical thinking with the trust that students "know how to
execute the plan". I do think that something like Problem Solving should be
given its own separate course where its skills can be trained and honed
parallel, but separate, from Programming.

------
0xdeadbeefbabe
Is this what they meant when they told me software would eat the world?

------
diminoten
Por que no los dos? The _real_ problem is that academics aren't reading the
papers, couldn't the solution here to use both the automated tools, _and_ to
pay better attention to what they're consuming? Maybe improve the UX of the
plagiarism tools as well.

"Throw it all out" doesn't seem like a real solution to me, more of a flashy
headline.

~~~
setr
Do you mean not reading the plagiarism checker’s output, or the paper itself?
Because the latter isn’t sufficient; you’d have to have read anything the
student might’ve (presumably found by googling), and pattern match the
plagiarized text. If its not blatant/poorly inserted, there’s not really any
tell.

