
Plagiarized news sites are using Cyrillic characters to avoid detection - mschenk
https://hoax-alert.leadstories.com/3469009-forget-russian-bots-fake-native-americans-are-using-russian-characters-to-avoid-fake-news-and-plagia.html
======
iluxonchik
I doubt websites hosted in Eastern Europe care about copyright legal threats.
Even if they contact the hosting provider directly I doubt any action will be
taken. Eastern Europe has plenty of cheap, shady hosting providers where you
can host pretty much anything that you want. Unless the website is making a
lot of money, nobody is going to spend significant resources to take those
websites down.

Let me speculate of why they might be doing it. Google will de-rank pages that
have content that's identical to others (e.g. identical paragraphs of text).
Maybe Facebook is doing something similar?

Let's say one of your friends shares an article from BuzzFeed.com, then
another friend shares an exact clone of this article from FakeBuzzFeed.com.
Now, Facebook might not want to show two articles with the same title from two
different websites on your timeline. And considering that BuzzFeed.com is a
website with a higher ranking that FakeBuzzFeed.com, it will probably choose
to display only the fist one. If you do the Cyrillic trick to the article in
FakeBuzzFeed.com, Facebook will think it's something completely different and
present it to you, thus getting you a higher reach.

The same applies to the advertising part: if you're constantly submitting page
ads with exactly the same titles as the one's that real users are sharing, it
might get you banned.

~~~
RustGirl
I think you're right. Also news stories tend to have a currency. By the time a
copyright owner complains, the news is old news. They don't want some
automatic process to mark their content as "spammy" and not feature their
links or content if someone pushes it.

------
Cynddl
A bit of extrapolation here. In short, a few websites dedicated to make easy
money on Facebook by copying articles have started to use unicode to obfuscate
the title. They automatically replace latin characters with similar letters.

This makes their title harder to be detected by either Facebook, fact checking
websites, or DMCA/copyright bots. Nothing related to Russia here.

~~~
mfoy_
Or Native Americans.

------
AndrewNCarr
Here is a project that maintains a list of homoglyphs and has some Java and
Javascript code for detecting them.

[https://github.com/codebox/homoglyph](https://github.com/codebox/homoglyph)

The list itself in sorted text format, each line a list of similar glyphs:

[https://github.com/codebox/homoglyph/blob/master/raw_data/ch...](https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt)

------
frits1993
This reminds me of a project back in 2014, where a school-mate and I created
an "uncopyable" font using the same idea.

I put the site back online at [http://nopy.progresso-
ict.nl/](http://nopy.progresso-ict.nl/) ($10 PayPal money has already been
given away years ago)

------
haneefmubarak
I think at some point, a sort of visual-normalization that converts similar
looking unicode to a single unique string sequence (ex: convert certain
letters from Cyrillic and other language sets that are also present in Latin
to just Latin) is just going to be necessary as a security precaution.

Given the whole "fake news" thing over the past couple of years, I expect that
the first step will be taken by one of Google/Twitter/Facebook/etc, but I hope
that they (or someone else) releases a library (or worst case, an online API)
that allows this sort of normalization for security verification. I get that
having it open would make it easier for people to find loopholes by brute-
force testing, but these sorts of loopholes could also be patched rather
quickly as they came up, providing benefit to everyone (especially from a
security perspective).

EDIT: Perhaps this could start out as a series of matches generated using ML
classification? I don't know much about ML - does anyone who does think this
is a realistic starting point?

~~~
sedev
Unicode normalization/equivalence is half of what you want and UCAPI is
probably the other half.

[https://en.wikipedia.org/wiki/Unicode_equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence)
[https://www.casaba.com/products/UCAPI/](https://www.casaba.com/products/UCAPI/)

~~~
haneefmubarak
Those are interesting, especially the second one. I've read a little here and
there about Unicode normalization before, but UCAPI does look like what I
really wanted. However, seeing as UCAPI isn't free or even "listed price
plans", I get the strong feeling that this will not see much pickup (at least
until someone makes a free one).

~~~
BoorishBears
There are plenty of free libraries that allow you to detect and compare
strings with confusables

------
beager
Should be easy enough for networks to detect and remove these, by identifying
content where character ranges in words routinely fall outside the charsets of
languages.

That, or some sort of fuzzy CV hashing, which is cool, but more intensive.
That would also mitigate null length and invisible modifiers.

------
smsm42
This is an old trick, successfully used for a while in domain names (does
gооglе.com look suspicious to you? what if it had a valid SSL certificate?)
but hopefully all browsers and registrars have smarted up by now.

Another version of this trick has been popular in Russia with corrupt
government workers: by law, a lot of government purchase/service contracts
should be subject to public calls for bids, usually placed in a website which
you can search. However, if you write what you need replacing some of Cyrillic
characters with Latin ones, a honest supplier that is looking for a government
contract will never find your entry. However a corrupt one that you have
arranged with beforehand would, and will be the sole bidder on this contract,
with a price that you have arranged before (which of course includes a juicy
cut for the corrupt government official) and nobody is the wiser, all
requirements of the law are fulfilled, who could be blamed that there's only
one bidder?

------
filleokus
In Sweden (and probably other places), a service called URKUND[0] ("deed" in
Swedish) is used for automatic detection of plagiarism for school work.

I have always wondered to what extent they identify stuff like this, and other
potential trickery with UTF-8 or removing text layers from PDF files.

0: [http://www.urkund.com/en/](http://www.urkund.com/en/)

~~~
netsharc
A neat trick would be to render the PDF into an image file, and then OCR that
image file, and do the detection using the text file generated by the OCR
process.

Just like VW's defeat device, the cheater would then need to create software
that outputs something different when they see they are being rendered not to
display, but to an image file...

~~~
pbhjpbhj
So, one needs to use a font that OCR finds hard to avoid detection. Would
rubbish keming do the trick, you could abide by imposed don't requirements but
change kerning/leading?

~~~
stordoff
Presumably if you have the original file, you could reset the formatting to
known-readable state before the OCR step.

Edit: You could also just use the text, but red-flag submissions where X% of
the words used fall outside of the submitted language.

------
goptimize
Maarten Schenk (resident expert on fake news) create click-bait titles

~~~
dmix
It's not even fake news and even plagiarism is a weak description. It's
blogspam / content farming for cheap advertising dollars by eastern europeans.
This has been going on for ages, much like SEO gaming, it's just using another
hacky trick, and using Cyrillic characters is hardly new trick either.

------
BanzaiTokyo
substitution of characters o/a/e (that are similar in Latin and Cyrillic
alphabets) has been used for years to pass automatic plagiarism detectors.

------
mfoy_
>The site is part of a growing list of fake Native American pages run out of
places like Macedonia, Kosovo or Vietnam.

So the headline is a little misleading... It's just that there are a growing
number of websites that simply plagiarize content to get views / ad revenue.
Because their titles are obfuscated to prevent detection of the plagiarism,
they have to target specific niche groups to drive views. So it's not some
weird "fake Native American" scheme / scam / ploy... it's just that _this_
site in particular seems to focus on "Native American topics".

So it's not "Fake Native Americans Are Using Russian Characters to Avoid
Plagiarism Detectors", it's "Fake News Sites Plagiarize Articles by Using
Cyrillic Character Replacement to Avoid Detection", subtitle: "One such site
targets Native Americans!"

~~~
dmix
This isn't even fake news. It's just blogspam for advertising dollars. Nothing
fake about it... the articles they copy are real content.

This is just jumping on trendy words like Russians/bots/fake news for click
bait.

------
kozak
Let me nitpick a bit: these characters are not Russian, they are Cyrillic.
There are some Cyrillic characters that are distinctly Russian (i.e. used only
in the Russian language), but these characters can't impersonate Latin letters
because they are too different from them.

[https://en.wikipedia.org/wiki/Cyrillic_script](https://en.wikipedia.org/wiki/Cyrillic_script)

~~~
moufestaphio
Yeah that bothered me as well.

Would you write: "Site uses American characters to publish fake news." ?

~~~
dwighttk
you gotta loop the Russians in to drive page views

