
The day I found Saowen.com had stolen my content - boyter
https://nickmchardy.com/2018/12/the-day-i-found-saowen.com-had-stolen-my-content.html
======
Tor3
The irony is that (I just checked) saowen.com has stolen your "The day I found
Saowen.com had stolen my content" content as well! It's right up there at
their front page.

~~~
ar-jan
I tried to document this on archive.org but got "This url is not available on
the live web or can not be archived."

~~~
H4CK3RM4N
Try archive.is

~~~
ar-jan
I did, but it gave garbled content.

------
paraditedc
It's on Saowen.com now as well:

[https://www.saowen.com/a/3b4deadad676021a51cbcee5d5fc1ec4269...](https://www.saowen.com/a/3b4deadad676021a51cbcee5d5fc1ec4269045ca1c13c0323eaa1c0da982295f)

I am vehemently against Saowen.com stealing content, but this is practically
quite similar to how outline works (to remove css or bypass paywall), wonder
why people (especially on HN) don't have issue with those:

[https://outline.com/DnVGCw](https://outline.com/DnVGCw)

[https://www.outline.com/dmca.html](https://www.outline.com/dmca.html)

Edit: Replace "Not to defend the site" with "I am vehemently against
Saowen.com stealing content"

~~~
chmod775
Outline makes articles _more_ readable instead of making them worse by adding
a metric ton of ads and trackers.

Outline is better from a moral standpoint and closer to say (but not quite)
the internet archive.

~~~
forgotmypw2
only if you have js enabled, otherwise it's a blank page

~~~
9935c101ab17a66
I think what outline does would be much harder to implement without JS. As it
stands, it seems to take parse the URL, fetch the page with a specific UA /
some other magic, then automatically format it, without having to store
anything on disk and without having to generate static html. Wouldn't this be
much, much harder without using JS?

~~~
askmike
> without having to store anything on disk and without having to generate
> static html. Wouldn't this be much, much harder without using JS?

While it's possible to do without storing anything on disk (strange
requirement imo - but you can keep it in memory on the server if you really
want), it's a lot more practical to do this work in the backend. And that's
also what outline is doing:

outline's frontend will do an AJAX request to outline's servers to actually
fetch the article from a blog and serve that back[1]. So they could do this
easily without frontend javascript. But I think the UX would suffer on many
levels. Having a frontend in javascript allows outline to do better caching,
better user experience, etc. The only downside to using JS is that it doesn't
work for people still in 1995 and for people who disable their javascript.

[1]: example fetch call:
[https://outlineapi.com/v3/get_article?id=6ps979](https://outlineapi.com/v3/get_article?id=6ps979)

edit: fixed typos

------
mvanga
I've had the same issue with saowen.com copying articles off my blog
([https://sighack.com](https://sighack.com)) onto their website and I'm glad
OP called them out on it publicly.

They even have an index page with a list of all my articles, updated as I
post:
[https://www.saowen.com/source/site/sighack_com](https://www.saowen.com/source/site/sighack_com)

Interestingly, they always seem to backdate the articles by a few days from my
actual date of posting! I also noticed they link to my website, but with a
nofollow attribute.

Completely nuts...

~~~
thecatspaw
Backdatinh is most likely for SEO, in the hopes that google recognizes them as
the original, and your blog the copy

~~~
mvanga
Yeah that's likely the reason. It's just funny to me that such a simple hack
can confuse Google into not being able to reliably pinpoint the original
source.

------
cm2187
It's kind of ironic (and sad) that dereferencing from google is as good as
removing the offending content from the internet.

------
edent
This happens to all of my blog posts - on dozens of "aggregator" sites. Some,
like Outline, are happy to offer an opt-out. But the rest are just SEO farms -
lifting content and passing it off as their own.

I occasionally report them. If they're hosted in the EU, I'm usually
successful in getting them taken down. US hosting companies just don't care.
Chinese & Russian companies don't even answer emails.

It's rather frustrating.

~~~
superflyguy
I'd be tempted to include blogs about/links to Tiananmen Square protests etc
then report the copies to the Chinese authorities. Possibly do something
similar for Russia about homosexuality or whatever.

~~~
dhimes
Exactly my thought. Some really nasty base64 encoded images and stuff, if I
could identify them by IP or some other way.

------
spydum
If they are mirroring your content, figure out a signature of their crawler,
and start serving "special" content to them.. Perhaps a bit of JS which does a
window.location check and redirects if it's not your own (chances are they
might do some poor hostname search and replace, so you'd have to obfuscate).

~~~
dhimes
Can you id them by ip?

------
th0br0
Blurring out the ticket id isn't that useful when the TXT record still
exists...

------
maaaats
I wonder why Google is so bad at picking up that these pages are clickfarms,
and not instead return the original when searching.

In my native language (norwegian), I have lately had issues with searching for
stuff, opening the page linked by google, to quickly realise it's not even
legible norwegian, just auto-generated content (google translate of some
product review I guess?). Absolutely useless content, no idea why it ranks so
high. How can they not manage to filter out stuff like this?

~~~
peteretep
> I wonder why Google is so bad at picking up that these pages are clickfarms,
> and not instead return the original when searching.

They have a monopoly on search, why do they care?

~~~
dublo7
Good point. So if Bing started getting better at detecting duplicate articles
and keeping track of duplicates per domain and ranked them lower that would be
a big improvement over Google. Doesn't sound hard to do it you've got search
engine size resources.

------
potatofarmer45
Saowen.com is clearly a mainland Chinese website. Assuming the copying is
automatic, all you have to do is post an entry on the evils of communism with
pictures of Winnie the Poo interspaced within it. Add in a few photos of the
Dalai Lama and a statement like "Xi Jinping is the biggest moron in history".

Then you report it to the Chinese authorities and the penalties for Saowen
will be much much higher than a Google search takedown!

------
djaychela
I think that's a useful walk-through for anyone coming across this that
wouldn't know what to do to try to get the listing removed; with the
appropriate 3 pieces of evidence it looks as if it worked well.

I wonder if it's possible to automate the process - i.e. to alert the content
creators that saowen.com is stealing from and help them complete the process?

~~~
Browun
I assume if each page has the markup tag > <strong class="fn"
itemprop="author">nickmchardy.com</strong>

You can then either attempt an email to info@{} the tld of the author tag or
scrape that site for email addressees on there.

Assuming that most of these are blogs, such as the case here, hopefully there
wouldn't be too many addresses on each domain. So hopwfully relatively easy to
do ... ?

Would be interested in pursuing this though

------
dazc
Good idea to disable the default feed option in wordpress to avoid this kind
of stuff. It won't stop a determined plagiarist but should eliminate a lot of
the automated stuff?

Looks like the author has done that now.

------
new_here
Is there not some collective action that can be taken to have Google penalise
Saowen.com and other sites that engage in this kind of plagiarism?

------
maaaats
> _This article hot links images hosted at nickmchardy.com_

Yeahhh, I would quickly have changed the contents of those images..

~~~
deytempo
Not just that, you could redirect requests from their URL to images of
whatever you want wherever you want via htaccess

~~~
jsjohnst
Just be careful doing that when it’s a tabloid trash blogger who is linking to
your photos. Some fights are best left un-fought.

------
hartator
It’s seo 101, but avoids mentioning the theif domain name and linking to it.

------
cauldron
These Chinese aggregators are all the same, just like Toutiao from Bytedance
(that TikTok company), not long ago the world's most valuable startup.

Together with Tencent's WeChat and other me-too sites they created the "self-
media" cottage industry, basically lend credibility and let laymen publish all
sorts of lowbrow and sensationalistic content to earn ad money.

If you check [https://www.bilibili.com/](https://www.bilibili.com/), a very
popular Chinese video site (I'd say also a stomping ground for anime
pedophiles which the site was built upon. US listed btw.), you can find
pirated US tv shows and basicially every popular yutube video, reprocessed,
edited and watermarked as their "original" content, raking in money for their
uploaders. Toutiao again is just the same model.

Part of the reason why Bytedance is so valuable, is Chinese just love these
things, their de facto "news" source, the vast majority of people didn't
receive higher edcation and doesn't want to read serious articles (which
Chinese state-controlled media also lack).

Nobody care if it's original, they just want entertaining, explosive and easy-
to-consume content.

~~~
ar-jan
Time to add a map of Taiwan as independent state, Senkaku Islands as Japanese,
etc., as an aside to any and all articles published? ;)

~~~
cauldron
Copying and plagiarism is so prevalent and pervasive that these "self-media"
platforms have to introduce an "original" tag.

They will copy your content and edit it as the due process anyway, so add
whatever you like and it won't deter them.

Change some words and the paragraph order, add funny pics and snippets, alter
key parts to make it much more explosive and eyecatching, voila. They even
offer part-time on-line jobs for this reprocessing.

------
aaaaaaaaaab
Chinese stealing IP. Color me surprised!

~~~
andybak
Just for a bit of historical perspective:

[https://www.ipwatchdog.com/2017/07/05/americas-industrial-
re...](https://www.ipwatchdog.com/2017/07/05/americas-industrial-revolution-
based-trade-secret-theft/id=85377/)

[https://www.pri.org/stories/2014-02-18/us-complains-other-
na...](https://www.pri.org/stories/2014-02-18/us-complains-other-nations-are-
stealing-us-technology-america-has-history)

[https://www.realclearmarkets.com/articles/2018/07/30/ip_thef...](https://www.realclearmarkets.com/articles/2018/07/30/ip_theft_is_what_once_helped_make_america_great_103367.html)

~~~
leereeves
Are you suggesting that because people 200 years ago did something, we have to
tolerate the same thing now? That's just silly.

~~~
andybak
Of course I'm not. That's why I used the precise phrase I used. Key words:
"historical" and "perspective".

~~~
leereeves
Then how is this "historical perspective" relevant to the current discussion?

