Hacker News new | past | comments | ask | show | jobs | submit login
The day I found Saowen.com had stolen my content (nickmchardy.com)
69 points by boyter 4 months ago | hide | past | web | favorite | 56 comments



The irony is that (I just checked) saowen.com has stolen your "The day I found Saowen.com had stolen my content" content as well! It's right up there at their front page.


That's absolutely hilarious!


And making money out of all the page views that hacker news is bringing them!


I hope this doesn't lead to an infinite loop ;)


I tried to document this on archive.org but got "This url is not available on the live web or can not be archived."


Try archive.is


I did, but it gave garbled content.


It's on Saowen.com now as well:

https://www.saowen.com/a/3b4deadad676021a51cbcee5d5fc1ec4269...

I am vehemently against Saowen.com stealing content, but this is practically quite similar to how outline works (to remove css or bypass paywall), wonder why people (especially on HN) don't have issue with those:

https://outline.com/DnVGCw

https://www.outline.com/dmca.html

Edit: Replace "Not to defend the site" with "I am vehemently against Saowen.com stealing content"


Outline makes articles more readable instead of making them worse by adding a metric ton of ads and trackers.

Outline is better from a moral standpoint and closer to say (but not quite) the internet archive.


My bad. I totally missed the ads and trackers part (because I use uBlock Origin).


only if you have js enabled, otherwise it's a blank page


I think what outline does would be much harder to implement without JS. As it stands, it seems to take parse the URL, fetch the page with a specific UA / some other magic, then automatically format it, without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?


> without having to store anything on disk and without having to generate static html. Wouldn't this be much, much harder without using JS?

While it's possible to do without storing anything on disk (strange requirement imo - but you can keep it in memory on the server if you really want), it's a lot more practical to do this work in the backend. And that's also what outline is doing:

outline's frontend will do an AJAX request to outline's servers to actually fetch the article from a blog and serve that back[1]. So they could do this easily without frontend javascript. But I think the UX would suffer on many levels. Having a frontend in javascript allows outline to do better caching, better user experience, etc. The only downside to using JS is that it doesn't work for people still in 1995 and for people who disable their javascript.

[1]: example fetch call: https://outlineapi.com/v3/get_article?id=6ps979

edit: fixed typos


>but this is practically quite similar to how outline works

Not at all. Outline doesn't essentially present the content as their own.


Neither does Saowen.com. Above the title you can see:

> 2018-12-20 nickmchardy.com

The difference is that the link in Saowen.com directs to https://www.saowen.com/source/site/nickmchardy_com, whereas the one on outline goes to the source.


I fundamentally disagree that this is effectively attributing the content.


The more they are talked about the higher google and other search engines will rank them


I get the idea of back links. But I doubt our small discussion on HN would have much impact given it already copied thousands of posts at this point.


I've had the same issue with saowen.com copying articles off my blog (https://sighack.com) onto their website and I'm glad OP called them out on it publicly.

They even have an index page with a list of all my articles, updated as I post: https://www.saowen.com/source/site/sighack_com

Interestingly, they always seem to backdate the articles by a few days from my actual date of posting! I also noticed they link to my website, but with a nofollow attribute.

Completely nuts...


Backdatinh is most likely for SEO, in the hopes that google recognizes them as the original, and your blog the copy


Yeah that's likely the reason. It's just funny to me that such a simple hack can confuse Google into not being able to reliably pinpoint the original source.


Finally a problem that blockchain technology can solve?


It's kind of ironic (and sad) that dereferencing from google is as good as removing the offending content from the internet.


This happens to all of my blog posts - on dozens of "aggregator" sites. Some, like Outline, are happy to offer an opt-out. But the rest are just SEO farms - lifting content and passing it off as their own.

I occasionally report them. If they're hosted in the EU, I'm usually successful in getting them taken down. US hosting companies just don't care. Chinese & Russian companies don't even answer emails.

It's rather frustrating.


I'd be tempted to include blogs about/links to Tiananmen Square protests etc then report the copies to the Chinese authorities. Possibly do something similar for Russia about homosexuality or whatever.


Exactly my thought. Some really nasty base64 encoded images and stuff, if I could identify them by IP or some other way.


If they are mirroring your content, figure out a signature of their crawler, and start serving "special" content to them.. Perhaps a bit of JS which does a window.location check and redirects if it's not your own (chances are they might do some poor hostname search and replace, so you'd have to obfuscate).


Can you id them by ip?


Blurring out the ticket id isn't that useful when the TXT record still exists...


I wonder why Google is so bad at picking up that these pages are clickfarms, and not instead return the original when searching.

In my native language (norwegian), I have lately had issues with searching for stuff, opening the page linked by google, to quickly realise it's not even legible norwegian, just auto-generated content (google translate of some product review I guess?). Absolutely useless content, no idea why it ranks so high. How can they not manage to filter out stuff like this?


> I wonder why Google is so bad at picking up that these pages are clickfarms, and not instead return the original when searching.

They have a monopoly on search, why do they care?


Good point. So if Bing started getting better at detecting duplicate articles and keeping track of duplicates per domain and ranked them lower that would be a big improvement over Google. Doesn't sound hard to do it you've got search engine size resources.


Saowen.com is clearly a mainland Chinese website. Assuming the copying is automatic, all you have to do is post an entry on the evils of communism with pictures of Winnie the Poo interspaced within it. Add in a few photos of the Dalai Lama and a statement like "Xi Jinping is the biggest moron in history".

Then you report it to the Chinese authorities and the penalties for Saowen will be much much higher than a Google search takedown!


I think that's a useful walk-through for anyone coming across this that wouldn't know what to do to try to get the listing removed; with the appropriate 3 pieces of evidence it looks as if it worked well.

I wonder if it's possible to automate the process - i.e. to alert the content creators that saowen.com is stealing from and help them complete the process?


I assume if each page has the markup tag > <strong class="fn" itemprop="author">nickmchardy.com</strong>

You can then either attempt an email to info@{} the tld of the author tag or scrape that site for email addressees on there.

Assuming that most of these are blogs, such as the case here, hopefully there wouldn't be too many addresses on each domain. So hopwfully relatively easy to do ... ?

Would be interested in pursuing this though


Good idea to disable the default feed option in wordpress to avoid this kind of stuff. It won't stop a determined plagiarist but should eliminate a lot of the automated stuff?

Looks like the author has done that now.


Is there not some collective action that can be taken to have Google penalise Saowen.com and other sites that engage in this kind of plagiarism?


> This article hot links images hosted at nickmchardy.com

Yeahhh, I would quickly have changed the contents of those images..


Not just that, you could redirect requests from their URL to images of whatever you want wherever you want via htaccess


Just be careful doing that when it’s a tabloid trash blogger who is linking to your photos. Some fights are best left un-fought.


It’s seo 101, but avoids mentioning the theif domain name and linking to it.


These Chinese aggregators are all the same, just like Toutiao from Bytedance (that TikTok company), not long ago the world's most valuable startup.

Together with Tencent's WeChat and other me-too sites they created the "self-media" cottage industry, basically lend credibility and let laymen publish all sorts of lowbrow and sensationalistic content to earn ad money.

If you check https://www.bilibili.com/, a very popular Chinese video site (I'd say also a stomping ground for anime pedophiles which the site was built upon. US listed btw.), you can find pirated US tv shows and basicially every popular yutube video, reprocessed, edited and watermarked as their "original" content, raking in money for their uploaders. Toutiao again is just the same model.

Part of the reason why Bytedance is so valuable, is Chinese just love these things, their de facto "news" source, the vast majority of people didn't receive higher edcation and doesn't want to read serious articles (which Chinese state-controlled media also lack).

Nobody care if it's original, they just want entertaining, explosive and easy-to-consume content.


Time to add a map of Taiwan as independent state, Senkaku Islands as Japanese, etc., as an aside to any and all articles published? ;)


Copying and plagiarism is so prevalent and pervasive that these "self-media" platforms have to introduce an "original" tag.

They will copy your content and edit it as the due process anyway, so add whatever you like and it won't deter them.

Change some words and the paragraph order, add funny pics and snippets, alter key parts to make it much more explosive and eyecatching, voila. They even offer part-time on-line jobs for this reprocessing.


> the vast majority of people didn't receive higher edcation and doesn't want to read serious articles

> Nobody care if it's original, they just want entertaining, explosive and easy-to-consume content.

This is hardly a unique thing to Chinese.


Bilibili certainly has copyright issues, but it's the same issue with YouTube and other millions of websites with user-uploaded pirated contents. I wouldn't say Bilibili is much worse than any other websites.


Is this whataboutism?

Funny I've been using Youtube for years, the only pirated content I've encountered were a few TV documenties and some unwacthable movies, and they seem to be deleted often.

Never saw Youtube promote and recommend any pirated content to me, unlike Bilibili.

Bilibili's front page is almost normal and all good, their shady stuff seems buried deep, my ipad is full of pirated Youtube videos with hundreds of thousands of views.


> Is this whataboutism?

1. The topic on this thread is copyright, not bilibili or YouTube per se.

2. You brought up YouTube yourself.

I agree with the rest of your statements. YouTube is better than other websites in protecting copyright.


Chinese stealing IP. Color me surprised!


Please don't post unsubstantive comments here, and especially not nationalistic flamebait.



Are you suggesting that because people 200 years ago did something, we have to tolerate the same thing now? That's just silly.


Of course I'm not. That's why I used the precise phrase I used. Key words: "historical" and "perspective".


Then how is this "historical perspective" relevant to the current discussion?


I mean, interesting, but after reading the first link I found this pretty good rebuttal:

So it is odd to hear today that America stole secrets from Arkwright when Arkwright himself was claiming that he had disclosed the machines in his patent specifications to a degree sufficient to make and use them.

Comment #5: Edward Heller July 6, 2017 9:43 am


Aren't patents all about disclosure? They were meant to prevent secrecy in return for a limited monopoly.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: