Hacker News new | past | comments | ask | show | jobs | submit login
Combating blog article theft by delaying RSS feeds (ctrl.blog)
57 points by d2wa 11 months ago | hide | past | web | favorite | 38 comments

So, people feel some sympathy for the author of this submission because they can see work is being taken without being paid for.

> Defending copyrights as an individual isn’t an easy matter.

But the same people seem to fucking hate Article 11, which would forbid these sites from copying the entire article and would give the author of the material real options to take action.

I don't think article 11 or 13 are particulary good, but this submission does a good job of showing why some people think they're needed.

Existing copyright law already forbids you from copying and redistributing the creative works of other. More laws won’t stop anyone from violating the law. The problem with content scrapers is that its so easy to hide your identity, so easy to scrape content, and so damned easy to be approved into advertising programs.

Exisiting copyright law is, as the author says, hard to use. Making those laws easier to use for blatant commercial violation is a good thing.

Fines should follow the money if they can't resolve on the party directly responsible. Advertisers can buy insurance and insurance companies can vet domains.

How do they help? What the author is describing was illegal before and will be illegal after.

It means you can sue Google for damages, so that Google's responsibility becomes to prevent copyright thefts. That way, you don't have to run around trying to chase down a phantom no-name dude on the internet.

And if you can keep the copyright-theft'd works off of Google, Bing, and other big name search engines... then you've basically won in this day and age.


Yeah, its an extreme measure. But its pretty clear cut how Article 13 helps in this case. Link Tax (Article 11) may also play a role, but its less obvious IMO.

> It means you can sue Google for damages, so that Google's responsibility becomes to prevent copyright thefts.

Nope. You can not. Google is not a content sharing platform which is what Article 13 is about.

You can do the same as before: Send a takedown request to google. Again, literally nothing changes here.

This isn't something that I've seriously looked into myself, but I thought that in the past, Google had a way to tell them about your new articles, to avoid this problem. Does that no longer exist/work?

You can “ping” them about new articles (which I still do), but if other websites ping them within a few seconds/minutes than you basically have a race-condition trying to see how is indexed first. Notably, this is a bigger problem with Bing and Yandex who don’t have the capacity to index the web as quickly as Google does.

For WordPress blogs, pings are sent immediately to ping servers. Google could reasonably crawl a new article within one second of the author clicking the publish button. So if you have automated pinging built into your blogging workflow, then there should be no problem. At least until high-frequency traders get involved in this market :)

Is this just not a problem in practice, or is there some technical barrier I'm not thinking of here? I'm imagining a system where you say "hey Google, I'm about to post content X" that you do right before it goes live on your site. No race condition.

Only Yandex has something where you can inform them of upcoming articles/links before you publish. (As mentioned in the article.) https://yandex.com/support/webmaster/authored-texts/owners.h...

If you ping Bing or Google before you publish, they’ll get a 404 and will take that as a sign that there is no content there. They also will wait longer before trying to reindex a page that previously returned a 404.

Make a visible but unlisted URL, ping search engines, wait 5 minutes, list / link the URL from the home page of your site, publish the URL to RSS. Solves the 404 problem, doesn't it?

I experimented with this too. However, I found that just delaying the RSS feed produced the desired results.

It seems to me that Yandex is doing the right thing.

Why does Google not do the same? It seems to me that it's their responsibility as a search engine to give authors tools to identify their work.

You can get them crawl a new page in webmaster tools (or whatever name it has this week) and they will index with a few seconds - only if it's an established site though, doesn't work so well on new sites.

Of course, this doesn't prove your page has the original content though.

I'm not sure even delaying things work unfortunately. At least if someone has strong ranking already, they can hijack your content[0]

[0] at least according to https://dejanseo.com.au/hijack/ (which according to the author is still a problem today, see https://news.ycombinator.com/item?id=17827589)

Maybe a more subversive method would be publish the page with gibberish, initially, then republish the real content a day after?

I’ve actually done something like this in the past. I was able to identify the IP address of a few content scraping websites, and then made my RSS feed return a huge number of gibberish pages when requested from these IPs. I’m not sure whether they eventually ran out of database space or what happen, but these sites did go down after a few days and stayed down. https://www.ctrl.blog/entry/defacing-content-scrapers

(I didn’t attack these servers, by the by. They came to my servers and gobbled up all the auto-generated junk I served them all on their own.)

However, in recent years it has been more and more difficult to identify the right IP address as everything is hosted people are hosting behind Cloudflare or do the actual scraping from a short-term lease server with an unique IP.

Actually, shortly after I learned that Google switched to neural networks for search results, I noticed an old style of spam making a huge comeback: Either markov chain or neural network generated text, taking from related websites (I suspect they look at the top 10 websites for certain valuable terms).

This gibberish actually outranks legit content which refers to my content, sometimes even my own articles, especially when it is turned into a PDF.

Seems like it is easy to block ~250k webpages like:


but I guess Google keeps them there to keep the spammers in the dark? I hope so, else their new ranking signals allow for easier spam.

> Seems like it is easy to block ~250k webpages like

I think it's unfair to look at these results and say "but it's so easy to block these". Google's time is best spent on solutions which will reliably and automatically block these, without going through fairly manual steps.

I did notice a decrease in quality in gmail's spam filter though. Increase in false positives and false negatives lately. I guess it's unlikely to be related...

To me it seems all the work of the same spammer(s). In such a case, do some manual intelligence and wrap it up. It won't scale to all forms of spam, but if a simple regex can uncover 250k+ results in 10 minutes, a manual spam fighter can still block millions of pages a day (and warn the webhost, remove these flakey ads from their networks, etc.).

No doubt the recent machine learning hype has given spammers more advanced tools to avoid detection.

False positives are far more problematic than false negatives...

If you remove from index... sure. But for that URL that I posted, do you think there is even a single false positive in there?

That would probably work, but it would affect legitimate users. They would see the gibberish, too.

You're right, no plan is perfect. I guess it might work if most readers are not going to the page as soon as it's first published though?

The problem is a matter of audience and missing communications IMO.

In the old usenet we do communicate DIRECTLY each others so anyone have a reputation, so we know "original sources" and we prize accordingly. In today's world being disconnected, on one side only content producer, on the over side only consumers, in the middle at maximum uncomfortable platforms that are limited and limiting user-user communication it's harder.

IMVHO the medicine is coming back to the communication era we lost in the past, no other systems proved to be effective, take a look a audio/video piracy as a good living example.

No, copying and distributing articles, even when illegal, is not theft.


First, DNT makes my browser more unique, which means I will be easier to track.

Second, I have no interest in seeing ads or having my data sent to third parties.

Never heard of this blog before. I really like the looks of the content. My only method of subscribing to anything is RSS. I'm definitely not adding it to my list after reading this. I wasn't going to make them any money by unblocking trackers but that Amazon tip jar[1] thing looks really interesting, I've never seen it before, I likely would have used it.


So a blog with content you like is verboten because your RSS feed wouldn't update instantly? That seems like an odd hill to die on.

It's a horrible band-aid that hurts the functionality and reputation of RSS and shouldn't be encouraged by supporting the site setting a bad example. If they want to play the pointless cat-and-mouse game with the "thieves", fine, but crippling RSS in the process is not cool. They will just scrape the site instead and the RSS feed will be worse off for no gain.

Crippling? I'm as information addicted as the next guy, but unless I'm missing something fundamental about RSS, the only effect is that you as a user will notice new posts a few hours late. That seems more than reasonable to me.

If it turns out not to work, they can go back.

And if it turns out to work, they will encourage others to follow, hurting an already neglected great technology.

I didn't look around at the rest of this blog, but I don't see anything here that's particularly time sensitive. I would have been just as interested in delaying RSS today or tomorrow.

That's not the point, it's making RSS a second-class citizen. That's bad enough on its own. It's doubly bad in this case because they went from treating it as special. If it had just been made equal, I wouldn't object. This is clearly a problem best solved by the search engines, if the world fixes it with this technique, there will be no push to fix it the right way and RSS will be worse off than it already is.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact