
Combating blog article theft by delaying RSS feeds - d2wa
https://www.ctrl.blog/entry/delaying-feed-updates
======
DanBC
So, people feel some sympathy for the author of this submission because they
can see work is being taken without being paid for.

> Defending copyrights as an individual isn’t an easy matter.

But the same people seem to fucking hate Article 11, which would forbid these
sites from copying the entire article and would give the author of the
material real options to take action.

I don't think article 11 or 13 are particulary good, but this submission does
a good job of showing why some people think they're needed.

~~~
d2wa
Existing copyright law already forbids you from copying and redistributing the
creative works of other. More laws won’t stop anyone from violating the law.
The problem with content scrapers is that its so easy to hide your identity,
so easy to scrape content, and so damned easy to be approved into advertising
programs.

~~~
DanBC
Exisiting copyright law is, as the author says, hard to use. Making those laws
easier to use for blatant commercial violation is a good thing.

------
hyperpape
This isn't something that I've seriously looked into myself, but I thought
that in the past, Google had a way to tell them about your new articles, to
avoid this problem. Does that no longer exist/work?

~~~
d2wa
You can “ping” them about new articles (which I still do), but if other
websites ping them within a few seconds/minutes than you basically have a
race-condition trying to see how is indexed first. Notably, this is a bigger
problem with Bing and Yandex who don’t have the capacity to index the web as
quickly as Google does.

~~~
hyperpape
Is this just not a problem in practice, or is there some technical barrier I'm
not thinking of here? I'm imagining a system where you say "hey Google, I'm
about to post content X" that you do right before it goes live on your site.
No race condition.

~~~
d2wa
Only Yandex has something where you can inform them of upcoming articles/links
before you publish. (As mentioned in the article.)
[https://yandex.com/support/webmaster/authored-
texts/owners.h...](https://yandex.com/support/webmaster/authored-
texts/owners.html)

If you ping Bing or Google before you publish, they’ll get a 404 and will take
that as a sign that there is no content there. They also will wait longer
before trying to reindex a page that previously returned a 404.

~~~
DenisM
Make a visible but unlisted URL, ping search engines, wait 5 minutes, list /
link the URL from the home page of your site, publish the URL to RSS. Solves
the 404 problem, doesn't it?

~~~
d2wa
I experimented with this too. However, I found that just delaying the RSS feed
produced the desired results.

------
gingerlime
I'm not sure even delaying things work unfortunately. At least if someone has
strong ranking already, they can hijack your content[0]

[0] at least according to
[https://dejanseo.com.au/hijack/](https://dejanseo.com.au/hijack/) (which
according to the author is still a problem today, see
[https://news.ycombinator.com/item?id=17827589](https://news.ycombinator.com/item?id=17827589))

~~~
dazc
Maybe a more subversive method would be publish the page with gibberish,
initially, then republish the real content a day after?

~~~
contem
Actually, shortly after I learned that Google switched to neural networks for
search results, I noticed an old style of spam making a huge comeback: Either
markov chain or neural network generated text, taking from related websites (I
suspect they look at the top 10 websites for certain valuable terms).

This gibberish actually outranks legit content which refers to my content,
sometimes even my own articles, especially when it is turned into a PDF.

Seems like it is easy to block ~250k webpages like:

[https://www.google.com/search?q=inurl:?yhjhuyfib](https://www.google.com/search?q=inurl:?yhjhuyfib)

but I guess Google keeps them there to keep the spammers in the dark? I hope
so, else their new ranking signals allow for easier spam.

~~~
scrollaway
> _Seems like it is easy to block ~250k webpages like_

I think it's unfair to look at these results and say "but it's so easy to
block these". Google's time is best spent on solutions which will reliably and
automatically block these, without going through fairly manual steps.

I did notice a decrease in quality in gmail's spam filter though. Increase in
false positives and false negatives lately. I guess it's unlikely to be
related...

~~~
contem
To me it seems all the work of the same spammer(s). In such a case, do some
manual intelligence and wrap it up. It won't scale to all forms of spam, but
if a simple regex can uncover 250k+ results in 10 minutes, a manual spam
fighter can still block millions of pages a day (and warn the webhost, remove
these flakey ads from their networks, etc.).

No doubt the recent machine learning hype has given spammers more advanced
tools to avoid detection.

~~~
scrollaway
False positives are far more problematic than false negatives...

~~~
contem
If you remove from index... sure. But for that URL that I posted, do you think
there is even a single false positive in there?

------
xte
The problem is a matter of audience and missing communications IMO.

In the old usenet we do communicate DIRECTLY each others so anyone have a
reputation, so we know "original sources" and we prize accordingly. In today's
world being disconnected, on one side only content producer, on the over side
only consumers, in the middle at maximum uncomfortable platforms that are
limited and limiting user-user communication it's harder.

IMVHO the medicine is coming back to the communication era we lost in the
past, no other systems proved to be effective, take a look a audio/video
piracy as a good living example.

------
jwilk
No, copying and distributing articles, even when illegal, is not theft.

------
kgwxd
Never heard of this blog before. I really like the looks of the content. My
only method of subscribing to anything is RSS. I'm definitely not adding it to
my list after reading this. I wasn't going to make them any money by
unblocking trackers but that Amazon tip jar[1] thing looks really interesting,
I've never seen it before, I likely would have used it.

[1][https://www.ctrl.blog/tip-jar](https://www.ctrl.blog/tip-jar)

~~~
gbrown
So a blog with content you like is verboten because your RSS feed wouldn't
update instantly? That seems like an odd hill to die on.

~~~
kgwxd
It's a horrible band-aid that hurts the functionality and reputation of RSS
and shouldn't be encouraged by supporting the site setting a bad example. If
they want to play the pointless cat-and-mouse game with the "thieves", fine,
but crippling RSS in the process is not cool. They will just scrape the site
instead and the RSS feed will be worse off for no gain.

~~~
gbrown
Crippling? I'm as information addicted as the next guy, but unless I'm missing
something fundamental about RSS, the only effect is that you as a user will
notice new posts a few hours late. That seems more than reasonable to me.

If it turns out not to work, they can go back.

~~~
kgwxd
And if it turns out to work, they will encourage others to follow, hurting an
already neglected great technology.

