
Ask HN: Dealing with a competitor who is scraping my content and ranking higher - rakjosh
One of my competitors has been scrapping my site and providing service to their users without paying anything. And now they surpassed me on google ranking.<p>I&#x27;ve created a website which gets around 5K unique hits every day. It&#x27;s a free service for users but I&#x27;ve to pay a monthly fee to a third party service provider.<p>Because my site is free for users and doesn&#x27;t require users to register it&#x27;s been very hard to keep up with this guy. If I change certain things, they counter it immediately and make it work. And they use several proxies to send the request, it&#x27;s virtually impossible to block based on IP.<p>Please suggest, if there is anything I&#x27;m missing that can be done.
======
SyneRyder
If you haven't already, try adding some "trap streets" to your data. Map
makers occasionally include streets that don't exist, so if a competitors map
includes it too, it's clear that the competitor copied it:

[https://en.wikipedia.org/wiki/Trap_street](https://en.wikipedia.org/wiki/Trap_street)

I did that with an online marketing dictionary I wrote years ago, some of the
definitions included strange usage examples that contained the names of
several of my friends. When a competitor scraped us, instead of shutting them
down, the boss negotiated a data licensing arrangement with the scraper
instead, so we ended up getting a revenue stream & backlinks out of the
incident.

If that fails, and talking to them directly doesn't work, then the DMCA is
often effective. I've made DMCA requests against websites that distributed
cracks of my software & they often disappeared in a couple of days.

~~~
partycoder
Similar to trap streets are "phantom settlements", aka "paper towns", which
are fake towns rather than streets.

Now, this idea is not limited to maps: Google used trap search results to
catch Microsoft using Internet Explorer to scrap Google search results:
[https://googleblog.blogspot.com/2011/02/microsofts-bing-
uses...](https://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-
search.html)

~~~
m0nty
I was told all these phenomena (on maps) are referred to as "cartographers'
follies", so the other names are interesting to hear.

In the "real world", it was commonplace for compilers of mailing lists to
include several "phantom names" in their lists. If those names received a
mail, the list-holder would send an invoice to the person who sent it. Simple,
elegant, very difficult to bypass way to protect your knowledge-based
business.

------
slededit
This is pretty much what the DMCA and other copyright mechanisms were made
for. Send your request to Google and have them delisted.

[https://support.google.com/legal/answer/3110420?hl=en](https://support.google.com/legal/answer/3110420?hl=en)

~~~
graeme
Does anything stop the scraper from using DCMA on the original site?

~~~
jordan801
Anything receiving 5k visits a month, is likely crawled at least once a day
unless it's told not to. In this case, Google could check crawls and determine
the original source of content between the two competitors.

~~~
jklein11
If the source that is copying is also being crawled once per day isn't
possible that it would be crawled first and appear to Google that it is the
original source?

~~~
setr
presumably site A got popular _before_ site B started scraping it; else there
wouldn't be much point in scraping A. There is ofc the scenario was A was a
generally unknown resource, and B realized it had the potential to be valuable
if you just SEO'd it right.. but there's still probably evidence _somewhere_
that A cropped up first (how did B stumble into it?)

------
mullen
I worked at a company that had this problem and while I was not involved with
the solution, I did sit near the guy that was and had extensive talk with him
on how to resolve the issue (Or solve the issue well enough).

* Require accounts with good robot account creation detection. Actually, not hard since there is a lot of canned coded for this.

* Subscribe to Tor and Proxy IP collection services and then block all of the Tor and Proxy addresses. Tor IP list is free from the Tor Project and Proxy IP's can be found through various services.

* Put in crawler detection code. Most scrappers are really simple and are really easy to detect. I think Apache even has a module to detect simple scrapers and autoblock them.

* Feed bad data to scrapers once they are detected. This was the thing that made scrapers go away. Once they know you can detect them and feed bad data, they know they are screwed and will give up. I did get a kick out of looking at the scrapers website and seeing "Penis won the 5K Penis run". Of course, you should really feed the scrapers bad data that looks believable.

* DMCA their website with their provider. The company I worked for did have several scraper sites deactivated.

* Ask Google to delist them. The company I worked for did have several scraper sites removed from Google.

* There are a number of services that will help you with all of this but your website is only 5K unique users a day and not really worth the cost.

~~~
DEADBEEFC0FFEE
Banning Tor and proxies is pretty unfair to legitimate users who use those
services.

~~~
tehlike
While i agree, the op isnt doing a charity either, and op needs to keep their
interest at a higher priority to be able to keep providing the service.

------
laurieg
If you are able to detect them, serve them garbage data instead of blocking
them. Mix in some false entries etc. If you completely block them they will
work hard to work around the block, but if you give them some bad data
randomly it will take them longer to notice.

Also, you can use the fake data as evidence they scraped you.

~~~
coding123
And also to try to obtain a pattern of what request headers are common... Try
to create a special hash from the request headers. Then find a way to
infiltrate the data with it so you can identify them better

------
lubujackson
Write a few things. Register them with the copyright office. Then post them
and wait for them to copy it. Now you can sue them for the max and strong-arm
them to take it all down or face obvious legal fees and punishments over 100k.

~~~
setr
Can you sue for monetary damages if you yourself were not making any money off
the content? If anything the other site could be called a mirror, and saving
our op money on the hosting cost by diverting traffic.. (although maybe not
from scraping?)

~~~
icebraining
[https://en.wikipedia.org/wiki/Statutory_damages_for_copyrigh...](https://en.wikipedia.org/wiki/Statutory_damages_for_copyright_infringement)

------
8bitsrule
(Note:I am not a lawyer.) If you're a US citizen, any content that you've
created yourself is -automatically- copyrighted. (That's also true in all
countries where Berne Convention standards apply.) See this link for further
basics: [https://smallbusiness.findlaw.com/intellectual-
property/what...](https://smallbusiness.findlaw.com/intellectual-
property/what-is-copyright.html/)

If they're scraping content you created, they have already broken the
copyright law. (They may not realize that. Make sure they do.)

Since 1989, in Berne countries you don't even have to post copyright notice
for them to be in violation. So your first step would be to notify them that
they've already broken the law, and invite them to cut it out.

Your second step would be to contact a lawyer.If the other site is _not_ in a
Berne country, that might complicate things. On the other hand, they may
simply not be aware that they've broken the law.

~~~
stef25
What if I created my content by slightly altering other content? Like writing
a news article from a Reuters press release, how much difference does there
need to be?

Seems like a pretty hard line to draw.

~~~
mikekchar
If it's based on the other content, then it is a derived work and copyright
infringement.

Keep in mind that copyright is for _creative works_. Facts can not be
copyrighted. In a news article, it is the expression of the facts that covers
the copyright, not the facts themselves. So you can definitely write an
article stating the same facts as a Reuters press release (and be even based
on that press release). You just can't base your prose on their prose. You
have to completely rewrite it as a human -- it has to be an artefact, not an
automated process.

However, there is another class of copyright for collections. A collection of
facts can be copyrighted. You can't copy the same collection of the facts, nor
a substantial portion of the collection. You can take individual facts out of
the collection and include them in a creative work, but you can't just grab
facts one at a time and create a new collection -- that would be a derived
work.

Like all laws, copyright law is subject to interpretation by humans (lawyers
and judges). While you might think that you are satisfying the law, they may
disagree. To avoid that circumstance it is best to stay in an area where it is
obvious to everyone that you are within the law. If you don't mind being sued,
then you can try to push the boundaries. It is your choice.

------
torrence4
This is a good article I've followed in the past. idk how helpful this is, you
might have already gone through it.

[https://github.com/JonasCz/How-To-Prevent-
Scraping](https://github.com/JonasCz/How-To-Prevent-Scraping)

~~~
rakjosh
That was definitely a very good resource. I've gone through that and tried to
implement as many things as I could.

------
smilesnd
Learn to accept it. Only way to stop scrapers is to make it a financial burden
for them to scrape your website. There are to many ip address, proxy, vpn, and
botnets for you to try and block them all. You can write code to try and tell
legit traffic from scrapers traffic but then they will just figure out how to
bypass that through trial and error. You can try to take legal action, but
that will cost you money up front and might take longer then you think. You
can try to get there service terminated, but most host won't do it without a
court order. You can make you html/css hell to read randomize autogen all your
class and id tags. Put all you elements in random order so the people writing
the scraping code don't have a default template they can update in minutes.
Use javascript to actually send the data which then you use to identify people
and have a call home function so they cannot hide behind proxies and vpn and
such.

These are all just things that will slow them down, but won't stop them. Make
it cost them more money then they make using your data will be the only way to
truly stop them.

------
timrichard
> If I change certain things, they counter it immediately and make it work.

If the content is markup based, are your countermeasures about changing the
IDs, classes, or overall tag structure of the markup you serve? I was
wondering if you could have several variations of the above, and serve your
content via a random one each time that would be visually indistinguishable to
a human viewer. The person maintaining the scraper would have to have seen and
adapted to all of them to get all your new content reliably. Not an impossible
hurdle, but they might try easier targets if too many barriers are in the way.

~~~
winkeltripel
Easily bypassed: just retry on failure until scraper gets syntax it likes.

~~~
fencepost
The other end retrying until it gets what it wants will dramatically change
its usage pattern in ways that may be easy to detect unless they have an
enormous store of IPs to connect from.

There are enough suggestions in here to provide a bunch of useful options, and
while the site itself may not be making money, the experience dealing with
this may be very useful on a resume or for building a client base with similar
issues.

Possible approach: look for abnormal usage patterns to ID opponent systems.
Randomize format and possibly other steps to assist that. Build that
randomization in marginally effective ways that are easy to improve later.
Build a way to feed bad/poison/"test" data to specific source IPs. At a time
chosen to maximize impact, start feeding poison data to the suspect IPs using
the marginally effective randomization, while feeding regular data to most
visitors but with much improved randomization. Basically make your opponent's
site visibly unreliable.

If you feel particularly vicious and know something about the opponent's
infrastructure, make the poison vicious e.g. feed SQL injections. Be aware
that this may have costs - you'd likely be fine on a legal basis ("I'm not
responsible for their crappy sanitizing of inputs they shouldn't have had
anyway") but you might still incur costs (lawyer if sued).

Edit: also, anyone going to serious measures to continue scraping after you
act against it may also be inclined to ddos your site if you actually fully
block them.

------
PopeDotNinja
Can you figure out how the person is scraping your site? If so, start putting
in content that only shows up for that scraper, such as...

"this content was unabashedly stolen from <your site URL>"

...or maybe...

"my favorite movie is Mac & Me, check it out!
[https://youtu.be/vNjACYfQlbI"](https://youtu.be/vNjACYfQlbI")

~~~
kazinator
"this content was unabashedly stolen from <your site URL> *by <IP address of
scraper> on <date time>".

Unless the content must be 100% static, you can embed the caller's IP address
in it.

------
Rjevski
Add sensible rate-limits (modelled after an average user's expected usage) and
present a captcha after that. Make it per-IP.

Sure, the crawler can change IPs, but I doubt they have unlimited addresses
and eventually they will run out.

Another option (if you can reliably detect the scraper) is to poison their
data by sending them bad (but valid-looking) data.

~~~
winkeltripel
This will be more effort to implement than it will be to route around: between
free VPNs and TOR, everyone has infinite IPs.

------
toddkazakov
If you're able to detect when your content is being scraped, you can send them
the same text but with "confusable" [1] characters (i.e. utf8 characters that
look similar to ascii characters).

[1]
[https://unicode.org/cldr/utility/confusables.jsp?a=test&r=No...](https://unicode.org/cldr/utility/confusables.jsp?a=test&r=None)

------
shiado
Find a way to contact the owner of the site and bait them with something like
an acquisition offer, you might be able to use the email in the whois data if
there is one. The point is to get the owner to click a link you provide,
behind this link you want to log IP/Headers, maybe do some JS
fingerprinting/etc... It is unlikely they will open this link over VPN/Tor. If
you feel like it check your server logs to see if they have accessed your site
from that IP and when. Then send a C&D containing all this information and
tell them that you have legally contacted their ISP for their subscriber
information and not only will they be sued for violating DMCA for possessing
your data but that they will also get hit with the CFAA for circumventing
protections against access and that they will enjoy a five year sentence in a
federal pound me in the ass prison.

~~~
robin_reala
I don’t disagree with the content of your post and I know this is a US
cultural trope at this point, but trivialising rape benefits precisely no-one.

~~~
yakshaving_jgt
"Federal pound me in the ass prison" is a quote from the movie Office Space.

If you already knew this, then I guess the applicable movie quote this time is
"Lighten up, Francis" from Stripes.

------
brightball
Everyone else here has a lot of good suggestions, but the biggest thing I
would recommend is NOT actively blocking them. As you already know, if you
detect them and make things harder...they adjust.

Start with successfully detecting them and go from there. Log them. Track
origins. Compile patterns.

Then, when you decide exactly what you want your move to be use all that data
to be as effective as possible.

------
maxpupmax
ideas:

1\. Create content that can't be scraped. I'm not sure exactly what your
"content" is in this case, but images can be watermarked, text can be given
lots of references to your own brand and service, ect.

2\. Submit legal requests to google to remove the content. Enough violations
can get their domain blacklisted. I've done this successfully in the past for
competitors using my trademark without permission to get it removed from
Google. Ads.
[https://support.google.com/legal/answer/3110420](https://support.google.com/legal/answer/3110420)

3\. Talk to the hosting provider, if applicable. If someone is repeatedly
breaking copyright law using their platform, they may have some incentive to
stop providing hosting.

~~~
amelius
You can also apply steganography to your data, i.e., embed a copyright message
that can't be detected or easily removed.

[https://en.wikipedia.org/wiki/Steganography](https://en.wikipedia.org/wiki/Steganography)

------
mcv
Instead of blocking them by IP, send false information to their IP. Make it
subtle so they don't immediately notice, and end up hosting tons of unreliable
information.

Maybe even automate it: one lots of requests start coming from the same
source, a sign that somebody is scraping your site, start messing with the
data you send them in later requests.

This all assuming there's no legal objection to providing false data on this
particular topic. There might be, depending on what it's about.

------
Eridrus
Google ReCaptcha is a pretty low friction way for (most) users to get your
content.

You can do this while serving your real content to Google.

~~~
SyneRyder
The spammers & scrapers seem to have found their way around ReCaptcha. I made
a simple spam filter for my website comment form that logs every submission,
and a lot of the spam being sent (maybe 20%?) has a valid Google Recaptcha.
I'm thinking of removing Recaptcha because simple keyword based filters have
been more effective than ReCaptcha has been.

~~~
Jaruzel
ReCaptcha and the original two-word based one are just Google using your
visitors as 'mechanical turks' \- by solving ReCaptcha et al, they are
actually helping Google apply identification tags to images in streetview, and
previously the Google books collection.

An afternoons coding would get you an independent self-hosted Captcha system
that would be just as effective (if not more so, as it's proprietary) and
probably less annoying to your visitors.

~~~
orf
> and probably less annoying to your visitors.

As someone who has used both, absolutely 100% no in every way.

Google's captcha is so superior to what you would write, please don't tell
people not to.

~~~
Jaruzel
How so ? It's really not that hard. I've done it, and it zeroed spam coming
through a web form almost 100%.

~~~
danielbarla
Both (custom and recaptcha) are essentially security by obscurity, in the
sense that a skilled attacker will eventually bypass it. The problem is not
that difficult. The advantage Google has is being able to move the goalposts,
potentially on a daily basis, while you presumably have better things to do.
On the other hand, they are a juicier target, so will draw more effort. Ymmv.

Google tends to strike a good balance between convenience and irritation.

------
jlengrand
Something I haven't seen mentioned yet, but did you try to simple get in
contact with them to discuss the issue?

Since your website is free, it might be worth combining forces to serve your
users better. I know it's hard to swallow, but in the hand what matters is
that what you do is useful to people isn't it? And if your users are moving
away it probably is because `some` of the things they do is right?

------
ninjakeyboard
Are you using server side rendering? There are ways to make it harder. But
ultimately you don't want to sacrifice the user experience. Why don't you
require registration provided by a third party like facebook or google - it's
one click sign on and you can make your access requirements extremely low.

------
alexmorenodev
"If you're reading this, you should probably read its article on its original
source (your link). This entry is scrapped from site (your link)."

------
bluetop
Have you tried filing a spam report with Google?

[https://support.google.com/webmasters/answer/93713](https://support.google.com/webmasters/answer/93713)

------
Witesse
When you say "my content", are you referring to content that you created, like
a blog or online resource or are you referring to content your users created?
Also is the information they're copying factual in nature (like a map or
results of a calculation)?

If it's factual in nature: It's not protected by copyright.

If it's content your users created: You are not the copyright holder and
cannot file a DMCA takedown notice. Your best bet is really going to be making
yourself unscrapable.

If it's content you created: File a DMCA takedown notice.

~~~
Doctor_Fegg
> If it's factual in nature: It's not protected by copyright

Compilations of facts can however be protected in the EU under the Database
Directive.

Also, Terms of Use etc.

------
rakjosh
This is my website if anyone was wondering:
[https://freephonenum.com/](https://freephonenum.com/)

My competitor was scrapping all my phone numbers and SMS received from my site
and displaying it on their site. It took me several months to figure that out.
I was seeing abnormal bandwidth usage for JSON (my API endpoint) but never
realized that someone was constantly pinging my APIs.

[edit] Now, I've removed the API and everything is served as HTML (not easier
for someone to scrap my API)

------
stef25
How can a "service" be scraped, unless the service is just making content
available?

Or are you referring to the copy that promotes the service?

Scraping is very easy if your content is rendered on the server. If it's
rendered on the client it's a little more complex, I guess your competitor
would have to use headless browsers.

Technologically, you're potentially up against services like Crawlera which
will be pretty hard to beat.

The best solution is probably filing copyright claims through DMCA which would
remove them from Google rankings.

------
ivolimmen
Usually scrapping means it will go though all links you provide. If you add
non visible links on your site that link to rubbish content he will
automatically add them as wel.

------
bryanrasmussen
[https://github.com/JonasCz/How-To-Prevent-
Scraping](https://github.com/JonasCz/How-To-Prevent-Scraping)

------
linsomniac
We use DistilNetworks to do bot and scraping limiting. I wouldn't exactly say
I recommend it, I'm skeptical about it's usefulness, but it might be worth
trying to see if it solves the problem. At least then they are battling a
third party with some expertise in it and additional data, rather than you
having to try to come up with countermeasures from whole cloth.

------
jwally
What about a bunch of dummy links that aren't visible in the UI (funky css
selectors) that are mixed in with other links but provide junk content and
take forever to load.

If you can randomise how those selectors are applied enough it might make it
less fun to scrape.

------
cpv
"Ask HN: How to deal with GDPR / cookie notices in the context of a crawler?"

[https://news.ycombinator.com/item?id=17795750](https://news.ycombinator.com/item?id=17795750)

Might give an idea.

------
sharemywin
Curious how this turns out..Let us know what you tried and if it worked.

------
bastawhiz
Find a way to subtly detect them (user agent? Sneaky headers?) and serve them
janky content. Serve them the same copy with copious swears. Have a bit of fun
while you go through legal channels.

------
babaganoosh89
Can you implement a captcha somewhere? Or figure a way to obfuscate your html
if that’s what he’s scraping. Something that can scramble the css class names
each build.

~~~
notyourwork
It really depends on how complex/simple the page is. If its small site with
little context you can usually traverse from parent body ignoring classes/ids.
However, for more complex sites changing class/id values on each build can
certainly help make this a lot more challenging.

Ideally changing the children counts to prevent things like
$(body).children().children()[3].html().

------
is_true
If you can detect them try to serve them wrong data

~~~
chrischen
Yea I was thinking a "hellban" type tactic might wreak more havoc. Just subtle
enough to de-rank them.

------
wolco
Loads of great suggestions. The DMCA request is the tool you want. That will
cause a downrank if you keep doing it. Problem solved.

------
hkai
I thought Cloudflare's "under attack" mode can prevent scraping bots. Correct
me if I'm wrong.

~~~
ishanjain28
It can be bypassed and they don't make it very difficult to bypass it.

------
dsfyu404ed
How advanced is their scraping? Can you just serve them crap based on some
simple combination of identifiers?

------
crb002
Encode a timestamp. If they are running at the same time of day feed them a
bag of dicks.

------
onemoresoop
You can add typos to a certain set of items and follow them on their service
for a match

------
HodKonem
1.Cooperate with them 2.Buy them 3.Abuse them 4.DDOS, Hack, Discredit it
5.Make your content HardLinked to your platform (Liken Productplacement,
watermark, use authentic style and features.

If this not enough, feel free to ask, will think more

------
siquick
How about offering them a paid-for API with your content?

------
HodKonem
Cases: 1.Cooperate with them 2.DDOS them 3.Buy them 3.Post comments at each
publication (like common user) 4 Remake content to link it with your platform
(like product-placement technology)

if this not enough, feel free to ask me

------
rajadigopula
Google is smart enough to penalize plagiarizing sites. If the content appeared
on your site first, it is highly unlikely they get ranked higher. Work on your
own SEO strategy.

~~~
pacuna
I was thinking the same thing. I believe with the latest changes in the
algorithm this should be even harder.

------
paulie_a
You can't do much, they will beat you every time.

~~~
onemoresoop
It requires some work but they can be beat

------
paulcnichols
Capitalism is ruthless.

------
habobobo
Just use imperva....

