

Ask HN: How do you guys stop scrapers from mirroring your site? - mmaunder

I'm about to put a ton of really useful, unique and valuable (in the CPC and usefulness sense) content online. Scrapers are very quickly going to want it all. How do you guys protect your content? [keeping in mind that suing someone offshore is difficult]<p>My current approach is going to be to limit all non-USA crawlers and crawlers that don't identify themselves as Google, Bing or someone else I care about. I'm planning on using nginx.conf and the maxmind country database to do this.<p>By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique documents per day.<p>Any suggestions would be much appreciated.
======
autoreverse
Set no-cache meta tag so scrapers can't use Google cache.

Check out bad-behavior: <http://bad-behavior.ioerror.us/>

Otherwise:

Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an
email for every new IP address so you can catch any false positives.

Redirect requests with no user agent specified to your honey-pot URL.

Same for known bad/useless user agents: wget, curl, Bing, etc.

Validate supposed good crawler agents via reverse IP lookup - cache bad IP
addresses - redirect to honey-pot URL

Filter remainder based on request rate over a given period (use a DB to cache
requests). More than 1 request per second for 20 seconds = bot, more than 200
requests per day = bot : redirect to honey-pot URL

I've been doing this for a site with around 150K URLs for the past 4 years.
Have about a 600 URLs blocked and around 35 user-agents.

~~~
peelle
These are good ways to stop your basic scraper.

One of my jobs is to scrape websites. The websites we scrape we have
permission to scrape. Sadly many companys we work with don't have full time
developers, or they have rules set up to stop other from scraping their site,
that make it hard on us.

Other than Bad Behavior, I have circumvented all of the above suggestions. I
don't know what Bad Behavior is and as far as I know I have not encountered
it.

We falsify user agents to say they are browseres/google/whateversneeded. We
script lynxs or write programs to type in login/pass, and change the page. We
write programs to follow do_postback(with their var's) stuff that ASP.NET
creates for multi page tables. We add in random time intervals to ease the
load on some companies sites, and to circumvent blocks on others. We download
and parse PDF, DOC, XLS, even some images. We have went as far as scraping
Flash app's using Screen shots and OCR.

The list goes on. My point is that nothing is completely safe.

------
bobds
You will quickly find out that the scrapers are coming from USA-based proxies,
they pretend to be Google or Bing, they use an intermediary (say, Google
Cache)... and so on, until you are tired of fighting them.

It's a losing game, unless you don't mind making it hard for actual users to
view your content.

~~~
whatevers2009
I agree with bobds. It's retarded to worry about scrapers. You're better off
focusing on doing your own thing. Large sites get scrapped and cloned all the
time. At the end of the day, it's better to be focused on what you do rather
than the clones.

------
gexla
Limit all non-USA crawlers? The USA traffic is probably the worst offenders
for scraping content.

Personally, I wouldn't even bother with this. As long as other sites aren't
outranking you with your own content then I don't see a problem. If they are
outranking you with your own content, then you need to evaluate your SEO
strategy.

~~~
redstripe
Google will apparently remove sites that are copying copyright content:
[http://www.google.com/support/websearch/bin/answer.py?hl=en&...](http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=58)

So he will definitely outrank them if he owns the content.

~~~
silvestrov
You could add some unique words/sentences to the pages which will make
googling for "mirrors" easy (i.e. automated) so sending copyright notices to
Google can be almost fully automated.

------
pwg
If you do not want your content downloaded, then don't put your content
online.

The web, and computers in general, work by copying data. Trying to prevent
that is like trying to "make water not wet" (Bruce Schneier).

~~~
jhamburger
But honestly, the web is considered 'public domain' and you should be happy
they didn't just lift your whole article and put someone else's name on it.

------
guan
You could do a reverse lookup on crawler IP addresses to make sure that they
match the User Agent. Google PTR records will end in google.com or
googlebot.com. Maybe compare with the list at <http://chceme.info/ips/>

You probably can’t prevent this altogether, but you can probably do a lot to
raise the cost of this type of crawling. The exception might be if there are
people out to copy your content specifically and will actually tailor their
crawlers to your countermeasures.

~~~
deno
And regardless how good your defences are going to be, there's always
Mechanical Turk.

------
regularfry
Can you partition it? Have one portion of the data available to all and
sundry, then have another more valuable part hidden behind a captcha'd login?

I know this won't prevent anyone else from being able to match your Google
ranking on the free content, but it _should_ mean that you can maintain your
position as the "default place to go" for it, which should come with attached
PageRank goodness. It also means you can institute per-account rate limits,
rather than IP-based ones, which might help.

------
DjDarkman
If you have a good lawyer, sue them for copyright infringement. Otherwise
ignore it, the current crawlers can do pretty much an average visitor can,
some even use real browser engines.

> By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique
> documents per day.

That won't do you any good, It's ridiculously simple to obtain unique IP
addresses and proxies, not to mention that your database may blow up with that
much data. :)

------
nolite
If they want to get it, they'll get it. There are plenty of services available
that will sell you lists of open proxy servers by the hundreds. If they really
want your content, they'll just pop up each request on a different IP. The
only way to really make it tough to get, is to take the content out of the
HTML.. which.. you prob don't want to do

------
ig1
Use absolute URLs, a lot of scrapers don't put a lot of effort into rewriting
URLs so you can lead users back to the real site. Plus Google will take links
coming from the scraped site to your site as an indication that you're the
original source, and as a bonus you'll get PR for it as well.

------
wslh
i) Javascript: Very few crawlers can read your javascript code. Look at the
source html of Google. Is pure Javascript! ii) Accept crawling based on IPs.
iii) Use captchas. iv) Use cookies and Measure the crawling speed. v) Think
that if someone wants to copy your content, they can just look at the cache of
some search engine, they don't need to crawl you.

~~~
gexla
This is basically cloaking (showing your content one way to regular readers
and showing it a different way to search engines,) which could get you slapped
by Google. Not worth this risk IMO. However, maybe there is a way to do this
so you don't have to cloak? Not sure, I have never looked into it.

~~~
evgen
Just support Google's AJAX crawling standards and disallow the
escaped_fragment requests from crawlers that are not on your whitelist.

------
bartonfink
Isn't this a problem for a CAPTCHA?

~~~
jackowayed
No.

If the content is only accessible after you fill out a CAPTCHA, Google won't
be able to index your content, and you won't have any visitors for scrapers to
steal from you.

I suppose you could provide the content without a CAPTCHA if the user agent
matches the crawlers of Google, Bing, etc. But it's still a bad idea because a
lot (most?) of your users will bounce on the CAPTCHA, killing your ad revenue.

~~~
damncabbage
... And it's very easy to just impersonate the GoogleBot by using its User-
Agent.

~~~
sursani
Ditto.. there is no point trying to prevent certain bots, etc from scraping
your content. It's a never ending battle. Just like all the others said before
me, if it's your content, you should be fine.

