
Preventing Site Scraping - jumpbug
http://www.techjunkie.com/preventing-site-scraping/
======
mikeash
How to prevent site scraping:

1) Don't have any data worth scraping.

2) Charge for access.

3) Provide APIs so people don't need to scrape your site.

Trying to essentially DRM your web site so that it's human-readable and not
machine-readable is not only inherently impossible to do effectively (like any
DRM), but is also solving the wrong problem.

~~~
jack-r-abbit
#3 is only an option if you are trying to prevent scraping just to reduce your
bandwidth caused by rapid, repeated page requests. If you are trying to
prevent someone one from just coming in and scooping up all your data, then
providing an API is worse than just allowing the scraper to scrape.

There is no bullet proof way to _stop_ it. So you make it as painful for the
scraper as possible. I like the randomized classes/ids and the extraneous
random invisible table cells and divs.

~~~
rhizome
In other words, "you can't puts ads on an API."

~~~
jack-r-abbit
That is true. I'm sure there are sites that don't provide an API for that
reason. Ad revenue on the site might be what allows them to provide the data
for free. But that wasn't what I was talking about.

------
vasco
Not one of the methods listed here would detract a decent scraper. Moreover
you would either screw with your users or with SEO if you wanted to make this
or that technique more aggressive.

If your database has really great content it's not because some kid has copy
of your website online that you'll lose users. Stackoverflow has been scraped
to death and nobody goes to the other sites to check out answers.

~~~
josephcooney
I don't think there is even any need to scrape stackoverflow. They make all
their data available for free.

~~~
pinchyfingers
Right, but other sites scrape that free info and republish it with the goal of
drawing traffic. Apparently, Stack Overflow has an API. I don't know the
functionality of their API, but it doesn't matter: via API or web scraping,
the content of the SO database is all over the web, yet SO is thriving.

~~~
josephcooney
You said "other sites scrape that free". I agree that info is all over the web
but it isn't scraped, and I don't think you should keep calling it scraping.

------
ChuckMcM
Try running a search engine. :-) Needless to say we get folks all the time who
are trying to create or enhance data bases out of our index all the time. We
even have an error page that suggests they contact business development in the
unlikely event they don't "get" the fact that our index is part of our
economic value.

One of the humorous things we found is that scrapers can eat error pages very
very quickly. Some of our first scrapers were scripts that looked for a page,
then the next page, then the next page. We set up nginx so that it could
return an error really cheaply and quickly, and once an IP crossed the
threshold, blam! we start sending them the error page. What happened next was
something over 20,000 hits per second from the IP as the page processing loop
became effectively a no-op in their code.

We thought about sending them SERPs to things like the FBI or Interpol or
something so they would go charge off in that direction, but its not our way.
We settled on telling our router to dump them in the bit bucket.

------
pygy_
Ajaxification can be defeated if you scrape using a headless browser like
PhantomJS [+]. Actually all the markup/visual techniques you propose can also
be defeated using Phantom. Dump the page as PNG and OCR it.

Honey pots suppose that the scrapper is an idiot... And even in that case, if
he's dedicated, he'll come back later and will be more careful.

The only potentially effective solutions are those that preclude usability for
everyone: truncating the content for logged out users. And even then, with
PhantomJS and some subtlety/patience in order not to trigger flood detection,
an attacker could probably get away with it.

[+] <http://phantomjs.org/>

------
ceejayoz
> By loading the paginated data through javascript without a page reload, this
> significantly complicates the job for a lot of scrapers out there. Google
> only recently itself started parsing javascript on page. There is little
> disadvantage to reloading the data like this.

Well, unless you're visually impaired and using a screen reader... and it
doesn't really complicate things for any halfway dedicated scraper, as your
AJAX pagination requests probably follow the same predictable pattern as the
non-AJAX ones would've.

~~~
joshu
AJAXification could make it much easier. GET -> json import.

~~~
jamesaguilar
This is exactly what I used for my project to scrape Path of Exile item data.

~~~
joshu
Ha! I did the same for KoL once upon a time.

------
epoxyhockey
I love the first comment on that post: _It will be more of a pain than usual,
but I will get the data. I always get the data._

\- As mentioned: AJAXification the data makes it easier to grab.

\- Convert text to images? I'll OCR it. <http://code.google.com/p/tesseract-
ocr/wiki/ReadMe>

\- Honeypot a random link? I don't scrape every link on the page, only links
that have _my_ data.

\- Randomize the output? And drive your real users crazy?

I have found that the best deterrent to drive-by scraping is to not put CSS
id's on everything. Apart from that, you'll need to put the data behind a pay
wall.

------
cletus
A lot of the people commenting on these techniques being fallible are missing
the point: the idea isn't to make scraping impossible (despite the misleading
title), it's to make it hard(er).

A determined scraper will defeat these techniques but most scrapers aren't
determined, sufficiently skilled or so inclined to spend the time.

I've been curious about a variation of the honeypot scheme using something
like Varnish. If you catch a scraper with a honeypot, how easy would it be to
give them a version of your site that is cached and doesn't update very often?

~~~
ChuckMcM
C'mon cletus give us a bit of credit. Are you telling us that a company with
the World's Smartest Engineers(tm) doesn't already do exactly this with their
custom front end machines? :-) It's one of the more entertaining new hire
classes.

You are correct that perfection is not achievable and you don't even want to
get so close that you get very many false positives. But honey pots are
bandwidth, which for folks who pay for bandwidth as part of their
infrastructure charge, its a burden they are loath to bear. Rather to simply
toss the packets into the ether whence they came rather than bother waking up
their EC-2 instance.

~~~
robryan
They sure love to use GWT with an indecipherable exchange format though, have
tried to scrape a few things in adwords before. I'm sure it is possible but
there was enough of a deterrent for me to not bother.

------
barbazfoo12
4\. Provide a compressed archive of the data the scrapers want and make it
available.

No one should have to scrape in the first place.

It's not 1993 anymore. Sites want Google and others to have their data. Turns
out that allowing scraping produced something everyone agrees is valuable: a
decent search engine. Sites are being designed to be scraped by a search
engine bot. This is silly when you think about it. Just give them the data
already.

There is too much unnecessary scraping going on. We could save a whole lot of
energy by moving more toward a data dump standard.

Plenty of examples to follow. Wikimedia, StackExchange, Public Resource,
Amazon's AWS suggestions for free data sources, etc.

~~~
FuzzyDunlop
One might argue that indexing from a data-dump will lead to search results
that are only as up to date as the last dump.

In StackExchange's case, most of these are now a week or more old.

Maybe it's a good idea, but I'm not sure how many would want to dump their
data on a daily basis to keep Google updated, when Google can quite easily
crawl their sites as and when it needs to.

~~~
barbazfoo12
Have you considered rysnc? Dropbox uses it. So lots of people who don't even
know what rsync is are now using it. We could all be using it for much more
than just Dropbox. And if you have ever used gzip on html you know how well it
compresses. The savings are quite substantial. Do you think most browsers are
normally requesting compressed html?

------
Cyndre
Here is my approach on how to find scrappers.

They are already supplying fake data to see if they are being scrapped.

Using this fake data they can find all the sites that are using their scrapped
data. Congrats we now know who is scrapping you with a simple google search.

Now comes the fun part. Instead of supplying the same fake data to all, we
need to supply unique fake data to every ip address that comes to the site.
Keep track of what ip, and what data you gave them.

Build your own scrapper's specifically for the sites that are stealing your
content and scrape them looking for your unique fake data.

Once you find the unique fake data, tie it back to the ip address we stored
earlier and you have your scrapper.

This can be all automated at this point to auto ban the crawler that keeps
stealing your data. But that wouldn't be fun and would be very obvious.
Instead what we will do is randomize the data in some way so its completely
useless etc.

Sit back and enjoy

------
showerst
In general I think that getting into an arms race with scrapers is not
something that you will win, but if you have a dedicated account for each user
you can at least take some action.

If this data is actually valuable, they should put it behind some sort of
registration. Then they can swap out the planted data for each user to
something that links back to the unique account, without wrecking things for
users with accessibility needs or unusual setups.

------
goodside
I have yet to see any anti-scraping method that can protect against a full
instance of Chrome automated with Sikuli. It's obviously very expensive to
run, since you either need dedicated boxes or VMs, but it always works. In my
experience the most consistent parts of any web application are the text and
widgets that ultimately render on the screen, so you easily make up for the
runtime costs with reduced maintenance. You could in theory make a site that
randomly changes button labels or positions, but to the extent you annoy
scrapers you're also going to annoy your actual users.

------
dustywusty
As pointed out by others, many of the suggestions here break core fundamentals
of the web, and are generally horrible ideas. It's unsurprising to see
suggestions in the comments such as, "add a CAPTCHA", which is nearly as bad
of an idea. If you're willing to write bad code and damage user experience to
prevent people from retrieving publicly accessible data, perhaps you should
rethink your operation a bit.

------
basseq
Generally speaking, if you're in the business of collecting data, but you have
a competitive incentive not to share and disseminate that data as broadly as
possible, you're in the wrong business. This article seems to address a
problem of business model more than anything else. And if you're using
technology to solve a problem in your business model...

------
chrisacky
Let me start by saying that I am a sadochistic scraper (yeah I just made up
that word) but I will get your database if I want it. This goes the same for
other scrapers who I am sure are more persistent than even I am.

You don't have to read any futher, but you should realise that...

* _People will get your data if they want it_ *

The only way you can try and prevent it, is to have a [1] whitelist of
scrapers and blacklist useragents who are hitting you faster than you deem
possible. You should also paywall if the information is that valuable to you.
Or work on your business model so that you can work on providing it free....
so that reuse doesn't effect you.

\---------------------------------

I thought I would provide an account of the three reasons why I scrape data.

There are lots of different types of data that I scrape for and it falls into
a few different categories. I'll keep it all vague so I can explain in as much
detail as possible.

[1] User information (to generate leads for my own services)...

This can be useful for a few reasons. But often it's to find people who might
find my service useful.... So many sites reveal their users information. Don't
do this unless you have good reason to do so. If I'm just looking for contact
information of users, I'll run something like httrack and then parse the
mirrored site for patterns. (I'm that paranoid that check out how I write my
email address in my user profile on this site).

[2] Economically valuable data that I can resuppose....

A lot of the data that I scrape I won't use directly on sites. I'm not going
to cross legal boundaries.. and I certainly don't want to be slapped with a
copyright notice (I might scrape content, but I'm not going to so willfully
break the law). But, for example, there is a certain _very popular_ website
that collects business information and displays it on their network of
websites. They also display this information in Google Maps as Markers. One of
my most successful scrapes of all time, was to pretend to be a user and
constantly request different locations to their "private API". It took over a
month to stay under the radar, but I got the data. I got banned regularly, but
would just spawn up a new server with a new IP. I'm not going to use this data
anywhere on my sites. It's their database that they have built up. But, I can
use this data to make my service better to my users.

[3] Content...

Back in day... I used to just scrap content. I don't do this any more since
I'm actually working on what will hopefully be a very succesul startup...
however, I used to scrape articles/content written by people. I created my own
content management system that would publish entire websites for specific
terms. This used to work fantastically when the search engines weren't that
smart. I would guess it would fail awfully now. But I would quite easily be
able to generate a few hundred uniques per website. (This would be
_considerable_ when multiplied out to lots of websites!!!).

Anyway, content would be useful to me, because I would spin in into new
content, using a very basic markov chain. I'd have thousands of websites up
and running all on different .info domains, (bought for 88cents each) and
running advertisements on them. The domains would eventually get banned from
Google and you'd throw the domain away. You'd make more than 88 cents through
affiliate systems and commission junction and the likes that this didn't
matter, and you were doing it on such a large scale that it would be quite
prosperous.

\------------------------------------

I honestly couldn't really offer you any advice on how to prevent scraping.
The best you can do is slow us down.

And the best way to do that is the figure out who is hitting your pages in
such a methodical manner and rate limiting them. If you are smart enough, you
might also try to "hellban" us, by serving up totally false data. I really
would have laughed, if the time I scraped 5million longitude and latitudes
over a period of a few months, if at the end of the process, I noticed that
all of the lats were wrong.

Resistance is futile. You will be assimilated. </geek>

~~~
Silhouette
_I honestly couldn't really offer you any advice on how to prevent scraping.
The best you can do is slow us down.

And the best way to do that is the figure out who is hitting your pages in
such a methodical manner and rate limiting them. If you are smart enough, you
might also try to "hellban" us, by serving up totally false data._

Well, no, there are other ways too.

For example, any site behind a paywall probably has your identity, and unless
you live in a faraway place with impotent copyright laws -- and there aren't
that many of them any more -- there are often staggeringly disproportionate
damages for infringement available through the courts these days, certainly
enough to justify retaining legal representation to bring a suit in any major
jurisdiction. Given a server log showing a pattern of systematic downloading
that could only be done by an automated scraper in violation of a site's ToS,
and given a credit card in your name linked to the account and an IP address
linked to your residence where the downloads went, I imagine it's going to be
a fairly short and extremely expensive lawsuit if you upset the wrong site
owner.

~~~
lusr
Not all valuable scrapeable data is copyrightable. I also know of a number of
sites I've scraped that don't even bother attempting to restrict your access
to their data through T&Cs even though its the basis for their site (not that
they'd have much of a legal basis for enforcing that, anyway). Ultimately if
you're in the business of selling raw data with no value added, the problem is
your business model, not scrapers.

~~~
Silhouette
_Not all valuable scrapeable data is copyrightable._

Sure, but a lot of it is, and even the bits that aren't may be protected by
other laws such as database rights depending on your jurisdiction. I think
anyone who maintains that you can't stop scrapers as a general principle is
possibly a little unwise.

------
389401a
There's nothing worse than spending lots of hard work scraping sites to build
your search engine and then having bad guys perpetrate the scraping of your
search engine.

Maybe it's some sort of karma. If you scrape, then you will get scraped.

------
_delirium
I don't know what kind of site this is, so it's hard to say if it applies, but
do note that several of these can significantly harm usability for legitimate
users as well. For example, someone might be copy/pasting a segment of data
into Excel to do some analysis for a paper, fully intending to credit you as
the source; if you insert fake cells, or render the data to an image, you make
their life a lot more difficult.

The first suggestion (AJAX-ifying pagination) can be done without a major
usability hit if you give the user permalinks with hash fragments, though, so
example.com/foo/2 becomes example.com/foo#2.

------
soulclap
I am currently working on a project that involves some scraping as well. The
most annoying things I came across so far are:

\- Totally broken markup (I fixed this by either using Tidy first or just
using a Regex instead of a 'smart' HTML/XML parser)

\- Sites that need Javascript even on 'deep links' (I fixed this by using
PhantomJS and saving the HTML instead of just using curl)

\- Inconsistency, by far the most annoying: different classes, different
formatting, different elements for things that should more or less be
identical (basically fixing this whenever I come across a problem but
sometimes it's just too much hassle and well, ask yourself if you really need
to get every single 'item' from your target)

One more thing: RSS is your friend. And often you can find a suitable RSS link
(that's not linked anywhere on the site) by just trying some URLs.

PS: No, I am not doing anything evil. If this project ever goes live/public,
I'll hit all the targeted sites up and ask for permission. Not causing any
significant traffic either.

------
kysol
The best techniques to stop scraping, are ones not discussed in public.

------
garethsprice
Anything that can be displayed on a screen can be scraped.

An approach I used to prevent scraping in the past is to start rate limiting
anything that hits over N pageviews in an hour, where N is a value around what
a high-use user could manually consume. Start with a small delay and increment
it with each pageview (excess hits*100ms), then send HTTP 509 (Exceeded
Bandwidth) for anything that is clearly hammering the server (or start
returning junk data if you're feeling vengeful).

Added bonus is that the crawler will appear to function correctly during
testing until they try to do a full production run and run into the
(previously undetectable) rate limiting.

This project did not require search indexing so we didn't care about legit
searchbots, but you could exclude known Google/Bing crawlers and log IPs of
anything that hits the limit for manual whitelisting (or blacklisting of
repeat offenders).

~~~
eps
Not a 509, but a Deny rule in the firewall config. Works miracles.

------
x3sphere
More trouble than it's worth. Plus, none of these solutions actually prevent
site scraping... if the person is dedicated enough, they'll find a way. The
time spent on implementing any of these approaches would be much better spent
on site optimization, features, etc.

------
rehack
Sorry to say, that these techniques will work for _Enterprisey folks_ who
anyway won't have to scrape.

For example you say 'Randomize template output'. Scrapers used a mixture of
various techniques. Say, HTML Path does not work (despite supporting HTML wild
cards in the form of body/table[1]/tr[*]/. Then you fall back to just some
patterns could be the title of your data or anything.

Have scraped content coming in Flash also. Basically how can you stop anybody
from understanding either a) data exchanged between the browser and the server
Or (in case its encrypted/encoded) b) The HTML once its rendered.

Only way of doing it is by having your own custom browser, and preventing its
source code getting leaked.

PS: I scrape, and our clients (whom we scrape) know that.

------
DigitalSea
I've written a lot of scrapers myself and let me start off by saying there is
no such thing as a site that can't be scraped. You can add in honey-pots all
day long and at the end of the day once I've discovered my bot has been
detected, I'll find a way around it. If the content is worth scraping and the
site owner doesn't have an API (free or pay to use), then people will find a
way to get the data regardless of what you do.

A well-intentioned article that puts forth a few great ideas for amateurs, but
at the end of the day it's wasted time and effort that could have been better
spent, oh I don't know, developing an API for your users instead.

If I want data I'll get the data, I always win.

------
jwdunne
A less aggressive approach I've encountered is to insert links to other pages
of your website with full URLs (<http://www.example.com/page.html> over just
/page.html). Usually, a scraper will copy the links too. This should then make
it obvious that the content's been scraped.

This could become a nightmare to maintain if you don't automate it. It'd be
trivial to automate on a CMS. I know WordPress has loads of plugins for
exactly this. I don't think I've come across something that can do this for
static websites though, which make up for the brunt of the websites I
maintain.

~~~
epoxyhockey
wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

------
netvarun
IHMO, preventing site scraping is really really hard. There are couple of
startups that offer products that claim to stop site scraping.[1]

1\. Ajaxified paginated data ->

With Firebug looking at XHR requests, one can easily reverse engineer the ajax
call and extract data directly from the json - in fact you are making the
scraper's life easier as you will probably be storing the information in a
structured manner in the json string. Or one could use a more sophisticated
tool like MITMProxy to study how requests are made.

If you somehow managed to implement a highly obfuscated method of ajax
requests (re: microsoft aspx.net), there is always selenium to get through
them.

2\. Randomize template output ->

You are going to annoy users if you are going to display a different template
altogether. If its just randomizing div and class ids, one can write clever
xpath expressions or css selectors to circumvent this. Or worse case there are
always the ever reliable regex expressions.

3\. HoneyPot ->

Scrapers only crawl pages that they are specifically looking for. A good
scraper only runs through pages he wants to scrape. Nothing more. This is
probably the least effective strategy.

4\. Write data to images on the fly ->

Use an OCR api to decode them!

5\. Alternatives ->

Putting in a login screen is also not effective as not only will that annoy
users, it can easily be circumvented by using selenium or passing the
cookie/session information to the scraping script.

Blacklisting ips is not going to a very effective strategy. With tons of free
proxies, Tor and cloud based services (especially PiCloud[2] - which offers a
scraping optimized instance!!), ip blocking can easily be circumvented.

Best strategy would be display corrupted content or start throwing CAPTCHAs if
you sense a large number of requests coming from a particular IP.

But once again you may want to do some sort of machine learning on server logs
based on the various ips and the specific urls visited and build a model that
could predict if a particular user is a bot or a human before you start
throwing fake data or captchas. Just to be on the safe side so that you don't
annoy anyone.

[1] <http://spider.io> and <http://blockscraping> [2]
<http://www.picloud.com/pricing/> s1 instance plan

------
TazeTSchnitzel
Some ideas:

\- Use Unicode RTL override, then put in numbers/text backwards (this is also
fun for messing with web forums, but that's another story...)

\- Inject zero-width characters (such as ZWSP U+200B, WJ U+2060, ZWNJ U+200C,
ZWJ U+200D)

\- Use intentionally broken HTML in places, to catch out some parsers
(browsers won't care, some scraping parsers will)

\- Occasionally don't maintain element order in HTML, use CSS tricks to make
it appear in the correct places

\- Add useless dummy data hidden in places using complex CSS rules

\- Require JS-generated auth token using an obscure, obfuscated algorithm to
download the data, which expires immediately

------
simonster
Google Scholar's solution is to show a captcha when a given IP has made too
many requests in a given time period, although a scraper can easily throttle
to avoid this.

You could require a captcha every n page views, or you could render the text
of the page as a distorted image, which would defeat the OCR approach others
have suggested here. These would make scraping difficult, but they would also
mean throwing UX out the window, and they could still be defeated with
Mechanical Turk.

------
wensheng
Great! I learned a few scraping tricks from this article. In the past, all I
did was using "time.sleep(3)" to pace my scrapping so it stayed off the
scrapee's radar screen.

------
akrymski
A friend of mine has been working on anti-scraping for a while. The question
is: is there a market for an anti-scraping service? Would you pay so such a
service and how much?

Some examples out there: [http://blog.cloudflare.com/introducing-scrapeshield-
discover...](http://blog.cloudflare.com/introducing-scrapeshield-discover-
defend-dete#!/)

<http://www.blockscript.com/>

------
ericcholis
I've run into a few sites that "prevent" scraping. I just jump on
<http://developer.yahoo.com/yql/> and scrape away. The surface level defenses
(ip blocking) are usually sufficient, but if there is a real developer behind
the scraper, they will get to your content.

Too much prevention could be counter-productive, as you may inadvertently deny
the friendly spiders.

------
ajitk
In addition to mentioned suggestion I will add some here. Make difference
layers difficult to understand and use (which will affect real user too!):

1\. Script delivery. Use dynamically loaded modules.

2\. Content delivery. Use websockets to deliver data. Does any developer tool
show content going through websockets?

3\. Content rendering. Use canvas to render content on the screen. Use fonts
that make OCR difficult. Handwriting? :)

4\. Use of browser plug-ing like Flash and Java.

5\. Content delivered over video.

6\. Quota based content delivery for requests originating from the same
source. Use of multiple signals like cookies and IP addresses to pinpoint the
source. Progressive

However, I would not recommend doing most of these. As mikeash aptly compared
it to DRM, usability of the website will be affected negatively.

Like software, we can make it hard to reverse engineer them but cannot prevent
a determined person from reversing it. There is no way to prevent a determined
scraper. They will scrape your data using appropriate tools like real
browsers, OCR or real people (mtrurk?).

Edit: formatting.

------
paulsutter
Great ideas. Love the honeypot.

One more thought: once you discover a badguy, if you signal this immediately
you are just speeding up his learning cycle. Don't just cut him off, tarpit
him (gradually slow down responses) or return bad data.

------
irfan
I agree with the argument * People will get your data if they want it* Reading
the discussion here and on original post, I can make a scraper that would be
difficult to detect and block ;-)

------
tlevine
You could just print the data in a book instead of putting them on a website.

Oh woops never mind (<http://www.diybookscanner.org/>).

------
weixiyen
Honeypots are pretty worthless. The article counters itself.

------
blrblr
I can write a scrapper for any text-based website within an hour. As far as I
know, site scrapping can't be prevented. You could make it harder, though.

------
level09
the only way to prevent scraping is to shut down access to your website, with
the new modern libraries like phantomjs and selenium any one can write a
scrapper that executes javascript and reads website pretty much like any human
user.

