
Ask HN: What do people use to prevent crawlers? - jongi_ct
What people use to prevent robots and crawlers from websites?
======
wrath
I've built crawlers that retrieve billions of web pages every month. We had a
whole team working modifying the crawlers to resolve website changes, to
reverse engineer ajax requests and solve complex problems like captcha
solvers. Bottom line, if someone wants to crawl your website they will.

What you can do, however, is make it hard so that the vast majority of
developers can't do it (e.g. My tech crawl billions of pages, but there was a
whole team dedicated to keeping it going). If you have money to spend, there's
Distill Networks or Incapsula that have good solutions. They block PhantomJS
and browsers that use Selenium to navigate websites, as well as rate limit the
bots.

What I found really affective that some websites do is tarpit bots. That is,
slowly increase the number of seconds it takes to return the http request. So
after a certain amount of request to your site it takes 30+ seconds for the
bot to get the HTML back. The downside is that your web servers need to accept
many more incoming connections but the benefit is you'll throttle the bots to
an acceptable level.

I currently run a website that gets crawled a lot, deadheat.ca. I've written a
simple algorithm that tarpits bots. I also throw a captcha every now and then
when I see an IP address hits too often over a span of a few minutes. The
website is not super popular and, in my case, it's pretty simple to
differentiate between a human or bot.

Hope this helps...

~~~
ne0free
How do you bypass google recaptcha

~~~
laughfactory
I do it using rotating proxies, stripping cookies between requests, randomly
varying the delay between requests, randomly selecting a valid user-agent
string, etc. It's a pain in the butt. And to scrape more than I do, faster
than I do, would be pretty freaking expensive in terms of time and money.

Note that Google is pretty aggressive about captcha-ing "suspicious" activity
and/or throttling responses to suspicious requests. You can easily trigger a
captcha with your own manual searching. Just search for something, go to page
10, and repeat maybe 5-20 times and you'll see a captcha challenge.

If Google gets more serious about blocking me then I'll use ML to overcoming
their ML (which should be doable because they're always worried about keeping
Search consumer-friendly).

~~~
futhey
If you do go the ML route, I recommend TensorFlow + Google Cloud (Both for the
cost performance, and the irony).

------
Rjevski
Don't prevent them. The same data you let humans access for free should be
accessible via bots. If you only want to give out a "reasonable" amount of
data, that humans wouldn't usually exceed but bots would, then define a rate-
limit that wouldn't inconvenience humans and then apply it for everyone - bot
or not. That way you're discriminating based on the amount of data instead of
whether it's a bot or not. It will thwart people simply paying humans to
scrape the data (which would happen if you magically found a way to block
bots) while not inconveniencing humans who use a bot to make their job easier
while scraping a reasonable amount of data.

~~~
alecco
They kill your bandwidth. For a client's catalog site we discovered crawlers
were more than half of the used bandwidth costs.

~~~
Rjevski
Don't humans kill your bandwidth too? In fact a properly written bot would use
less bandwidth as it doesn't care about CSS or images.

My solution would solve the issue in a fair way by having a "reasonable usage"
limit applied to everyone, bot or not. This also means it can't be defeated by
someone paying humans to do the dirty work to bypass bot restrictions.

------
monodeldiablo
"How do I stop all these dinner guests from eating this lovely pie I set out
on the table?"

I remember working hard on a project for a year, then releasing the data and
visualizations online. I was very proud. It was very cool. Almost immediately,
we saw grad students and research assistants across the globe scraping our
site. I started brainstorming clever ways to fend off the scrapers with a
colleague when my boss interrupted.

Him: "WTF are you doing?"

Me: "We're trying to figure out how to prevent people from scraping our data."

Him: "WTF do you want to do that for?"

Me: "Uh... to prevent them from stealing our data."

Him: "But we put it on the public Web..."

Me: "Yeah, but that data took thousands of compute hours to grind out. They're
getting a valuable product for free!"

Him: "So then pull it from Web."

Me: "But then we won't get any sales from people who see that we published
this new and exciting-- Oh. I see what you mean."

Him: "Yeah, just get a list of the top 20 IP addresses, figure out who's
scraping, and hand it off to our sales guys. Scraping ain't free, and our
prices aren't high. This is a sales tool, and it's working. Now get back to
building shit to make our customers lives easier, not shittier."

Sure enough, most of the scrapers chose to pay rather than babysit web
crawlers once we pointed out that our price was lower than their time cost. If
your data is valuable enough to scrape, it's valuable enough to sell.

The only technological way to prevent someone crawling your website is to not
put it on a publicly-facing property in the first place. If you're concerned
about DoS or bandwidth charges, throttle all users. Otherwise, any attempts to
restrict bots is just pissing into the wind, IMHO.

Spend your energies on generating real value. Don't engage in an arms racw
you're destined to lose.

~~~
huffmsa
Side bar, if only Twitter would realize this and turn the firehose back on and
charge $x per month for API access.

They'd be profitable in a month.

~~~
danpalmer
Twitter do charge for the firehose, and I hear it’s a reasonable amount of
revenue. A person can’t go and buy it on a credit card, but there are
authorised sellers, enterprise sales people, etc.

Bear in mind that it’s a large technical feat to be able to ingest the
firehose effectively, so it’s not really suitable for consumers.

~~~
huffmsa
I mean sell it piecemeal. Add a butt ton of endpoints. Make it easy to use,
easy to integrate. Sell it on credit card.

~~~
grogenaut
What do you mean piecemeal? Like random subset? Go answered why not the whole
thing already.

------
bmetz
My favorite thing was to identify bots and instead of blocking them, switch to
a slightly scrambled data set to make the scrape useless but look good to the
developer who stole it. It was a ton of fun as a side project. I'd also
suggest you add some innocent fake data to your real site and then set up
google alerts of all of the above to catch traffic. About 50% of sites would
respond positively to an email when you showed them they were hosting fake
data. About 90% would take my data down if that was followed up with a
stronger C&D. One key is to catch them fast, while they're still a little
nervous about showing off their stolen data online.

~~~
Moru
This is what we used to do. Then send a large zipfile with schreenshots and
other data to the lawyers to handle the contact. Shortly after the scraping
usually stopped. The contact and sell access wasnt an option because it was
competitors taking the data.

~~~
grogenaut
I did some scraping for a lawyer back in like 01 from other lawyers. He got a
c&d and told me to turn it off (we were done anyway).

Funny part was the lawyer on the other side wanted us to return all of the
content on disk. Not show what we had copied but literally return it. My
lawyer laughed about it. The other lawyer was smart/savvy enough to be
effectively using the internet in 01 but didn't really understand the tech.

Other funny part is if he had generalized his site outside of law he would
have had a major business these days.

~~~
bdcravens
> Funny part was the lawyer on the other side wanted us to return all of the
> content on disk.

I actually had a client that asked me to record a screenshot of me deleting
their files from my computer. (and this was actually a developer making the
request)

------
ThePhysicist
I assume you are concerned about crawlers that do not respect the robots.txt
file (which is the polite way to restrict them from indexing your side, but
does not provide any actual protection if crawlers chose to ignore the file).
Cloudflare has a tool for doing this (now part of their core service):

[https://blog.cloudflare.com/introducing-scrapeshield-
discove...](https://blog.cloudflare.com/introducing-scrapeshield-discover-
defend-dete/)

There's a nice Github repo with some advice on blocking scrapers:

[https://github.com/JonasCz/How-To-Prevent-
Scraping](https://github.com/JonasCz/How-To-Prevent-Scraping)

Finally, you could use a plugin in your Webserver to display a CAPTCHA to
visitors from IP addresses that cause a lot of requests to your site.

There are many more strategies available (up to creating fake websites /
content to lead crawlers astray), but the CAPTCHA solution is the most robust
one. It will not be able to protect you against crawlers that use a large
source IP pool to access your site though.

~~~
bdcravens
> the CAPTCHA solution is the most robust one

The going rate for CAPTCHA solving is about 1/10 of a USD penny.

------
bad_user
The other day I've made a Chrome extension for scrapping a protected website.
It worked wonderfully, as it simulated a normal user session, bypassing the
JavaScript protections the website has. You can also run such scripts with a
headless browser for full automation, PhantomJS being an obvious choice.

You really can't protect against this unless you start making the experience
of regular visitors much worse.

~~~
inglor
Sure you can protect against this - there are several companies that use
machine learning to spot small differences between selenium and real users
(mouse delays etc).

For example, it might detect that a mouse click is dispatched at exact
intervals (and block it). To which you'd think "I'll just add Math.random() *
2000" which it'll easily detect as well.

It's _definitely_ doable, but it's not as trivial as recording a selenium
macro. (Not to mention these tools look for selenium presence and extensions
anyway).

~~~
bad_user
The issue here isn't that it can't be done, the problem being one of false
positives.

By doing this you must accept that a certain percentage of legitimitate users,
which can be quite significant, will be blocked. In case you're wondering,
yes, this does happen with solutions such as Cloudflare.

And at that point, either your website isn't popular, in which case you can't
afford to lose users, or it is very popular, in which case you'll piss of
enough users as to receive bad reviews.

Basically you can't afford to do this, unless you're Facebook or Google, and
you have to then wonder why Facebook or Google do not deploy such protections

So going back to my main point, of course it's possible, but the experience
for users gets significantly worse such that you won't want to do it.

~~~
inglor
I've found Cloudflare's solution to be very inadequate compared to Distil or
Incapsula.

Both Distil and Incapsula _did_ have a small number of false positives
(showing a captcha to users). We did have to write some code to overcome that.

------
mattm
One thing I've thought about but never had the chance to put into practice
would be to randomize CSS classes and IDs. Most web scraping relies on these
to identify the content they are looking for.

Imagine if everyday they changed? It would make things a lot more difficult.

There would be disadvantages to actual users with this method like caching
wouldn't work very well but maybe this alternative site could be displayed
only to bots.

The crawler could get smart about it and only use xpaths like the 6th div on
the page so maybe in the daily update you could throw in some random useless
empty divs and spans in various locations.

It's a lot of work to setup but I think you would make scraping almost
impossible.

~~~
bdcravens
Depends on your content. If the content is dependable, but the DOM isn't, you
can get pretty far with something like XPath's contains(). Calling .text on an
element in many parsers will happily return all the child content. Worst case
is you call .text on <body>

------
thinbeige
You can't prevent good crawlers. Captchas might help and what Amazon does:
Erratic, unpredictable changes of the HTML structure.

------
huffmsa
If you're getting a lot of crawler traffic, your site probably has information
a lot of people find useful, so you should consider finding a way to monetize
it.

Otherwise, your best bet (hardest to get around in my experience) is
monitoring for actual user I/O. Like if someone starts typing in an input
field, real humans have to click on it beforehand, and most bots won't.

Or if a user clicks next-page without the selector being visible or without
scrolling the page at all. Not natural behavior.

Think like a human.

~~~
FroshKiller
You will create accessibility issues for users if you do this. The bias you'd
encode in this idea of "human" behavior doesn't consider assistive software at
all.

I don't click text inputs when my form-filling plugin enters my personal
information on a payment screen. And even if I did, you wouldn't know it if I
had JavaScript disabled.

------
calafrax
There are a variety of methods that can be deployed:

1) request fingerprinting - browser request headers have arbitrary patterns
that depend on user agent. matching user agent strings with a database of the
request header fingerprints allows you to filter out anyone who is not using a
real browser who hasn't taken the time to correctly spoof the headers. this
will filter out low skill low energy scrapers and create higher costs.

2) put javascript in the page that tracks mouse movement and pings back. this
forces scrapers to simulate mouse movement in a js execution environment or
reverse engineer your ping back script. this is a very high hurdle and once
again forces much more computationally intensive scraping and also much more
sophisticated engineering effort.

3) do access pattern detection. require valid refer headers. don't allow api
access without page access, etc. check that assets from page are loaded. etc.

4) use maxmind database and treat as suspicious any access not from a consumer
isp. block access from aws, gcp, azure, and other cloud services offering
cheap ip rental.

------
okket
A password prompt/captcha? If you do not want to get crawled, do not make it
public?

------
inglor
We've used Incapsula (cheap and works, but awful support and service) and
Distil (expensive and works, great support but steep pricing).

Both worked, both worked well with http downloads and selenium (and common
techniques). Neither worked against someone dedicated enough - but there are
the usual tricks for bypassing them (which we used, to test our own stuff).

We also developed something in-house, but that never helps.

~~~
inglor
Also - We've served fake data and honey pots (invisible to a real user) a lot
of times in order to detect what sites "reuse" our data and sue them :)

------
twobyfour
For well-bahaved bots, robots.txt.

For ill-behaved ones, it depends on why you're trying to block them. Rate
throttling, IP blocking, requiring login, or just gating all access to the
site with HTTP Basic Auth can all work.

------
laurieg
Domain specific but if you detect a bot you can start giving it false
information.

For example, a dictionary site. Someone tries to crawl your site after
triggering your "This is a bot" code, serve bad data to every 20 requests.
Mispell a word, Mislabel a noun as a verb, give an incorrect definition.

If you combine this with throttling then the value of scraping your site is
greatly reduced. Also, most people won't come up with a super advanced crawler
if they never get a "Permission denied, please stop crawling" message.

~~~
laughfactory
This is very true. My scraping efforts have become vastly more sophisticated
after running into explicit attempts to block me. Now I've got all kinds of
bells and whistles, and validate the data returned.

------
timbowhite
I wrote a plugin for node.js/express that performs basic bot detection and
bans bots by IP address until they pay you some Bitcoin:

project:
[https://github.com/timbowhite/botbouncer](https://github.com/timbowhite/botbouncer)

simple demo: [http://botbouncer.xyz/](http://botbouncer.xyz/)

I ran it for awhile on some medium traffic websites that were being heavily
scraped. It blocked thousands of IP addresses, but IIRC only received one
Bitcoin payment.

------
danielbeeke
The stack
([https://github.com/omega8cc/boa/](https://github.com/omega8cc/boa/)) I am
using, uses CSF
[https://www.configserver.com/cp/csf.html](https://www.configserver.com/cp/csf.html)

This is for Drupal sites. It has a strong firewall (csf) and it has a lot of
crawler detections on the nginx configurations. It checks the load and when on
high load it blocks the crawlers.

------
hollander
If it's about content, SVG and convert all text to curves. /s

~~~
alexcnwy
Can just screencap the rendered page and use OCR.

If someone wants to scrape your website badly enough, they'll find a way.

~~~
pbhjpbhj
Use a really bad font with terrible kerning and very similar 1lI and oO0,
etc.. /s

~~~
majewsky
Spellcheckers and error-correcting OCR are a thing. In fact, spellchckers are
so well-understood that they're a frequent assignment for CS undergraduates.

------
danpalmer
I encourage developers thinking of doing this to check that they aren't
required to have their website be scraper-friendly first.

The company I work for does a large amount of scraping of partner websites,
with whom we have contracts that allow us to do it and that someone in their
company signed off, but we still get blocked and throttled by tech teams who
think they are helping by blocking bots. If we can't scrape a site we just
turn off the partner, and that means lost business for them.

------
peterkelly
[https://en.wikipedia.org/wiki/Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)

~~~
AmrMostafa
robots.txt is a "sign" that you leave at the door for well-behaving robots,
but it doesn't actually make any practical difference when a robot isn't
implemented to honor it

------
digitalzombie
You can use cloudfare but it's a small roadblock. I can still crawl that.

Also you can do frontend rendering, it's a bit larger roadblock but you can
use phantomJS or something to crawl that.

IIRC there is a php framework that mutate your front end code but I'm not sure
if it does it enough to stop a generalized xpath...

Also I used to work for company where they employ people full time for
crawling. It will even notified the company if crawler stopped working so they
can update their crawler...

~~~
viraptor
Why do you think frontend rendering is harder? Every time I see it, I'm happy
because there's a nice API it relies on - I can grab clean, structured data
from it rather than trying to extract it from the text.

------
jeremyliew
I don't understand why you'd want to stop crawlers. If you didn't want people
to see your content, it probably shouldn't be on the public web.

------
lazyjones
\- permanently block Tor exit nodes and relays (some relays are hidden exit
nodes)

\- permanently block known anonymizer service IP addresses

\- permanently block known server IP address ranges, such as AWS

\- temporarily (short intervals, 5-15 mins) block IP addresses with typical
scraping access patterns (more than 1-2 hits/sec over 30+ secs)

\- add captchas

All of these will cost you a small fraction of legitimate users and are only
worth it if scraping puts a strain on your server or kills your business
model...

------
ge0
Can you give a little more context to the question? There are various ways but
they would be dependent on the reasons in the first place.

------
ghola2k5
The best answer is probably an API.

------
crispytx
I think websites, in general, tend to get a lot of bot traffic. My website
doesn't have anything valuable to scrape, but I still get 100 hits from bot
traffic every day.

------
Pica_soO
Add keywords which are likely to get the crawling company involved into a
lawsuit (like the names of persons who suit google to be removed from search).

------
debacle
For most crawlers, robots.txt will work. For people actually trying, nothing
but an IP block and vigilance will help.

------
wcummings
Scraping is my birthright, you'll never stop me.

------
whatnotests
One cannot stop a determined thief.

One technique that bothers me quite a bit is constant random changes in class
names or DOM structure, which can make it more difficult. Not impossible but
more difficult.

------
Kenji
robots.txt

then Zip bombs.

~~~
d33
Why was this downvoted?

------
owebmaster
They used to use flash, for the good or the bad.

------
z92
I run a cronjob every 5 minutes that parses httpd access log. If there's an IP
with abnormally large request number, it blocks it.

Most crawlers will make hundreds of requests in five minutes, while legitimate
viewers will make be bellow 100.

~~~
lazyjones
This is ineffective and dangerous where ISPs allow switching dynamic IP
addresses with no delays. Large german ISPs do this, so your abusive scraper
will just continue with a new IP address while some legitimate user, who is
unlucky enough to get the abuser's old IP address, is blocked.

