
Ask HN: Should you block web scrapers? - jorgecurio
Let&#x27;s leave the ethics debate aside and ask this question: is it realistic to invest time and money into blocking scrapers?<p>I see distill networks is popular, but what is their pricing? try to scrape crunchbase and see what happens. It&#x27;s actually very good I tested with phantomjs, selenium + firefox and it&#x27;s able to detect and successfully block the traffic.<p>I&#x27;m willing to invest time and money into preventing competitors from launching based on scraping my data that I worked hard for.<p>Is this something that I can implement? Sounds complicated in two parts<p>1) building fingerprints on user agent and other information. What do you include, where do you start?<p>2) Distinguishing malicious requests on a time series while avoiding false positives (some people compulsively click next button twice if it doesn&#x27;t load immediately<p>What are some surefire ways that will give me high degree of confidence that scraper will fail.
======
softwaredev__
You should go after the low hanging fruits first. Check if the request is
coming from a trusted browser (Chrome, Firefox, Safari, etc). If not, block
them. I'm sure that'll cut down scrapes by a lot. And then, you can go after
some behavioral stuff. You can do things like measure the milliseconds between
requests. If it's too fast for a human to have clicked that link after the
initial page load, then you know it's a bot.

~~~
ldjb
Please don't do that. Some of us use rather obscure web browsers, and unless
you keep the whitelist updated, users of newer browsers will be locked out of
the website.

Besides, if your website blocks me, I'm not going to switch browser — I'll
just go elsewhere.

~~~
softwaredev__
Just curious, what obscure browsers do you use?

~~~
ldjb
Some examples are Lynx, Iceweasel, Chromium, Steam's in-game browser and the
Nintendo 3DS browser. Okay, none of them are _that_ obscure but they probably
aren't the first things that spring to mind when you think of browsers.

Depending on your method of blocking browsers, it might automatically lump
Iceweasel together with Firefox and Chromium together with Chrome. But if this
is something you really want to do, you have to be very careful about these
things.

~~~
jorgecurio
so it seems like the only sure-fire way of detecting bad bots is via an IP
address and not relying on Browser Agent.

Is it any beneficial if there's a way to detect if a given IP address is bad
or not right from the get go? Possibly by searching through millions of IP
addresses that are blacklisted through a bloom filter and REST API? Do you
think this would help-> a long roster of abusive IP address databases?

~~~
ldjb
IP address blacklists can be effective, and certainly make a lot more sense
than blocking anything that doesn't appear on a whitelist.

Depending on how you are presenting the data, you might also find it helpful
to impose a rate limit (accept only a certain number of requests from a
particular IP address within a given timeframe).

~~~
jorgecurio
if there was a public API that you could cross reference IP from your website
that you suspect as a bot, would you use it? Would you pay for that kind of
API?

Of course I would start off by ironically scraping proxies from public domain
as those are often used. Not sure what else determined scrapers use to get
data, but it's safe to say that 98% of scrapers are trying to pay as little
money as possible and often complain when a scraper gets blocked...fuck these
people seriously

~~~
ldjb
I personally don't have any use for such an API, but I'm sure others do. This
might be something that already exists, I'm not too sure.

------
ldjb
There is no way to completely block scrapers. The most you can do is make life
difficult for those writing them. You also have to ensure you're not blocking
genuine users.

Instead of investing time and money trying to prevent competitors, why not use
that time and money to innovate on the service you provide so that there will
be no point in setting up a competitor?

~~~
jorgecurio
Of course it's not possible to 100% block ALL scrapers, much like the debate
surrounding obfusticating javascript.

The point is to increase the cost of developing a scraper to a maxima where
ROI begins to rapidly deteriorate. I'm trying to identify techniques and
methods which will:

\- identify real users apart from bot users

\- increase the costs for the bot

