

Ask HN: How to prevent unwanted scraping - alhenaadams

The number 1 post right now is about how to use scraping to essentially get a handle on undocumented API's.  That's all well and good, here are my questions to all HN: All this being said, how do we prevent our sites from being scraped in this way?  What can you not get around, and what are the potential uses for an 'unscrapeable' site, in your opinions.  Is the push to obfuscate with javascript a side effect of modern web app architecture or the intend in designs exhibiting such behavior?
======
karterk
As someone who has done a lot of scraping in the past - you just need to
change the CSS classnames or re-design your pages once in a while :) This
breaks a lot of automated bots that extract semantic meaning from a webpage by
using html + regexp parsers.

I would suggest staying away from using JS as it affects genuine users as well
(e.g. those who use screen readers).

------
dalke
It's very hard to impossible to prevent screen scraping. In the worst case
scenario, the person scraping uses a Firefox instance running on a real
display and controlled via a system like Sikuli to control the mouse the same
way that a human would do it.

No, I take that back. The worst case scenario is hiring a team of people in
some low-wage country to manually go through the site to extract the
information.

How do you prevent those cases? I think the most you can do is throttle based
on a mixture of login account and request IP address.

That said, the first step is to develop a threat model. You need to get an
idea of why would people want to scrape your site, the incentive for them to
do so, and the effect on your site and business if your data is scraped.

------
csense
If some of the data you care about is user-generated, you might want to try
the Github model: People can get a free account, but all the information they
generate on the site will be public. Keeping your information private requires
a paying subscription.

------
moocow01
Use Flash or render your content as images. Neither of them are 100% locked
down but its going to give anyone writing a scraper a run for their money.

Outside of preventing scraping, both of these ideas are likely to be seen as
stupid.

------
ressaid1
Check out services like www.distil.it or blockscraping.com

------
taligent
There is nothing you can do to prevent scraping especially with tools like
PhantomJS which use exactly the same engine as in your browser.

The ONLY way is to as suggested throttle based on IP address and X-Forwarded
For.

