

Ask HN: how do comparison websites prevent scraping? - Zuzz

I'm developing a web app not too dissimilar from a price/flights comparison one (nothing to do with travel though) and part of it is collecting lots of info not currently available in a single db (that's were the value really is).<p>But then I wonder: how to I prevent people from getting all that data off my website (via some sort of scraping, intelligent or not)? I realise it's not worth it to try and cover every single attempt but surely there are ways to make it at least very difficult - apart from presenting everything as an image or via flash!<p>Thanks for any input<p>Zuz
======
paulhauggis
I've gotten pretty good at data scraping websites over the years and no matter
how the data is arranged on the page, I've been able to reliable retrieve the
data.

You might be able to prevent it if you had ids and class names that randomly
changed on a daily basis or really malformed html, but I wouldn't bother.

Another option is to limit concurrent connections to your server based on IP
(Some of the financial websites do this). This would make it so it took much
longer to get your data. But again, if someone really wants to get your data,
they will.

I don't worry about it. Just make you website/app is better than your
competition. Even if someone copies all of your data, they won't have the web
presence. It will just be a cheap imitation.

~~~
Zuzz
Thanks, I certainly don't want to worry too much but, at the same time, the
data is a big component of the value proposition in this case and likely what
cost us more (more than the actual development of the service, for example).

So much so that we are likely to make more money by giving access to the data
via API to other companies than by allowing "normal" people to search through
the data and get paid a referral fee every time they click on something.

By why should we avoid the latter as an additional revenue stream? I'd rather
try and make it more difficult for people to scrape it (and I don't want to
ask people to accept T&C before making a search)

Thanks for the reply Paul

~~~
AznHisoka
impose rate limits on IP, class C IP's, detect non-Javascript bots excluding
search engines, block most major web hosts/cloud hosts (very doubtful a legit
user is from Amazon Cloud IP).

As someone who bypassed paying fees for APIs by just scraping, I think you do
have a legit concern. But don't be concerned about regular users scraping it,
be concerned about potential buyers scraping it. Regular users scraping it
can't do much damage.

~~~
Zuzz
thanks a lot, all good suggestions

------
chris_dcosta
Once the data is served there's not much you can do. But it depends how your
site consumes the data. Straight html is pretty easy to grab, ajax calls make
it more painful (not necessarily difficult), HTML datastorage, a little more
obscure but still available, my personal favourite is global javascript
objects filled at runtime, accessed when necessary. Data can be obfuscated and
links constructed on the fly.

------
ianpri
You could try adding dummy/deliberately incorrect data and then do a search
for this via Google to find sites which are scraping you - business listings
sites use a similar method. Doesn't stop them scraping but lets you find them
to start harassing their hosts.

~~~
Zuzz
indeed that's a common approach to spotting who copied your data in bulk but
not to prevent it (or at least make it more difficult).

Thanks

------
hokua
One idea is to put sensitive data into an image (for example the prices), and
only serve the image to the user. While images themselves can be scraped, it
is much harder than text.

