
Ask HN: How common is illegal web scraping? - ng-user
With web-crawlers being so prevalent today, and the only thing really stopping them is a Terms &amp; Services page or robots.txt file, how often are those broken or plainly put ignored?<p>Asking out of curiosity and general naïveté on the topic. If it doesn&#x27;t make a ridiculous amount of requests and won&#x27;t trigger any flags on the server, what&#x27;s stopping people from building large db&#x27;s full of illegally scraped data? What can we do to prevent it? Surely it must be going on..
======
tenken
Web scaring is an "arms race". I was asked this weekend by a client, who found
a Python scraping library for their sit on Github; how to avoid scraping.

Either remove your content from the internet or put up with it like we do
spam.

The client linked this: [http://stackoverflow.com/questions/3161548/how-do-i-
prevent-...](http://stackoverflow.com/questions/3161548/how-do-i-prevent-site-
scraping)

But the problem with all these approaches, is the arms race problem. These
solutions take developer time, and they can affect end-users. The army of
scrapers can easily undo your efforts in short order making alot of these
approaches an effort in futility.

~~~
ng-user
Thanks for the feedback and the helpful link, I appreciate the comment!

------
gesman
Illegal?

Someone's TOS is not a law but mere a wishful suggestion to others who usually
won't bother to read it anyways.

"You are not allowed ..." is the most laughable statement in TOS.

TOS may however remind of existence of laws.

Laws do or do not allow, TOS are not.

Laws do exist to protect copyrights and trademarks but scraping per se is not
illegal.

~~~
shakna
I'd say the law disagrees with you there.

* Facebook vs Power.com [0]

* AP vs Meltwater [1], where the ruling decided fair use did not apply to World Wide Web content unless explicitly written in the TOS.

There have been a ton of other smaller cases on this.

US law currently says that scraping, is trespass.

That being said, I find this utterly preposterous. How can one know that one
may only access the data in a certain form, and not manipulate or publish it
in certain ways, when one must first make a request to see that information?

[0]
[https://www.techdirt.com/articles/20090605/2228205147.shtml](https://www.techdirt.com/articles/20090605/2228205147.shtml)

[1] [https://www.scribd.com/document/131847330/Meltwater-AP-
Rulin...](https://www.scribd.com/document/131847330/Meltwater-AP-Ruling)

------
kluck
It is happening. You can not stop it, only slow it down to the "human" pace.
Any scraper performing actions with a browser (be it headless or with head) in
a human pace can not be detected. Every human action (on a webpage) can be
simulated so any data accessable by a browser (be it behind a login or not)
can be retrieved. Certain captchas are also hard to crack but that is about
all you can do: rate limiting and captchas.

------
wayn3
You can put up ToS all you want. I'm not agreeing to your terms by visiting
your website (I'm not even visiting the site, my scraper is. Can a scraper
sign a contract? I doubt it). Maybe law is f'd up like that in the US. Is it?
Fine, I'll buy servers in Russia. Or some banana republic. Come sue me in the
wonderful state of Nevis and see how that goes.

You can do a lot to prevent it. The best way to prevent it is to not have
valuable data. The more valuable your data, the more effort we will spend on
cracking your countermeasures and we will always win because this is our core
business - to you its just a cost center.

Linkedin is one of the most notorious sites for trying to prevent scraping and
they certainly have the funds. Yet they can't do shit about it and you'd think
that they have it easy because they hide everything behind a paywall. Yet they
can't prevent it from happening. Not even close. If they can't do it, you
probably can't either.

And that's the really boring part. Do you know how many blank APIs there are
in the web? People do their node.js and their frontend SPA bs and then they
just dangle an API that is open for the world to scan. I could make a business
out of scanning for exposed user data and feed them into a lawyer doing class-
action lawsuits all day. Would be easy. Like really easy. So-called
"engineers" need to learn how cryptography works. Or just reject unauthorized
requests. Its really not that difficult.

To call out one prominent example: The Tinder API has been exposed ca. 2012
and ever since then, they didn't give enough of a shit to secure it. You can
still build a tinder 3rd party app using their api.

