
Web Scraping and Crawling Are Perfectly Legal, Right? - carlmungz
https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/
======
wolco
I don't fully understand the tos of services clause or how that woman won a
lawsuit. If that were common wouldn't I be able to put a similiar notice and
start collecting payments. I would think everyone would do that causing a
decline in visiting random sites.

~~~
bbernard
The way it works was not immediately clear to me either.

If you go to her website: [http://profane-justice.org/profane-
justice.org/](http://profane-justice.org/profane-justice.org/) (yep, you have
to enter this weird URL)

At the bottom, it's stated that if you copy anything, you enter into a
contract, and that for each page copied, you owe her money. This is how it
works.

I don't think that such clauses are common on websites, probably because:

    
    
      1. People don't know that they can do that.
      2. People don't have the patience, energy and financial resources to sue everyone copying things from their personal websites.
    

So yes, you could put something like this on your website. But let's be
honest, this isn't good for the "free" and "open" internet that we all want.

------
bbernard
Hey thanks for posting my article to HN. Glad that you found it helpful! :)

~~~
carlmungz
No worries. I'm into aggregation and scraping and your article had some good
stuff in it. I've been scraping a site for a project of mine once every three
minutes and I thought that was a bit much. Didn't realise it could be as low
as once every 15 seconds.

~~~
bbernard
The longer between your requests, the better.

If you generate a load greater than what a human would do, this might become
problematic. A human wouldn't poke a website forever every 10-15 seconds.

The 10-15 seconds is more for web crawlers. Eventually, a crawler will run out
of pages to crawl on a website, so it will stop sending requests to it.

Personally, I would stick to the 3 minutes delay. But it depends on what type
of website you're scraping :)

