

Ask HN: What is the current legal position on scraping? - rebootthesystem

I was looking around for a comprehensive refenence on the current legalities surrounding scraping.  Not sure I have a complete image at this point.  Perhaps HN&#x27;ers know of a definitive resource on this topic?<p>To be clear, I am talking about scraping data publicly available to anyone without having to log on to a website or become a member.<p>EDIT:<p>I just came across this article on the subject and found it interesting.  Not sure how definitive or up to date it is (posted in 2013).<p>http:&#x2F;&#x2F;www.bna.com&#x2F;legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes
======
cblock811
Having done a lot of web scraping, I would say there are a few rules:

1) Follow any Terms of Service or Terms of Use. 2) Respect the site's
robots.txt 3) When in doubt...don't do it.

What is the purpose of your scraping? I found that if I was doing analysis,
just asking people will prevent any issues. Half the time they just gave me
the information I wanted.

~~~
rebootthesystem
The purpose is to enhance data available through API's. Every single bit of of
it is available publicly without login. This is not to repost articles or data
but rather to provide better actionable information.

If you look at the article I linked to the common thread seems to be abusive
use of scrapped data for direct financial gain. It seems TOS is irrelevant
unless the data scrapped was behind a login and, even then, it depends on the
TOS wording.

Clearly it is a minefield u der certain circumstances. Definitely need to do
more research.

~~~
rebootthesystem
Also, there seems to be a huge difference between scraping 100,000 pages per
hour vs. one page every few seconds. The kind of scraping I am considering
would be the latter and, even at that, it would only happen a few days per
month.

------
MalcolmDiggs
I'm not a lawyer, but a rule of thumb is: Obey the TOS and the robots.txt
under all circumstances.

And in general, be careful. There seems to be precedent for a TOS violation
being considered a crime these days:

[http://www.nydailynews.com/news/crime/wiseguys-tickets-
charg...](http://www.nydailynews.com/news/crime/wiseguys-tickets-charged-
hacking-ticketmaster-livenation-illegally-grab-best-seats-article-1.171013)

[http://marketingland.com/twitter-reaches-spam-lawsuit-
settle...](http://marketingland.com/twitter-reaches-spam-lawsuit-settlement-
with-tweet-adder-45890)

And you should definitely talk to a lawyer.

~~~
erroneousfunk
The ticket scam case hasn't been decided yet -- it's unclear if it's actually
going to set a precedent or not.

The TweetAttacks case is more than just "they didn't follow the TOS." Just
having a TOS on your site saying "no scrapers" means very little, legally,
unless other conditions are met.

1\. TweetAttacks specifically agreed to the TOS while creating accounts (in
many instances of scraping the TOS does not have to be agreed to)

2\. THEN they didn't follow the TOS

3\. They were notified by Twitter to stop operations, and did not

4\. Twitter spent significant amounts of money to compensate for damages that
they specifically (not just spammers in general) caused (if these occurred
after they were given a cease and desist, Twitter can sue under other laws,
although I'm not clear if this actually occurred)

5\. TweetAdder lied on its website and misled users into thinking that it was
complying with Twitter rules, deceiving consumers.

In addition, the case was settled, so it's also not a precedent.

And robots.txt? It means even less than the TOS. It's an unofficial standard
and an often unlinked file that means nothing legally.

~~~
MalcolmDiggs
Those are all really good points. I think you're right on the money.

I had always assumed though, that by simply visiting a (whether by browser, or
scraper), you were implicitly accepting the Terms of Use. (And would therefore
be breaking them if the TOS disallowed scraping). But it sounds like (from
what you said) that there's a big difference between visiting a site and
actually registering/accepting-the-terms.

If that's the case, then I think that changes my opinion quite a bit.

------
minimaxir
I've done a lot of data scrapping for my statistical-analysis blog posts, both
via API and HTML parsing. (E.g.
[http://minimaxir.com/2015/01/linkbait/](http://minimaxir.com/2015/01/linkbait/))

I have not received any complaints; in fact, I've received complements and
promotion from the parent sources. I would not recommend breaking any API
limits or selling the data for profit, though.

------
csharpallday
While against the terms of service and still probably not okay to do it. Facts
are fair game. Think public domain things. Phone numbers, science, even sports
stats

------
btbuildem
It really depends how litigious the scrapees are feeling..

[http://arstechnica.com/tech-policy/2015/06/3taps-to-pay-
crai...](http://arstechnica.com/tech-policy/2015/06/3taps-to-pay-
craigslist-1-million-to-end-lengthy-lawsuit-will-shut-down/)

~~~
erroneousfunk
It depends on what the scrapers are using the data for. If 3taps had simply
aggregated the data and created some neat visualizations showing relative
housing prices across the country, that wouldn't be copyright infringement.
However, 3taps used Craigslist's data to create a service with the same goal
as Craigslist. This is copyright infringement.

------
otterley
You'll want to hire a lawyer ASAP if you're planning to do this as the basis
for a business. The legal landscape is a patchwork/minefield, depending on a
lot of different variables.

------
bjourne
Specify jurisdiction. Laws vary around the world.

------
smt88
There is no definitive resources. In the US, it's still a gray area. My
suggestion is not to build a business around it.

~~~
mdaniel
I think you mean don't build a business headquartered in the USA, but there
are certainly folks who are trying:

[http://cloudscrape.com](http://cloudscrape.com)

[https://import.io](https://import.io)

[https://www.kimonolabs.com](https://www.kimonolabs.com)

[https://morph.io](https://morph.io)

[http://scrapinghub.com](http://scrapinghub.com)

~~~
smt88
Those are all tools for scraping. There's some fairly strong precedent that
prevents such tools from being held liable for the actions of their users.

I was warning specifically against a business that requires scraped data to
survive.

------
insomniac2
scraping is not legal, and damages SEO meaning that anywhere it is added to
does poorly in search engine results

copying or scraping of public domain information or information licensed to be
copied/permission for copying or redistribution is not illegal but still not
great for SEO. Many websites have a copyright policy on the bottom in the
small print. You can also search for information which is public domain, or
creative commons licenses. I don't really see the point of scraping. It's easy
to link to sources.

