Ask HN: What is the current legal position on scraping?

cblock811 · on June 30, 2015

Having done a lot of web scraping, I would say there are a few rules:

1) Follow any Terms of Service or Terms of Use. 2) Respect the site's robots.txt 3) When in doubt...don't do it.

What is the purpose of your scraping? I found that if I was doing analysis, just asking people will prevent any issues. Half the time they just gave me the information I wanted.

rebootthesystem · on July 1, 2015

The purpose is to enhance data available through API's. Every single bit of of it is available publicly without login. This is not to repost articles or data but rather to provide better actionable information.

If you look at the article I linked to the common thread seems to be abusive use of scrapped data for direct financial gain. It seems TOS is irrelevant unless the data scrapped was behind a login and, even then, it depends on the TOS wording.

Clearly it is a minefield u der certain circumstances. Definitely need to do more research.

rebootthesystem · on July 1, 2015

Also, there seems to be a huge difference between scraping 100,000 pages per hour vs. one page every few seconds. The kind of scraping I am considering would be the latter and, even at that, it would only happen a few days per month.

MalcolmDiggs · on June 30, 2015

I'm not a lawyer, but a rule of thumb is: Obey the TOS and the robots.txt under all circumstances.

And in general, be careful. There seems to be precedent for a TOS violation being considered a crime these days:

http://www.nydailynews.com/news/crime/wiseguys-tickets-charg...

http://marketingland.com/twitter-reaches-spam-lawsuit-settle...

And you should definitely talk to a lawyer.

erroneousfunk · on June 30, 2015

The ticket scam case hasn't been decided yet -- it's unclear if it's actually going to set a precedent or not.

The TweetAttacks case is more than just "they didn't follow the TOS." Just having a TOS on your site saying "no scrapers" means very little, legally, unless other conditions are met.

1. TweetAttacks specifically agreed to the TOS while creating accounts (in many instances of scraping the TOS does not have to be agreed to)

2. THEN they didn't follow the TOS

3. They were notified by Twitter to stop operations, and did not

4. Twitter spent significant amounts of money to compensate for damages that they specifically (not just spammers in general) caused (if these occurred after they were given a cease and desist, Twitter can sue under other laws, although I'm not clear if this actually occurred)

5. TweetAdder lied on its website and misled users into thinking that it was complying with Twitter rules, deceiving consumers.

In addition, the case was settled, so it's also not a precedent.

And robots.txt? It means even less than the TOS. It's an unofficial standard and an often unlinked file that means nothing legally.

MalcolmDiggs · on June 30, 2015

Those are all really good points. I think you're right on the money.

I had always assumed though, that by simply visiting a (whether by browser, or scraper), you were implicitly accepting the Terms of Use. (And would therefore be breaking them if the TOS disallowed scraping). But it sounds like (from what you said) that there's a big difference between visiting a site and actually registering/accepting-the-terms.

If that's the case, then I think that changes my opinion quite a bit.

minimaxir · on June 30, 2015

I've done a lot of data scrapping for my statistical-analysis blog posts, both via API and HTML parsing. (E.g. http://minimaxir.com/2015/01/linkbait/)

I have not received any complaints; in fact, I've received complements and promotion from the parent sources. I would not recommend breaking any API limits or selling the data for profit, though.

csharpallday · on June 30, 2015

While against the terms of service and still probably not okay to do it. Facts are fair game. Think public domain things. Phone numbers, science, even sports stats

btbuildem · on June 30, 2015

It really depends how litigious the scrapees are feeling..

http://arstechnica.com/tech-policy/2015/06/3taps-to-pay-crai...

erroneousfunk · on June 30, 2015

It depends on what the scrapers are using the data for. If 3taps had simply aggregated the data and created some neat visualizations showing relative housing prices across the country, that wouldn't be copyright infringement. However, 3taps used Craigslist's data to create a service with the same goal as Craigslist. This is copyright infringement.

otterley · on July 1, 2015

You'll want to hire a lawyer ASAP if you're planning to do this as the basis for a business. The legal landscape is a patchwork/minefield, depending on a lot of different variables.

bjourne · on June 30, 2015

Specify jurisdiction. Laws vary around the world.

smt88 · on June 30, 2015

There is no definitive resources. In the US, it's still a gray area. My suggestion is not to build a business around it.

mdaniel · on June 30, 2015

I think you mean don't build a business headquartered in the USA, but there are certainly folks who are trying:

http://cloudscrape.com

https://import.io

https://www.kimonolabs.com

https://morph.io

http://scrapinghub.com

smt88 · on June 30, 2015

Those are all tools for scraping. There's some fairly strong precedent that prevents such tools from being held liable for the actions of their users.

I was warning specifically against a business that requires scraped data to survive.

insomniac2 · on June 30, 2015

scraping is not legal, and damages SEO meaning that anywhere it is added to does poorly in search engine results

copying or scraping of public domain information or information licensed to be copied/permission for copying or redistribution is not illegal but still not great for SEO. Many websites have a copyright policy on the bottom in the small print. You can also search for information which is public domain, or creative commons licenses. I don't really see the point of scraping. It's easy to link to sources.