Hacker News new | past | comments | ask | show | jobs | submit login

The current state of the art is hiQ vs LinkedIn:


Basically: if it's publicly visible, you can scrape it.

Caveat: the case is still making its way to the Supreme Court.

Edit: There's also Sandvig v. Sessions, which establishes that scraping publicly available data isn't a computer crime:


Edit2: Two extra common sense caveats:

- Don't hammer the site you're scraping, which is to say don't make it look like you're doing a denial of service attack.

- Don't sell or publish the data wholesale, as is -- that's basically guaranteed to attract copyright infringement lawsuits. Consume it, transform it, use it as training data, etc. instead.

As someone who used to run a heavily trafficked and heavily scrapped site, some tips from an operator:

- Make sure your scrapper has both a reasonable delay (one request per second or slower) and a proper backoff. It you start getting errors, back off. We never cared about scrapers, until we noticed them, and we only noticed them if they hit us too hard, we told them to back off, and then they didn't.

- Look deep for an API. A ton of people would scrape reddit without realizing we had (an admittedly poorly marketed) API for doing just that.

- Respect robots.txt. That was another way to get noticed quickly -- hitting forbidden URLs. If you hit a forbidden URL too often, you'd start getting 500 errors, and if you didn't back off, you'd get banned from using the site. It was an easy way to tell if someone was not a well behaved scraper.

> we only noticed them if they hit us too hard, we told them to back off, and then they didn't

Response code 429 is your friend: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429

If you're writing a scraper, you should handle any http error code and back off.

And if you want to get really pedantic, 429 didn't exist when we did this. It wasn't approved until April 2012 and the first patches for it didn't show up until around 2014. We could have monkey patched if we really wanted to, but we didn't really want to.

You should back off whatever the error. This is on the client to implement. 429 is not directly supported by HTTP libraries to make the client wait, so I don't feel like it would help getting misbehaving bots to slow down.

#3 respect robots.txt

This is a polite thing to do, but I don't think that there is any legal precedence for it being an actual requirement. Notably, both Apple and The Wayback Machine publicly disregard robots.txt files [1]. I would be very curious to read any court ruling that determined a robots.txt file needs to be respected.

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...

It depends on the intention. You should respect robots.txt for search indexing, for example, but not necessarily for something like archiving or creating alternative page layouts (e.g outline/reader view).

Wayback machine does look at robots.txt - https://help.archive.org/hc/en-us/articles/360004651732-Usin...

They look at them, but they don't follow them strictly [1]. They make judgement calls on what they should do rather than treating robots.txt files as a legal contract.

[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

It's a pity that robots.txt doesn't let you specify what the crawler can do with the resources it's allowed to fetch. I think that if we had such a feature (or something similar, like a "License" header) standardized early enough , a few issues regarding crawling and search engines would be moot, or at least easier to solve automatically.

True but all the commercial websites would use it to ban scraping then.

If we're talking about being polite, then #4 respect the TOS. Especially requests per minute.

It’s the TOS itself that is legally tenuous, so you’re best bet is to completely ignore it. There’s no picking and choosing part s of it. Ignore all of it or implicitly accept all of it.

”which establishes that scraping publicly available data isn't a computer crime”

…in the USA (possibly even “in DC”). Also, ”isn't a computer crime” doesn’t imply ”isn't a crime”. Copyright law likely still applies, for example.

This being the top comment, it must be noted that HiQ v. LinkedIn is very much the exception to the well-established rule.

I'm not a lawyer but I did receive a C&D from a Fortune 100 that ultimately shut my project down. I was not selling or exposing any data directly -- it was purely consumed on the back end.

I was not hammering their site, but aggregating and caching requests such that people who used my project ultimately had orders-of-magnitude lower impact than they would've had otherwise.

The data we were sampling was fundamentally non-copyrightable in the US per Feist v. Rural Telecom; just a compendium of places, dates, and times (in the EU, raw data without substantial creative components is copyrightable), but because it was on their servers, and because we had to extract it from a HTML page that constituted a creative work, the CFAA and the Copyright Act were against us.

I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless. The law and the legal precedent is 100% in favor of the site being scraped. Essentially, it may not be illegal until they tell you to stop, but after that, it's unquestionably illegal. There is no public right-of-way on the internet.

My case is by no means unusual; it happens to several small companies on a daily basis, and it's a critical component in the ability of BigTechCos to maintain their walled gardens and effectively use legal mechanisms to route around the web's inherent distributed properties. All this "decentralized internet" stuff misses the point that the decentralization is not a technical problem, but a legal and social one.

Eric Goldman's blog [0] is a great resource that has consistently followed law related to scrapers for several years. He discusses hiQ v. LinkedIn at [1].


The applicable federal statutes, which are primarily the CFAA and the Copyright Act, don't leave much wiggle room at all on this topic, and neither does the overwhelming majority of case law. Precedents established in the 80s like MAI v. Peak have been consistently misapplied to screen scraping.

There are two particular onerous prongs of the law here: first, the CFAA's "authorized access" stipulations, and second, interpretations of the Copyright Act that hold RAM copies of data are sufficiently tangible to be potentially-infringing.

The CFAA makes it both a crime and a tort to ever access a server in a manner that "exceeds authorized access" -- essentially, as soon as the company indicates that they don't want you to talk to them, if you talk to them again, you're dead meat (craigslist v. 3taps among others).

Most companies include boilerplate in their Terms of Service that says the site cannot be accessed by any automated means and generally successfully argue that you were thereby on notice regarding the extent of your authorized access as soon as you did anything that constitutes enactment of that contract, which generally means accessing anything beyond the front page of the site ("clickwrap" or "linkwrap"), and almost certainly means anything that involves logging in, submitting forms, etc.

Re: the Copyright Act -- until it's modified to clarify that RAM copies are not independent infringements and to enshrine the rights of users to extract their own copyrighted content from another's copyrighted wrapper, it's going to be a potential infringement every time your software downloads someone's page. The real-world analog of the "RAM Copy doctrine", as it's called, would be that every time your eye reflects the image of a copyrighted work into your brain, you've made a new infringing copy. When it gets to court, that's what scrapers deal with -- and they almost always lose.

On the API front you may be able to argue that a simple JSON structure isn't sufficiently creative to qualify for copyright protection, but that would be blazing a new trail (and still leaves the CFAA to worry about). In almost all cases, something as complex as the JavaScript and the HTML that you get from $ANYWEBSITE.com, just loading it on an unapproved device is probably an infringement. That each digital load/transform is a potential infringement is how you hear about millions of infringements in file sharing cases, etc., because they're claiming each time you copied that data from your hard drive into your RAM, it was a new independent infringing copy.

Seriously, sit down and read the law, and then read the dozens of cases where this has been litigated previously. HiQ v. LinkedIn is a very limited anomaly in this pantheon, still very early in the cycle, and NO ONE should be taking it as a guiding star, at least not until it hits the Supreme Court and they come down reversing all the old precedent around this.

If you are going to build a business that depends on scraping, ONLY do so with the backing of mega-well-funded VCs, etc., who are able and willing to take on the powerful lobbies, and who are funding your company at least as much for its potential to break legal precedent as for its commercial viability.

Final note: expect no help from FAANG et al on this. Without the CFAA, their walled gardens are dead in the water. It is a critical tool used by MegaCos to retain their digital monopolies. "Network effect" means something, but it's only strangling the web to death because there are $1000/hr law firms enforcing it behind the scenes. Without that, we'd have automatic multiplexed Twitter/G+/FB streams a long time ago. They shut down aggregators because they need to control the direct interface to the user -- if they're relegated to a backend data provider by someone with a better user experience, they're very vulnerable. This realization is what motivated Craigslist's rapid reversal on scraper-friendliness and sunk 3taps, and been the death of many potentially innovative early-stage companies.


tl;dr The long and short of it is that until Congress passes revisions to the CFAA and the Copyright Act and/or until the Supreme Court comes down with a wide-ranging ironclad reversal of the last 30 years of case law on this topic, it's going to be perilous for anyone whose business depends on scraping.

And all this is at the federal level -- many states have enacted similar statutes so they can get in on the "put hackers in jail" action, and these battles will have to be fought at the state level too.

[0] https://blog.ericgoldman.org/ [1] https://blog.ericgoldman.org/archives/2017/08/linkedin-enjoi...

> I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless.

How long ago was this? It seems like the courts have shifted their position on this over time and only very recently (as in the last year) have they started to take a more permissive stance on scraping.

The paper linked elsewhere in this thread does a great job of summarizing the trend: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625

My experience was mid-2015. Hopeful signs have indeed become more frequent, but law moves at an absolutely glacial pace. Things are not going to change substantially for a few more years at the bare minimum.

We're in a good spot socially right now, as the tech behemoths are no longer perceived as plucky upstarts and quirky computer whizzes, but instead as creepy 1984-ish overlords. So I think the stage is set for upheaval -- maybe even some Congressional action if someone can tie this to the "deplatforming" thing that has Republicans fired up -- but we're a ways out yet, especially if we're just going to be crossing our fingers for a favorable SCOTUS ruling.

Compare the Aereo case at [0] for what is perhaps a counter-intuitive philosophical divide: the conservative side of the Court dissented from the majority in holding that Aereo should've been in the clear.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

>we'd have automatic multiplexed Twitter/G+/FB streams a long time ago.

Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Regardless, the state of copyright and IP law in the US is abysmal. We can't trust these companies (FAANG) to keep their own press releases online for a decade, how can we let them monopolize ideas (which they fail to fully flesh out) and content? They have been shown to be inept stewards to their own content :c

> Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Unfortunately, this is where the RAM Copy Doctrine gets us into trouble. It is not only illegal to "exceed authorized access" to a networked computer, the precedent currently considers loading any copyrighted work into RAM potentially infringing, e.g., if the rightsholder says you're not allowed to use their copyrighted work in that way, you have to present a viable fair use defense.

afaik, no one has brought suit against things like client-side adblockers and browser extensions that modify a page, but if they did, they'd be likely to prevail under current precedent.

We really need true legal protection for users to select their own user agents and to be free to access information willfully transmitted to them in the way they like, especially in the case of something like Facebook/Twitter, where the site itself is just a wrapper around other peoples' copyrighted content.

That will only happen if someone can convince enough Congresscritters to carve out an exception in the actual law, rather than relying on long-outmoded pre-internet judicial interpretations.

Power Ventures scoped down to extract only your own data out of Facebook and they still ended up owing $3M in damages.

See Ticketmaster v RMG at https://en.wikipedia.org/wiki/Ticketmaster,_LLC_v._RMG_Techn.... , where the argument that alternative user agents should be allowed was shot down. I discussed at some length here: https://news.ycombinator.com/item?id=12352450

Thanks for your detailed comment. This is super informative

How about free api data. Many come with terms that say you many not Analyse or process it further

Thanks. I was unaware about Sandvig v. Sessions

Does this mean Photos, and using those photos?

I've added a second edit which hopefully answers that question.

A timely reminder that the "new and improved, cool, friendly, loves open source, a different company" Microsoft is - beyond the slick rebranding PR - still quite happy to throw it's massive weight around, abusing the law, intent on rewriting the fundamentals of the open internet and access to information to everyone's disadvantage but their own.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact