Hacker News new | past | comments | ask | show | jobs | submit login

This being the top comment, it must be noted that HiQ v. LinkedIn is very much the exception to the well-established rule.

I'm not a lawyer but I did receive a C&D from a Fortune 100 that ultimately shut my project down. I was not selling or exposing any data directly -- it was purely consumed on the back end.

I was not hammering their site, but aggregating and caching requests such that people who used my project ultimately had orders-of-magnitude lower impact than they would've had otherwise.

The data we were sampling was fundamentally non-copyrightable in the US per Feist v. Rural Telecom; just a compendium of places, dates, and times (in the EU, raw data without substantial creative components is copyrightable), but because it was on their servers, and because we had to extract it from a HTML page that constituted a creative work, the CFAA and the Copyright Act were against us.

I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless. The law and the legal precedent is 100% in favor of the site being scraped. Essentially, it may not be illegal until they tell you to stop, but after that, it's unquestionably illegal. There is no public right-of-way on the internet.

My case is by no means unusual; it happens to several small companies on a daily basis, and it's a critical component in the ability of BigTechCos to maintain their walled gardens and effectively use legal mechanisms to route around the web's inherent distributed properties. All this "decentralized internet" stuff misses the point that the decentralization is not a technical problem, but a legal and social one.

Eric Goldman's blog [0] is a great resource that has consistently followed law related to scrapers for several years. He discusses hiQ v. LinkedIn at [1].


The applicable federal statutes, which are primarily the CFAA and the Copyright Act, don't leave much wiggle room at all on this topic, and neither does the overwhelming majority of case law. Precedents established in the 80s like MAI v. Peak have been consistently misapplied to screen scraping.

There are two particular onerous prongs of the law here: first, the CFAA's "authorized access" stipulations, and second, interpretations of the Copyright Act that hold RAM copies of data are sufficiently tangible to be potentially-infringing.

The CFAA makes it both a crime and a tort to ever access a server in a manner that "exceeds authorized access" -- essentially, as soon as the company indicates that they don't want you to talk to them, if you talk to them again, you're dead meat (craigslist v. 3taps among others).

Most companies include boilerplate in their Terms of Service that says the site cannot be accessed by any automated means and generally successfully argue that you were thereby on notice regarding the extent of your authorized access as soon as you did anything that constitutes enactment of that contract, which generally means accessing anything beyond the front page of the site ("clickwrap" or "linkwrap"), and almost certainly means anything that involves logging in, submitting forms, etc.

Re: the Copyright Act -- until it's modified to clarify that RAM copies are not independent infringements and to enshrine the rights of users to extract their own copyrighted content from another's copyrighted wrapper, it's going to be a potential infringement every time your software downloads someone's page. The real-world analog of the "RAM Copy doctrine", as it's called, would be that every time your eye reflects the image of a copyrighted work into your brain, you've made a new infringing copy. When it gets to court, that's what scrapers deal with -- and they almost always lose.

On the API front you may be able to argue that a simple JSON structure isn't sufficiently creative to qualify for copyright protection, but that would be blazing a new trail (and still leaves the CFAA to worry about). In almost all cases, something as complex as the JavaScript and the HTML that you get from $ANYWEBSITE.com, just loading it on an unapproved device is probably an infringement. That each digital load/transform is a potential infringement is how you hear about millions of infringements in file sharing cases, etc., because they're claiming each time you copied that data from your hard drive into your RAM, it was a new independent infringing copy.

Seriously, sit down and read the law, and then read the dozens of cases where this has been litigated previously. HiQ v. LinkedIn is a very limited anomaly in this pantheon, still very early in the cycle, and NO ONE should be taking it as a guiding star, at least not until it hits the Supreme Court and they come down reversing all the old precedent around this.

If you are going to build a business that depends on scraping, ONLY do so with the backing of mega-well-funded VCs, etc., who are able and willing to take on the powerful lobbies, and who are funding your company at least as much for its potential to break legal precedent as for its commercial viability.

Final note: expect no help from FAANG et al on this. Without the CFAA, their walled gardens are dead in the water. It is a critical tool used by MegaCos to retain their digital monopolies. "Network effect" means something, but it's only strangling the web to death because there are $1000/hr law firms enforcing it behind the scenes. Without that, we'd have automatic multiplexed Twitter/G+/FB streams a long time ago. They shut down aggregators because they need to control the direct interface to the user -- if they're relegated to a backend data provider by someone with a better user experience, they're very vulnerable. This realization is what motivated Craigslist's rapid reversal on scraper-friendliness and sunk 3taps, and been the death of many potentially innovative early-stage companies.


tl;dr The long and short of it is that until Congress passes revisions to the CFAA and the Copyright Act and/or until the Supreme Court comes down with a wide-ranging ironclad reversal of the last 30 years of case law on this topic, it's going to be perilous for anyone whose business depends on scraping.

And all this is at the federal level -- many states have enacted similar statutes so they can get in on the "put hackers in jail" action, and these battles will have to be fought at the state level too.

[0] https://blog.ericgoldman.org/ [1] https://blog.ericgoldman.org/archives/2017/08/linkedin-enjoi...

> I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless.

How long ago was this? It seems like the courts have shifted their position on this over time and only very recently (as in the last year) have they started to take a more permissive stance on scraping.

The paper linked elsewhere in this thread does a great job of summarizing the trend: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625

My experience was mid-2015. Hopeful signs have indeed become more frequent, but law moves at an absolutely glacial pace. Things are not going to change substantially for a few more years at the bare minimum.

We're in a good spot socially right now, as the tech behemoths are no longer perceived as plucky upstarts and quirky computer whizzes, but instead as creepy 1984-ish overlords. So I think the stage is set for upheaval -- maybe even some Congressional action if someone can tie this to the "deplatforming" thing that has Republicans fired up -- but we're a ways out yet, especially if we're just going to be crossing our fingers for a favorable SCOTUS ruling.

Compare the Aereo case at [0] for what is perhaps a counter-intuitive philosophical divide: the conservative side of the Court dissented from the majority in holding that Aereo should've been in the clear.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

>we'd have automatic multiplexed Twitter/G+/FB streams a long time ago.

Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Regardless, the state of copyright and IP law in the US is abysmal. We can't trust these companies (FAANG) to keep their own press releases online for a decade, how can we let them monopolize ideas (which they fail to fully flesh out) and content? They have been shown to be inept stewards to their own content :c

> Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Unfortunately, this is where the RAM Copy Doctrine gets us into trouble. It is not only illegal to "exceed authorized access" to a networked computer, the precedent currently considers loading any copyrighted work into RAM potentially infringing, e.g., if the rightsholder says you're not allowed to use their copyrighted work in that way, you have to present a viable fair use defense.

afaik, no one has brought suit against things like client-side adblockers and browser extensions that modify a page, but if they did, they'd be likely to prevail under current precedent.

We really need true legal protection for users to select their own user agents and to be free to access information willfully transmitted to them in the way they like, especially in the case of something like Facebook/Twitter, where the site itself is just a wrapper around other peoples' copyrighted content.

That will only happen if someone can convince enough Congresscritters to carve out an exception in the actual law, rather than relying on long-outmoded pre-internet judicial interpretations.

Power Ventures scoped down to extract only your own data out of Facebook and they still ended up owing $3M in damages.

See Ticketmaster v RMG at https://en.wikipedia.org/wiki/Ticketmaster,_LLC_v._RMG_Techn.... , where the argument that alternative user agents should be allowed was shot down. I discussed at some length here: https://news.ycombinator.com/item?id=12352450

Thanks for your detailed comment. This is super informative

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact