Hacker News new | past | comments | ask | show | jobs | submit login
9th Circuit holds that scraping a public website does not violate the CFAA [pdf] (uscourts.gov)
1136 points by donohoe 10 days ago | hide | past | web | favorite | 275 comments





This action does more than that. The court left the preliminary injunction against LinkedIn in place: "The district court granted hiQ’s motion. It ordered LinkedIn to withdraw its cease-and-desist letter, to remove any existing technical barriers to hiQ’s access to public profiles, and to refrain from putting in place any legal or technical measures with the effect of blocking hiQ’s access to public profiles."

So LinkedIn is prohibited from blocking hiQ's access by technical means. That's a strong holding. If this case is eventually decided in favor of hiQ, scrapers can no longer be blocked. Throttled a little, maybe, but no more than other users doing a comparable query rate.


Not allowing the CFAA to be (ab)used to attempt to make scraping illegal makes sense.

However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not? Serving the web site costs money, and the scraper surely isn't going to generate ad income...


Ugh, yeah, the more I think about this ruling, the less I like it.

It's actually pretty insane to force a site to serve content. I think both parties are in the wrong here - HiQ for assuming they're entitled to receive a response from LinkedIn's webservers, and LinkedIn for abusing the CFAA to try to deny service rather than figure out a technical solution to their business problem.

In my view:

* The data is public, and free of copyright. If you're a scraper and can get it, you haven't done anything wrong.

* The servers serving the data are still under LinkedIn's control, and they have no obligation or public duty to always serve that content. They could just as well block you based on your IP or other characteristics. If they want to discriminate and try to only let Google's scrapers access the data - what's wrong with that? Scraper brand is not a protected class. Tough taters if your business model "depends" on your ability to successfully make requests to another uninvolved company's webservers.

If I were the judge, I'd throw this out and let LinkedIn/HiQ duke it out themselves - they deserve each other.


I would argue that under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one.

Hosting costs money, servers cost money.. but maybe create a public facing API that is way cheaper and easier to use than scraping your website? I see that ruling in positive light that it might promote more open and structured access to the public facing data.


> under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one

Huh? Net neutrality isn't about the server or client... it's about the network operator in between them.


I suspect Xelbair is making a more expensive definition of net neutrality, taking as a basis the one that says it's about network operators only.

I think you wanted to say more expansive? But it's definitely also more expensive. :D

Yes, I meant expansive. Oops.

That was the case, hence the reference to the "spirit" of net neutrality.

Public facing internet sites, in my opinion, should be treated in same way as public space - anyone should be free to read, and write down in their notepad whatever is there, in the same way as anyone else.

Scraping public facing website in my opinion is huge waste of resources. It would be cheaper(in total) to build an API that can serve the data from it, than to build a good scraper.


Net neutrality is more about nondiscrimination in routing content from a provider to a user, rather than forcing content providers to serve everyone regardless of conduct. It's entirely reasonable for a site to discriminate who they wish to allow to access their data (whether technically their copyright or data they caretake).

That being said, if you provide data to the public, you don't get to invoke the CFAA to plug the holes your content discrimination code doesn't fill.


Why should you be forced to serve content to people who won't look at your ads?

Like disabled users with screen-readers?

I suppose we can give them a pass if they solve a bunch of captchas.

Anyone is free to put up a paywall and deny access to people who don't pay.

But LinkedIn is apparently happy to let Googlebot and bingbot scrape public profiles. If they want to do that, they can't argue that their policy is to block bots who don't click on ads. Discriminating Googlebot from other visitors is probably a violation of Google policies, too. They can't have their cake and eat it at the same time.


From reading the opinion, I think the argument goes something like this:

> First, LinkedIn does not contest hiQ’s evidence that contracts exist between hiQ and some customers, including eBay, Capital One, and GoDaddy

> Second, hiQ will likely be able to establish that LinkedIn knew of hiQ’s scraping activity and products for some time. LinkedIn began sending representatives to hiQ’s Elevate conferences in October 2015

> Third, LinkedIn’s threats to invoke the CFAA and implementation of technical measures selectively to ban hiQ bots could well constitute “intentional acts designed to induce a breach or disruption” of hiQ’s contractual relationships with third parties.

> Fourth, the contractual relationships between hiQ and third parties have been disrupted and “now hang[] in the balance.” Without access to LinkedIn data, hiQ will likely be unable to deliver its services to its existing customers as promised.

> Last, hiQ is harmed by the disruption to its existing contracts and interference with its pending contracts. Without the revenue from sale of its products, hiQ will likely go out of business.

> LinkedIn does not specifically challenge hiQ’s ability to make out any of these elements of a tortious interference claim. Instead, LinkedIn maintains that it has a “legitimate business purpose” defense to any such claim. ... That contention is an affirmative justification defense for which LinkedIn bears the burden of proof.

So the real situation is that you can't go out and start blocking access you knew about in a way that would interfer with third party contracts without a legitimate business reason to do so. The burden of proving the legitimacy of that business reason is on you.

edit: TLDR;

> "A party may not ... under the guise of competition ... induce the breach of a competitor’s contract in order to secure an economic advantage."


That’s quite ... crazy.

Be restaurant. Be on Deliveroo. Be getting low margins because of high fees.

So basically you can’t decide not to use Deliveroo any more, to improve margina (“secure an exonomic advantage”). I mean, you can cancel Deliveroo, but only as long as you’re not “inducing a breach of their contract”. So only a matter of time before Deliveroo writes a contract “we’re obligated to deliver food for you from said restaurant”.


Choosing not to use a middleman any more so that you can secure higher margins sounds like about clearest example of a "legitimate business reason" imaginable. The purpose of the act is to immediately increase your margins, not to hurt Deliveroo because you don't want their competition.

That's very different from the case in question, where LinkedIn's motive for cutting off hiQ's access is to inflict damage on hiQ because they are a potential competitor.


I would imagine that if you contract with Deliveroo, they have some terms that say that you need to give notice when cancelling?

I don't know Deliveroo, but I think a better analogy would be if you suddenly, even though it is not causing you trouble, denied access to someone picking up food that you didn't contract with, with the full knowledge that the someone would be in big trouble with their customers.


IANAL, but I think you're misunderstanding "without a legitimate business reason to do so"

"Be Restaurant" blocking Deliveroo because they can't continue operating with the loss of revenue due to high fees is a legitimate business reason. "Be Restaurant" blocking Deliveroo 2: Electric Boogaloo because I don't like their owner, but continuing to allow Deliveroo access would be, presumably, disallowed.

Also there's nothing stopping "Be Restaurant" from offering an exclusive delivery contract to Deliveroo and forcing Deliveroo 2 out, or requiring a minimum fee for all delivery services, Deliveroo and Deliveroo 2 included.

Of course, I think this is all in a very different area from a restaurant; we're talking about a service provided on the internet. I believe LinkedIn has many, many other recourses here, but, as I see it, the courts are just telling them, this aint it chief.


So, if you want to block someone from your service, you need to be able to prove that it is for a legitimate business purpose.

Moreover it seems, 'this harms a competitor of ours' is not considered a legitimate business purpose, but anti-competitive behavior.


Why does there need to be a legitimate business purpose? What about freedom of speech? It's my website and I'll publish what I want to.

Eh, I think you got this backwards. If you really want to talk about this in terms of freedom of speech, LinkedIn is in the act of censoring?

Edit: What I mean is that freedom of speech is not the same as freedom of censoring.


> What I mean is that freedom of speech is not the same as freedom of censoring.

This is at least not quite true of First Amendment law. The concept of "compelled speech" exists in US law, and is considered an unconstitutional violation of the First Amendment. Exactly what falls into that category (and whether the right of domain owners to censor user-provided content as they see fit is protected), I'm not sure, but freedom of speech in the US certainly does at least sometimes include the right not to speak.


Yes, the court was right to block LinkedIn's abuse of the CFAA. But the court was wrong that say that LinkedIn must show HIQ the same website as LinkedIn shows everyone else.

The ruling doesn't seem to say that they can't throttle access atleast.

The data are certainly not free of copyright. Data can contain user picture, or even small essay describing the job, life of a user though linkedin is not the copyright holder. Moreover these are personal data, and I'm not sure that the scraper has the original user right to collect the data. In Europe, the scrapper may face issues related to GDPR.

Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected, and with this ruling can be scraped, at least in the U.S. Others things common on LinkedIn, as you rightly point out, are protected--but by copyright law, not the CFAA. So a scraper acting in good faith would have to be careful about what they used if they wanted to respect copyright, but it's a separate issue from this ruling.

http://www.dmlp.org/legal-guide/works-not-covered-copyright


This is exactly right. Copyright protects creative expression, not pure fact. Famously, phone books (remember those?) are basically not copyrightable except for the ads, because they're just lists of data. Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

I never said that fact can be copyrighted, I said that most of the things people put around in their profile can be. I was responding to the claim that the data were not under copyright made above. If you just scrap name, company, position, this is fine, but I highly doubt that they just do that. This lawsuit can have tons of side effects.

I think what hiQ does is to predict whether a particular employee is about to quit.

So the interesting question to me is whether you can lawfully make predictions based on published information if that information is under copyright.

In Europe the answer is probably no, because the assumption is that in order to analyse data you have to copy it first.

To me, this interpretation of the term "copying" makes very little sense. So I wonder what US law makes of it.


Europe has database rights, which has a fair dealing exemption for data analysis.

I'm not sure what "database rights" refers to specifically, but the whole matter is actually rather complicated, because the EU copyright directive has a lot of optional exceptions that member states may or may not adopt.

Most of these exceptions only apply to non-commercial use though. So they wouldn't apply in a case like hiQ.

UK specific exceptions are explained here:

https://www.gov.uk/guidance/exceptions-to-copyright

Unfortunately, both Labour and the Tories have taken a relatively hard line in the EU copyright negotiations, so it seems unlikely that things will be relaxed very much after Brexit.


Database rights are a copyright-like intellectual property regime for databases.

"Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected"

There's an infinite number of ways to describe a job history, without any single standard, so I don't think it makes any sense to say that a profile or resume is not copyrightable.


Isn't the issue of being selective on who can view the content? If I, random Joe User views the publicly available content you have no issue. But if someone scrapes that data them you'd want to charge them. Unless I click on the ad, the act of using your bandwidth doesn't change based on who the viewer is. You'd want to apply fees based on the future use of the data rather than on your actual costs.

I'd assume if you weren't signing up, you'd probably look at like 10 profiles tops. A scraper is more than likely going to run through anything and everything it can grab links to (provided it doesn't leverage a very specific filtering mechanism for selecting profiles to scrape).

I could see the hit from a scraper being heavier than that of a typical user. There's also the potential that a user is going to click an ad for any number of reasons, there isn't that likelihood the scraper will.

I'm not anti-scraping by any means, but I get the concerns.


Presumably you'd be allowed to limit a scraper to a standard user bandwidth, and a standard user access - X links per day, Y bandwidth.

Surely the action is "if you display stuff in public you can't segment the public".

You're not obliged to have public access.

Is there perhaps a factor here of users having an expectation that their profile is publicly accessible; so companies hosting that profile shouldn't be able to choose _secretly_ "who" can access it?


You're inconsistent, and so are the courts and most comments here. Either you favour such conflicts to be decided by technological might, or by the clearly expressed will of the content publisher to have binding effect.

If you consider scrapers to have some sort of right to access any public website, any technological barriers inflict exactly the same harm as an injunction, assuming it is effective. IF you allow technical blocking, it would be preferable to allow blocking-by-clearly-stated-wish, because it would save everyone the costs of the arms race. It would also make both parties' success somewhat independent of the resources they can invest into outgunning their opponents.


> However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not?

Your statement makes absolutely no sense. That's not how internet works. If you serve something publicly you don't get to cherry pick who sees it.

Not only it makes no sense technically it's also a huge anti-competitive case.


It makes sense and it is how the internet works. Servers cherry pick who sees their content all the time. Scrapers are often blocked, as are entire IP address ranges. Things like Selenium server scrapers can be (approximately) detected and often are denied access.

I’m not sure about being anti-competitive. Serving a website is an action in which you open up your resources for others to access. My friend runs an open source stock market tracking website for free. He started getting hit with scrapers from big hedge funds and fintech companies a couple of months back. This costs him around $50-100 a month to serve all of these scrapers.


If he gives them a stable, fast API with a subscription fee, and the scrapers are truly from hedge funds, he’s going to make a lot more than $100/mo.

He should open up a Patreon, tip jar, something to get that funded.

Could also delay results, offer reduced temporal precision and other things to differentiate use cases.


He and I both have similar free open source websites with donate buttons. They are rarely clicked. Ad revenue over a month for me has been ~$400 while donations over two years have totaled $20. There are about 80,000 unique visitors per month.

It is nice to think donation platforms can fund high traffic open source projects, but this is simply not the case.

In any regard, I fear the potential of this ruling limiting developers’ ability to protect their servers and making us all roll over to the big players with their hefty scrapers taking all of our data for resale.


how long are you allowed to delay results, I mean not serving results is just delaying them forever but that's out. Can I delay serving results longer than chromium's default timeout?

Probably up to the point where a judge says 'this is blocking not delaying'.

I don’t see what legal or technical argument you’re making.

Technically, of course you can identify IP ranges owned by certain entities and restrict their access. That’s trivial, so what do you mean when you say the internet doesn’t work like that?

Legally, there’s plenty of region locked content for copyright and censorship reasons. A distributor might region lock because they don’t have distribution rights in particular regions. Are you saying distributors can’t publish free content at all because they can’t choose who sees it but would be breaking copyright law to publish to everyone? Or a site might region lock because certain content is censored in particular countries. Can you not publish anti-regime articles because a totalitarian country is on the Internet?

The entire world isn’t and shouldn’t be held hostage to the most restrictive laws that exist in the world. And the answer isn’t blocking on the requesting end because that’s technically much harder and blocks much, much more content. So what am I missing?

Edit: Forgot to include the other end of the spectrum. If I, as an individual, host my own site on my own hardware with my own connection that I pay the bandwidth for, can I deny a suspected not network?


Of course you get to choose. You can reject requests based on their user agent, their IP address, the owner or likely geographic location of the IP address, and many other possibilities.

What are these possibilities? You only get IP and client side information that client is _willingly_ sending to you. So if a script/user/bot/etc tells it's Firefox from 1.2.3.4 then all you know that it's a request from 1.2.3.4 that says it's Firefox. You can ask it to run Javascript code but that's beyond classic web interaction and then again you need to trust the client.

This interaction is impossible to be trustless thus every client can only be served based on their IP or some convoluted, hack exchange that is cat-and-mouse game at best.


LinkedIn’s public facing content is exactly that: public. This ruling merely says accessing public content isn’t hacking and so LinkedIn cannot use the CFAA as discriminatory weapon to limit access to that public facing content.

If LinkedIn wants to block access they need to do so by another means that isn’t described as hacking.


Does their robots.txt say don't crawl this part of the site? If it does, this ruling is catastrophic. If it doesn't then there is hope.

> If it does, this ruling is catastrophic

I know it is a generally considered bad form to ask, but did you read much of the ruling? I feel like a lot of people on this thread are just going off of Animats' comment and haven't spent much time looking at the opinion.

I didn't read the whole thing, but skimmed through it and read what seemed to be the relevant parts of the argument. (Including the bit that talks about LinkedIn's robots.txt)

The ruling doesn't really support your claim of catastrophe and doesn't claim to pass any sort of final judgement.

The judge makes a specific point about not reading too much into him upholding the injunction saying:

>> These appeals generally provide “little guidance” because “of the limited scope of our review of the law” and “because the fully developed factual record may be materially different from that initially before the district court.”


Compliance with robots.txt is and always has been voluntary, and many crawlers have long ignored it, including Archive.org.

Do any scrapers actually pay attention to robots.txt?

It's not forcing anything. Don't make a page public then? If a page is public then it is fair game.

This second part is pretty stupid, however, now that we are at this point, Linkedin still has the ability to decide which of its information is public and which is not. By making all of its information private, it can take back control.

I think the title is wrong and they are holding that linkedin can not block specifically hiq from viewing public data.

Which seems fair its public or its not, you can't pick and choose who its public for and who is a second class citizen.


it's not the scraper's fault that their business model incorrectly assumed profitability through ads in a way that did not foresee compliance with future ani-scraper-discrimination laws.

it's a good point you bring up, and may contribute to the death of ads.


Does this prevent Google from returning captchas if you use a robot to scrape the search result pages, as they currently do?

I mean it should. That is also a huge anti-competitive action that just isn't pursued by anyone yet: google makes money of scraping and denies scrapers scraping them - that's all sorts of messed up.

The problem is that someone would have to sue google first and no one will do that unless there's big business incentive and big business can already scrape the shit out of google.

This is the weird thing about web-scraping. Big companies can get around protections quite easily - it's the small scripts and average users that get hurt by them. No one is going to tell you this because people would stop buying Cloudflare's "anti-bot 99% effective anti-ddos money saving package" which is complete bullshit.


Google respects robots.txt, so it might be hard to prove that they are accessing websites without their implied consent.

Also, their own robots.txt contains "Disallow: /search". So, there is arguably no inconsistency, either.

But, what does this new ruling mean for robots.txt?


I think OP is getting at the nature of the relationship is kinda imbalanced. Consider basically most of their website is off limits: https://www.google.com/robots.txt

you can have your website offlimits from google by having a robots.txt too. What's the problem? People willingly want google to index them so they appear in search results and google can send them traffic.

Google only scrapes sites that allow it by their robots.txt file so I don’t think their policy is as hypocritical as you are making it sound.

This is true, but only technically. Google won't actively scrape anything disallowed in robots.txt, but those resources can still be indexed if found in the many other ways Google aggregates data, all of which is automated.

Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.


>Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.

Many times Robots.txt are implemented with the interest of barring access to information.

This works by relying on scrapers respecting the file, but it's no different than a no-loitering sign which itself cannot actively stop someone who is loitering.

Google doesn't have a Robots.txt disallowing search because it can't handle a large amount of queries against a resource...


They still scrape and index sites blocked by robots.txt, but they often don’t display those sites in their SERPs (but sometimes they still do)

Never seen it in any logs in 20 years.

Do you have a source for that claim?


Looks like I’m wrong. They will rank pages blocked by robots.txt, but won’t index them.

https://www.google.ca/amp/s/www.searchenginejournal.com/goog...


Google has an effective monopoly on search engine market - you _can't realistically_ block google from scraping your website. They have this power and they are abusing it.

Also robots.txt is bullshit, if a person can access a public website why automated script shouldn't - technically speaking it's the same thing.


It really isn't.

Imagine you have a piece of information that your neighbors in town come to you for every once in a while. They come over every now and again, ask you for it, maybe even bring you cookies for the trouble, and you provide it.

Then there's Ted.

Ted is insatiable. He hounds you at every minute of your day, constantly asking you the same question over and over. You've done everything you can. You tried to reason with Ted. You tried to contact whoever it is that brought Ted to your neighborhood. You even got so desperate, you moved a few houses down to escape the incessant hounding. That only worked for a little while though, and Ted found you again.

So you tried to stop answering the door; no use, he pokes his head in every time you're in the garage. You demanded people identify themselves first. Oh well, it's changed little. Now he just names himself gibberish names before hounding you for the same things over and over again.

This would not by any stretch of the imagination be acceptable behavior between two people. The main factor in determination for a court injunction would likely be physical trespass, or public nuisance; but no digital equivalent exists currently other than the CFAA, in the sense that in as much as one can prove that the access to the system is inconsistent with the intent or legal terms of providing a service, one may seek relief.

The problem is, LinkedIn has failed to make a convincing argument in the eyes of the appellate court that hiQ's Ted is violating the CFAA while LinkedIn has proactively engaged in activity disrupting hiQ's ability to do business; business which was consistent with the service granted to unknown members of the public at large.

In the Court's eyes, from the sounds of it, it appears LinkedIn is doing the greater harm.

What it looks like to me is this is setting up a common law framework that is going to cause website/service providers to have to choose from the get-go what their relationship to the Net is.

Are you just a transitor providing a service over the Net to a limited, select clientele bound by very specific terms of service? Then you may have a leg to stand on in fending off malignant Teds, but your exposure and onboarding will need to have concomitant friction to make the case to the Court that these Teds were never meant to be serviced in the first place.

Or, are you providing a public content portal, meant to make things accessible to everyone, with minimal terms? In which case, no legal Ted relief for you!

Just because it is your "syatem' and it isn't connected to your nervous system, does not mean it isn't capable of being caused harm to, or inflicting harm on someone else with careless muckery.

The one thing that disturbs me most is how the Court has disregarded the chilling effect that interpreting a duty to maintaining visibility may incur. A First Amendment challenge may end up being the inevitable result of this legal proceeding.


All true.

Can we also be allowed to view people's profiles without being forced to sign in to LinkedIn? They lost my trust many years ago with their shady practises and 'dark patterns', so I don't want to share any of my data with them.

I however sometimes want to look up people. Or would this be a case of wanting to have my cake and eat it?


There will be a 3rd party service for this once the scraping is legally allowed.

"If this case is eventually decided in favor of hiQ, scrapers can no longer be blocked."

If the case is decided in favour of hiQ, then, absent an injunction, what would prevent a website from blocking a scraper? Maybe the website could still block unless and until the scraper gets her lawyers to file an injunction.

Another interpretation is that if hiQ wins, then in the 9th Circuit's jurisdiction websites serving public information they neither own nor exclusively license may no longer try to to use the CFAA and/or copyright law to threaten scrapers.


Scraper bots have genders now?

the scraper = the person doing the scraping

Scrapers generally, sure. Not sure about scrapers on other social-media sites like Facebook, though.

The question being: if just having access to the network isn't enough to grant you access to the data of a specific profile, but instead you have to aggregate samples from a bunch of people in the network in order to see "through their eyes" to the data on the profiles of their friends and friends-of-friends, is that allowed?

Because, if even that was allowed, that'd surely open a different kind of floodgate.


If you have to login to access the data, that data isn't publicly available, and unlikely to be protected by this decision.

Page 31 of the filing specifically differentiates this case from one regarding Facebook:

"While Power Ventures was gathering user data that was protected by Facebook’s username and password authentication system, the data hiQ was scraping was available to anyone with a web browser."


The ol' DMCA trivial encryption switcharoo. I'd expect a lot more sites and their data to require a login now, to the detriment of the data being publicly indexed.

Wait, doesn't Linkedin have pretty strong authentication safeuguards to view user profiles. if you google someone and click their linkedin without being logged in to LinkedIn, you're always directed to sign up or log in to it.

Twitter is the only example I can think of among the large social media sites that doesn't require you to be logged in to see profiles


It's not authentication safeguards, it's dark UX patterns driving you to signup. Whether or not you'll see someone's profile and how much of it you'll see depends on how they fingerprint you, where you navigate from, and on the phase of the Moon. LinkedIn is notorious for that.

My memory might be misleading me here, but I think I remember years ago in one case I could see less info about a profile when logged in to my account than I could see while logged out and navigating from a Google search...


Nope. You can see public profiles without being logged in. Certainly the case here in the UK; not sure about elsewhere.

Even twitter requires you to be logged in to view "replies" on a user's profile.

Leaving the injunction in place is insane and a huge oversight. It amounts to making web pages carriers that cannot select who they serve.

It should have said only that there is nothing judicially wrong with scraping but also not limited the rights of a service.


It's limited to public pages. They can still discriminate whom they serve, with logins or something, but they can't limit your ability to access their page in a way that you prefer.

Allthesame. If I as a web host want to fuck with people or just someone by randomly dropping connections, this says I can't (well, it says so for LinkedIn but it's results in the same being applied to others).

This is insane. It really should have just said that "it's legal to scrape" but it shouldn't have said "and you can't stop that".


Of course they can. The question is whether they can do so legally and, if not, based on what laws.

And I presume it would also apply to cloudflare’s captchas.

Would it hold up if profiles were only visible to logged-in users and part of the sing-up EULA was an agreement not to scrape profiles with automated tools?

Obviously not. Although if you had an extension in your browser that scraped every linkedin profile then it's not automated. If everybody in your office had the extension installed, and you maybe also had a couple interns who did nothing but do linked in searches and hit search results all day long, and maybe hired some people through mechanical turk to do the same thing.

Well that's not automated, and slightly more costly but it gets you pretty close to the same effect.


.

In the opinion they specifically discuss DOS attacks: "Internet companies and the public do have a substantial interest in thwarting denial-of-service attacks and blocking abusive users, identity thieves, and other ill-intentioned actors. But we do not view the district court’s injunction as opening the door to such malicious activity"

Hmm, not a lawyer, but I think if you commit one crime in the process of doing something else that's legal, you're still guilty. But LinkedIn couldn't block your spider if it came back and throttled at a reasonable rate.

Considering the kind of private scraping and selling tactics LinkedIn has been chronically guilty of (and not just the ordinary "growth hack" stuff: "LinkedIn violated data protection by using 18M email addresses of non-members to buy targeted ads on Facebook" [1]), it's satisfying to see LinkedIn lose this.

[1] https://techcrunch.com/2018/11/24/linkedin-ireland-data-prot...


I feel like this is a really common theme I've seen several times. Something like "Music Lyric site X sues Google for embedding their lyrics in the results directly" which is funny because site X got the lyrics by scraping them from other sites.

Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

I find it really funny that the scrapers are battling scrapers - like guys you only exist because you do THE EXACT SAME THING


Regardless, there is legitimate value in the collection, cleaning, interlinking, and presentation of existing data. How that is interpreted by the law is one thing but merely because the data came from a variety of other public/private sources doesn't mean it derived all of its value externally.

For sure, but they shouldn't be hypocritical about it. If they don't consider themselves content parasites, they shouldn't consider people scraping their site to be content parasites, either. (Some sites really are just parasites, though.)

There's nothing hypocritical about it. Googlebot respects robots.txt configured on pages it scrapes. Google in turn expects that their own robots.txt will be respected. What's the issue?

https://www.google.com/robots.txt


Can I politely point out that the conversation is not about respecting robots.txt.

If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.


> If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.

It seems like you're stating this as though to cast some sort of moral aspersion. I don't get it. If other companies don't want Googlebot to scrape them they just have to say so. Most companies want Googlebot to scrape their content. Google doesn't want other people's scrapers to scrape Google's content. Nobody involved in any of this has done anything unreasonable or morally objectionable.


> Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

Yes. This is EXTREMELY frustrating.

Of all companies to prevent scraping, Google is the most ironic.

Especially since their goal is to organize the world's information, it shocks me that there's no way to get access to this organized information from machine to machine.


I d0n't really think it's inappropriate or ironic. I can easily imagine naive scrapers essentially DDOSing google.

Perhaps this issue will be recognised in some of the antitrust investigations.

If I am not mistaken, they no longer claim "organize the world's information" as their goal.


But it still is:

> https://about.google/

"Our mission is to organize the world’s information and make it universally accessible and useful."


Appears I am mistaken. Cheers.

Yes, we want a Google Search API [at a decent price].

Well, setting up so-called Barriers to Entry[0] is econ 101.

[0] https://en.wikipedia.org/wiki/Barriers_to_entry


Creating barriers to entry is an antisocial tactic that harms consumers and society at large.

It is the responsibility of moral consumers to avoid spending their money with companies that use these regressive tactics.


I think it’s important to distinguish types of barriers to entry. Some are “real” while others are “artificial”. For example, a real barrier to entry would be institutional knowledge about an industry while an artificial one would be an arbitrary TOS clause.

And disallowing scraping or making it difficult while refraining from providing an API for the same data is the arbitrary kind. The default state of the web is that it's trivially scrapable - you have to go out of your way to make it harder.

[flagged]


HN is not an appropriate place to project your moral insecurities in this manner.

HN is a place where you feel free to make personal attacks.

Google respects robots.txt, so it's arguably not the same as scraping a website without their consent.

Most sites don't have their main data/functionality in the Disallow section though.

Sure, but they could if they wanted to and that's their own business.

Google respects robots.txt files from webmasters who don't want Google to scrape their content

> LinkedIn has taken steps to protect the data on its website from what it perceives as misuse or misappropriation. The instructions in LinkedIn’s “robots.txt” file—a text file used by website owners to communicate with search engine crawlers and other web robots—prohibit access to LinkedIn servers via automated bots, except that certain entities, like the Google search engine, have express permission from LinkedIn for bot access.

Not a big fan of weev, but this sure seems like he got screwed if he was just enumerating public web pages and went to jail for it.[1]

[1] https://en.wikipedia.org/wiki/Weev#AT&T_data_breach


Even if the case was tried today, 9th Cir. isn't binding on other regions of the US, and there's a bit of a split, as detailed in the opinion[1]:

> In recognizing that the CFAA is best understood as an anti-intrusion statute and not as a “misappropriation statute,” we rejected the contract-based interpretation of the CFAA’s “without authorization” provision adopted by some of our sister circuits. Compare Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058, 1067 (9th Cir. 2016), cert. denied, 138 S. Ct. 313 (2017) (“[A] violation of the terms of use of a website—without more— cannot establish liability under the CFAA.”); Nosal I, 676 F.3d at 862 (“We remain unpersuaded by the decisions of our sister circuits that interpret the CFAA broadly to cover violations of corporate computer use restrictions or violations of a duty of loyalty.”), with EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577, 583–84 (1st Cir. 2001) (holding that violations of a confidentiality agreement or other contractual restraints could give rise to a claim for unauthorized access under the CFAA); United States v. Rodriguez, 628 F.3d 1258, 1263 (11th Cir. 2010) (holding that a defendant “exceeds authorized access” when violating policies governing authorized use of databases).

weev was tried in an area under the 3rd Cir. jurisdiction. Somewhat interestingly, his conviction was thrown out in 2014 on venue grounds (e.g. being tried in NJ), without addressing the statutory question.[2]

[1]: pp. 27-28 [2]: https://en.wikipedia.org/wiki/Weev?oldid=912921723#cite_ref-...


Is there an AWS region in the District governed by this case, so you can just do all your web scraping from instances in that region?

Not saying the court made the right call but for that case the big issue for the court was the pages were clearly not intended for the public and the defendant knew it.

I believe the salient issue is whether or not there were effective access controls, not whether or not a page could be reasonably interpreted as intended to be non-public.

There is no requirement to have "effective access controls." What matters is what a reasonable person would believe about whether they were allowed to access the data. The access controls are relevant only insofar as that they convey a message that access is not permitted. The effectiveness of the access control is utterly irrelevant.

That is false. What effective access controls do, legally speaking, is help determine if a person could reasonably conclude that the information was non-public.

For example, if a door is locked, but easily defeated, there is an implied assumption that what lies behind it is only available to someone with the key. Another example is an unlocked door with a sign that states "No access without authorization". Or an unmarked and unlocked door that is on private property in a place where it would be very unlikely for someone to reasonably conclude that the access was intended to be public.


The real issue here is somewhere between both you and GP. What is required to trigger the CFAA?

Does accessing a page the site owner doesn't want you to violate the CFAA or do you need to hack through access controls?


As a real-world analogue: you can indeed be guilty of trespassing on someone's property even if you don't have to jump over any fences or pick any locks to get there. In some places, they don't even have to have a "no trespassing" sign. Simply being present on someone else's property without an invitation from them is illegal, and no, an open door does not count as an invitation.

> an open door does not count as an invitation.

But if you have a someone living there who lets anyone in if they ask, it would be pretty hard to argue they are trespassing.

A user agent must ask for every page with an http request. If the server responds with 200 OK, it’s pretty hard to argue that it isn’t letting (or even inviting) you in.


This doesn't cover you if you lied to get the invitation. If a robots.txt file denies some but not all user agents, setting your user agent to indicate that your request originates from a source it does not originate from is clearly a circumvention of an access control.

You generally can’t be charged with trespass unless you refuse to leave when told to do so.

An open door to a home is different, but unfenced property is 100% not trespass until you refuse to leave.


Trespass in criminal law usually requires notice that trespassing is prohibited, but this is usually satisfied by posting a "no trespassing" sign in a prominent area.

For example, my state's law (emphasis added):

> Whoever, without right enters or remains in or upon the dwelling house, buildings, boats or improved or enclosed land, wharf, or pier of another, or enters or remains in a school bus, as defined in section 1 of chapter 90, after having been forbidden so to do by the person who has lawful control of said premises, whether directly or by notice posted thereon, ...

The federal version of trespass (which I think applies to Indian reservations) considers merely fencing off the area to be sufficient notice that trespassing is prohibited.


That's not true at all. If you are aware that the property you are accessing is not meant for your use, you can be charged with trespassing regardless of if you have specifically been asked to leave or not.

It's even possible to be guilty of trespass even if you weren't aware that you weren't allowed on the land. This is negligent trespassing.


Negligence only applies I'm situations where a reasonable person should have known. You're only able to be charged with trespassing whilst being unaware if you were to so ridiculously unaware of your surroundings that any reasonable person in the same situation _would_ have known that they were trespassing.

If a public park blends into somebodies private lawn you can't be charged with tresspassing for stepping over the line.


this is bad comparison because when scraping a site, you don't cross any borders, you just send and receive information. You can compare this to a phone call or to talking to someone.

A website or server is property, just like land is. Accessing it is no different than accessing any other piece of property. Opening a website is, for all intents and purposes, the same as crossing a border.

To take it a step further, the information on said website is also personal property, and accessing the information without permission is also trespassing. Specifically, this is called trespass to chattels [1] (trespass is most usually legally defined as trespass to person, trespass to land, and trespass to chattels). Even more specifically, this is the actual part of the CFAA that's being debated: computer trespass gets its roots from trespass to chattels. [2]

1: https://en.wikipedia.org/wiki/Trespass_to_chattels

2: https://en.wikipedia.org/wiki/Trespass_to_chattels#In_the_el...


This analogy is faulty and congress really needs to clarify what they meant the CFAA to protect against.

Opening a website is making a request, technically speaking. That is not equivalent to breaking into someone's home and taking information. The equivalent would be if the head of the household told you not to stand outside and ask someone inside to give you something from the house. You haven't trespassed, you're asking someone in the house to do something for you. It's on them if they do what you request or not.


A website isn't a person and the law doesn't expect them to act as such. It's a tool. Making a 'request' to a web server is more like turning the knob on a door: maybe the owner installed a lock, or maybe it just opens without there being a lock. But even if there isn't a lock, the law doesn't absolve you of trespassing against the door's owner just because the door itself didn't have the sentience to refuse your request.

But it's a door handle that is MEANT to be turned by the public at large. It's like putting a big "Order Inside" sign above the door to a restaurant and being surprised when people try to gain entry.

You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

I'm not saying that websites shouldn't have recourse against people taking all the contents of their sites, just that the CFAA is the wrong tool.


>But it's a door handle that is MEANT to be turned by the public at large.

No it isn't. That's the crux of the case.

>It's like putting a big "Order Inside" sign above the door to a restaurant and being surprised when people try to gain entry.

It's like putting a big "order inside" sign above the door to a restaurant, and then also having a separate door in the back of the restaurant that clearly is used only by employees to go to the back office, and not being happy when non-employees keep trying to walk into the back office claiming "well there's a sign outside...".

>You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

According to the courts, you did 'go inside the house' because the electronic signals that you sent to the server as part of the request are enough to constitute the 'physical contact' part of trespassing.

Again, trespassing isn't just about you physically having your body on someone else's property. It also can be your interaction with someone else's property (which can be land, or a door, or web servers) through the use of tools or intermediaries.


> not being happy when non-employees keep trying to walk into the back office claiming "well there's a sign outside..."

If you don't bother to put up an "Employees Only" sign on the door, you are going to have a hard time getting a trespassing charge to stick...


> You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

By this reasoning it's impossible ever to hack anything. Even breaking password controls or cryptography is still just sending the server a request and getting something served back.


Lol really?

I'm not "on" your site when I browse there. I asked your server to send me some data and it did so.

Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.


You didn't "ask my server". You used a tool to extract data from my server.

It's more akin to you standing just outside my property border and using a fishing pole to pull fish from a pond that is inside my property border. You're still trespassing even if your two feet aren't physically on my land.

The common legal argument (see the second link in my above comment) is that accessing a web server actually does constitute being "on" the server because you are sending signals to my server in order to interact with it, and this satisfies the "physical contact" part of trespassing.

From Wikipedia:

> The courts that imported this common law doctrine into the digital world reasoned that electrical signals traveling across networks and through proprietary servers may constitute the contact necessary to support a trespass claim.

>Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.

This absolutely would be illegal and I'm not sure why you think otherwise. Misrepresenting yourself in order for me to reveal to you private information is fraud and is illegal in pretty much every jurisdiction I can think of.


> You didn't "ask my server". You used a tool to extract data from my server.

The tool asked the server. The server replied.

> It's more akin to you standing just outside my property border and using a fishing pole to pull fish

Bullshit. Using HTTP to access public information is akin to standing outside your business and writing down the phone number in the banner. Or even reading the "No trespassing" sign.

As long as you're not violating copyright, NDAs or EULAs (and that's debatable) there should be nothing wrong with reading information that you were authorized to view.


>here should be nothing wrong with reading information that you were authorized to view.

You aren't authorized to view it. That's the entire point.

And the lack of access control does not implicitly give you authorization to view it.


When it comes to physical properties there's a huge difference between reading a banner posted in a street and entering the property to read some secret data: you have to be in different locations. That's why your analogy is completely faulty.

When it comes to PUBLIC data in a website there's no difference. How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?

Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits. "Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.

Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.


>When it comes to physical properties there's a huge difference between reading a banner posted in a street and entering the property to read some secret data: you have to be in different locations. That's why your analogy is completely faulty.

At no point is accessing a web server similar in any matter to reading words off of a banner posted in a street. You cannot use a faulty analogy of your own to describe why my analogy is faulty.

>When it comes to PUBLIC data in a website there's no difference.

Yes there is. Even for data that is public and meant to be accessed to the public, you still must access the web server. It is much more similar to walking into a publicly accessible restaurant and reading their menu, it is not similar to reading a banner on the outside of the restaurant.

>How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?

A reasonable person knows that www.google.com is meant for public use. It is common knowledge and from whatever avenue you heard about Google, you probably gathered from context that www.google.com is somewhere you are allowed to go.

This is absolutely not the case if you randomly guess a URL like 'mycompany. intranet. io/financials /employeelist. xls'. And it certainly is not the case when you are explicitly told (such as in a robots.txt) that you are not allowed.

>Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits.

It does, though. The owner of property is under no responsibility to inform the public that their property isn't meant for use. It is up to each individual person to determine if they are allowed to use it or not. This is typically done by context clues and societal expectations: it would be absurd for a random member of the public to walk through someone's open front door and claim "well I was never explicitly told to not come into your house...". The person should know, based on social conventions that you don't just walk into someone else's house, that it's not allowed. This is the same for websites. There is some leeway given, such as if you saw a sign for "Open House" and simply walked into the wrong house. But it is still possible to commit an act of trespassing even if you didn't explicit intend to: this is called negligent trespassing.

>"Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.

No, it's not. Did you even click on the link I provided earlier regarding trespassing?

>Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.

That law already exists. It's called the CFAA, and the debate around it is what is being discussed in this post.


The "don't walk into someone else's house" rule applies to ALL houses everywhere. You are explicitly forbidden to enter a house unless explicitly authorized.

When it comes to website, there are billions of domains in the planet, each one has multiple internal URLs, ranging from tens to several million. You can't expect everyone to have common knowledge about every domain and link. It is beyond ridiculous to compare the two.


> You can't expect everyone to have common knowledge about every domain and link. It is beyond ridiculous to compare the two.

It's true that there's a presumption that sites that are accessible by the public are open for access to the public. But a lack of technical restriction is not an invitation. If a reasonable person would conclude that your access is not welcome then your access is also illegal. this is the crux of why so much of security research is on precarious legal footing. If you find an unsecured mongoDB database with a name like "customer_data" and you download the contents you are 100% breaking the law.


A better analogy: Accessing a website is like calling up a business and asking whichever employee answers for information.

> And the lack of access control does not implicitly give you authorization to view it.

I know you're trying really hard to sway opinion on HN for some reason, but I'm just going to reinforce the entire point of this thread and, assuming we're staying within the context of publicly accessible information: the Ninth Circuit Court strongly disagrees with you.

Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.


I don't know why you think this has anything to do with opinion. I'm relaying information that is available in the Wikipedia link that I provided in an earlier comment.

>but I'm just going to reinforce the entire point of this thread

That isn't the entire point of this thread, nor is it the point of the PDF posted in the OP.

>Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.

Nobody has said anything about it being a criminal offense. The relation to trespassing is literally the entire point of this thread.


> You didn't "ask my server". You used a tool to extract data from my server.

You're always using a "tool" to "extract" data from a web server, unless you're manually operating a telnet session. A web browser is such tool, an incredibly complex and automated one. cURL is such a tool too, and so is cURL wrapped in a bash script. None of them go outside of what's allowed by HTTP protocol[0]. And the most core assumptions of the Internet and HTTP protocol combine into a simple rule: if it's a publicly routable server answering to HTTP requests, you can issue requests and receive whatever it sends. If a server wants to discriminate, it should set up an auth scheme.

--

[0] - protocol family at this point.


> "You didn't "ask my server".

Yes, you did.

> You used a tool to extract data from my server.

No, that's not how the technology works.


Using fishing bait is just a "request" for a fish to bite my line so I can pull it in. It's up to the fish to respond to the 'request', right? So does that absolve me of a crime if I go fishing in someone else's pond and pull out all of their fish? Cause the fish are the ones that responded, right, so it's not my fault?

No, of course not. The technical details of how an HTTP request works are not what is relevant here. Don't be obtuse.


Personal attack noted.

No, it’s more akin to standing on the boundary, reading your posters using binoculars.

um, seems in this case the court specified that it is NOT on them - if the info is public, the website/house-people are required to return it and create no obstacles to doing so.

>A website or server is property, just like land is. Accessing it is no different than accessing any other piece of property

What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?

You laid out a lot of opinions there as if they were facts. They are not. These issues are complex and are still being debated at levels higher than the HN comment section.


I don't understand your comment.

>What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?

No. Of course not. What exactly is your question?

>You laid out a lot of opinions there as if they were facts.

I didn't lay out any opinions. I relayed information that is available from Wikipedia and other sources and rephrased it into an HN comment. None of it is opinion. If you take issue with what my comment says, you can take it up with the courts that made the decisions that gave the information I posted.


>No. Of course not. What exactly is your question?

My point was that it's hardly as clear cut as a piece of land and you know it. You posted a link to W's Trespass of Chattels, which I think is funny because it exactly proves my point. From your link:

>...several companies have successfully used the tort to block certain people, usually competitors, from accessing their servers. Though courts initially endorsed a broad application of this legal theory in the electronic context, more recently other jurists have narrowed its scope. As trespass to chattels is extended further to computer networks, some fear that plaintiffs are using this cause of action to quash fair competition and to deter the exercise of free speech; consequently, critics call for the limitation of the tort to instances where the plaintiff can demonstrate actual damages.

It is not at all clear that what we're discussing here is a clear violation. It's very debatable and the law itself was never envisioned to apply to scraping websites (because they didn't exist yet!) It also goes on to say (in the US)

>One who commits a trespass to a chattel is subject to liability to the possessor of the chattel if, but only if,

>(a) he dispossesses the other of the chattel, or

>(b) the chattel is impaired as to its condition, quality, or value, or

>(c) the possessor is deprived of the use of the chattel for a substantial time, or

>(d) bodily harm is caused to the possessor, or harm is caused to some person or thing in which the possessor has a legally protected interest.

The only clause there which even begins to help your case is the 'value' part of clause b and, again, that's very debatable.

> you can take it up with the courts that made the decisions that gave the information I posted.

Decisions made by court A get overturned by court B all of the time. We'll see where it lands, but we're not there yet (again, my point!)


Those are apple to orange comparisons: A phone call - you don't have to answer the call, nor say anything once you know (or don't know) who the caller is or what their intention is - you can stop whenever you want - is it a robo-call? You hangup. And similarly with talking to someone (in person - if they say something you're free to just not respond; and if they persist, it's harassment.

The main reason I see businesses being concerned about being required to serve scrapers pages (even at a reasonable rate of download) is that there's still cost associated to it, and more so the more scrapers try to access and regularly access the data for updates. Similarly, if it is the users of a platform who have input the data, update it, and they are only wanting it presented on that platform (for whatever reasons) then what rights do they have?

Is the answer then requiring adding another acknowledgement message like "this site uses cookies" required, perhaps with required response before moving forward to have users acknowledge "scraping isn't allowed" - akin to "no trespassing" signs on properties? That seems awfully ridiculous to put the onus on 100% of users (including the overwhelming majority being non-scrapers), adding friction and speed of access to billions of internet surfers? Of course browsers could then could act as a layer that auto-respond to that or pre-agree to the rules - perhaps in a way reading through a site's TOS and pre-approving what you agree to. And as the trend has been otherwise it leads to closed platforms so the data isn't considered public; I won't argue whether that is good or bad for the general internet, however how much value is there in a person having access to that data without having to be a user?

Or the much simpler thing is we could put the onus on businesses who are scraping or will use scraped data to not cause this mass friction.


And does robots.txt count as an access control?

What about a humans.txt that says "please don't scrape this site"?


According to the ruling, a cease and desist letter directly demanding that they not scrape the site didn't count as access control, so one would assume that humans.txt wouldn't either. It needs to be a technical prevention like a password, access token, etc.

Sounds more like an access suggestion than control to me

Ahh so if a company leaks data it's the viewer's fault, not the companies?

Yes. That's the general rule--negligence of a victim does not negate the culpability of the criminal. "It was easy to commit the crime" is not a defense. If you find yourself with access to something you think you're not supposed to have access to, you're supposed to do the right thing.

I think the issue is not "negligence of the victim" so much as, "took no steps to make the information private". If I don't lock my door and someone goes into my house, opens my filing cabinet, and copies my financial info then they're still a criminal. They took information from a place that was unambiguously meant to be private. But if I staple a copy of my financial info to a telephone pole, surely those who read it are not committing the crime of reading private information.

A URL that is not authenticated seems more like the latter than the former. The web is public unless people take steps to make it private. Criminalizing accessing unprotected URLs is like arresting people for reading the financial info I left stapled to the telephone pole. A while back there was a "world's most exclusive chat room" website that hosted IRC channels for people above twitter follower numbers. People learned that they could just manually increment a URL parameter to access higher room numbers regardless of their twitter follower count. Were those people committing a crime?


To take the case at issue, when your immediate reaction to discovering an unprotected URL is to scrape it, discuss on an IRC channel how you're going to monetize it, and then go to the media to announce your security vulnerability discovery, you are going to find it difficult to make the argument that you believed you had authorized access.

By that same line of reasoning, one could argue that changing your url parameter in that twitter chatroom website is a privilege escalation attack that allows users to access protected information.

Absence of authentication means all access is authorized, otherwise just typing in random urls is a crime.


The case referenced by the top-level comment of this chain (the one about 'weev') is a case where someone was prosecuted and imprisoned specifically because changing URL parameters was seen as an attack allowing access to protected information.

You can indeed argue that. Typing random URLs can indeed be a crime.

Nobody leaked any data here. These were public profiles that were "controlled" by a robots.txt file.

The judge appears to question whether robots.txt is sufficient to prevent scraping, or if a proper authorization step would be required.

The best real-world analogy I can come up with... I post a No Trespassing sign on my garden, but don't fence/gate the property. Is it ok to access the property and take my tomatoes? After all, the sign is just a suggestion... had I really wanted to prevent access, I'd install a fence.


It's more like a store putting up a no shoes no shirt no service sign and then trying to sue for trespass when a beachgoer comes in to shop anyway. LinkedIn is a business with publicly accessible assets they want to be frequented, but they want to control how you do that. However they are finding the laws regulating the rights people have in respect to frequenting places open to the public apply.

jep. it's more like having a public store and only letting some people into it. like only males, no womans. because they clearly allowed the google bot.

More like only letting humans into it, and only one type of robot, the googlebot. Then this company is ignoring the posted rules and sending armies of robots into the store to photograph every square inch of the business.

Well I'm sorry, but if I opt to wrap a cURL call in a bash for loop, I'm still a human that tries to access the same resources, only with a different user agent.

There is legitimate individual interest in both scraping and non-browser HTTP sessions.


I think this is the best analogy I've seen on this thread.

Where I live in order to be trespass it has to be an enclosed space. If you don’t fence your property then it is not tesspass.

Taking your tomatoes is theft, sign or no. That has nothing to do with this case.

A friend of mine from grad school was very involved in legal issues related to cfaa stuff. According to him, weev really got screwed because he failed "the punk test", which discouraged lawyers from wanting to use him as a test case.

Curious question: what is "the punk test"?

“Would this person’s attitude make them more unpleasant to work with than others I could be representing in my already limited time”, or something to that effect.

Someone who spends their free time hacking university printers to distribute white supremacist propaganda and is a proud member of the “Gay Nigger Association of America” trolling group would likely not pass.


"will this person be negatively perceived by a jury of their peers, or, especially, a jury of 3 federal judge? And, will this person say dumb shit that torpedos my case for reasons other than its merits"

"Is this person going to say idiotic nonsense that tanks the case because the judge or jury thinks they are just awful?"

Weev may have been a good test case if he wasn't a white supremacist.


>except that certain entities, like the Google search engine, have express permission from LinkedIn for bot access

how does this work technically? i just tried crawling a friend's profile using curl and set my user agent to Google's bot and it still was blocked.


Google describes how to verify Googlebot here: https://support.google.com/webmasters/answer/80553?hl=en

Most other search engine crawlers provide similar methods.


IP range whitelisting? MASSL?

EDITED: corrected auto-correct.


hiQ asked the court for a preliminary injunction to stop Linkedin from denying them access, won it, and this is the result of Linkedin's appeal of that injunction. This is not the end of the case.

The title is wrong. The 9th Circuit just ruled that hiQ has a decent enough argument to move forward. The question of whether them scraping a public site can violate the CFAA is not settled.

> We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required

> The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by LinkedIn as private using such an authorization system. HiQ has therefore raised serious questions about whether LinkedIn may invoke the CFAA to preempt hiQ’s possibly meritorious tortious interference claim.

Note the tone of the language used in the ruling. The judge makes it pretty clear that nothing is final here.


I think you've mischaracterized the state of things. In the underlying case, LinkedIn asserted that HiQ violated the CFAA and HiQ said LinkedIn tortiously interfered with its business. The trial court said LinkedIn couldn't assert the CFAA. LinkedIn appealed, asking the appellate court to overturn the trial court and also to hold that the tortious interference claim is preempted by the CFAA. The appellate court said no, we agree with the trial court and there's no preemption, so now HiQ can go back to the trial court and proceed to trial with its tortious interference claim.

LinkedIn tried to use the CFAA as an argument against the preliminary injunction HiQ was seeking at the start of the trial (which would force LinkedIn to continue to provide access to the profiles). They claimed that HiQ was likely to fail under the CFAA and so do not deserve the injunction to be granted. When the preliminary injunction was granted, LinkedIn appealed. This is the ruling on that appeal:

> It is likely that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA. The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by LinkedIn as private using such an authorization system. HiQ has therefore raised serious questions about whether LinkedIn may invoke the CFAA to preempt hiQ’s possibly meritorious tortious interference claim.

So yes, HiQ and LinkedIn need to go back and finish the trial, but the language used is in no way ruling on whether or not the CFAA preempts state law, just that even if there is pre-emption that hiQ still has a decent argument.


Are you saying the trial court never ruled on the preemption claim?

The court rules that hiQ has a good enough argument against the preemption claim that the preemption claim cannot be used to block the injuction:

>> We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required


I disagree about the tone, it seems to suggest to me that the judge believes there is a strong case here for hiQ.

I agree, but there's a difference between "hiQ has a strong case" and "Here is the final ruling on the hiQ case". I was trying to point out that the case is not over or ruled on at all. Just the preliminary injunction.

Oh is that all, just a PI against their fundamental argument.

"Just the preliminary injunction."

But I thought that was a few years ago!

How many more over-rulings, or appeals do we freaking need? I really hope this is the final ruling.


This ruling isn't on the actual case. The ruling is about the injunction being upheld while the case is tried.

Funnily enough the judge has a similar concern:

>> I write separately to express my concern that “in some cases, parties appeal orders granting or denying motions for preliminary injunctions in order to ascertain the views of the appellate court on the merits of the litigation.”


AP seems to be saying differently. https://apnews.com/1e1cacd92df74f48846e8bce5237b97d

I think I would trust the opinion itself over a random AP reporter.

You misunderstand basic law terminology.

A preliminary injunction is considered very strong. So it's not that "nothing is final here", it's actually almost pretty much final unless something comes out of left field.


The injunction was to stop LinkedIn from blocking access while the case is ongoing, not to stop them from arguing that hiQ violated the CFAA. The trial court could hear the arguments and say "hiQ is wrong, they did violate the CFAA". Maybe that's not likely, but it also is not yet decided.

So what exactly did I misunderstand and why do you think this is final?


I think you missed that this injunction is the case?

You are saying "the injunction was to stop LinkedIn from blocking access while [the injunction request] is ongoing".

If the court didn't think hiq had a strong case they would not have granted the initial injunction, then reaffirmed it on this appeal.


The 9th circuit uses a sliding-scale version of the preliminary injunction test. Because hiQ has more at stake, all hiQ needs is a serious question in this case, not a likelyhood of success on the merits.

It still might be case that hiQ has less than a 50% chance of winning in the eyes of the appeals court.


A good decision was reached, but it's a little worrying that the emphasis in the ruling was mostly about a weighing of business interests rather than affirming a right to access public information. If HiQ's business model had not been jeopardized by LinkedIn's business desire to block them, I fear this court could have easily gone the other way. I'd really love to see a ruling that solidifies the right of someone to access publicly available data without fear of repercussions. If this case makes it to SCOTUS, I would hope the ruling is predicated on that rather than business harm.

Key paragraphs from the ruling:

> In short, even if some users retain some privacy interests in their information notwithstanding their decision to make their profiles public, we cannot, on the record before us, conclude that those interests—or more specifically, LinkedIn’s interest in preventing hiQ from scraping those profiles—are significant enough to outweigh hiQ’s interest in continuing its business, which depends on accessing, analyzing, and communicating information derived from public LinkedIn profiles.

> Nor do the other harms asserted by LinkedIn tip the balance of harms with regard to preliminary relief. LinkedIn invokes an interest in preventing “free riders” from using profiles posted on its platform. But LinkedIn has no protected property interest in the data contributed by its users, as the users retain ownership over their profiles. And as to the publicly available profiles, the users quite evidently intend them to be accessed by others, including for commercial purposes—for example, by employers seeking to hire individuals with certain credentials. Of course, LinkedIn could satisfy its “free rider” concern by eliminating the public access option, albeit at a cost to the preferences of many users and, possibly, to its own bottom line.


> If HiQ's business model had not been jeopardized

I think this is more about validating hiQ's legal standing in the case.


This case is so ridiculous on multiple fronts that although this procedural ruling (injunction) seems technically correct (to allow the case to proceed to actual court), it could just as well have been thrown out with no difference in or ultimate harm to the parties.

First, LinkedIn makes the claim that its users have a right to privacy against scraping by such a 3rd party. That's laughable. As the court saw, their whole business model is made on people sharing their profiles broadly and mostly to the public.

Secondly, HiQ claims that LinkedIn's efforts to stop it from using the data are tortious interference. That's bold -- suppose someone is taking your assets (you believe illegally) and selling them to others -- can you imagine the gall that the person taking your assets can sue you for interfering with their subsequent sale of your assets?

Finally, that LinkedIn resorted to using the computer fraud and anti-terrorism statutes to make their argument is ridiculous.

So much craziness to go around. I would've just tossed the case, but I guess there is the whole bit about due process... Maybe HiQ will fail anyway at the next substantive trial, but what a waste of time.


> suppose someone is taking your assets

Except that, in the digital sense, it's only copied. They now have it, but you didn't lose your assets or money besides the <$0.001 it costs to serve each web page.

> So much craziness to go around.

I agree - I haven't read through the entire thing, but it looks like, instead of saying "you can't scrape", they could implicity give a license to users for personal and business use, but not be allowed the reselling of the data (of course carefully worded to allow the likes of Recruiters and whatnot to do so). It's like trying to argue that the DMCA says you can't create a torrent file of some movie.


Would that ruling mean that sites could no longer refuse to show content based on how they're accessed? For example, sites that won't load if the browser is in headless mode, or sites that depend on javascript as a way of blocking wget/curl.

I have a scraper for a site that used to offer an API for their publicly available site but removed the API with no warning. The info is still available to the general public, but only through their website. I created a scraper for the public page, but shortly after they switched to loading some public information through Javascript so my HTML scraper couldn't see it anymore. I ended up having to write an application around Selenium to load the Javascript and import this public information. I'm just waiting for them to start randomizing the CSS classes to make scraping even harder. The content is static, even as data changes on the server it does not refresh on the page unless you reload the page.

There is no reason why your page should refuse to load plain text without Javascript enabled.


> There is no reason why your page should refuse to load plain text without Javascript enabled.

On a technical level sure. SPA's should pre-render data before sending it to the client.

The problem is that's a ton of extra work when the client will have to fetch data anyway - so it's difficult to justify the time to management.

EDIT: If their page fetches the data with JS you might actually have an easier time figuring what their API looks like instead of scraping the rendered page. You might find there's more data available than is rendered too.


I once showed someone how to use dev tools to see an API request a web app makes and how to recreate it with your own code. He’s since done all sorts of “life hack” automations. Love teaching people this stuff!

> There is no reason why your page should refuse to load plain text without Javascript enabled.

Sure there is. You prefer writing javascript and you want to serve your site through a CDN.

You might not think that's a good reason, but that's certainly a reason.


Until the ADA comes along and demands you create an accessible to the blind site.

I've often wondered when the laws would start to be applied and I think its coming


> It is a common misconception that people with disabilities don't have or 'do' JavaScript, and thus, that it's acceptable to have inaccessible scripted interfaces, so long as it is accessible with JavaScript disabled. A 2012 survey by WebAIM of screen reader users found that 98.6% of respondents had JavaScript enabled. [0]

and that was 7 years ago. There may be certain complicated interactions that are a challenge for screen readers but simply because a page relies on JavaScript for rendering doesn't automatically mean it is inaccessible to screen readers.

[0] https://webaim.org/techniques/javascript/#reliance


I have a website that's a full page map. I care about accessibility - is there any way I can make this meaningfully accessible to the blind?

Look at WCAG 2 (web content accessiblity guidelines) they specify tags and elements common screen readers will understand to help make you site accessible.

This is a really good resource: https://accessibility.18f.gov/

A lot of frameworks now have some accessiblity built in if you add the correct attributes.


I suspect you're being glib, but you could look at https://wiki.openstreetmap.org/wiki/OSM_for_the_blind

I'm confused. Why would a blind person be any less likely to use JavaScript?

To test for a blind persons ability to render your website a good method is a CLI browser. Neither the blind persons device or a CLI browser will render javascript

I’m no expert in this space, but if free, common, and easily accessible tools render your site readable and usable, I’m not sure how refusing to use those tools would be a claim under ADA.

There are plenty of real ways that sites are unusable by screen readers, using Javascript to download dynamic content shouldn’t be one of them.


Why on earth would you think that? The screen readers tie into modern browsers like Safari or Chrome.

What does a CDN has to do with it?

Because if your application bundle is a fixed asset — like a JS SPA that fetches it’s data from an API then you can distribute your entire application via an inexpensive CDN.

As soon as your application bundle is rendered on your servers dynamically then only part of your site can be delivered via CDN.

Basically going all-JS gives you an app model where your sever side code doesn’t even know or care about HTML or the web or whatever. It just pushes JSON or whatever around and it largely client independent.

Great model when you need to support iOS, Android, web, desktop.


Using a CDN like their talking about likely means your html is static and only served from the CDN.

For many SPA's the only actual html is a header, container div, and a call to the app's js. Ignoring the header there might only be say half a dozen lines total.


Chrome headless does that really well.

> google-chrome --headless --run-all-compositor-stages-before-draw --virtual-time-budget=25000 --print-to-pdf='foo.pdf' URL

Edit: Plus

> pdftotext -raw foo.pdf


> There is no reason why your page should refuse to load plain text without Javascript enabled.

Of course there is! It’s my site and I can do what I want with it.


Captchas would be another technology that might be fall under that "technical barriers" terminology. I don't do much scraping, but I think most of us would enjoy never having to "identify the traffic lights in these photos".

I would very much hope so. These sites are a perversion of the idea of the Web, and only get away with it because fundamentals of the Web were created with friendly, cooperative users in mind - not with commercial users of today - so there's no enforcement of good behaviour.

We don’t know yet. It would depend on the specific legal question that is decided in the case. Courts usually try hard to constrain the law to as few questions as they need to in order to resolve a dispute.

That’s awesome news. Thanks also to the EFF for all the work they are doing to ensure fair use is still a thing. We’ll (https://serpapi.com) be donating next year.

This is actually bad, would not it be better if sites would be allowed to block crawlers? I don't see what is the legal basis for forbidding to ban scrapers. Is there a law that a site must serve pages for anyone?

There is no legal basis for "forbidding to ban scrapers".

The question is whether there is any legal basis for banning scrapers, i.e., for blocking hiQ. In other words, if hiQ keeps scraping, are they violating anyone's rights and/or breaking the law by doing that?

As long as that remains a legitimate, open question, then hiQ can argue they should be allowed to keep scraping without incurring civil or criminal liability. That is the purpose of the injunction. There could be no legal basis for blocking hiQ. Until that question is resolved, hiQ can keep on scraping.


Parent was downvoted but I think they have a point. This sounds overbroad, to an extent that I'd worry will get the whole ruling tossed out by SCOTUS.

On the other hand, if the ruling stands, it sounds like it will finally be possible to do useful things with Craigslist.


Because they are exposing their website to the general public. If you don't want the public to have access and be able to scrap it don't make it public facing.

Yes - for instance if your e-commerce site banned people from visiting based on whether their zip code made it more likely they were of a certain racial group, you would be running afoul of the law.

You can redefine what is 'public', or require login to see those information anyway.

If anything, the ruling could just push websites to hide information even deeper.


Why? if your data is publicly accessible what is the difference between using scripts and someone hiring a ton of people on a third world country to copy&paste your content?

Considering the kind of private scraping and selling tactics LinkedIn has been chronically guilty of (and not just the ordinary "growth hack" stuff: "LinkedIn violated data protection by using 18M email addresses of non-members to buy targeted ads on Facebook" [1]), it's satisfying to see LinkedIn lose this.

[1] https://techcrunch.com/2018/11/24/linkedin-ireland-data-prot...


So the champion of the public internet turns out to be a company that scrapes your social media, MLs it and sells the results to your HR dept?

I'm reminded of Dave Chapelle's Halle Berry routine...


In short, even if some users retain some privacy interests in their information notwithstanding their decision to make their profiles public, we cannot, on the record before us, conclude that those interests—or more specifically, LinkedIn’s interest in preventing hiQ from scraping those profiles—are significant enough to outweigh hiQ’s interest in continuing its business, which depends on accessing, analyzing, and communicating information derived from public LinkedIn profiles.

Reasonable. If a platform helps you make information of an individual public, then why it should matter for the platform how the market uses that public information?


On how this case relates to the CFAA:

We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required. Put differently, the CFAA contemplates the existence of three kinds of computer information: (1) information for which access is open to the general public and permission is not required, (2) information for which authorization is required and has been given, and (3) information for which authorization is required but has not been given (or, in the case of the prohibition on exceeding authorized access, has not been given for the part of the system accessed).

Public LinkedIn profiles, available to anyone with an Internet connection, fall into the first category. With regard to such information, the “breaking and entering” analogue invoked so frequently during congressional consideration has no application, and the concept of “without authorization” is inapt.


Volokh take is an interesting read [1]

I am curious how quickly most pages get put behind authorization. With the wording of this ruling you could pretty much go snap up any blog side (say like medium) and more. I wonder what kind of services would come out of that, having the data in a format it can be more easier parsed/analyzed?

so every ecommerce site is fair game? I assume most are already being scraped but I cannot imagine having to be in an environment where many of your connections are not people

[1] https://reason.com/2019/09/09/scraping-a-public-website-does...


Hmm - I think a key in the ruling here was that LinkedIn maintains no copyright claim on these pages. Users on LinkedIn retain ownership of their profile data. Compare that to a blog and maybe copyright could come into play? Not a lawyer just thinking out loud...

RIP Aaron Swartz

Even if LinkedIn loses and scrapers can no longer be blocked, they still just switched to putting all profiles behind an authwall, or at least it's very hard to not get an authwall. So could HiQ even carry on if they won anyway?

I'm not very familiar with neither LinkedIn nor HiQ, but what would be the problem with logging in before scraping?

The reason the pages are public to begin with is that Google will only scrape public pages for search indexing. LinkedIn wants to provide the pages ONLY to google, so they tried telling hiQ to stop scraping without any physical blockers (so as to not impede google's scraper).

If LinkedIn loses this case they (and others) might try to get Google to change their policy (either use auth or some whitelisted IP addresses or something).


Best hope that hiQ prevails. The slope slips very fast without LinkedIn's defeat. If LinkedIn prevails, the "EULA" has the force of criminal law and not just an agreement that lacks the meeting of the minds.

This is a weird case, as it turns the question of scraping on its head. Normally you'd think "am I allowed to scrape?", but instead the question becomes "am I allowed to prevent scraping?".

Anyways, I disagree with the court's judgment here. The users have consented that their data be used in accordance with LinkedIn's privacy policy. Even if it is publicly posted does not mean that the user has relinquished control over their personal information for another company to do with as they wish.


eBay had better lawyers than LinkedIn:

https://casetext.com/case/ebay-v-bidders-edge

I'm glad this court ruled it wasn't a violation of CFAA. But using trespass to prevent it seems reasonable. A private business should be allowed to restrict certain kinds of use of its resources (servers, bandwidth, etc), especially if it is beyond typical use. But if the load is typical and doesn't actually harm LinkedIn, it seems less reasonable to restrict them. If LinkedIn doesn't want automated access to their data because it is too much of a load on their servers, then they should be required to ban ALL automated access, including Google's bots. Of course they want Google's bots because that sends them traffic.

Another reason I think it was stupid for LinkedIn to use CFAA is that it sets them up to be a protected computer system, with protected information. If that is the case, it seems they could be liable for disclosing the information to someone a user didn't want, like a stalked. It's rather dumb: LI is claiming they host protected information, but it is only protected against someone that might compete with them.


> If LinkedIn doesn't want automated access to their data because it is too much of a load on their servers, then they should be required to ban ALL automated access, including Google's bots. Of course they want Google's bots because that sends them traffic.

By that logic, I should have access to linked premium features for free. Why should linkedin give more data to people who pay?


Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: