9th Circuit holds that scraping a public website does not violate the CFAA [pdf]

Animats · on Sept 9, 2019

This action does more than that. The court left the preliminary injunction against LinkedIn in place: "The district court granted hiQ’s motion. It ordered LinkedIn to withdraw its cease-and-desist letter, to remove any existing technical barriers to hiQ’s access to public profiles, and to refrain from putting in place any legal or technical measures with the effect of blocking hiQ’s access to public profiles."

So LinkedIn is prohibited from blocking hiQ's access by technical means. That's a strong holding. If this case is eventually decided in favor of hiQ, scrapers can no longer be blocked. Throttled a little, maybe, but no more than other users doing a comparable query rate.

tgsovlerkhgsel · on Sept 10, 2019

Not allowing the CFAA to be (ab)used to attempt to make scraping illegal makes sense.

However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not? Serving the web site costs money, and the scraper surely isn't going to generate ad income...

akersten · on Sept 10, 2019

Ugh, yeah, the more I think about this ruling, the less I like it.

It's actually pretty insane to force a site to serve content. I think both parties are in the wrong here - HiQ for assuming they're entitled to receive a response from LinkedIn's webservers, and LinkedIn for abusing the CFAA to try to deny service rather than figure out a technical solution to their business problem.

In my view:

* The data is public, and free of copyright. If you're a scraper and can get it, you haven't done anything wrong.

* The servers serving the data are still under LinkedIn's control, and they have no obligation or public duty to always serve that content. They could just as well block you based on your IP or other characteristics. If they want to discriminate and try to only let Google's scrapers access the data - what's wrong with that? Scraper brand is not a protected class. Tough taters if your business model "depends" on your ability to successfully make requests to another uninvolved company's webservers.

If I were the judge, I'd throw this out and let LinkedIn/HiQ duke it out themselves - they deserve each other.

Xelbair · on Sept 10, 2019

I would argue that under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one.

Hosting costs money, servers cost money.. but maybe create a public facing API that is way cheaper and easier to use than scraping your website? I see that ruling in positive light that it might promote more open and structured access to the public facing data.

dataflow · on Sept 10, 2019

> under spirit of net neutrality you either serve your site to everyone equally(the public facing part) or to no one

Huh? Net neutrality isn't about the server or client... it's about the network operator in between them.

cabalamat · on Sept 10, 2019

I suspect Xelbair is making a more expensive definition of net neutrality, taking as a basis the one that says it's about network operators only.

Darkphibre · on Sept 10, 2019

I think you wanted to say more expansive? But it's definitely also more expensive. :D

cabalamat · on Sept 10, 2019

Yes, I meant expansive. Oops.

Xelbair · on Sept 11, 2019

That was the case, hence the reference to the "spirit" of net neutrality.

Public facing internet sites, in my opinion, should be treated in same way as public space - anyone should be free to read, and write down in their notepad whatever is there, in the same way as anyone else.

Scraping public facing website in my opinion is huge waste of resources. It would be cheaper(in total) to build an API that can serve the data from it, than to build a good scraper.

neekburm · on Sept 11, 2019

Net neutrality is more about nondiscrimination in routing content from a provider to a user, rather than forcing content providers to serve everyone regardless of conduct. It's entirely reasonable for a site to discriminate who they wish to allow to access their data (whether technically their copyright or data they caretake).

That being said, if you provide data to the public, you don't get to invoke the CFAA to plug the holes your content discrimination code doesn't fill.

eru · on Sept 10, 2019

Why should you be forced to serve content to people who won't look at your ads?

RobAley · on Sept 10, 2019

Like disabled users with screen-readers?

ehvatum · on Sept 10, 2019

I suppose we can give them a pass if they solve a bunch of captchas.

kijin · on Sept 10, 2019

Anyone is free to put up a paywall and deny access to people who don't pay.

But LinkedIn is apparently happy to let Googlebot and bingbot scrape public profiles. If they want to do that, they can't argue that their policy is to block bots who don't click on ads. Discriminating Googlebot from other visitors is probably a violation of Google policies, too. They can't have their cake and eat it at the same time.

shkkmo · on Sept 10, 2019

From reading the opinion, I think the argument goes something like this:

> First, LinkedIn does not contest hiQ’s evidence that contracts exist between hiQ and some customers, including eBay, Capital One, and GoDaddy

> Second, hiQ will likely be able to establish that LinkedIn knew of hiQ’s scraping activity and products for some time. LinkedIn began sending representatives to hiQ’s Elevate conferences in October 2015

> Third, LinkedIn’s threats to invoke the CFAA and implementation of technical measures selectively to ban hiQ bots could well constitute “intentional acts designed to induce a breach or disruption” of hiQ’s contractual relationships with third parties.

> Fourth, the contractual relationships between hiQ and third parties have been disrupted and “now hang[] in the balance.” Without access to LinkedIn data, hiQ will likely be unable to deliver its services to its existing customers as promised.

> Last, hiQ is harmed by the disruption to its existing contracts and interference with its pending contracts. Without the revenue from sale of its products, hiQ will likely go out of business.

> LinkedIn does not specifically challenge hiQ’s ability to make out any of these elements of a tortious interference claim. Instead, LinkedIn maintains that it has a “legitimate business purpose” defense to any such claim. ... That contention is an affirmative justification defense for which LinkedIn bears the burden of proof.

So the real situation is that you can't go out and start blocking access you knew about in a way that would interfer with third party contracts without a legitimate business reason to do so. The burden of proving the legitimacy of that business reason is on you.

edit: TLDR;

> "A party may not ... under the guise of competition ... induce the breach of a competitor’s contract in order to secure an economic advantage."

tomp · on Sept 10, 2019

That’s quite ... crazy.

Be restaurant. Be on Deliveroo. Be getting low margins because of high fees.

So basically you can’t decide not to use Deliveroo any more, to improve margina (“secure an exonomic advantage”). I mean, you can cancel Deliveroo, but only as long as you’re not “inducing a breach of their contract”. So only a matter of time before Deliveroo writes a contract “we’re obligated to deliver food for you from said restaurant”.

XCabbage · on Sept 10, 2019

Choosing not to use a middleman any more so that you can secure higher margins sounds like about clearest example of a "legitimate business reason" imaginable. The purpose of the act is to immediately increase your margins, not to hurt Deliveroo because you don't want their competition.

That's very different from the case in question, where LinkedIn's motive for cutting off hiQ's access is to inflict damage on hiQ because they are a potential competitor.

olau · on Sept 10, 2019

I would imagine that if you contract with Deliveroo, they have some terms that say that you need to give notice when cancelling?

I don't know Deliveroo, but I think a better analogy would be if you suddenly, even though it is not causing you trouble, denied access to someone picking up food that you didn't contract with, with the full knowledge that the someone would be in big trouble with their customers.

davvolun · on Sept 10, 2019

IANAL, but I think you're misunderstanding "without a legitimate business reason to do so"

"Be Restaurant" blocking Deliveroo because they can't continue operating with the loss of revenue due to high fees is a legitimate business reason. "Be Restaurant" blocking Deliveroo 2: Electric Boogaloo because I don't like their owner, but continuing to allow Deliveroo access would be, presumably, disallowed.

Also there's nothing stopping "Be Restaurant" from offering an exclusive delivery contract to Deliveroo and forcing Deliveroo 2 out, or requiring a minimum fee for all delivery services, Deliveroo and Deliveroo 2 included.

Of course, I think this is all in a very different area from a restaurant; we're talking about a service provided on the internet. I believe LinkedIn has many, many other recourses here, but, as I see it, the courts are just telling them, this aint it chief.

rocqua · on Sept 10, 2019

So, if you want to block someone from your service, you need to be able to prove that it is for a legitimate business purpose.

Moreover it seems, 'this harms a competitor of ours' is not considered a legitimate business purpose, but anti-competitive behavior.

Thorrez · on Sept 10, 2019

Why does there need to be a legitimate business purpose? What about freedom of speech? It's my website and I'll publish what I want to.

olau · on Sept 10, 2019

Eh, I think you got this backwards. If you really want to talk about this in terms of freedom of speech, LinkedIn is in the act of censoring?

Edit: What I mean is that freedom of speech is not the same as freedom of censoring.

XCabbage · on Sept 10, 2019

> What I mean is that freedom of speech is not the same as freedom of censoring.

This is at least not quite true of First Amendment law. The concept of "compelled speech" exists in US law, and is considered an unconstitutional violation of the First Amendment. Exactly what falls into that category (and whether the right of domain owners to censor user-provided content as they see fit is protected), I'm not sure, but freedom of speech in the US certainly does at least sometimes include the right not to speak.

Thorrez · on Sept 10, 2019

Yes, the court was right to block LinkedIn's abuse of the CFAA. But the court was wrong that say that LinkedIn must show HIQ the same website as LinkedIn shows everyone else.

hgoel · on Sept 10, 2019

The ruling doesn't seem to say that they can't throttle access atleast.

greatpatton · on Sept 10, 2019

The data are certainly not free of copyright. Data can contain user picture, or even small essay describing the job, life of a user though linkedin is not the copyright holder. Moreover these are personal data, and I'm not sure that the scraper has the original user right to collect the data. In Europe, the scrapper may face issues related to GDPR.

johnny99 · on Sept 10, 2019

Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected, and with this ruling can be scraped, at least in the U.S. Others things common on LinkedIn, as you rightly point out, are protected--but by copyright law, not the CFAA. So a scraper acting in good faith would have to be careful about what they used if they wanted to respect copyright, but it's a separate issue from this ruling.

http://www.dmlp.org/legal-guide/works-not-covered-copyright

torstenvl · on Sept 10, 2019

This is exactly right. Copyright protects creative expression, not pure fact. Famously, phone books (remember those?) are basically not copyrightable except for the ads, because they're just lists of data. Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

greatpatton · on Sept 10, 2019

I never said that fact can be copyrighted, I said that most of the things people put around in their profile can be. I was responding to the claim that the data were not under copyright made above. If you just scrap name, company, position, this is fine, but I highly doubt that they just do that. This lawsuit can have tons of side effects.

fauigerzigerk · on Sept 10, 2019

I think what hiQ does is to predict whether a particular employee is about to quit.

So the interesting question to me is whether you can lawfully make predictions based on published information if that information is under copyright.

In Europe the answer is probably no, because the assumption is that in order to analyse data you have to copy it first.

To me, this interpretation of the term "copying" makes very little sense. So I wonder what US law makes of it.

anticensor · on Sept 10, 2019

Europe has database rights, which has a fair dealing exemption for data analysis.

fauigerzigerk · on Sept 10, 2019

I'm not sure what "database rights" refers to specifically, but the whole matter is actually rather complicated, because the EU copyright directive has a lot of optional exceptions that member states may or may not adopt.

Most of these exceptions only apply to non-commercial use though. So they wouldn't apply in a case like hiQ.

UK specific exceptions are explained here:

https://www.gov.uk/guidance/exceptions-to-copyright

Unfortunately, both Labour and the Tories have taken a relatively hard line in the EU copyright negotiations, so it seems unlikely that things will be relaxed very much after Brexit.

anticensor · on Sept 10, 2019

Database rights are a copyright-like intellectual property regime for databases.

perl4ever · on Sept 10, 2019

"Facts can't be copyrighted, so such things as whether or not a person worked for a certain company, or went to a certain school, are unprotected"

There's an infinite number of ways to describe a job history, without any single standard, so I don't think it makes any sense to say that a profile or resume is not copyrightable.

jecxjo · on Sept 10, 2019

Isn't the issue of being selective on who can view the content? If I, random Joe User views the publicly available content you have no issue. But if someone scrapes that data them you'd want to charge them. Unless I click on the ad, the act of using your bandwidth doesn't change based on who the viewer is. You'd want to apply fees based on the future use of the data rather than on your actual costs.

jsgo · on Sept 10, 2019

I'd assume if you weren't signing up, you'd probably look at like 10 profiles tops. A scraper is more than likely going to run through anything and everything it can grab links to (provided it doesn't leverage a very specific filtering mechanism for selecting profiles to scrape).

I could see the hit from a scraper being heavier than that of a typical user. There's also the potential that a user is going to click an ad for any number of reasons, there isn't that likelihood the scraper will.

I'm not anti-scraping by any means, but I get the concerns.

pbhjpbhj · on Sept 10, 2019

Presumably you'd be allowed to limit a scraper to a standard user bandwidth, and a standard user access - X links per day, Y bandwidth.

pbhjpbhj · on Sept 10, 2019

Surely the action is "if you display stuff in public you can't segment the public".

You're not obliged to have public access.

Is there perhaps a factor here of users having an expectation that their profile is publicly accessible; so companies hosting that profile shouldn't be able to choose _secretly_ "who" can access it?

IfOnlyYouKnew · on Sept 10, 2019

You're inconsistent, and so are the courts and most comments here. Either you favour such conflicts to be decided by technological might, or by the clearly expressed will of the content publisher to have binding effect.

If you consider scrapers to have some sort of right to access any public website, any technological barriers inflict exactly the same harm as an injunction, assuming it is effective. IF you allow technical blocking, it would be preferable to allow blocking-by-clearly-stated-wish, because it would save everyone the costs of the arms race. It would also make both parties' success somewhat independent of the resources they can invest into outgunning their opponents.

kabacha · on Sept 10, 2019

> However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not?

Your statement makes absolutely no sense. That's not how internet works. If you serve something publicly you don't get to cherry pick who sees it.

Not only it makes no sense technically it's also a huge anti-competitive case.

sezna · on Sept 10, 2019

It makes sense and it is how the internet works. Servers cherry pick who sees their content all the time. Scrapers are often blocked, as are entire IP address ranges. Things like Selenium server scrapers can be (approximately) detected and often are denied access.

I’m not sure about being anti-competitive. Serving a website is an action in which you open up your resources for others to access. My friend runs an open source stock market tracking website for free. He started getting hit with scrapers from big hedge funds and fintech companies a couple of months back. This costs him around $50-100 a month to serve all of these scrapers.

ericd · on Sept 10, 2019

If he gives them a stable, fast API with a subscription fee, and the scrapers are truly from hedge funds, he’s going to make a lot more than $100/mo.

ddingus · on Sept 10, 2019

He should open up a Patreon, tip jar, something to get that funded.

Could also delay results, offer reduced temporal precision and other things to differentiate use cases.

sezna · on Sept 10, 2019

He and I both have similar free open source websites with donate buttons. They are rarely clicked. Ad revenue over a month for me has been ~$400 while donations over two years have totaled $20. There are about 80,000 unique visitors per month.

It is nice to think donation platforms can fund high traffic open source projects, but this is simply not the case.

In any regard, I fear the potential of this ruling limiting developers’ ability to protect their servers and making us all roll over to the big players with their hefty scrapers taking all of our data for resale.

bryanrasmussen · on Sept 10, 2019

how long are you allowed to delay results, I mean not serving results is just delaying them forever but that's out. Can I delay serving results longer than chromium's default timeout?

rocqua · on Sept 10, 2019

Probably up to the point where a judge says 'this is blocking not delaying'.

spoondan · on Sept 10, 2019

I don’t see what legal or technical argument you’re making.

Technically, of course you can identify IP ranges owned by certain entities and restrict their access. That’s trivial, so what do you mean when you say the internet doesn’t work like that?

Legally, there’s plenty of region locked content for copyright and censorship reasons. A distributor might region lock because they don’t have distribution rights in particular regions. Are you saying distributors can’t publish free content at all because they can’t choose who sees it but would be breaking copyright law to publish to everyone? Or a site might region lock because certain content is censored in particular countries. Can you not publish anti-regime articles because a totalitarian country is on the Internet?

The entire world isn’t and shouldn’t be held hostage to the most restrictive laws that exist in the world. And the answer isn’t blocking on the requesting end because that’s technically much harder and blocks much, much more content. So what am I missing?

Edit: Forgot to include the other end of the spectrum. If I, as an individual, host my own site on my own hardware with my own connection that I pay the bandwidth for, can I deny a suspected not network?

_ps6d · on Sept 10, 2019

Of course you get to choose. You can reject requests based on their user agent, their IP address, the owner or likely geographic location of the IP address, and many other possibilities.

kabacha · on Sept 10, 2019

What are these possibilities? You only get IP and client side information that client is _willingly_ sending to you. So if a script/user/bot/etc tells it's Firefox from 1.2.3.4 then all you know that it's a request from 1.2.3.4 that says it's Firefox. You can ask it to run Javascript code but that's beyond classic web interaction and then again you need to trust the client.

This interaction is impossible to be trustless thus every client can only be served based on their IP or some convoluted, hack exchange that is cat-and-mouse game at best.

austincheney · on Sept 10, 2019

LinkedIn’s public facing content is exactly that: public. This ruling merely says accessing public content isn’t hacking and so LinkedIn cannot use the CFAA as discriminatory weapon to limit access to that public facing content.

If LinkedIn wants to block access they need to do so by another means that isn’t described as hacking.

bryanrasmussen · on Sept 10, 2019

Does their robots.txt say don't crawl this part of the site? If it does, this ruling is catastrophic. If it doesn't then there is hope.

shkkmo · on Sept 10, 2019

> If it does, this ruling is catastrophic

I know it is a generally considered bad form to ask, but did you read much of the ruling? I feel like a lot of people on this thread are just going off of Animats' comment and haven't spent much time looking at the opinion.

I didn't read the whole thing, but skimmed through it and read what seemed to be the relevant parts of the argument. (Including the bit that talks about LinkedIn's robots.txt)

The ruling doesn't really support your claim of catastrophe and doesn't claim to pass any sort of final judgement.

The judge makes a specific point about not reading too much into him upholding the injunction saying:

>> These appeals generally provide “little guidance” because “of the limited scope of our review of the law” and “because the fully developed factual record may be materially different from that initially before the district court.”

johnny99 · on Sept 10, 2019

Compliance with robots.txt is and always has been voluntary, and many crawlers have long ignored it, including Archive.org.

paulie_a · on Sept 10, 2019

Do any scrapers actually pay attention to robots.txt?

decoyworker · on Sept 10, 2019

It's not forcing anything. Don't make a page public then? If a page is public then it is fair game.

test6554 · on Sept 10, 2019

This second part is pretty stupid, however, now that we are at this point, Linkedin still has the ability to decide which of its information is public and which is not. By making all of its information private, it can take back control.

dangerface · on Sept 10, 2019

I think the title is wrong and they are holding that linkedin can not block specifically hiq from viewing public data.

Which seems fair its public or its not, you can't pick and choose who its public for and who is a second class citizen.

DoctorOetker · on Sept 10, 2019

it's not the scraper's fault that their business model incorrectly assumed profitability through ads in a way that did not foresee compliance with future ani-scraper-discrimination laws.

it's a good point you bring up, and may contribute to the death of ads.

LeoPanthera · on Sept 10, 2019

Does this prevent Google from returning captchas if you use a robot to scrape the search result pages, as they currently do?

kabacha · on Sept 10, 2019

I mean it should. That is also a huge anti-competitive action that just isn't pursued by anyone yet: google makes money of scraping and denies scrapers scraping them - that's all sorts of messed up.

The problem is that someone would have to sue google first and no one will do that unless there's big business incentive and big business can already scrape the shit out of google.

This is the weird thing about web-scraping. Big companies can get around protections quite easily - it's the small scripts and average users that get hurt by them. No one is going to tell you this because people would stop buying Cloudflare's "anti-bot 99% effective anti-ddos money saving package" which is complete bullshit.

skybrian · on Sept 10, 2019

Google respects robots.txt, so it might be hard to prove that they are accessing websites without their implied consent.

Also, their own robots.txt contains "Disallow: /search". So, there is arguably no inconsistency, either.

But, what does this new ruling mean for robots.txt?

SnorkelTan · on Sept 10, 2019

I think OP is getting at the nature of the relationship is kinda imbalanced. Consider basically most of their website is off limits: https://www.google.com/robots.txt

nojvek · on Sept 10, 2019

you can have your website offlimits from google by having a robots.txt too. What's the problem? People willingly want google to index them so they appear in search results and google can send them traffic.

dangero · on Sept 10, 2019

Google only scrapes sites that allow it by their robots.txt file so I don’t think their policy is as hypocritical as you are making it sound.

taptaptapimin · on Sept 10, 2019

This is true, but only technically. Google won't actively scrape anything disallowed in robots.txt, but those resources can still be indexed if found in the many other ways Google aggregates data, all of which is automated.

Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.

BoorishBears · on Sept 10, 2019

>Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.

Many times Robots.txt are implemented with the interest of barring access to information.

This works by relying on scrapers respecting the file, but it's no different than a no-loitering sign which itself cannot actively stop someone who is loitering.

Google doesn't have a Robots.txt disallowing search because it can't handle a large amount of queries against a resource...

jjohansson · on Sept 10, 2019

They still scrape and index sites blocked by robots.txt, but they often don’t display those sites in their SERPs (but sometimes they still do)

C14L · on Sept 10, 2019

Never seen it in any logs in 20 years.

Do you have a source for that claim?

jjohansson · on Sept 11, 2019

Looks like I’m wrong. They will rank pages blocked by robots.txt, but won’t index them.

https://www.google.ca/amp/s/www.searchenginejournal.com/goog...

kabacha · on Sept 10, 2019

Google has an effective monopoly on search engine market - you _can't realistically_ block google from scraping your website. They have this power and they are abusing it.

Also robots.txt is bullshit, if a person can access a public website why automated script shouldn't - technically speaking it's the same thing.

salawat · on Sept 12, 2019

It really isn't.

Imagine you have a piece of information that your neighbors in town come to you for every once in a while. They come over every now and again, ask you for it, maybe even bring you cookies for the trouble, and you provide it.

Then there's Ted.

Ted is insatiable. He hounds you at every minute of your day, constantly asking you the same question over and over. You've done everything you can. You tried to reason with Ted. You tried to contact whoever it is that brought Ted to your neighborhood. You even got so desperate, you moved a few houses down to escape the incessant hounding. That only worked for a little while though, and Ted found you again.

So you tried to stop answering the door; no use, he pokes his head in every time you're in the garage. You demanded people identify themselves first. Oh well, it's changed little. Now he just names himself gibberish names before hounding you for the same things over and over again.

This would not by any stretch of the imagination be acceptable behavior between two people. The main factor in determination for a court injunction would likely be physical trespass, or public nuisance; but no digital equivalent exists currently other than the CFAA, in the sense that in as much as one can prove that the access to the system is inconsistent with the intent or legal terms of providing a service, one may seek relief.

The problem is, LinkedIn has failed to make a convincing argument in the eyes of the appellate court that hiQ's Ted is violating the CFAA while LinkedIn has proactively engaged in activity disrupting hiQ's ability to do business; business which was consistent with the service granted to unknown members of the public at large.

In the Court's eyes, from the sounds of it, it appears LinkedIn is doing the greater harm.

What it looks like to me is this is setting up a common law framework that is going to cause website/service providers to have to choose from the get-go what their relationship to the Net is.

Are you just a transitor providing a service over the Net to a limited, select clientele bound by very specific terms of service? Then you may have a leg to stand on in fending off malignant Teds, but your exposure and onboarding will need to have concomitant friction to make the case to the Court that these Teds were never meant to be serviced in the first place.

Or, are you providing a public content portal, meant to make things accessible to everyone, with minimal terms? In which case, no legal Ted relief for you!

Just because it is your "syatem' and it isn't connected to your nervous system, does not mean it isn't capable of being caused harm to, or inflicting harm on someone else with careless muckery.

The one thing that disturbs me most is how the Court has disregarded the chilling effect that interpreting a duty to maintaining visibility may incur. A First Amendment challenge may end up being the inevitable result of this legal proceeding.

3xblah · on Sept 10, 2019

All true.

nevi-me · on Sept 10, 2019

Can we also be allowed to view people's profiles without being forced to sign in to LinkedIn? They lost my trust many years ago with their shady practises and 'dark patterns', so I don't want to share any of my data with them.

I however sometimes want to look up people. Or would this be a case of wanting to have my cake and eat it?

gondo · on Sept 10, 2019

There will be a 3rd party service for this once the scraping is legally allowed.

3xblah · on Sept 10, 2019

"If this case is eventually decided in favor of hiQ, scrapers can no longer be blocked."

If the case is decided in favour of hiQ, then, absent an injunction, what would prevent a website from blocking a scraper? Maybe the website could still block unless and until the scraper gets her lawyers to file an injunction.

Another interpretation is that if hiQ wins, then in the 9th Circuit's jurisdiction websites serving public information they neither own nor exclusively license may no longer try to to use the CFAA and/or copyright law to threaten scrapers.

baroffoos · on Sept 10, 2019

Scraper bots have genders now?

Rotten194 · on Sept 10, 2019

the scraper = the person doing the scraping

derefr · on Sept 9, 2019

Scrapers generally, sure. Not sure about scrapers on other social-media sites like Facebook, though.

The question being: if just having access to the network isn't enough to grant you access to the data of a specific profile, but instead you have to aggregate samples from a bunch of people in the network in order to see "through their eyes" to the data on the profiles of their friends and friends-of-friends, is that allowed?

Because, if even that was allowed, that'd surely open a different kind of floodgate.

meritt · on Sept 9, 2019

If you have to login to access the data, that data isn't publicly available, and unlikely to be protected by this decision.

Page 31 of the filing specifically differentiates this case from one regarding Facebook:

"While Power Ventures was gathering user data that was protected by Facebook’s username and password authentication system, the data hiQ was scraping was available to anyone with a web browser."

toomuchtodo · on Sept 9, 2019

The ol' DMCA trivial encryption switcharoo. I'd expect a lot more sites and their data to require a login now, to the detriment of the data being publicly indexed.

dickjocke · on Sept 9, 2019

Wait, doesn't Linkedin have pretty strong authentication safeuguards to view user profiles. if you google someone and click their linkedin without being logged in to LinkedIn, you're always directed to sign up or log in to it.

Twitter is the only example I can think of among the large social media sites that doesn't require you to be logged in to see profiles

TeMPOraL · on Sept 10, 2019

It's not authentication safeguards, it's dark UX patterns driving you to signup. Whether or not you'll see someone's profile and how much of it you'll see depends on how they fingerprint you, where you navigate from, and on the phase of the Moon. LinkedIn is notorious for that.

My memory might be misleading me here, but I think I remember years ago in one case I could see less info about a profile when logged in to my account than I could see while logged out and navigating from a Google search...

bartread · on Sept 10, 2019

Nope. You can see public profiles without being logged in. Certainly the case here in the UK; not sure about elsewhere.

tanderson92 · on Sept 9, 2019

Even twitter requires you to be logged in to view "replies" on a user's profile.

emilfihlman · on Sept 9, 2019

Leaving the injunction in place is insane and a huge oversight. It amounts to making web pages carriers that cannot select who they serve.

It should have said only that there is nothing judicially wrong with scraping but also not limited the rights of a service.

dhimes · on Sept 9, 2019

It's limited to public pages. They can still discriminate whom they serve, with logins or something, but they can't limit your ability to access their page in a way that you prefer.

emilfihlman · on Sept 10, 2019

Allthesame. If I as a web host want to fuck with people or just someone by randomly dropping connections, this says I can't (well, it says so for LinkedIn but it's results in the same being applied to others).

This is insane. It really should have just said that "it's legal to scrape" but it shouldn't have said "and you can't stop that".

jibal · on Sept 9, 2019

Of course they can. The question is whether they can do so legally and, if not, based on what laws.

cm2187 · on Sept 10, 2019

And I presume it would also apply to cloudflare’s captchas.

Causality1 · on Sept 9, 2019

Would it hold up if profiles were only visible to logged-in users and part of the sing-up EULA was an agreement not to scrape profiles with automated tools?

bryanrasmussen · on Sept 10, 2019

Obviously not. Although if you had an extension in your browser that scraped every linkedin profile then it's not automated. If everybody in your office had the extension installed, and you maybe also had a couple interns who did nothing but do linked in searches and hit search results all day long, and maybe hired some people through mechanical turk to do the same thing.

Well that's not automated, and slightly more costly but it gets you pretty close to the same effect.

buro9 · on Sept 9, 2019

thomasjudge · on Sept 9, 2019

In the opinion they specifically discuss DOS attacks: "Internet companies and the public do have a substantial interest in thwarting denial-of-service attacks and blocking abusive users, identity thieves, and other ill-intentioned actors. But we do not view the district court’s injunction as opening the door to such malicious activity"

blacksmith_tb · on Sept 9, 2019

Hmm, not a lawyer, but I think if you commit one crime in the process of doing something else that's legal, you're still guilty. But LinkedIn couldn't block your spider if it came back and throttled at a reasonable rate.

meowface · on Sept 9, 2019

Considering the kind of private scraping and selling tactics LinkedIn has been chronically guilty of (and not just the ordinary "growth hack" stuff: "LinkedIn violated data protection by using 18M email addresses of non-members to buy targeted ads on Facebook" [1]), it's satisfying to see LinkedIn lose this.

[1] https://techcrunch.com/2018/11/24/linkedin-ireland-data-prot...

penagwin · on Sept 9, 2019

I feel like this is a really common theme I've seen several times. Something like "Music Lyric site X sues Google for embedding their lyrics in the results directly" which is funny because site X got the lyrics by scraping them from other sites.

Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

I find it really funny that the scrapers are battling scrapers - like guys you only exist because you do THE EXACT SAME THING

dmix · on Sept 9, 2019

Regardless, there is legitimate value in the collection, cleaning, interlinking, and presentation of existing data. How that is interpreted by the law is one thing but merely because the data came from a variety of other public/private sources doesn't mean it derived all of its value externally.

meowface · on Sept 9, 2019

For sure, but they shouldn't be hypocritical about it. If they don't consider themselves content parasites, they shouldn't consider people scraping their site to be content parasites, either. (Some sites really are just parasites, though.)

dodobirdlord · on Sept 10, 2019

There's nothing hypocritical about it. Googlebot respects robots.txt configured on pages it scrapes. Google in turn expects that their own robots.txt will be respected. What's the issue?

https://www.google.com/robots.txt

olau · on Sept 10, 2019

Can I politely point out that the conversation is not about respecting robots.txt.

If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.

dodobirdlord · on Sept 10, 2019

> If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.

It seems like you're stating this as though to cast some sort of moral aspersion. I don't get it. If other companies don't want Googlebot to scrape them they just have to say so. Most companies want Googlebot to scrape their content. Google doesn't want other people's scrapers to scrape Google's content. Nobody involved in any of this has done anything unreasonable or morally objectionable.

ijidak · on Sept 10, 2019

> Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

Yes. This is EXTREMELY frustrating.

Of all companies to prevent scraping, Google is the most ironic.

Especially since their goal is to organize the world's information, it shocks me that there's no way to get access to this organized information from machine to machine.

TylerE · on Sept 10, 2019

I d0n't really think it's inappropriate or ironic. I can easily imagine naive scrapers essentially DDOSing google.

3xblah · on Sept 10, 2019

Perhaps this issue will be recognised in some of the antitrust investigations.

If I am not mistaken, they no longer claim "organize the world's information" as their goal.

etaioinshrdlu · on Sept 10, 2019

But it still is:

> https://about.google/

"Our mission is to organize the world’s information and make it universally accessible and useful."

3xblah · on Sept 10, 2019

Appears I am mistaken. Cheers.

aledalgrande · on Sept 10, 2019

Yes, we want a Google Search API [at a decent price].

falcor84 · on Sept 9, 2019

Well, setting up so-called Barriers to Entry[0] is econ 101.

[0] https://en.wikipedia.org/wiki/Barriers_to_entry

cwkoss · on Sept 9, 2019

Creating barriers to entry is an antisocial tactic that harms consumers and society at large.

It is the responsibility of moral consumers to avoid spending their money with companies that use these regressive tactics.

leesalminen · on Sept 9, 2019

I think it’s important to distinguish types of barriers to entry. Some are “real” while others are “artificial”. For example, a real barrier to entry would be institutional knowledge about an industry while an artificial one would be an arbitrary TOS clause.

TeMPOraL · on Sept 10, 2019

And disallowing scraping or making it difficult while refraining from providing an API for the same data is the arbitrary kind. The default state of the web is that it's trivially scrapable - you have to go out of your way to make it harder.

jibal · on Sept 9, 2019

[flagged]

cwkoss · on Sept 9, 2019

HN is not an appropriate place to project your moral insecurities in this manner.

jibal · on Sept 13, 2019

HN is a place where you feel free to make personal attacks.

skybrian · on Sept 10, 2019

Google respects robots.txt, so it's arguably not the same as scraping a website without their consent.

aledalgrande · on Sept 10, 2019

Most sites don't have their main data/functionality in the Disallow section though.

dodobirdlord · on Sept 10, 2019

Sure, but they could if they wanted to and that's their own business.

Breza · on Sept 14, 2019

Google respects robots.txt files from webmasters who don't want Google to scrape their content

mullingitover · on Sept 9, 2019

> LinkedIn has taken steps to protect the data on its website from what it perceives as misuse or misappropriation. The instructions in LinkedIn’s “robots.txt” file—a text file used by website owners to communicate with search engine crawlers and other web robots—prohibit access to LinkedIn servers via automated bots, except that certain entities, like the Google search engine, have express permission from LinkedIn for bot access.

Not a big fan of weev, but this sure seems like he got screwed if he was just enumerating public web pages and went to jail for it.[1]

[1] https://en.wikipedia.org/wiki/Weev#AT&T_data_breach

lwf · on Sept 10, 2019

Even if the case was tried today, 9th Cir. isn't binding on other regions of the US, and there's a bit of a split, as detailed in the opinion[1]:

> In recognizing that the CFAA is best understood as an anti-intrusion statute and not as a “misappropriation statute,” we rejected the contract-based interpretation of the CFAA’s “without authorization” provision adopted by some of our sister circuits. Compare Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058, 1067 (9th Cir. 2016), cert. denied, 138 S. Ct. 313 (2017) (“[A] violation of the terms of use of a website—without more— cannot establish liability under the CFAA.”); Nosal I, 676 F.3d at 862 (“We remain unpersuaded by the decisions of our sister circuits that interpret the CFAA broadly to cover violations of corporate computer use restrictions or violations of a duty of loyalty.”), with EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577, 583–84 (1st Cir. 2001) (holding that violations of a confidentiality agreement or other contractual restraints could give rise to a claim for unauthorized access under the CFAA); United States v. Rodriguez, 628 F.3d 1258, 1263 (11th Cir. 2010) (holding that a defendant “exceeds authorized access” when violating policies governing authorized use of databases).

weev was tried in an area under the 3rd Cir. jurisdiction. Somewhat interestingly, his conviction was thrown out in 2014 on venue grounds (e.g. being tried in NJ), without addressing the statutory question.[2]

[1]: pp. 27-28 [2]: https://en.wikipedia.org/wiki/Weev?oldid=912921723#cite_ref-...

joncrane · on Sept 10, 2019

Is there an AWS region in the District governed by this case, so you can just do all your web scraping from instances in that region?

tick_tock_tick · on Sept 9, 2019

Not saying the court made the right call but for that case the big issue for the court was the pages were clearly not intended for the public and the defendant knew it.

Kalium · on Sept 9, 2019

I believe the salient issue is whether or not there were effective access controls, not whether or not a page could be reasonably interpreted as intended to be non-public.

rayiner · on Sept 9, 2019

There is no requirement to have "effective access controls." What matters is what a reasonable person would believe about whether they were allowed to access the data. The access controls are relevant only insofar as that they convey a message that access is not permitted. The effectiveness of the access control is utterly irrelevant.

momokoko · on Sept 9, 2019

That is false. What effective access controls do, legally speaking, is help determine if a person could reasonably conclude that the information was non-public.

For example, if a door is locked, but easily defeated, there is an implied assumption that what lies behind it is only available to someone with the key. Another example is an unlocked door with a sign that states "No access without authorization". Or an unmarked and unlocked door that is on private property in a place where it would be very unlikely for someone to reasonably conclude that the access was intended to be public.

henryfjordan · on Sept 9, 2019

The real issue here is somewhere between both you and GP. What is required to trigger the CFAA?

Does accessing a page the site owner doesn't want you to violate the CFAA or do you need to hack through access controls?

txcwpalpha · on Sept 9, 2019

As a real-world analogue: you can indeed be guilty of trespassing on someone's property even if you don't have to jump over any fences or pick any locks to get there. In some places, they don't even have to have a "no trespassing" sign. Simply being present on someone else's property without an invitation from them is illegal, and no, an open door does not count as an invitation.

mr_toad · on Sept 10, 2019

> an open door does not count as an invitation.

But if you have a someone living there who lets anyone in if they ask, it would be pretty hard to argue they are trespassing.

A user agent must ask for every page with an http request. If the server responds with 200 OK, it’s pretty hard to argue that it isn’t letting (or even inviting) you in.

dodobirdlord · on Sept 10, 2019

This doesn't cover you if you lied to get the invitation. If a robots.txt file denies some but not all user agents, setting your user agent to indicate that your request originates from a source it does not originate from is clearly a circumvention of an access control.

LyndsySimon · on Sept 9, 2019

You generally can’t be charged with trespass unless you refuse to leave when told to do so.

An open door to a home is different, but unfenced property is 100% not trespass until you refuse to leave.

jcranmer · on Sept 9, 2019

Trespass in criminal law usually requires notice that trespassing is prohibited, but this is usually satisfied by posting a "no trespassing" sign in a prominent area.

For example, my state's law (emphasis added):

> Whoever, without right enters or remains in or upon the dwelling house, buildings, boats or improved or enclosed land, wharf, or pier of another, or enters or remains in a school bus, as defined in section 1 of chapter 90, after having been forbidden so to do by the person who has lawful control of said premises, whether directly or by notice posted thereon, ...

The federal version of trespass (which I think applies to Indian reservations) considers merely fencing off the area to be sufficient notice that trespassing is prohibited.

txcwpalpha · on Sept 9, 2019

That's not true at all. If you are aware that the property you are accessing is not meant for your use, you can be charged with trespassing regardless of if you have specifically been asked to leave or not.

It's even possible to be guilty of trespass even if you weren't aware that you weren't allowed on the land. This is negligent trespassing.

admax88q · on Sept 9, 2019

Negligence only applies I'm situations where a reasonable person should have known. You're only able to be charged with trespassing whilst being unaware if you were to so ridiculously unaware of your surroundings that any reasonable person in the same situation _would_ have known that they were trespassing.

If a public park blends into somebodies private lawn you can't be charged with tresspassing for stepping over the line.

codedokode · on Sept 9, 2019

this is bad comparison because when scraping a site, you don't cross any borders, you just send and receive information. You can compare this to a phone call or to talking to someone.

txcwpalpha · on Sept 9, 2019

A website or server is property, just like land is. Accessing it is no different than accessing any other piece of property. Opening a website is, for all intents and purposes, the same as crossing a border.

To take it a step further, the information on said website is also personal property, and accessing the information without permission is also trespassing. Specifically, this is called trespass to chattels [1] (trespass is most usually legally defined as trespass to person, trespass to land, and trespass to chattels). Even more specifically, this is the actual part of the CFAA that's being debated: computer trespass gets its roots from trespass to chattels. [2]

1: https://en.wikipedia.org/wiki/Trespass_to_chattels

2: https://en.wikipedia.org/wiki/Trespass_to_chattels#In_the_el...

henryfjordan · on Sept 9, 2019

This analogy is faulty and congress really needs to clarify what they meant the CFAA to protect against.

Opening a website is making a request, technically speaking. That is not equivalent to breaking into someone's home and taking information. The equivalent would be if the head of the household told you not to stand outside and ask someone inside to give you something from the house. You haven't trespassed, you're asking someone in the house to do something for you. It's on them if they do what you request or not.

txcwpalpha · on Sept 9, 2019

A website isn't a person and the law doesn't expect them to act as such. It's a tool. Making a 'request' to a web server is more like turning the knob on a door: maybe the owner installed a lock, or maybe it just opens without there being a lock. But even if there isn't a lock, the law doesn't absolve you of trespassing against the door's owner just because the door itself didn't have the sentience to refuse your request.

henryfjordan · on Sept 9, 2019

But it's a door handle that is MEANT to be turned by the public at large. It's like putting a big "Order Inside" sign above the door to a restaurant and being surprised when people try to gain entry.

You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

I'm not saying that websites shouldn't have recourse against people taking all the contents of their sites, just that the CFAA is the wrong tool.

txcwpalpha · on Sept 9, 2019

>But it's a door handle that is MEANT to be turned by the public at large.

No it isn't. That's the crux of the case.

>It's like putting a big "Order Inside" sign above the door to a restaurant and being surprised when people try to gain entry.

It's like putting a big "order inside" sign above the door to a restaurant, and then also having a separate door in the back of the restaurant that clearly is used only by employees to go to the back office, and not being happy when non-employees keep trying to walk into the back office claiming "well there's a sign outside...".

>You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

According to the courts, you did 'go inside the house' because the electronic signals that you sent to the server as part of the request are enough to constitute the 'physical contact' part of trespassing.

Again, trespassing isn't just about you physically having your body on someone else's property. It also can be your interaction with someone else's property (which can be land, or a door, or web servers) through the use of tools or intermediaries.

shkkmo · on Sept 10, 2019

> not being happy when non-employees keep trying to walk into the back office claiming "well there's a sign outside..."

If you don't bother to put up an "Employees Only" sign on the door, you are going to have a hard time getting a trespassing charge to stick...

dodobirdlord · on Sept 10, 2019

> You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.

By this reasoning it's impossible ever to hack anything. Even breaking password controls or cryptography is still just sending the server a request and getting something served back.

admax88q · on Sept 9, 2019

Lol really?

I'm not "on" your site when I browse there. I asked your server to send me some data and it did so.

Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.

txcwpalpha · on Sept 9, 2019

You didn't "ask my server". You used a tool to extract data from my server.

It's more akin to you standing just outside my property border and using a fishing pole to pull fish from a pond that is inside my property border. You're still trespassing even if your two feet aren't physically on my land.

The common legal argument (see the second link in my above comment) is that accessing a web server actually does constitute being "on" the server because you are sending signals to my server in order to interact with it, and this satisfies the "physical contact" part of trespassing.

From Wikipedia:

> The courts that imported this common law doctrine into the digital world reasoned that electrical signals traveling across networks and through proprietary servers may constitute the contact necessary to support a trespass claim.

>Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.

This absolutely would be illegal and I'm not sure why you think otherwise. Misrepresenting yourself in order for me to reveal to you private information is fraud and is illegal in pretty much every jurisdiction I can think of.

ratww · on Sept 9, 2019

> You didn't "ask my server". You used a tool to extract data from my server.

The tool asked the server. The server replied.

> It's more akin to you standing just outside my property border and using a fishing pole to pull fish

Bullshit. Using HTTP to access public information is akin to standing outside your business and writing down the phone number in the banner. Or even reading the "No trespassing" sign.

As long as you're not violating copyright, NDAs or EULAs (and that's debatable) there should be nothing wrong with reading information that you were authorized to view.

txcwpalpha · on Sept 9, 2019

>here should be nothing wrong with reading information that you were authorized to view.

You aren't authorized to view it. That's the entire point.

And the lack of access control does not implicitly give you authorization to view it.

ratww · on Sept 9, 2019

When it comes to physical properties there's a huge difference between reading a banner posted in a street and entering the property to read some secret data: you have to be in different locations. That's why your analogy is completely faulty.

When it comes to PUBLIC data in a website there's no difference. How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?

Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits. "Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.

Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.

txcwpalpha · on Sept 10, 2019

>When it comes to physical properties there's a huge difference between reading a banner posted in a street and entering the property to read some secret data: you have to be in different locations. That's why your analogy is completely faulty.

At no point is accessing a web server similar in any matter to reading words off of a banner posted in a street. You cannot use a faulty analogy of your own to describe why my analogy is faulty.

>When it comes to PUBLIC data in a website there's no difference.

Yes there is. Even for data that is public and meant to be accessed to the public, you still must access the web server. It is much more similar to walking into a publicly accessible restaurant and reading their menu, it is not similar to reading a banner on the outside of the restaurant.

>How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?

A reasonable person knows that www.google.com is meant for public use. It is common knowledge and from whatever avenue you heard about Google, you probably gathered from context that www.google.com is somewhere you are allowed to go.

This is absolutely not the case if you randomly guess a URL like 'mycompany. intranet. io/financials /employeelist. xls'. And it certainly is not the case when you are explicitly told (such as in a robots.txt) that you are not allowed.

>Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits.

It does, though. The owner of property is under no responsibility to inform the public that their property isn't meant for use. It is up to each individual person to determine if they are allowed to use it or not. This is typically done by context clues and societal expectations: it would be absurd for a random member of the public to walk through someone's open front door and claim "well I was never explicitly told to not come into your house...". The person should know, based on social conventions that you don't just walk into someone else's house, that it's not allowed. This is the same for websites. There is some leeway given, such as if you saw a sign for "Open House" and simply walked into the wrong house. But it is still possible to commit an act of trespassing even if you didn't explicit intend to: this is called negligent trespassing.

>"Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.

No, it's not. Did you even click on the link I provided earlier regarding trespassing?

>Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.

That law already exists. It's called the CFAA, and the debate around it is what is being discussed in this post.

ratww · on Sept 10, 2019

The "don't walk into someone else's house" rule applies to ALL houses everywhere. You are explicitly forbidden to enter a house unless explicitly authorized.

When it comes to website, there are billions of domains in the planet, each one has multiple internal URLs, ranging from tens to several million. You can't expect everyone to have common knowledge about every domain and link. It is beyond ridiculous to compare the two.

dodobirdlord · on Sept 10, 2019

> You can't expect everyone to have common knowledge about every domain and link. It is beyond ridiculous to compare the two.

It's true that there's a presumption that sites that are accessible by the public are open for access to the public. But a lack of technical restriction is not an invitation. If a reasonable person would conclude that your access is not welcome then your access is also illegal. this is the crux of why so much of security research is on precarious legal footing. If you find an unsecured mongoDB database with a name like "customer_data" and you download the contents you are 100% breaking the law.

SAI_Peregrinus · on Sept 10, 2019

A better analogy: Accessing a website is like calling up a business and asking whichever employee answers for information.

meritt · on Sept 10, 2019

> And the lack of access control does not implicitly give you authorization to view it.

I know you're trying really hard to sway opinion on HN for some reason, but I'm just going to reinforce the entire point of this thread and, assuming we're staying within the context of publicly accessible information: the Ninth Circuit Court strongly disagrees with you.

Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.

txcwpalpha · on Sept 10, 2019

I don't know why you think this has anything to do with opinion. I'm relaying information that is available in the Wikipedia link that I provided in an earlier comment.

>but I'm just going to reinforce the entire point of this thread

That isn't the entire point of this thread, nor is it the point of the PDF posted in the OP.

>Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.

Nobody has said anything about it being a criminal offense. The relation to trespassing is literally the entire point of this thread.

TeMPOraL · on Sept 10, 2019

> You didn't "ask my server". You used a tool to extract data from my server.

You're always using a "tool" to "extract" data from a web server, unless you're manually operating a telnet session. A web browser is such tool, an incredibly complex and automated one. cURL is such a tool too, and so is cURL wrapped in a bash script. None of them go outside of what's allowed by HTTP protocol[0]. And the most core assumptions of the Internet and HTTP protocol combine into a simple rule: if it's a publicly routable server answering to HTTP requests, you can issue requests and receive whatever it sends. If a server wants to discriminate, it should set up an auth scheme.

--

[0] - protocol family at this point.

jibal · on Sept 9, 2019

> "You didn't "ask my server".

Yes, you did.

> You used a tool to extract data from my server.

No, that's not how the technology works.

txcwpalpha · on Sept 9, 2019

Using fishing bait is just a "request" for a fish to bite my line so I can pull it in. It's up to the fish to respond to the 'request', right? So does that absolve me of a crime if I go fishing in someone else's pond and pull out all of their fish? Cause the fish are the ones that responded, right, so it's not my fault?

No, of course not. The technical details of how an HTTP request works are not what is relevant here. Don't be obtuse.

jibal · on Sept 13, 2019

Personal attack noted.

megablast · on Sept 10, 2019

No, it’s more akin to standing on the boundary, reading your posters using binoculars.

toss1 · on Sept 9, 2019

um, seems in this case the court specified that it is NOT on them - if the info is public, the website/house-people are required to return it and create no obstacles to doing so.

EpicEng · on Sept 9, 2019

>A website or server is property, just like land is. Accessing it is no different than accessing any other piece of property

What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?

You laid out a lot of opinions there as if they were facts. They are not. These issues are complex and are still being debated at levels higher than the HN comment section.

txcwpalpha · on Sept 9, 2019

I don't understand your comment.

>What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?

No. Of course not. What exactly is your question?

>You laid out a lot of opinions there as if they were facts.

I didn't lay out any opinions. I relayed information that is available from Wikipedia and other sources and rephrased it into an HN comment. None of it is opinion. If you take issue with what my comment says, you can take it up with the courts that made the decisions that gave the information I posted.

EpicEng · on Sept 10, 2019

>No. Of course not. What exactly is your question?

My point was that it's hardly as clear cut as a piece of land and you know it. You posted a link to W's Trespass of Chattels, which I think is funny because it exactly proves my point. From your link:

>...several companies have successfully used the tort to block certain people, usually competitors, from accessing their servers. Though courts initially endorsed a broad application of this legal theory in the electronic context, more recently other jurists have narrowed its scope. As trespass to chattels is extended further to computer networks, some fear that plaintiffs are using this cause of action to quash fair competition and to deter the exercise of free speech; consequently, critics call for the limitation of the tort to instances where the plaintiff can demonstrate actual damages.

It is not at all clear that what we're discussing here is a clear violation. It's very debatable and the law itself was never envisioned to apply to scraping websites (because they didn't exist yet!) It also goes on to say (in the US)

>One who commits a trespass to a chattel is subject to liability to the possessor of the chattel if, but only if,

>(a) he dispossesses the other of the chattel, or

>(b) the chattel is impaired as to its condition, quality, or value, or

>(c) the possessor is deprived of the use of the chattel for a substantial time, or

>(d) bodily harm is caused to the possessor, or harm is caused to some person or thing in which the possessor has a legally protected interest.

The only clause there which even begins to help your case is the 'value' part of clause b and, again, that's very debatable.

> you can take it up with the courts that made the decisions that gave the information I posted.

Decisions made by court A get overturned by court B all of the time. We'll see where it lands, but we're not there yet (again, my point!)

loceng · on Sept 10, 2019

Those are apple to orange comparisons: A phone call - you don't have to answer the call, nor say anything once you know (or don't know) who the caller is or what their intention is - you can stop whenever you want - is it a robo-call? You hangup. And similarly with talking to someone (in person - if they say something you're free to just not respond; and if they persist, it's harassment.

The main reason I see businesses being concerned about being required to serve scrapers pages (even at a reasonable rate of download) is that there's still cost associated to it, and more so the more scrapers try to access and regularly access the data for updates. Similarly, if it is the users of a platform who have input the data, update it, and they are only wanting it presented on that platform (for whatever reasons) then what rights do they have?

Is the answer then requiring adding another acknowledgement message like "this site uses cookies" required, perhaps with required response before moving forward to have users acknowledge "scraping isn't allowed" - akin to "no trespassing" signs on properties? That seems awfully ridiculous to put the onus on 100% of users (including the overwhelming majority being non-scrapers), adding friction and speed of access to billions of internet surfers? Of course browsers could then could act as a layer that auto-respond to that or pre-agree to the rules - perhaps in a way reading through a site's TOS and pre-approving what you agree to. And as the trend has been otherwise it leads to closed platforms so the data isn't considered public; I won't argue whether that is good or bad for the general internet, however how much value is there in a person having access to that data without having to be a user?

Or the much simpler thing is we could put the onus on businesses who are scraping or will use scraped data to not cause this mass friction.

akersten · on Sept 9, 2019

And does robots.txt count as an access control?

What about a humans.txt that says "please don't scrape this site"?

tempestn · on Sept 9, 2019

According to the ruling, a cease and desist letter directly demanding that they not scrape the site didn't count as access control, so one would assume that humans.txt wouldn't either. It needs to be a technical prevention like a password, access token, etc.

blackflame7000 · on Sept 9, 2019

Sounds more like an access suggestion than control to me

blackflame7000 · on Sept 9, 2019

Ahh so if a company leaks data it's the viewer's fault, not the companies?

rayiner · on Sept 9, 2019

Yes. That's the general rule--negligence of a victim does not negate the culpability of the criminal. "It was easy to commit the crime" is not a defense. If you find yourself with access to something you think you're not supposed to have access to, you're supposed to do the right thing.

manfredo · on Sept 9, 2019

I think the issue is not "negligence of the victim" so much as, "took no steps to make the information private". If I don't lock my door and someone goes into my house, opens my filing cabinet, and copies my financial info then they're still a criminal. They took information from a place that was unambiguously meant to be private. But if I staple a copy of my financial info to a telephone pole, surely those who read it are not committing the crime of reading private information.

A URL that is not authenticated seems more like the latter than the former. The web is public unless people take steps to make it private. Criminalizing accessing unprotected URLs is like arresting people for reading the financial info I left stapled to the telephone pole. A while back there was a "world's most exclusive chat room" website that hosted IRC channels for people above twitter follower numbers. People learned that they could just manually increment a URL parameter to access higher room numbers regardless of their twitter follower count. Were those people committing a crime?

tptacek · on Sept 9, 2019

To take the case at issue, when your immediate reaction to discovering an unprotected URL is to scrape it, discuss on an IRC channel how you're going to monetize it, and then go to the media to announce your security vulnerability discovery, you are going to find it difficult to make the argument that you believed you had authorized access.

manfredo · on Sept 9, 2019

By that same line of reasoning, one could argue that changing your url parameter in that twitter chatroom website is a privilege escalation attack that allows users to access protected information.

Absence of authentication means all access is authorized, otherwise just typing in random urls is a crime.

txcwpalpha · on Sept 9, 2019

The case referenced by the top-level comment of this chain (the one about 'weev') is a case where someone was prosecuted and imprisoned specifically because changing URL parameters was seen as an attack allowing access to protected information.

tptacek · on Sept 9, 2019

You can indeed argue that. Typing random URLs can indeed be a crime.

alistairSH · on Sept 9, 2019

Nobody leaked any data here. These were public profiles that were "controlled" by a robots.txt file.

The judge appears to question whether robots.txt is sufficient to prevent scraping, or if a proper authorization step would be required.

The best real-world analogy I can come up with... I post a No Trespassing sign on my garden, but don't fence/gate the property. Is it ok to access the property and take my tomatoes? After all, the sign is just a suggestion... had I really wanted to prevent access, I'd install a fence.

sgc · on Sept 9, 2019

It's more like a store putting up a no shoes no shirt no service sign and then trying to sue for trespass when a beachgoer comes in to shop anyway. LinkedIn is a business with publicly accessible assets they want to be frequented, but they want to control how you do that. However they are finding the laws regulating the rights people have in respect to frequenting places open to the public apply.

merb · on Sept 9, 2019

jep. it's more like having a public store and only letting some people into it. like only males, no womans. because they clearly allowed the google bot.

mullingitover · on Sept 9, 2019

More like only letting humans into it, and only one type of robot, the googlebot. Then this company is ignoring the posted rules and sending armies of robots into the store to photograph every square inch of the business.

TeMPOraL · on Sept 10, 2019

Well I'm sorry, but if I opt to wrap a cURL call in a bash for loop, I'm still a human that tries to access the same resources, only with a different user agent.

There is legitimate individual interest in both scraping and non-browser HTTP sessions.

JamesBarney · on Sept 9, 2019

I think this is the best analogy I've seen on this thread.

qtplatypus · on Sept 10, 2019

Where I live in order to be trespass it has to be an enclosed space. If you don’t fence your property then it is not tesspass.

jibal · on Sept 9, 2019

Taking your tomatoes is theft, sign or no. That has nothing to do with this case.

UncleMeat · on Sept 9, 2019

A friend of mine from grad school was very involved in legal issues related to cfaa stuff. According to him, weev really got screwed because he failed "the punk test", which discouraged lawyers from wanting to use him as a test case.

hitekker · on Sept 10, 2019

Curious question: what is "the punk test"?

jakear · on Sept 10, 2019

“Would this person’s attitude make them more unpleasant to work with than others I could be representing in my already limited time”, or something to that effect.

Someone who spends their free time hacking university printers to distribute white supremacist propaganda and is a proud member of the “Gay Nigger Association of America” trolling group would likely not pass.

throwawayjava · on Sept 10, 2019

"will this person be negatively perceived by a jury of their peers, or, especially, a jury of 3 federal judge? And, will this person say dumb shit that torpedos my case for reasons other than its merits"

UncleMeat · on Sept 10, 2019

"Is this person going to say idiotic nonsense that tanks the case because the judge or jury thinks they are just awful?"

Weev may have been a good test case if he wasn't a white supremacist.

devmunchies · on Sept 10, 2019

>except that certain entities, like the Google search engine, have express permission from LinkedIn for bot access

how does this work technically? i just tried crawling a friend's profile using curl and set my user agent to Google's bot and it still was blocked.

dx034 · on Sept 10, 2019

Google describes how to verify Googlebot here: https://support.google.com/webmasters/answer/80553?hl=en

Most other search engine crawlers provide similar methods.

phs318u · on Sept 10, 2019

IP range whitelisting? MASSL?

EDITED: corrected auto-correct.

henryfjordan · on Sept 9, 2019

hiQ asked the court for a preliminary injunction to stop Linkedin from denying them access, won it, and this is the result of Linkedin's appeal of that injunction. This is not the end of the case.

The title is wrong. The 9th Circuit just ruled that hiQ has a decent enough argument to move forward. The question of whether them scraping a public site can violate the CFAA is not settled.

> We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required

> The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by LinkedIn as private using such an authorization system. HiQ has therefore raised serious questions about whether LinkedIn may invoke the CFAA to preempt hiQ’s possibly meritorious tortious interference claim.

Note the tone of the language used in the ruling. The judge makes it pretty clear that nothing is final here.

staticautomatic · on Sept 9, 2019

I think you've mischaracterized the state of things. In the underlying case, LinkedIn asserted that HiQ violated the CFAA and HiQ said LinkedIn tortiously interfered with its business. The trial court said LinkedIn couldn't assert the CFAA. LinkedIn appealed, asking the appellate court to overturn the trial court and also to hold that the tortious interference claim is preempted by the CFAA. The appellate court said no, we agree with the trial court and there's no preemption, so now HiQ can go back to the trial court and proceed to trial with its tortious interference claim.

henryfjordan · on Sept 9, 2019

LinkedIn tried to use the CFAA as an argument against the preliminary injunction HiQ was seeking at the start of the trial (which would force LinkedIn to continue to provide access to the profiles). They claimed that HiQ was likely to fail under the CFAA and so do not deserve the injunction to be granted. When the preliminary injunction was granted, LinkedIn appealed. This is the ruling on that appeal:

> It is likely that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA. The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by LinkedIn as private using such an authorization system. HiQ has therefore raised serious questions about whether LinkedIn may invoke the CFAA to preempt hiQ’s possibly meritorious tortious interference claim.

So yes, HiQ and LinkedIn need to go back and finish the trial, but the language used is in no way ruling on whether or not the CFAA preempts state law, just that even if there is pre-emption that hiQ still has a decent argument.

staticautomatic · on Sept 9, 2019

Are you saying the trial court never ruled on the preemption claim?

shkkmo · on Sept 10, 2019

The court rules that hiQ has a good enough argument against the preemption claim that the preemption claim cannot be used to block the injuction:

>> We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required

foota · on Sept 9, 2019

I disagree about the tone, it seems to suggest to me that the judge believes there is a strong case here for hiQ.

henryfjordan · on Sept 9, 2019

I agree, but there's a difference between "hiQ has a strong case" and "Here is the final ruling on the hiQ case". I was trying to point out that the case is not over or ruled on at all. Just the preliminary injunction.

rhizome · on Sept 9, 2019

Oh is that all, just a PI against their fundamental argument.

AznHisoka · on Sept 9, 2019

"Just the preliminary injunction."

But I thought that was a few years ago!

How many more over-rulings, or appeals do we freaking need? I really hope this is the final ruling.

shkkmo · on Sept 10, 2019

This ruling isn't on the actual case. The ruling is about the injunction being upheld while the case is tried.

Funnily enough the judge has a similar concern:

>> I write separately to express my concern that “in some cases, parties appeal orders granting or denying motions for preliminary injunctions in order to ascertain the views of the appellate court on the merits of the litigation.”

Miner49er · on Sept 9, 2019

AP seems to be saying differently. https://apnews.com/1e1cacd92df74f48846e8bce5237b97d

shkkmo · on Sept 10, 2019

I think I would trust the opinion itself over a random AP reporter.