So LinkedIn is prohibited from blocking hiQ's access by technical means. That's a strong holding. If this case is eventually decided in favor of hiQ, scrapers can no longer be blocked. Throttled a little, maybe, but no more than other users doing a comparable query rate.
However, how is it reasonable to force a web site to serve its contents to a third-party company, without being allowed to make a decision whether to serve it or not? Serving the web site costs money, and the scraper surely isn't going to generate ad income...
It's actually pretty insane to force a site to serve content. I think both parties are in the wrong here - HiQ for assuming they're entitled to receive a response from LinkedIn's webservers, and LinkedIn for abusing the CFAA to try to deny service rather than figure out a technical solution to their business problem.
In my view:
* The data is public, and free of copyright. If you're a scraper and can get it, you haven't done anything wrong.
* The servers serving the data are still under LinkedIn's control, and they have no obligation or public duty to always serve that content. They could just as well block you based on your IP or other characteristics. If they want to discriminate and try to only let Google's scrapers access the data - what's wrong with that? Scraper brand is not a protected class. Tough taters if your business model "depends" on your ability to successfully make requests to another uninvolved company's webservers.
If I were the judge, I'd throw this out and let LinkedIn/HiQ duke it out themselves - they deserve each other.
Hosting costs money, servers cost money.. but maybe create a public facing API that is way cheaper and easier to use than scraping your website? I see that ruling in positive light that it might promote more open and structured access to the public facing data.
Huh? Net neutrality isn't about the server or client... it's about the network operator in between them.
Public facing internet sites, in my opinion, should be treated in same way as public space - anyone should be free to read, and write down in their notepad whatever is there, in the same way as anyone else.
Scraping public facing website in my opinion is huge waste of resources. It would be cheaper(in total) to build an API that can serve the data from it, than to build a good scraper.
That being said, if you provide data to the public, you don't get to invoke the CFAA to plug the holes your content discrimination code doesn't fill.
But LinkedIn is apparently happy to let Googlebot and bingbot scrape public profiles. If they want to do that, they can't argue that their policy is to block bots who don't click on ads. Discriminating Googlebot from other visitors is probably a violation of Google policies, too. They can't have their cake and eat it at the same time.
> First, LinkedIn does not contest hiQ’s evidence that contracts exist between hiQ and some customers, including eBay, Capital One, and GoDaddy
> Second, hiQ will likely be able to establish that LinkedIn knew of hiQ’s scraping activity and products for some time. LinkedIn began sending representatives to hiQ’s Elevate conferences in October 2015
> Third, LinkedIn’s threats to invoke the CFAA and implementation of technical measures selectively to ban hiQ bots could well constitute “intentional acts designed to induce a breach or disruption” of hiQ’s contractual relationships with third parties.
> Fourth, the contractual relationships between hiQ and third parties have been disrupted and “now hang in the balance.” Without access to LinkedIn data, hiQ will likely be unable to deliver its services to its existing customers as promised.
> Last, hiQ is harmed by the disruption to its existing contracts and interference with its pending contracts. Without the revenue from sale of its products, hiQ will likely go out of business.
> LinkedIn does not specifically challenge hiQ’s ability to make out any of these elements of a tortious interference claim. Instead, LinkedIn maintains that it has a “legitimate business purpose” defense to any such claim. ... That contention is an affirmative justification defense for which LinkedIn bears the burden of proof.
So the real situation is that you can't go out and start blocking access you knew about in a way that would interfer with third party contracts without a legitimate business reason to do so. The burden of proving the legitimacy of that business reason is on you.
> "A party may not ... under the guise of competition ... induce the breach of a competitor’s contract in order to secure an economic advantage."
Be restaurant. Be on Deliveroo. Be getting low margins because of high fees.
So basically you can’t decide not to use Deliveroo any more, to improve margina (“secure an exonomic advantage”). I mean, you can cancel Deliveroo, but only as long as you’re not “inducing a breach of their contract”. So only a matter of time before Deliveroo writes a contract “we’re obligated to deliver food for you from said restaurant”.
That's very different from the case in question, where LinkedIn's motive for cutting off hiQ's access is to inflict damage on hiQ because they are a potential competitor.
I don't know Deliveroo, but I think a better analogy would be if you suddenly, even though it is not causing you trouble, denied access to someone picking up food that you didn't contract with, with the full knowledge that the someone would be in big trouble with their customers.
"Be Restaurant" blocking Deliveroo because they can't continue operating with the loss of revenue due to high fees is a legitimate business reason. "Be Restaurant" blocking Deliveroo 2: Electric Boogaloo because I don't like their owner, but continuing to allow Deliveroo access would be, presumably, disallowed.
Also there's nothing stopping "Be Restaurant" from offering an exclusive delivery contract to Deliveroo and forcing Deliveroo 2 out, or requiring a minimum fee for all delivery services, Deliveroo and Deliveroo 2 included.
Of course, I think this is all in a very different area from a restaurant; we're talking about a service provided on the internet. I believe LinkedIn has many, many other recourses here, but, as I see it, the courts are just telling them, this aint it chief.
Moreover it seems, 'this harms a competitor of ours' is not considered a legitimate business purpose, but anti-competitive behavior.
Edit: What I mean is that freedom of speech is not the same as freedom of censoring.
This is at least not quite true of First Amendment law. The concept of "compelled speech" exists in US law, and is considered an unconstitutional violation of the First Amendment. Exactly what falls into that category (and whether the right of domain owners to censor user-provided content as they see fit is protected), I'm not sure, but freedom of speech in the US certainly does at least sometimes include the right not to speak.
So the interesting question to me is whether you can lawfully make predictions based on published information if that information is under copyright.
In Europe the answer is probably no, because the assumption is that in order to analyse data you have to copy it first.
To me, this interpretation of the term "copying" makes very little sense. So I wonder what US law makes of it.
Most of these exceptions only apply to non-commercial use though. So they wouldn't apply in a case like hiQ.
UK specific exceptions are explained here:
Unfortunately, both Labour and the Tories have taken a relatively hard line in the EU copyright negotiations, so it seems unlikely that things will be relaxed very much after Brexit.
There's an infinite number of ways to describe a job history, without any single standard, so I don't think it makes any sense to say that a profile or resume is not copyrightable.
I could see the hit from a scraper being heavier than that of a typical user. There's also the potential that a user is going to click an ad for any number of reasons, there isn't that likelihood the scraper will.
I'm not anti-scraping by any means, but I get the concerns.
You're not obliged to have public access.
Is there perhaps a factor here of users having an expectation that their profile is publicly accessible; so companies hosting that profile shouldn't be able to choose _secretly_ "who" can access it?
If you consider scrapers to have some sort of right to access any public website, any technological barriers inflict exactly the same harm as an injunction, assuming it is effective. IF you allow technical blocking, it would be preferable to allow blocking-by-clearly-stated-wish, because it would save everyone the costs of the arms race. It would also make both parties' success somewhat independent of the resources they can invest into outgunning their opponents.
Your statement makes absolutely no sense. That's not how internet works. If you serve something publicly you don't get to cherry pick who sees it.
Not only it makes no sense technically it's also a huge anti-competitive case.
I’m not sure about being anti-competitive. Serving a website is an action in which you open up your resources for others to access. My friend runs an open source stock market tracking website for free. He started getting hit with scrapers from big hedge funds and fintech companies a couple of months back. This costs him around $50-100 a month to serve all of these scrapers.
Could also delay results, offer reduced temporal precision and other things to differentiate use cases.
It is nice to think donation platforms can fund high traffic open source projects, but this is simply not the case.
In any regard, I fear the potential of this ruling limiting developers’ ability to protect their servers and making us all roll over to the big players with their hefty scrapers taking all of our data for resale.
Technically, of course you can identify IP ranges owned by certain entities and restrict their access. That’s trivial, so what do you mean when you say the internet doesn’t work like that?
Legally, there’s plenty of region locked content for copyright and censorship reasons. A distributor might region lock because they don’t have distribution rights in particular regions. Are you saying distributors can’t publish free content at all because they can’t choose who sees it but would be breaking copyright law to publish to everyone? Or a site might region lock because certain content is censored in particular countries. Can you not publish anti-regime articles because a totalitarian country is on the Internet?
The entire world isn’t and shouldn’t be held hostage to the most restrictive laws that exist in the world. And the answer isn’t blocking on the requesting end because that’s technically much harder and blocks much, much more content. So what am I missing?
Edit: Forgot to include the other end of the spectrum. If I, as an individual, host my own site on my own hardware with my own connection that I pay the bandwidth for, can I deny a suspected not network?
This interaction is impossible to be trustless thus every client can only be served based on their IP or some convoluted, hack exchange that is cat-and-mouse game at best.
If LinkedIn wants to block access they need to do so by another means that isn’t described as hacking.
I know it is a generally considered bad form to ask, but did you read much of the ruling? I feel like a lot of people on this thread are just going off of Animats' comment and haven't spent much time looking at the opinion.
I didn't read the whole thing, but skimmed through it and read what seemed to be the relevant parts of the argument. (Including the bit that talks about LinkedIn's robots.txt)
The ruling doesn't really support your claim of catastrophe and doesn't claim to pass any sort of final judgement.
The judge makes a specific point about not reading too much into him upholding the injunction saying:
>> These appeals generally provide “little guidance” because “of the limited scope of our review of the law” and “because the fully developed factual record may be materially different from that initially before the district court.”
Which seems fair its public or its not, you can't pick and choose who its public for and who is a second class citizen.
it's a good point you bring up, and may contribute to the death of ads.
The problem is that someone would have to sue google first and no one will do that unless there's big business incentive and big business can already scrape the shit out of google.
This is the weird thing about web-scraping. Big companies can get around protections quite easily - it's the small scripts and average users that get hurt by them. No one is going to tell you this because people would stop buying Cloudflare's "anti-bot 99% effective anti-ddos money saving package" which is complete bullshit.
Also, their own robots.txt contains "Disallow: /search". So, there is arguably no inconsistency, either.
But, what does this new ruling mean for robots.txt?
Robots.txt isn't something that bars access to information. It's just a notice that the administrator does not want large amounts of queries against certain resources.
Many times Robots.txt are implemented with the interest of barring access to information.
This works by relying on scrapers respecting the file, but it's no different than a no-loitering sign which itself cannot actively stop someone who is loitering.
Google doesn't have a Robots.txt disallowing search because it can't handle a large amount of queries against a resource...
Do you have a source for that claim?
Also robots.txt is bullshit, if a person can access a public website why automated script shouldn't - technically speaking it's the same thing.
Imagine you have a piece of information that your neighbors in town come to you for every once in a while. They come over every now and again, ask you for it, maybe even bring you cookies for the trouble, and you provide it.
Then there's Ted.
Ted is insatiable. He hounds you at every minute of your day, constantly asking you the same question over and over. You've done everything you can. You tried to reason with Ted. You tried to contact whoever it is that brought Ted to your neighborhood. You even got so desperate, you moved a few houses down to escape the incessant hounding. That only worked for a little while though, and Ted found you again.
So you tried to stop answering the door; no use, he pokes his head in every time you're in the garage. You demanded people identify themselves first. Oh well, it's changed little. Now he just names himself gibberish names before hounding you for the same things over and over again.
This would not by any stretch of the imagination be acceptable behavior between two people. The main factor in determination for a court injunction would likely be physical trespass, or public nuisance; but no digital equivalent exists currently other than the CFAA, in the sense that in as much as one can prove that the access to the system is inconsistent with the intent or legal terms of providing a service, one may seek relief.
The problem is, LinkedIn has failed to make a convincing argument in the eyes of the appellate court that hiQ's Ted is violating the CFAA while LinkedIn has proactively engaged in activity disrupting hiQ's ability to do business; business which was consistent with the service granted to unknown members of the public at large.
In the Court's eyes, from the sounds of it, it appears LinkedIn is doing the greater harm.
What it looks like to me is this is setting up a common law framework that is going to cause website/service providers to have to choose from the get-go what their relationship to the Net is.
Are you just a transitor providing a service over the Net to a limited, select clientele bound by very specific terms of service? Then you may have a leg to stand on in fending off malignant Teds, but your exposure and onboarding will need to have concomitant friction to make the case to the Court that these Teds were never meant to be serviced in the first place.
Or, are you providing a public content portal, meant to make things accessible to everyone, with minimal terms? In which case, no legal Ted relief for you!
Just because it is your "syatem' and it isn't connected to your nervous system, does not mean it isn't capable of being caused harm to, or inflicting harm on someone else with careless muckery.
The one thing that disturbs me most is how the Court has disregarded the chilling effect that interpreting a duty to maintaining visibility may incur. A First Amendment challenge may end up being the inevitable result of this legal proceeding.
I however sometimes want to look up people. Or would this be a case of wanting to have my cake and eat it?
If the case is decided in favour of hiQ, then, absent an injunction, what would prevent a website from blocking a scraper? Maybe the website could still block unless and until the scraper gets her lawyers to file an injunction.
Another interpretation is that if hiQ wins, then in the 9th Circuit's jurisdiction websites serving public information they neither own nor exclusively license may no longer try to to use the CFAA and/or copyright law to threaten scrapers.
The question being: if just having access to the network isn't enough to grant you access to the data of a specific profile, but instead you have to aggregate samples from a bunch of people in the network in order to see "through their eyes" to the data on the profiles of their friends and friends-of-friends, is that allowed?
Because, if even that was allowed, that'd surely open a different kind of floodgate.
Page 31 of the filing specifically differentiates this case from one regarding Facebook:
"While Power Ventures was gathering user data that was protected by Facebook’s username and password authentication system, the data hiQ was scraping was available to anyone with a web browser."
Twitter is the only example I can think of among the large social media sites that doesn't require you to be logged in to see profiles
My memory might be misleading me here, but I think I remember years ago in one case I could see less info about a profile when logged in to my account than I could see while logged out and navigating from a Google search...
It should have said only that there is nothing judicially wrong with scraping but also not limited the rights of a service.
This is insane. It really should have just said that "it's legal to scrape" but it shouldn't have said "and you can't stop that".
Well that's not automated, and slightly more costly but it gets you pretty close to the same effect.
Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".
I find it really funny that the scrapers are battling scrapers - like guys you only exist because you do THE EXACT SAME THING
If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.
It seems like you're stating this as though to cast some sort of moral aspersion. I don't get it. If other companies don't want Googlebot to scrape them they just have to say so. Most companies want Googlebot to scrape their content. Google doesn't want other people's scrapers to scrape Google's content. Nobody involved in any of this has done anything unreasonable or morally objectionable.
Yes. This is EXTREMELY frustrating.
Of all companies to prevent scraping, Google is the most ironic.
Especially since their goal is to organize the world's information, it shocks me that there's no way to get access to this organized information from machine to machine.
If I am not mistaken, they no longer claim "organize the world's information" as their goal.
"Our mission is to organize the world’s information and make it universally accessible and useful."
It is the responsibility of moral consumers to avoid spending their money with companies that use these regressive tactics.
Not a big fan of weev, but this sure seems like he got screwed if he was just enumerating public web pages and went to jail for it.
weev was tried in an area under the 3rd Cir. jurisdiction. Somewhat interestingly, his conviction was thrown out in 2014 on venue grounds (e.g. being tried in NJ), without addressing the statutory question.
: pp. 27-28
For example, if a door is locked, but easily defeated, there is an implied assumption that what lies behind it is only available to someone with the key. Another example is an unlocked door with a sign that states "No access without authorization". Or an unmarked and unlocked door that is on private property in a place where it would be very unlikely for someone to reasonably conclude that the access was intended to be public.
Does accessing a page the site owner doesn't want you to violate the CFAA or do you need to hack through access controls?
But if you have a someone living there who lets anyone in if they ask, it would be pretty hard to argue they are trespassing.
A user agent must ask for every page with an http request. If the server responds with 200 OK, it’s pretty hard to argue that it isn’t letting (or even inviting) you in.
An open door to a home is different, but unfenced property is 100% not trespass until you refuse to leave.
For example, my state's law (emphasis added):
> Whoever, without right enters or remains in or upon the dwelling house, buildings, boats or improved or enclosed land, wharf, or pier of another, or enters or remains in a school bus, as defined in section 1 of chapter 90, after having been forbidden so to do by the person who has lawful control of said premises, whether directly or by notice posted thereon, ...
The federal version of trespass (which I think applies to Indian reservations) considers merely fencing off the area to be sufficient notice that trespassing is prohibited.
It's even possible to be guilty of trespass even if you weren't aware that you weren't allowed on the land. This is negligent trespassing.
If a public park blends into somebodies private lawn you can't be charged with tresspassing for stepping over the line.
To take it a step further, the information on said website is also personal property, and accessing the information without permission is also trespassing. Specifically, this is called trespass to chattels  (trespass is most usually legally defined as trespass to person, trespass to land, and trespass to chattels). Even more specifically, this is the actual part of the CFAA that's being debated: computer trespass gets its roots from trespass to chattels. 
Opening a website is making a request, technically speaking. That is not equivalent to breaking into someone's home and taking information. The equivalent would be if the head of the household told you not to stand outside and ask someone inside to give you something from the house. You haven't trespassed, you're asking someone in the house to do something for you. It's on them if they do what you request or not.
You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.
I'm not saying that websites shouldn't have recourse against people taking all the contents of their sites, just that the CFAA is the wrong tool.
No it isn't. That's the crux of the case.
>It's like putting a big "Order Inside" sign above the door to a restaurant and being surprised when people try to gain entry.
It's like putting a big "order inside" sign above the door to a restaurant, and then also having a separate door in the back of the restaurant that clearly is used only by employees to go to the back office, and not being happy when non-employees keep trying to walk into the back office claiming "well there's a sign outside...".
>You also never entered the server. The server got your request and served something back you to. You did not go inside the house and read the contents of a book on the shelf, it was read aloud to you while you are still outside the house.
According to the courts, you did 'go inside the house' because the electronic signals that you sent to the server as part of the request are enough to constitute the 'physical contact' part of trespassing.
Again, trespassing isn't just about you physically having your body on someone else's property. It also can be your interaction with someone else's property (which can be land, or a door, or web servers) through the use of tools or intermediaries.
If you don't bother to put up an "Employees Only" sign on the door, you are going to have a hard time getting a trespassing charge to stick...
By this reasoning it's impossible ever to hack anything. Even breaking password controls or cryptography is still just sending the server a request and getting something served back.
I'm not "on" your site when I browse there. I asked your server to send me some data and it did so.
Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.
It's more akin to you standing just outside my property border and using a fishing pole to pull fish from a pond that is inside my property border. You're still trespassing even if your two feet aren't physically on my land.
The common legal argument (see the second link in my above comment) is that accessing a web server actually does constitute being "on" the server because you are sending signals to my server in order to interact with it, and this satisfies the "physical contact" part of trespassing.
> The courts that imported this common law doctrine into the digital world reasoned that electrical signals traveling across networks and through proprietary servers may constitute the contact necessary to support a trespass claim.
>Its real life equivalent to social engineering. Its so far not illegal for me to ask you things and for you to disclose them to me even if you weren't supposed to. I'm allowed to lie to you even to persuade you to tell me things.
This absolutely would be illegal and I'm not sure why you think otherwise. Misrepresenting yourself in order for me to reveal to you private information is fraud and is illegal in pretty much every jurisdiction I can think of.
The tool asked the server. The server replied.
> It's more akin to you standing just outside my property border and using a fishing pole to pull fish
Bullshit. Using HTTP to access public information is akin to standing outside your business and writing down the phone number in the banner. Or even reading the "No trespassing" sign.
As long as you're not violating copyright, NDAs or EULAs (and that's debatable) there should be nothing wrong with reading information that you were authorized to view.
You aren't authorized to view it. That's the entire point.
And the lack of access control does not implicitly give you authorization to view it.
When it comes to PUBLIC data in a website there's no difference. How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?
Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits. "Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.
Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.
At no point is accessing a web server similar in any matter to reading words off of a banner posted in a street. You cannot use a faulty analogy of your own to describe why my analogy is faulty.
>When it comes to PUBLIC data in a website there's no difference.
Yes there is. Even for data that is public and meant to be accessed to the public, you still must access the web server. It is much more similar to walking into a publicly accessible restaurant and reading their menu, it is not similar to reading a banner on the outside of the restaurant.
>How would I know I'm authorized, implicitly or explicitly, to access a website, say www.google.com? Should I phone the domain owner before accessing?
A reasonable person knows that www.google.com is meant for public use. It is common knowledge and from whatever avenue you heard about Google, you probably gathered from context that www.google.com is somewhere you are allowed to go.
This is absolutely not the case if you randomly guess a URL like 'mycompany. intranet. io/financials /employeelist. xls'. And it certainly is not the case when you are explicitly told (such as in a robots.txt) that you are not allowed.
>Just because you meant for something to be off limits but failed to inform anyone doesn't automatically make it off limits.
It does, though. The owner of property is under no responsibility to inform the public that their property isn't meant for use. It is up to each individual person to determine if they are allowed to use it or not. This is typically done by context clues and societal expectations: it would be absurd for a random member of the public to walk through someone's open front door and claim "well I was never explicitly told to not come into your house...". The person should know, based on social conventions that you don't just walk into someone else's house, that it's not allowed. This is the same for websites. There is some leeway given, such as if you saw a sign for "Open House" and simply walked into the wrong house. But it is still possible to commit an act of trespassing even if you didn't explicit intend to: this is called negligent trespassing.
>"Trespassing" in a website is analogous to hacking it, using stolen credentials, using exploits and things like that.
No, it's not. Did you even click on the link I provided earlier regarding trespassing?
>Unless some law passes that says that someone remotely accessing a folder called /secrets/, or /inside-the-property/ or something like that is trespassing, it won't be the case.
That law already exists. It's called the CFAA, and the debate around it is what is being discussed in this post.
When it comes to website, there are billions of domains in the planet, each one has multiple internal URLs, ranging from tens to several million. You can't expect everyone to have common knowledge about every domain and link. It is beyond ridiculous to compare the two.
It's true that there's a presumption that sites that are accessible by the public are open for access to the public. But a lack of technical restriction is not an invitation. If a reasonable person would conclude that your access is not welcome then your access is also illegal. this is the crux of why so much of security research is on precarious legal footing. If you find an unsecured mongoDB database with a name like "customer_data" and you download the contents you are 100% breaking the law.
I know you're trying really hard to sway opinion on HN for some reason, but I'm just going to reinforce the entire point of this thread and, assuming we're staying within the context of publicly accessible information: the Ninth Circuit Court strongly disagrees with you.
Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.
>but I'm just going to reinforce the entire point of this thread
That isn't the entire point of this thread, nor is it the point of the PDF posted in the OP.
>Common law torts, such as trespass to chattels, may apply. But it's not a criminal offense.
Nobody has said anything about it being a criminal offense. The relation to trespassing is literally the entire point of this thread.
You're always using a "tool" to "extract" data from a web server, unless you're manually operating a telnet session. A web browser is such tool, an incredibly complex and automated one. cURL is such a tool too, and so is cURL wrapped in a bash script. None of them go outside of what's allowed by HTTP protocol. And the most core assumptions of the Internet and HTTP protocol combine into a simple rule: if it's a publicly routable server answering to HTTP requests, you can issue requests and receive whatever it sends. If a server wants to discriminate, it should set up an auth scheme.
 - protocol family at this point.
Yes, you did.
> You used a tool to extract data from my server.
No, that's not how the technology works.
No, of course not. The technical details of how an HTTP request works are not what is relevant here. Don't be obtuse.
What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?
You laid out a lot of opinions there as if they were facts. They are not. These issues are complex and are still being debated at levels higher than the HN comment section.
>What if I placed a sign on my lawn which said "Please, step on the grass!"? Would it still be trespassing?
No. Of course not. What exactly is your question?
>You laid out a lot of opinions there as if they were facts.
I didn't lay out any opinions. I relayed information that is available from Wikipedia and other sources and rephrased it into an HN comment. None of it is opinion. If you take issue with what my comment says, you can take it up with the courts that made the decisions that gave the information I posted.
My point was that it's hardly as clear cut as a piece of land and you know it. You posted a link to W's Trespass of Chattels, which I think is funny because it exactly proves my point. From your link:
>...several companies have successfully used the tort to block certain people, usually competitors, from accessing their servers. Though courts initially endorsed a broad application of this legal theory in the electronic context, more recently other jurists have narrowed its scope. As trespass to chattels is extended further to computer networks, some fear that plaintiffs are using this cause of action to quash fair competition and to deter the exercise of free speech; consequently, critics call for the limitation of the tort to instances where the plaintiff can demonstrate actual damages.
It is not at all clear that what we're discussing here is a clear violation. It's very debatable and the law itself was never envisioned to apply to scraping websites (because they didn't exist yet!) It also goes on to say (in the US)
>One who commits a trespass to a chattel is subject to liability to the possessor of the chattel if, but only if,
>(a) he dispossesses the other of the chattel, or
>(b) the chattel is impaired as to its condition, quality, or value, or
>(c) the possessor is deprived of the use of the chattel for a substantial time, or
>(d) bodily harm is caused to the possessor, or harm is caused to some person or thing in which the possessor has a legally protected interest.
The only clause there which even begins to help your case is the 'value' part of clause b and, again, that's very debatable.
> you can take it up with the courts that made the decisions that gave the information I posted.
Decisions made by court A get overturned by court B all of the time. We'll see where it lands, but we're not there yet (again, my point!)
The main reason I see businesses being concerned about being required to serve scrapers pages (even at a reasonable rate of download) is that there's still cost associated to it, and more so the more scrapers try to access and regularly access the data for updates. Similarly, if it is the users of a platform who have input the data, update it, and they are only wanting it presented on that platform (for whatever reasons) then what rights do they have?
Or the much simpler thing is we could put the onus on businesses who are scraping or will use scraped data to not cause this mass friction.
What about a humans.txt that says "please don't scrape this site"?
A URL that is not authenticated seems more like the latter than the former. The web is public unless people take steps to make it private. Criminalizing accessing unprotected URLs is like arresting people for reading the financial info I left stapled to the telephone pole. A while back there was a "world's most exclusive chat room" website that hosted IRC channels for people above twitter follower numbers. People learned that they could just manually increment a URL parameter to access higher room numbers regardless of their twitter follower count. Were those people committing a crime?
Absence of authentication means all access is authorized, otherwise just typing in random urls is a crime.
The judge appears to question whether robots.txt is sufficient to prevent scraping, or if a proper authorization step would be required.
The best real-world analogy I can come up with... I post a No Trespassing sign on my garden, but don't fence/gate the property. Is it ok to access the property and take my tomatoes? After all, the sign is just a suggestion... had I really wanted to prevent access, I'd install a fence.
There is legitimate individual interest in both scraping and non-browser HTTP sessions.
Someone who spends their free time hacking university printers to distribute white supremacist propaganda and is a proud member of the “Gay Nigger Association of America” trolling group would likely not pass.
Weev may have been a good test case if he wasn't a white supremacist.
how does this work technically? i just tried crawling a friend's profile using curl and set my user agent to Google's bot and it still was blocked.
Most other search engine crawlers provide similar methods.
EDITED: corrected auto-correct.
The title is wrong. The 9th Circuit just ruled that hiQ has a decent enough argument to move forward. The question of whether them scraping a public site can violate the CFAA is not settled.
> We therefore conclude that hiQ has raised a serious
question as to whether the reference to access “without
authorization” limits the scope of the statutory coverage to
computer information for which authorization or access
permission, such as password authentication, is generally
> The data hiQ seeks to access
is not owned by LinkedIn and has not been demarcated by
LinkedIn as private using such an authorization system. HiQ
has therefore raised serious questions about whether
LinkedIn may invoke the CFAA to preempt hiQ’s possibly
meritorious tortious interference claim.
Note the tone of the language used in the ruling. The judge makes it pretty clear that nothing is final here.
> It is likely that when a computer network
generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA. The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by
LinkedIn as private using such an authorization system. HiQ
has therefore raised serious questions about whether
LinkedIn may invoke the CFAA to preempt hiQ’s possibly
meritorious tortious interference claim.
So yes, HiQ and LinkedIn need to go back and finish the trial, but the language used is in no way ruling on whether or not the CFAA preempts state law, just that even if there is pre-emption that hiQ still has a decent argument.
>> We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required
But I thought that was a few years ago!
How many more over-rulings, or appeals do we freaking need? I really hope this is the final ruling.
Funnily enough the judge has a similar concern:
>> I write separately to express my concern that “in some cases, parties appeal orders granting or denying motions for preliminary injunctions in order to ascertain the views of the appellate court on the merits of the litigation.”
A preliminary injunction is considered very strong. So it's not that "nothing is final here", it's actually almost pretty much final unless something comes out of left field.
So what exactly did I misunderstand and why do you think this is final?
You are saying "the injunction was to stop LinkedIn from blocking access while [the injunction request] is ongoing".
If the court didn't think hiq had a strong case they would not have granted the initial injunction, then reaffirmed it on this appeal.
It still might be case that hiQ has less than a 50% chance of winning in the eyes of the appeals court.
Key paragraphs from the ruling:
> In short, even if some users retain some privacy interests
in their information notwithstanding their decision to make
their profiles public, we cannot, on the record before us,
conclude that those interests—or more specifically,
LinkedIn’s interest in preventing hiQ from scraping those
profiles—are significant enough to outweigh hiQ’s interest
in continuing its business, which depends on accessing,
analyzing, and communicating information derived from
public LinkedIn profiles.
> Nor do the other harms asserted by LinkedIn tip the
balance of harms with regard to preliminary relief. LinkedIn
invokes an interest in preventing “free riders” from using
profiles posted on its platform. But LinkedIn has no
protected property interest in the data contributed by its
users, as the users retain ownership over their profiles. And as to the publicly available profiles, the users quite evidently
intend them to be accessed by others, including for
commercial purposes—for example, by employers seeking
to hire individuals with certain credentials. Of course,
LinkedIn could satisfy its “free rider” concern by eliminating
the public access option, albeit at a cost to the preferences of
many users and, possibly, to its own bottom line.
I think this is more about validating hiQ's legal standing in the case.
First, LinkedIn makes the claim that its users have a right to privacy against scraping by such a 3rd party. That's laughable. As the court saw, their whole business model is made on people sharing their profiles broadly and mostly to the public.
Secondly, HiQ claims that LinkedIn's efforts to stop it from using the data are tortious interference. That's bold -- suppose someone is taking your assets (you believe illegally) and selling them to others -- can you imagine the gall that the person taking your assets can sue you for interfering with their subsequent sale of your assets?
Finally, that LinkedIn resorted to using the computer fraud and anti-terrorism statutes to make their argument is ridiculous.
So much craziness to go around. I would've just tossed the case, but I guess there is the whole bit about due process... Maybe HiQ will fail anyway at the next substantive trial, but what a waste of time.
Except that, in the digital sense, it's only copied. They now have it, but you didn't lose your assets or money besides the <$0.001 it costs to serve each web page.
> So much craziness to go around.
I agree - I haven't read through the entire thing, but it looks like, instead of saying "you can't scrape", they could implicity give a license to users for personal and business use, but not be allowed the reselling of the data (of course carefully worded to allow the likes of Recruiters and whatnot to do so). It's like trying to argue that the DMCA says you can't create a torrent file of some movie.
On a technical level sure. SPA's should pre-render data before sending it to the client.
The problem is that's a ton of extra work when the client will have to fetch data anyway - so it's difficult to justify the time to management.
EDIT: If their page fetches the data with JS you might actually have an easier time figuring what their API looks like instead of scraping the rendered page. You might find there's more data available than is rendered too.
You might not think that's a good reason, but that's certainly a reason.
I've often wondered when the laws would start to be applied and I think its coming
This is a really good resource: https://accessibility.18f.gov/
A lot of frameworks now have some accessiblity built in if you add the correct attributes.
As soon as your application bundle is rendered on your servers dynamically then only part of your site can be delivered via CDN.
Basically going all-JS gives you an app model where your sever side code doesn’t even know or care about HTML or the web or whatever. It just pushes JSON or whatever around and it largely client independent.
Great model when you need to support iOS, Android, web, desktop.
For many SPA's the only actual html is a header, container div, and a call to the app's js. Ignoring the header there might only be say half a dozen lines total.
> google-chrome --headless --run-all-compositor-stages-before-draw --virtual-time-budget=25000 --print-to-pdf='foo.pdf' URL
> pdftotext -raw foo.pdf
Of course there is! It’s my site and I can do what I want with it.
The question is whether there is any legal basis for banning scrapers, i.e., for blocking hiQ. In other words, if hiQ keeps scraping, are they violating anyone's rights and/or breaking the law by doing that?
As long as that remains a legitimate, open question, then hiQ can argue they should be allowed to keep scraping without incurring civil or criminal liability. That is the purpose of the injunction. There could be no legal basis for blocking hiQ. Until that question is resolved, hiQ can keep on scraping.
On the other hand, if the ruling stands, it sounds like it will finally be possible to do useful things with Craigslist.
If anything, the ruling could just push websites to hide information even deeper.
I'm reminded of Dave Chapelle's Halle Berry routine...
Reasonable. If a platform helps you make information of an individual public, then why it should matter for the platform how the market uses that public information?
We therefore conclude that hiQ has raised a serious question as to whether the reference to access “without authorization” limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required. Put differently, the CFAA contemplates the existence of three kinds of computer information: (1) information for which access is open to the general public and permission is not required, (2) information for which authorization is required and has been given, and (3) information for which authorization is required but has not been given (or, in the case of the prohibition on exceeding authorized access, has not been given for the part of the system accessed).
Public LinkedIn profiles, available to anyone with an Internet connection, fall into the first category. With regard to such information, the “breaking and entering” analogue invoked so frequently during congressional consideration has no application, and the concept of “without authorization” is inapt.
I am curious how quickly most pages get put behind authorization. With the wording of this ruling you could pretty much go snap up any blog side (say like medium) and more. I wonder what kind of services would come out of that, having the data in a format it can be more easier parsed/analyzed?
so every ecommerce site is fair game? I assume most are already being scraped but I cannot imagine having to be in an environment where many of your connections are not people
If LinkedIn loses this case they (and others) might try to get Google to change their policy (either use auth or some whitelisted IP addresses or something).