Congrats! Web scraping is legal! (US precedent)

AznHisoka · on Jan 29, 2020

Linkedin is taking this to the Supreme Court: https://www.law360.com/articles/1237505/linkedin-will-go-to-...

No ultimate decision was ever made, and no, this doesn't make web scraping 100% legal. Wake me up when there's a new announcement because anyone interested in this already know this old news.

zackmorris · on Jan 29, 2020

This is a really big deal. Currently (IMHO) the US Supreme Court is a wholly-owned subsidiary of multinational corporations due to the shenanigans that happened with Obama, McConnell and Garland, so will likely side with LinkedIn since it's the larger corporation:

https://www.npr.org/2018/06/29/624467256/what-happened-with-...

I feel like siding with LinkedIn here would open up the web to extortion though, like troll companies that would send cease and desist letters to all scrapers (even search engines). I think it could be argued that letting one company scrape when another is denied is discrimination.

Then again, I don't know how conservative and republican-leaning courts decide corporate law. Maybe in this case since so much money is at stake, they might worry that banning scraping would infringe on something like free speech and ruffle the feathers of some of the wealthier contributors in their base. Especially on the media side since I imagine they use bots in one form or another to find newsworthy stories.

IANAL (obviously!), I just find it entertaining/dismaying to ponder these things in these times.

shkkmo · on Jan 29, 2020

The thing is, this is still all about a preliminary injunction. Even if the injunction is found to be without merit, that still doesn't provide a final answer to of if LinkedIn can successfully sue HiQ to force HiQ to stop scraping under the CFAA.

Aperocky · on Jan 29, 2020

Bad look on LinkedIn, if you want the information to be visible publicly, don’t fault others for learning that information.

Gate the information behind a login then sue the scraper for violating TOS and not scraping itself, I can understand that.

EGreg · on Feb 5, 2020

Doesnt this ruling say login gate doesnt matter

Eli_P · on Jan 29, 2020

The link you gave, says (sic):

> Sign up now for free access to this content > [...] > Email (NOTE: Free email domains not supported)

I think this sort of antinomy[1] sounds ironical especially in this thread. This is practically legal dependency injection pattern, when you call something free, then administer antinomies, catches-22, etc.

[1] https://en.wikipedia.org/wiki/Antinomy

ashman5 · on Jan 29, 2020

5-4 overturn?

powrtoch · on Jan 29, 2020

"HiQ only takes information from public LinkedIn profiles. By definition, any member of the public has the right to access this information. Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site."

Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

"In this case, hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law."

This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

Seems to imply that every business is somehow beholden to every contract signed by anyone.

nathanlied · on Jan 29, 2020

> Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

ToS are subservient to the law; you can (probably) terminate a service account from a user that breaks your ToS, but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.

> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

IANAL, but I believe that'd fall on intent, and intent is often difficult to prove at a personal level, but not necessarily at a company level. If your intent for putting up barriers that happen to impact scraping, whatever they may be, was indeed to knowingly prevent scraping from a particular company, then you may be liable under this decision. This is the only part of the decision I'm torn on, since it's a bit messy to really prove such things. I'd be much more comfortable with allowing companies to take whatever measures they feel necessary to prevent scraping, and also allowing scrapers to legally circumvent those measures without threat of prosecution, assuming they didn't actually hack into anything.

andrewmutz · on Jan 29, 2020

> but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.

Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

I have interpreted the LinkedIn ruling to mean that scraping public data is no longer criminal activity but it still leaves you open to civil lawsuits for violating the ToS of the website you are scraping.

rohansingh · on Jan 29, 2020

> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

How would that even work? If I browse to any random public page of your website, it's served to me before you've even transmitted the terms of service. How could I be bound by those terms of service when I haven't even seen them?

andrewmutz · on Jan 29, 2020

As an engineer, I agree with what you are saying, but I think normal people and the courts disagree.

I think these sorts of contracts are called Adhesion Contracts (https://www.investopedia.com/terms/a/adhesion-contract.asp) and we interact with them all the time. For example, if you valet your car, the valet will hand you a piece of paper with a number printed on it to retrieve your car. On that paper you will find an adhesion contract that is valid and real (although not as powerful as the types of contracts that you sign)

AstralStorm · on Jan 30, 2020

This does not work at least for software licensing based on precedents for shrink-wrap contracts, so again would not work for licensing use of data.

A paper served you by the valet is not an immediate contract as you can deny agreeing to it and service does not happen.

You cannot do that with a publicly visible website, unless you show ToS and require agreement before first use. If you allow a non-transferable license then said data cannot be used by a search engine. If it's transferable you just pushed the problem towards scraping a different bot. (Well, you could have a direct agreement with a few major search engines.)

Caveat emptor: not a lawyer.

michaelmior · on Jan 29, 2020

IANAL, but it seems like ToS could still govern your use of the data which you viewed. Sure, it seems like you couldn't claim any violation based on visiting a random page. But if the ToS is clearly identified on the page and you do something with the data that violates them, perhaps the owner of the site has a case.

cabaalis · on Jan 29, 2020

> perhaps the owner of the site has a case.

Except it sounds like the owner doesn't. If the information is on the page made public, the owner of the page can't place terms on what is done with the data downstream. They'd have to implement some real binding system such as authentication where CFAA would apply. (IANAL)

turc1656 · on Jan 29, 2020

Correct, but all of that is void if the data presented is any sort of protected information (copyright, IP, etc.). You can't, for example, scrape Yahoo Finance for pricing and dividend history and republish on your own stock tools website. They have a license to redistribute that data and publish on their own website. Similar story for copyrighted text and things of that nature.

AstralStorm · on Jan 30, 2020

That would require at least showing that ToS on first use. A link on a page is insufficient.

And said ToS would have to force copyright reassignment rather than a general licence, making LinkedIn culpable for any unlawful content published by users of its site.

checkyoursudo · on Jan 29, 2020

I am a lawyer, and there isn't really an easy answer to these questions.

TOS are a lot like EULAs. If they look like contracts of adhesion, then they're going to get more scrutiny and skepticism. The TOS that you claim applies even to every single random visitor to your site where they do not in fact affirmatively agree to the terms is potentially going to look more like a contract of adhesion. That's a lot harder to enforce.

If they are used more for CYA so that you can ban undesirable accounts from your website which people explicitly agreed to when they signed up for it, or so that you can just up and alter your entire business model without having to give all of your customers refunds, then they're easier to defend.

Just my general opinion, of course. Every jurisdiction is different.

cthalupa · on Jan 29, 2020

Also not a lawyer, but you cannot force me to accept your terms of service. Contract law requires both parties agree to enter it.

When you create an account, etc., you are agreeing to those terms. If I browse a public webpage that just has a terms of service link on the bottom of it, I've not agreed to anything.

elliekelly · on Jan 29, 2020

> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

Typically you'll see TOS say something along the lines of "by continuing to access this site you agree..." or "if you do not agree with these terms you may not access this site..."

Whether that's enough to create a binding contract depends on the jurisdiction and who you ask.

checkyoursudo · on Jan 29, 2020

It can also depend on the terms themselves. I can put "by using this site you agree to bake me a chocolate cake" on my website all day, but that doesn't mean I will be able to force you to bake me a chocolate cake.

beerandt · on Jan 29, 2020

Terms of Service is a form of contractual agreement, which requires there be an offer and subsequent agreement by the parties.

I don't think criminal law was ever part of this.

andrewmutz · on Jan 29, 2020

From the article, the LinkedIn decision was that scraping data does not violate the Computer Fraud and Abuse Act. Violating that act was considered to be criminal activity. (https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act)

beerandt · on Jan 29, 2020

But the claim of a violation was only a claim as part of a civil trial. The law has both civil and criminal elements to it, and this is about the tort part of the law.

LinkedIn made threats accusing hiq of criminal behavior, but that doesn't mean there's any criminal precedent being set here, as far as I can tell. And no one was criminally charged.

Separately, part of the ruling states that for the purposes of authorization, defying a cease and desist letter does not constitute illegal access, which might have some criminal implications. They imply some sort of technical authorization system must be bypassed, which didn't happen, since the data is "public."

(Which doesn't square well, imho, with existing meatspace law. If a public serving business banned someone from their store, the door being unlocked isn't an excuse to ignore that ban and trespass. But I digress.)

With the overlapping areas of law, it's admittedly beyond my understanding. But the law is generally viewed, like dmca, as being overreaching, if not at least partly unconstitutional.

markandrewj · on Jan 29, 2020

The CFAA is overreaching, and used often as a catch all. 'Reply All' has a good episode which explores this. This is actually what was used against Aaron Swartz when he was charged for downloading academic journals from MIT, and why his charges were unjustly severe.

Reply All - #43 The Law That Sticks https://gimletmedia.com/shows/reply-all/rnhoxb

markandrewj · on Jan 29, 2020

It doesn't completely answer your question, but what Nathan is pointing out is that private contracts cannot negate common law.

ChrisMarshallNY · on Jan 29, 2020

There's a long, long history (probably hundreds –if not thousands– of years old) of selling aggregated or processed publicly-available information.

I'm not particularly thrilled with it, but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.

LinkedIn users (as opposed to the company) might actually like what HiQ is doing, as it may help their own prospects.

ivanhoe · on Jan 29, 2020

> but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.

It's not free, it takes time to collect data. Buying it makes a lot of sense as long as you pay less than what's your own time worth to you...

Cerium · on Jan 29, 2020

It is true in the current situation, though I would prefer that we ensure free data must be free. In that case buyers of data would be incentivized to pressure providers of free data to improve the data quality.

moron4hire · on Jan 29, 2020

The data does remain free, as long as LinkedIn still provides it for free.

The data without the noise is what you're paying for. The service of winnowing out what you care about from what you don't care about.

Considering how big of an effort it is, and that the source from which it came is still available, why should the cleaned data be free? If I collect fallen trees from public land, chop it into usable firewood, should my bundles of firewood also be free? Or I collect solar power with my own solar cells, should I have to give you the electricity for free?

munk-a · on Jan 29, 2020

I think this is especially relevant when it comes to things that fall under disclosure & transparency requirements - a lot of information that is legally required to be made available isn't legally required to be convenient. So, as a patient, you may have the absolute right[1] to a free copy of the charge master[2] of a hospital you're admitted to but it could be required that you pick it up in person or that it is only supplied in microfliche form... so a company that's aggregated this and is reselling it can deliver real value.

1. This specific example is BS but plausible - I just wanted something more specific than the vagaries around things like FOIAs or shareholder reports which both have specific facts that can be rendered useless unless you have the context.

2. Basically, list of how much procedures cost.

ChrisMarshallNY · on Jan 29, 2020

Absolutely spot-on.

I'm thinking of processed GIS data. If you have ever tried using the various formats that are supplied by government sites, you know what a huge pain it is.

I'm happy to pay a reasonable price for an interpreted and bowdlerized version.

moron4hire · on Jan 29, 2020

I actually have! I had to import a huge file of all of the culverts around storm drains in a state, and each culvert was multiple pieces of geometry, none of them grouped together in any logical way. It was just a huge list of rectangles that looked like culverts when viewed visually but no way to identify them as being one culvert without heuristics on how close each rectangle was to others. Massively long process that should not have been so.

reroute1 · on Jan 29, 2020

What do you mean free data must be free?

The data is free, but the aggregated formatted data has been worked on and processed, are you saying the resulting aggregated data should also be free? That isn't going to happen, why would anyone do that work for free?

Or are you afraid linkedIn and others will make everything private? That's completely up to linkedIn or individual linkedIn users what they want to make private vs public. Maybe more data would be made private if they don't want it scraped. I don't think that's inherently a good or bad thing.

michaelbuckbee · on Jan 29, 2020

I'm trying to puzzle out how this works in practice. So if LinkedIn has truly public data (no login required to view) then it can be scraped no problem.

But if it's only accessible with a login, then it falls under TOS and they can be blocked?

matttb · on Jan 29, 2020

> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

This is just a preliminary injunction. This wasn't an actual ruling on the case. This just says that until there is a ruling they can't stop the scraping to make sure the company isn't put under while waiting for an actual ruling.

remote_phone · on Jan 29, 2020

You don’t understand what a preliminary injunction is then.

It’s a very, very strong indication that they will win. Courts don’t issue preliminary injunctions unless it’s extremely likely the side who won the preliminary injunction will win.

pbhjpbhj · on Jan 29, 2020

Huh, I thought in USA they also did them to avoid an injunction having the effect of making the judgement irrelevant. So, where the case is not clear cut the injunction could prevent one party acting to 'kill' the other (and so avoid judgement) in the meantime?

Could you cite something on this that indicates this (my understanding here) is wrong?

wrs · on Jan 29, 2020

It only requires a “substantial” likelihood that side will win (not an “extreme” one), which basically means there’s a substantive dispute. The more difficult criterion is a substantial likelihood that irreparable harm will occur if the injunction isn’t granted (irreparable harm is supposed to be a pretty extreme thing — it means you can’t fix it with any amount of money).

papln · on Jan 29, 2020

Please read the linked decision before putting words in judges' mouths:

https://parsers.me/appeal-from-the-united-states-district-co...

coding123 · on Jan 29, 2020

The issuance of an injunction is in no way related to how the future court battle will result.

munk-a · on Jan 29, 2020

> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

LinkedIn has long wanted to have their cake and eat it too - they advertised that data as being publicly accessible and allow google to index specific user pages but then attempt to restrict other bots from crawling it.

If you have private data behind a login there isn't an issue here - if you have public data but want some people to login before viewing it (or not be able to view it) then that's where this ruling comes up. So, this mostly hits sneaky SEO folks and dark UX patterns that rely on tempting someone with accessible data and then pulling the rug out from them at the last minute.

If your website places data outside of authentication then everyone should be able to see that data... I'm curious to see the specifics around

> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

though - DoS attacks are clearly illegal, but with this precedent there's going to be a lot of back and forth to see where the line between DoS and scraping falls... and I think that makes this precedent a lot weaker than the headline would have you believe. A company can still threaten to drag you through a lot of litigation by accusing you of malicious page requests, it'll take a few cases to define where that line needs to fall.

hiccuphippo · on Jan 29, 2020

This reminds me about Twitter, when I click to see a thread for a tweet it asks me to login, but if I open the link in a new tab it loads the thread just fine.

username90 · on Jan 29, 2020

Linkedin want their data to be scraped by bots so they have to keep it public, otherwise you wouldn't find peoples profile from Google. They just don't want bots from from their competitors like hiQ to scrape it.

danenania · on Jan 29, 2020

To me, this is crucial. If it's public and available for google, it's public and available for everyone. If you want content to be private, then make it private and accept that you won't get search engine traffic. Otherwise, don't be surprised when your publicly accessible content is accessed by gasp the public.

papln · on Jan 29, 2020

> hiQ argued

That does not mean that hte court agreed.

The judges said that CFAA doesn't apply.

In other words, the judges said that LinkedIn couldn't use the US legal system to force HiQ to stop. Judges didn't say that LinkedIn was barred from using technical measures.

The court did allow a preliminary injunction against LinkedIn, due to the possibility of "monopolies" (to be determined in Court later), pending resolution of that latter question.

LinkedIn might still win their claim to their right to block scrapers via technical means.

JoeAltmaier · on Jan 29, 2020

Something about malicious interfering with a contract? That might prevent technical measures?

bena · on Jan 29, 2020

They want to imply that, but they are wrong.

LinkedIn can't prevent HiQ from attempting to scrape their site through force of law.

LinkedIn can rate limit requests, make their site hard to scrape, change their format, whatever. LinkedIn is in no way responsible for how HiQ fulfills its contract to its customers. HiQ is attempting to say that if I sign a contract to provide you with a Tesla, then it would be illegal for Tesla to stop me from just taking one from them to give to you. If that sounds stupid, that's because it is.

scotty79 · on Jan 29, 2020

> Seems to imply that every business is somehow beholden to every contract signed by anyone.

Implied contract is that if you publish something, it's public and you have no right to dictate what software people use to consume it.

phkahler · on Jan 29, 2020

You also have no obligation to them.

lopmotr · on Jan 29, 2020

The court document says "... refrain from putting in place any legal or technical measures with the effect of blocking hiQ's access to public profiles." on page 11. I wonder if they mean targeted measures specifically blocking hiQ but allowing others such as Google.

JackRabbitSlim · on Jan 29, 2020

https://www.eff.org/cases/hiq-v-linkedin

LinkedIn aint the victim here...

hombre_fatal · on Jan 29, 2020

This is the part I disagree with:

> hiQ also asked the court to prohibit LinkedIn from blocking its access to public profiles while the court considered the merits of its request. hiQ won a preliminary injunction against LinkedIn in district court, and LinkedIn appealed.

Whether LinkedIn is the good guy or bad guy here doesn't matter when the decision creates precedence for the rest of us.

Surely a healthier precedent is that we can respond arbitrarily to requests and have no obligation to the requester. So what if I want to randomize the html structure on every request or block requests from Tor because 100% of them are abuse? Can someone take me to court on the grounds that either is effectively "blocking" their scraping syndicate? Why not?

I feel like once CFAA is off the table (which I do agree with), the cat and mouse game is a fair middle ground. Keep web scraping a sport!

joering2 · on Jan 29, 2020

Here is my attempt to draw an analogy.

There is a large banner next to the highway that shows some weather information that if properly organized (lets say monthly almanac) you would find people to pay money for it. The banner owner does not make money this way - he ask you to go to his website and signup for an account. But you drive the highway (internet) every day, look at the banner, write down the weather updates, and then offer them on your website as a sale. The owner gets angry and sue you. The court decides you are free to drive by the highway and you free to put your eyeballs on their weather banner, especially given the banner is available to everyone (LinkedIn profiles are avail to view without needing an account) and you are free to use the information you obtained for free without interference with said banner in a form of a monthly almanac that you sell. At the end of the day, the banner owner does not own the weather information that someone else put in there (for example a weather meteorologist).

I think personally its a healthy decision. Otherwise it would be similar to prejudice of who should be allowed to enter and browse a street store that by law is available to everyone.

candiodari · on Jan 29, 2020

https://en.wikipedia.org/wiki/Tortious_interference

This would mostly mean that you cannot start interfering with webscraping you previously allowed merely because you learned that they're making money with the scraped data.

thfuran · on Jan 29, 2020

It seems absurd if the 'interference' only directly affects their own property. Like, if my neighbors start monetizing livestreaming my backyard, suddenly I can't put up a fence? Except worse because in actuality, this third-party contract is costing them money through server load and bandwidth.

lovehashbrowns · on Jan 29, 2020

Your analogy doesn't hold. Your backyard is private property. The data that LinkedIn publishes is intended for the public. That's why Google can index the pages and give you results from LinkedIn.

Supermancho · on Jan 29, 2020

> Your analogy doesn't hold.

It does, in the US. You're likely making an inconsistent comparison.

Property ownership has nothing to do with visual access. You cannot legally be barred from casually (involuntarily) perceiving something. It's reasonable to put up physical barriers to reduce what is casually perceived. It's a very good analogy.

munk-a · on Jan 29, 2020

However it doesn't hold - as your neighbor I can't bar you from putting up a fence because it'll intrude on my view of your property... granted people try to do that _all the time_ but I think it's commonly understood that putting up a fence for privacy is allowed.

It's also not a great analogy for this case because another party is given continued easy access to view my backyard while the first party is denied - and the analogy breaks down here because, as a neighbor, I have no inherent right to view your private life at least as much as any of your other neighbors.

Supermancho · on Jan 30, 2020

> it's commonly understood that putting up a fence for privacy is allowed.

Try building that fence into the stratosphere. A regulatory body will prevent that.

> I have no inherent right to view your private life at least as much as any of your other neighbors.

That's a different analogy, not a violation of the first.

It's not necessary for every part of the analogy to hold, being an analogy.

OJFord · on Jan 29, 2020

It's trivial to fix that - the exterior of GP's house then. That's available for public viewing; is intended for it, but is private property. If you monetise livestreaming it and describe it in your ToS, GP can't repaint the front door, or get new windows?

Or perhaps slightly less contrived:

If I publish a monthly lowlights reel of my favourite sports team as a podcast discussion on where they can improve in all their lost games, and then they suddenly go on a winning streak for >1month so my USP is gone and I have nothing to talk about..?

jjeaff · on Jan 29, 2020

Those examples don't fit because they are contracts not made in good faith. They aren't things you can control.

In this case, it was rules that the public data is available. It was a good faith contract on the part of HiQ to assume they could collect public data from a public website.

It would not be a good faith contract to assume you could control the paint colors on a property you don't own.

It seems to me that the interference ruling was wholly independent on deciding that what hiq was doing is legal.

papln · on Jan 29, 2020

Does that mean that ia grocery store offers free samples, I can go in every day and take all the samples, and the grocery is not allowed to selectively prevent me access?

munk-a · on Jan 29, 2020

It means that if they're offering free samples and refuse to offer you the same service they're offering to other customers they might be in hot water - which is consistent with what a lot of folks consider ethical. Offering an item for free to some folks and not to others is a form of discrimination - it's usually not a particularly troubling form of discrimination but in this case Google is allowed to walk up and take all the samples and the grocery store manager just smiles and nods - but when you (hiQ in this example) try and get one you're hit with an injunction and barred from entry.

kazagistar · on Jan 29, 2020

Can someone taking down a open source project, like the leftpad debacle, be sued for tort?

cthalupa · on Jan 29, 2020

I mean, anyone can be sued for anything. I can file a lawsuit with basically zero legitimacy to it. It'll probably get thrown out, but you were still sued.

If the question is could someone win, potentially. The argument would basically have to be that the removal of that open source project is akin to other cases of negligent interference.

If this is a specific concern, consult a lawyer - 'cause I'm not one.

jjeaff · on Jan 29, 2020

Doubtful. If linkedin had completely taken down their site, or put it all behind a login account, them this case would not have turned out the same.

Maybe if leftpad somehow tried to block only some users from using their publically available plugin.

KptMarchewa · on Jan 29, 2020

User data is not theirs property.

frandroid · on Jan 29, 2020

Your backyard can be a walled garden--this is about the public front of the property.

candiodari · on Jan 29, 2020

Exactly your backyard is of course yours. But you are not at liberty to use it to damage others. There's lots of rules about this. For example, opening a brothel on your own land is definitely not legal without considering how it affects the neighborhood.

papln · on Jan 29, 2020

When is it "damage", and when is it "declining to contribute" ?

tcd · on Jan 29, 2020

If I decide to change the class names, or HTML structure of the page, is that no longer allowed?

How far does this go?

yorwba · on Jan 29, 2020

Are you doing it just to spite scrapers, i.e. with "malicious intent"? If you have some other reason, you won't be guilty of intentional tortious interference.

near · on Jan 29, 2020

> "Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site."

How would this affect Cloudflare's "checking your browser" anti-DDoS protection screen, meant to block bot requests from accessing sites?

dalore · on Jan 29, 2020

So if I (like many others) have cloudflare web scraping protection turned on, that is now against American law?

papln · on Jan 29, 2020

There's no reason to believe that.

coryodaniel · on Jan 29, 2020

It will be interesting to see how this will impact the bot detection market like perimeterX, etc.

protanopia · on Jan 29, 2020

> If LinkedIn wanted to force users to sign in to view profile info

Do they not already do this? Every link I've ever seen for LinkedIn has redirected me to sign up page rather than showing me the content.

thrwn_frthr_awy · on Jan 29, 2020

They want search engines to index their profiles and provide organic search results links to their site, but then those same sites will require you to sign in when clicking a link to another public profile. You can search for that 2nd profile in Google and then view it without signing in, but not by clicking internal links. I've experienced this with Quora, LinkedIn, Instagram, FB and others. They want to have their cake and eat it too.

alok-g · on Jan 29, 2020

As a user of LinkedIn, I can pick which portions of my profile information I would like to be publicly available. This is not by default, so most people do not have it public. You can try seeing my profile without logging in. :-)

MarioMan · on Jan 29, 2020

They block you out after the first few profiles you view. Try a private browser and you can still see them.

giarc · on Jan 29, 2020

Your second point is interesting. I suspect the contract between hiQ and some company is that hiQ provides info on public profiles, and if LinkedIn removes all public profiles by requiring a login the contract would become moot. Just the same if I was to change my profile settings from public to private, hiQ wouldn't be in breach of their contract (nor would I).

onetimemanytime · on Jan 29, 2020

Scrapping should either be legal or not. The fact that you have a contract to sell the content you assumed it's legal to scrape, should not matter. Too bad if you lose money

jacobwilliamroy · on Jan 29, 2020

>Are those [ToS] legally void now?

They were pretty much legally void even before this precedent was established. They are only valid when they don't violate any existing U.S. law. Any authority assumed beyond that is completely false.

>can the court force me to revert that change?

No.

wtetzner · on Jan 29, 2020

I wonder if it has anything to do with the fact that the data is actually owned by LinkedIn users, and they expressed that they want their data to be publicly available?

shadowgovt · on Jan 29, 2020

Unlikely. The license to LinkedIn retains ownership, but the user's retention of information ownership doesn't compel LinkedIn to affirmatively do things with that data (i.e. LinkedIn isn't forced to vend the data to a given consumer if the user says so).

The license further goes on to clarify that LinkedIn will vend public data to search engines, but the definition of "search engine" is almost certainly assumed (by LinkedIn, at least) to be up to them.

sjg007 · on Jan 29, 2020

It's because previously public information became private. They can't do that apparently.

vectorEQ · on Jan 29, 2020

i'm wondering if that robots.txt might then get you sued due to blocking scrapers / bots?

herval · on Jan 29, 2020

A robots.txt file doesn’t block scrapers, it’s the equivalent of a no trespassing sign. I don’t think putting up signs will suddenly become illegal?

setr · on Jan 29, 2020

Robots.txt doesn't interfere with anything. It's a suggestion.

kabacha · on Jan 29, 2020

The toxicity towards web-scraping is really what makes me lose hope in the current web. People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.

This precedent doesn't really mean much but is definitely step in the right direction.

Tactic · on Jan 29, 2020

The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.

But when you scrape it my load goes up dramatically. A load I have to pay for.

It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.

I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.

hannasanarion · on Jan 29, 2020

So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.

munk-a · on Jan 29, 2020

This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.

orf · on Jan 29, 2020

Provide an API for public data to reduce the costs associated with rendering a full blown page, and deliver just the information needed.

Tactic · on Jan 29, 2020

Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.

orf · on Jan 29, 2020

I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.

So no, I won’t pay you for the privilege of you saving money.

oauea · on Jan 29, 2020

No, I'll happily just scrape your site instead. But if you'd rather not have that happen, provide an API.

briandear · on Jan 29, 2020

Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?

0x445442 · on Jan 29, 2020

My guess is hiQ (and others) would happily pay for an API over the data they're scraping right now.

gregmac · on Jan 29, 2020

Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.

munificent · on Jan 29, 2020

Why buy the cow when you get the milk for free?

munk-a · on Jan 29, 2020

This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.

Could I suggest:

"Why buy the cheese when you get the milk for free?"

coryfklein · on Jan 29, 2020

Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.

Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.

dariosalvi78 · on Jan 29, 2020

The contention was not about load, it was about using the data.

kerkeslager · on Jan 29, 2020

> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.

That's a complete misconception. Of course you can find manufacture inconsistent ideologies if you combine ideas from different people, but I think you'd have a difficult time finding one person who believes what you just described.

What I want is, put simply, organizational transparency, personal privacy. I believe humans have a right to privacy, but I don't believe organizations have rights, period, and I believe radical transparency within an organization prevents organizations from trampling the rights of individuals.

Organizations in this case include corporations, governments, and nonprofits.

K0SM0S · on Jan 29, 2020

I'm interested in hearing your take on "organizational transparency". Like please push the concept / idea to its 'full' realization and tell me that picture, even if it implies a little bit of "sci-fi"¹.

Digging this because I think that domain / paradigm will see unparalleled evolution in the next few decades.

[1]: I mean, don't stop at current law / values / behaviors; like people from the 1940s wouldn't have dared speak about their idea of the 1970s because they'd think their belief "impossible". No flying cars though (Clarke-tech), because that's not a decision of the individual.

kerkeslager · on Jan 29, 2020

I don't think that looking too far ahead is useful: this is just a matter of pragmatics. Revolutionary change in a peaceful society happens via a long sequence of small, incremental changes, and that's a good thing, because you get to see how each of the changes plays out. I think the best sci-fi persuades you that it's looking at the distant future when in fact it's only using the future as a foil to provide deep insight into the present.

The short-term, the small, incremental changes I'd like to see are:

1. Reversal of the default privacy setting of government docs. Instead of documents being default-private and citizens having to make FOIA requests to make those documents public, documents should be default-public, and government workers should have to apply through and adversarial system (similar to courts) to classify documents, proving to a court why the document needs to be classified.

2. Classified documents should have a short (1 year max) timeframe after which they are declassified, or government workers should have to reapply to justify why the documents need to remain classified.

3. Political party documents should be public, without any provision for classifying them.

4. Tax-exempt organization documents should be public, without any provision for classifying them.

5. IPO'ed organization documents should be public, without any provision for classifying them.

6. Body cams on all police and military while on duty (when they are acting on behalf of an organization). 1 and 2 would apply to the footage from these cams as well.

7. Exceptions to 1-6 should be made for the personally-identifiable information of people who are not in the organization.

8. Organizations should be required to maintain a list of all the personally-identifiable information they have on a person (including employees), and provide that data to that person on demand by that person or their legal guardian, as well as a list of all people with whom that data has been shared, and be required to delete that information upon request by that person or their legal guardian.

9. Research which receives public funding should be forced to publish its results publicly.

10. All software which receives public funding should be forced to publish its source publicly.

11. Government documents should be published in open-source formats suitable for computer analysis (i.e. CSV, text, or some XML format--no PDFs).

K0SM0S · on Jan 30, 2020

Can't agree more with your first paragraph.

1, 2, 6, 11 should be no-brainers if people were educated IMHO — but this is 1920 relative to electricity or cars; still a long way to go before the mainsteam masses get it (which very much includes political figures). I would think 2030-2040 for the emergence of ethical consensus and concern (the kind that pervades political parties and social classes).

That is assuming the needle doesn't move too much farther in the authoritarian direction until then (the 20-year trend is really not looking that way currently).

3, 4, 5, 9, 10 are/would be met by strong opposition from interest groups, I'm sure you see that too. Everything I know about 3 tells me it's never going to happen with current parties / politicians. It's at least 1 generation away and I'm not sure the concept itself isn't utopia. 9 and 10 as well, I think it largely depends on the cultural paradigm (and this world's in 2020 is really not aligned with that, nor does it trend or even look that way). 4, 5 likewise, complex topics, lots and lots and lots of gatekeepers and lobbyists.

My take on these is they're very costly in terms of political capital; and they are largely debatable (politically, legally, philosophically, etc., you'll find passionate captains on both sides); thus there are 'better' (more consensual, with direct net positive effect) lower hanging fruits imho.

7 and 8 are hard problems, notably because of scale and the need for automation — it's part of a much bigger domain, automation of compliance and building "trustable" systems etc.; the kind that bridge or plane engineers must build, and probably software engineers too, but you know we're far from that if you read this forum.

I'd say 1 2 6 11 and 7 8 on the way to scale/automation already paint a whole different regime and degree of maturity for a 21st century State. I'd like to think we're now ~1 generation away from enactment of such norms.

whatshisface · on Jan 29, 2020

What if the organization is one person in an LLC? Do they get rights? If so then a big company can hire a bunch of little LLCs to act as rights-having proxies for any task that requires them.

kerkeslager · on Jan 29, 2020

I'm going to assume you're asking in good faith and try to address the confusion here.

The human does get rights, the organization doesn't.

In some cases, believing that humans have rights and believing that organizations have rights might lead one to the same action. In those cases, I'd take the action. I wouldn't want to violate a human's rights out of some vindictive dislike of organizations: that's not the point. The point is that I'd take that action because I believe in human rights, not because I believe that the organization has rights.

And with organizational transparency: the entire point of organizational transparency is to protect human rights. In cases where organizational transparency would trample human rights to privacy, I would go with human rights every time. Violating the privacy of humans to achieve organizational transparency would defeat the entire purpose.

whatshisface · on Jan 29, 2020

Let's say that individual humans have the right to keep secrets. Let's also say that they have the right to keep secrets with their associates, and to tell them to who they please. Now, doesn't that make it legal for a group of people to keep secrets about you? What about selling them? I just don't see what doing away with the legal fiction of corporate personage would do about Facebook.

kerkeslager · on Jan 29, 2020

It may not be your intent, but you're using some very vague, inapplicable terminology to make some screwed up behavior sound normal.

If you can tell secrets to who you please and sell them on the internet, they aren't secrets. Somewhere in the middle of what you're saying, the secrets stopped being secrets, but you kept using the word as if it still applied.

Facebook isn't a group of associates trading anecdotes about their friends: the server guy has never met Mark Zuckerberg, and they are not "associates" in any meaningful way. They're not friends, or even really allies: Facebook certainly has shown inconsistent concern for the well-being of its workers. So let's also drop the "associates" terminology: these aren't "associates", they're employers and employees. Employees aren't acting as individual humans on their own behalf, they're acting on behalf of an organization.

Putting aside the rights conversation for a second, let me ask you a question: if you tell your friend a secret in confidence, and they turn around and sell it to anyone on the internet who will pay a low fee, that would be pretty screwed up, no? We don't even have to talk about rights here: this is just screwed up behavior, regardless of the rights conversation.

RockIslandLine · on Jan 29, 2020

"Now, doesn't that make it legal for a group of people to keep secrets about you?"

A group of people sure, but corporations are not people.

whatshisface · on Jan 29, 2020

Corporations are groups of people.

RockIslandLine · on Jan 29, 2020

Corporations are a power of attorney document. They are golems that sometimes act on behalf of people. They are not people.

kazagistar · on Jan 29, 2020

Are contracts allowed in your worldview? Contracts must be signed by individuals, but when they act as representatives of a company, they are legally binding for that company. If they only have individual rights, then all individuals who didn't physically sign a contract cannot be held to it.

kerkeslager · on Jan 29, 2020

Correct.

At a small scale, it makes sense for people to appoint another person whom they trust to represent them in negotiation. But in a lot of cases, that's not how representatives are chosen. Particularly in corporations, the leadership of a corporation was not chosen by the employees to represent them, and in fact often doesn't even have the best interest of the employees in mind. A lot of the largest problems in our society arise from this fact.

Consider the case of a company that agrees to sell to a larger corporation, under the condition that they lay off half their workforce in advance of the sale. Surely we can agree that the laid-off workers were not fairly represented by the person signing the contract.

One could argue that the workers agreed to give up some of their rights in their employment contract, but I'd argue that they did so under duress: their option is sign the contract and work for the company, or starve and let their families starve. Sure, they can go work for another company, but other companies will require them to similarly sign away their rights.

This shouldn't be taken as a recommendation to blithely break contract law. Corporations don't have rights, but they do have power, and it would be unwise to behave as if they can't make your life miserable if you decide to cross them.

anticensor · on Jan 29, 2020

> The human does get rights, the organization doesn't.

Organisations are (usually) legal persons, too; they just have fewer responsibilities, get fewer rights in exchange.

kerkeslager · on Jan 29, 2020

That's how things are, not how they should be.

anticensor · on Jan 29, 2020

No rights, full liability would be a bad deal too.

kerkeslager · on Jan 29, 2020

A bad deal for whom?

It's a bad deal for corporations, but I do not care. Lack of liability is the cause of a ton of problems in our society.

Just to pick two stories of corporate sociopathy: Probably the reason people at State Farm are unconcernd about forging signatures[1] is that they know that the worst case scenario is that State Farm loses some business and maybe gets a fine: they are unlikely to go to jail for forgery or to have fines exacted from their personal bank accounts. Similarly, when Practice Fusion literally killed people[2], their execs had little to fear: nobody went to jail, nobody was fined: shareholders who had no visibility into the decision paid the fines.

When banks tanked the economy with irresponsible lending most were bailed out and gave their workers bonuses, while the people who were unable to pay their mortgages were ignored.

A little more liability for destructive behavior would be great for most people.

[1] https://news.ycombinator.com/item?id=22177812

[2] https://news.ycombinator.com/item?id=22165946

anticensor · on Jan 30, 2020

> they are unlikely to go to jail for forgery

Forgery is criminal regardless of private or official document is concerned. Even in a military setting, forgery of business-related documents are illegal.

> A little more liability for destructive behavior would be great for most people.

Why not full rights, full liability? Replace imprisonment and death by temporary and permanent suspension of company (including re-establishment of a sequel organisation out of a subset of stakeholders) respectively; and voila.

kerkeslager · on Jan 30, 2020

That's an optimistic theory, but read the link I posted and decide for yourself if you think anybody is going to end up in jail for it.

It seems like you're just picking out one aspect of each of my posts to disagree with. Do you have any disagreement with my overall point?

kijin · on Jan 29, 2020

The person and the LLC are legally separate entities, although an LLC is often passed through for tax purposes.

msla · on Jan 29, 2020

Making organization membership public would trample on personal privacy quite effectively in some respects, such as with disease support groups or PACs; medical privacy is taken seriously, but is there such a thing as political affiliation being private? Is it a violation of someone's privacy to reveal they give to the ACLU?

panarky · on Jan 29, 2020

This is where pseudonymous identities make a lot of sense.

In a world where organizations are radically transparent and individuals have radical privacy, before you join the disease support group or donate to the ACLU, you already know the organization's records about that transaction will be public information.

You can restrict your activities to organizations that do not keep personally identifiable information. Or you could join the support group with a pseudonymous identity or donate cryptocurrency to the ACLU.

kerkeslager · on Jan 29, 2020

The entire point of organizational transparency is to prevent organizations from trampling the rights of individuals, so in cases where organizational transparency would trample the rights of individuals, the rights of individuals supersedes the need for organizational transparency.

msla · on Jan 29, 2020

So, should it have become public knowledge that Brendan Eich donated $1,000 to support Proposition 8?

kerkeslager · on Jan 29, 2020

I'm gonna need some context there. To begin with you can't just donate to support a bill. What you probably mean is that he donated to an organization or politician who supported Prop 8, so say that.

The important detail is whether the funds were his personal funds, or whether they belonged to his company: i.e. was he acting on his own behalf, or on behalf of an organization?

It's an inane question, though, because the inane point you're trying to make is that individual privacy might protect homophobes. That's true, but it would also protect gays and allies who donated to oppose proposition 8. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason gay marriage is legal today is that people donated to people like Harvey Milk, at a time when donating to the campaign of a gay governor was risking your job and social standing.

Human rights still apply to humans who do bad things. If you are willing to give up human rights to fight bad people trying to do bad things, then those rights won't be there to protect good people trying to do good things, either.

msla · on Jan 30, 2020

So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?

kerkeslager · on Jan 30, 2020

Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:

> So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?

Yes, but it would also protect people who donated to a virtuous cause. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason virtuous causes have had any success at all is that people donated to support them, at a time when donating to those virtuous causes was risking your job and social standing.

Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.

msla · on Jan 30, 2020

> Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:

I'm attempting to clarify my question by removing irrelevant details, which it looked like you got hung up on last time.

> Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.

So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".

kerkeslager · on Jan 31, 2020

> So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".

No. Please try to respond to what I actually say instead of making stuff up; this is a straw man argument.

There are plenty of other ways we could find out about organizations supporting odious causes and boycott those organizations, without violating the privacy of their members. In fact the the point of "organizational transparency" is to make it hard to hide when organizations do bad things.

In addition to accusing me of saying things I didn't say, you're ignoring what I actually did say. Are you willing to make it impossible for people to privately donate to virtuous causes as a means of social change, when donating to support those causes publicly so would be a risk to their careers and reputations? I'm not going to continue this conversation further if you won't respond to this point.

briandear · on Jan 29, 2020

> I don't believe organizations have rights, period

So a group of people, joining together in a common cause, don’t have rights as members of that group?

You are contradicting yourself. Organizations are simply groups of people with a shared cause. To deny rights to the organization, you necessarily have to deny personal rights. Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.

It’s absurd.

a1369209993 · on Jan 29, 2020

> So a group of people, joining together in a common cause, don’t have rights as members of that group?

Emphasis added. No, they don't. They have rights as individual people.

> [...] this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals [...]

Yes, that. "Sam" and "John" would have cause for civil (and probably criminal) action against the interloper. "John and Sam, Inc" has no say in the matter.

It might be useful to grant John and Sam, Inc the privilege to own property, but even that isn't actually a right except insofar as Sam and John have a right not have the value of their assets (ie 50% ownership of John and Sam, Inc - functioning as a proxy for ownership of various comics) actively sabotaged/vandalized.

kerkeslager · on Jan 29, 2020

> Organizations are simply groups of people with a shared cause.

No, they're not. As soon as you have two people in an organization, you've got two different causes. The shared elements of those causes allow them to collaborate, but each of them has slightly different views of what they're working toward. And even when you have a very small, well-specified goal, every individual in the organization has different levels of investment, and other values and boundaries they're not willing to cross to achieve that goal. And the larger organizations get, the wider the variety of disparate goals that can occur within the organization, because individuals in the organization may not even interact directly with one another.

Example: John and Sam work for Facebook. John and Sam want to make money to feed themselves and their families, and don't want to have PTSD. But Mark Zuckerberg wants John and Sam to look at an endless stream of horrific PTSD-inducing images so that he can maintain the reputation of the Facebook platform and get incredibly rich. Where is the shared cause here, exactly?

> To deny rights to the organization, you necessarily have to deny personal rights.

Just because an organization doesn't have rights doesn't mean we have to go out of our way to take away their rights.

> Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.

It sounds like you've figured out that John and Sam, as individuals, each have a right to have a private conversation. John and Sam, Inc. doesn't have rights, but that doesn't suddenly remove John and Sam's individual rights.

The entire point of organizational transparency is to protect the rights of individuals, so obviously if organizational transparency would violate the rights of individuals, the individual rights to privacy supercede the need for organizational transparency.

It's telling that your example of an organization has two people in it. At a small scale, organizations tend to protect the rights of the individuals in the organization fairly well. It's at larger scales that the non-rights of an organization come in conflict more often with the rights of individuals.

jonas21 · on Jan 29, 2020

You see this inconsistency in individual people all the time. Like how almost everyone wants to be found easily on LinkedIn, Twitter, etc, so they add a public profile photo. Then they freak out because someone starts scraping these photos to build a facial recognition model.

kerkeslager · on Jan 29, 2020

I think that comes from ignorance of the implications of uploading their photo, not from an inconsistency in what they want.

DougWebb · on Jan 29, 2020

One of my clients is involved in property tax collection and reporting. Property Tax records are public info, and their website allows looking up the records for any property without a login. However, the data behind this website it the _source_ of the public records, and not the public records themselves (which would be local government databases).

For years now we've been in an arms race with someone using a botnet to scrape all of the account information for a particular county. My client doesn't care so much about the data; it's the server load that's a problem. Normal activity for this site is a few dozen account searches per minute, but when the botnet gets through our blockade it sends hundreds of search requests per second, overwhelmimg the site. The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data, which they'd probably provide for a minimal fee.

Data hosting isn't free, even if the data is.

vorpalhex · on Jan 29, 2020

Wouldn't the solution be to offer a streamlined download (maybe even as a torrent if you're worried about bandwidth) of all the data then?

Symbiote · on Jan 29, 2020

I work on a fully open data repository. The website has the API linked in 3 places, so when I find inappropriate scraping I block it with "HTTP 420 ... see <API link> or contact <email>".

Some people probably switch to using the API, but no-one has ever contacted us. They either give up, or run their scraper on a different computer -- I've seen the same scraper move between university computers, departments, then (in the evening) to a consumer broadband IP.

magduf · on Jan 29, 2020

I really don't understand why anyone would bother writing and using a web scraper when an API exists. Does the API not provide all the same data/functions as the website? Scrapers are a big PITA compared to just using an API: they're much harder to write to be reliable, and they can break at any time, whenever the site makes even the smallest change. APIs avoid all that mess, and make performance far better too (on both sides), since you're only downloading the data you want, not a ton of Javascript and HTML that you don't.

barrkel · on Jan 29, 2020

APIs are often not as complete as the web interface, since the customer sees the web interface and normally the customer is what drives the revenue model of the company.

If pages are driven via an API, then the API is preferable, but publicly facing websites are often a mix of server-side HTML generation and API enrichment, for caching if nothing else.

magduf · on Jan 29, 2020

In that case it seems that the webmasters complaining about scraping need to make sure their APIs actually provide access to all the same data, if they want people to use the APIs instead of scraping.

DougWebb · on Jan 29, 2020

If the scraper contacted the client, said what they need the data for, and (probably) paid for api access, then my client would probably go for it.

My client is under no obligation to make access to this data easier. It's not really their data either; the information is property addesses, owner names and addresses, and tax assessments and payments. My client wouldn't want to make it easier for scammers to get that data. So they're not going to do anything unless they know the scraper is legit. If that's the case, the api would require authentication, and any fees would be for the server load, not the data.

TremendousJudge · on Jan 29, 2020

probably, but as GP said, "The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data"

Some people just don't care for the commons

briandear · on Jan 29, 2020

For what purpose? That’s like suggesting that if people keep jumping your fence and trampling your roses because it’s a shortcut to a public park (in this case, the county records office) that already has public access roads, that you should be obliged to build a sidewalk through your garden, at your own expense, when the real answer should be that the public road should be improved.

vorpalhex · on Jan 29, 2020

If your goal is to get your roses to stop being trampled, it's probably easier to install a few pavers than to spend years petitioning to get a road built.

The ideal answer and the efficient answer are not usually the same.

wgerard · on Jan 29, 2020

Yeah, the real problem with scraping is that it's often done very haphazardly and bluntly. Sometimes it's very difficult to tell the difference between a scraper and someone trying to DOS your site.

moosey · on Jan 29, 2020

I'm in a similar job. We block people from scraping if they break a threshold, but we also refer them to the reporting system, which can get all of the information that they are collecting in a variety of formats.

I wonder if something like this would be allowed: if all the public information was available in a well-collated format, then can scrapers be blocked? I imagine that will eventually be fought in court as well.

scotty79 · on Jan 29, 2020

Maybe you could contact the scrapper? Just post magnet links on the site that allows them to get nicely formatted dump of what they want.

DougWebb · on Jan 29, 2020

We did figure out who the scraper probably is, but only after several years. For a long time they used an untracable botnet, but after blocking that they eventually switched to a corporate network we traced to a data aggregation company. But we don't know for sure who's doing the scraping; it could be the company, a rogue employee, or a botnet that got loose on their network.

sarchertech · on Jan 29, 2020

At work we have all of our data available publicly as easy to parse XML files, but no matter what we do the bot owner's refuse to use it. They'd rather hammer our search engine with sequential searches instead.

qihqi · on Jan 29, 2020

Would rate limiting be a viable solution?

DougWebb · on Jan 29, 2020

We've done that, but it's tough to rate-limit a botnet because of the ip address spread. Also, their crappy scraper software doesn't even bother to check if requests are successful; it spews them just as fast no matter how our site responds.

CWuestefeld · on Jan 29, 2020

No. They botnets works through multiple regions on multiple cloud providers - that's how they achieve such high throughput. For any single IP address, the load is reasonable, but for the whole botnet it's absurd.

Currently bot traffic accounts for 2/3 of my load, meaning that the cost of providing my service is 3x what it would be without these persistent bots.

gbasin · on Jan 29, 2020

Can you just put it behind a CDN that protects you?

danmg · on Jan 29, 2020

just put an option to download the raw csvs buried somewhere there. someone who is putting in the effort to bot scrapers will find that link, and save your server the load.

yodon · on Jan 29, 2020

> People want their data to be public

People don't want their data to be public. People want other people's data to be public. One's own data everyone thinks should be private and tightly controlled. This applies to people and businesses equally.

derefr · on Jan 29, 2020

In this case, LinkedIn users kind of do want their “public profiles” to be public. They’re online CVs; by definition, if you make one, your goal is to get it into the hands of anyone who asks for it!

LinkedIn, likewise, has built its business model on an implicit contract with its users that it’s going to show their CV to anyone who asks for it.

I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV. A CV (individually, rather than in aggregate) is ultimately useful for only one thing: marketing the CV’s author’s skills. Why wouldn’t I want my marketable skills scraped into some private “talent matchmaking” agency’s databases, such that someone could find me—and hire me—when I show up as a result of some fancy OLAP query they paid that agency to run on their scraped data? It’s more roundabout than them just finding my CV on LinkedIn, but I’m still glad they found it!

Dayshine · on Jan 29, 2020

>I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV.

LinkedIn is really really clear that:

- They won't share you information with 3rd parties

- You're not allowed to use information on LinkedIn for commercial purposes without their permission

- Other users can view your personal data

So, why would I expect random third party companies to be able to scrape and sell my personal information?

My personal information is there for the individual use of others, and for authorised use by recruiters (who are vetted/managed by LinkedIn).

I've chased down the convention spam mail I get using my GDPR rights, and surprise surprise, they got my details by scraping LinkedIn. That is absolutely not expected nor acceptable use of my data...

derefr · on Jan 29, 2020

There’s a difference between “my information” and “the public webpage that I went through a publishing workflow to create from a curated selection of my information.”

Let me put it this way: if I have a Wordpress blog, I’d certainly be miffed if Wordpress let bots see my drafts... but I’d also be miffed if Wordpress didn’t let bots (Google, for one!) see the published blog itself. It’s a blog; a public website! Anyone or anything with the URL is supposed to be able to retrieve the page! It’s not “my information” any more†; it’s been broadcast!

† You might want to mentally analogize to copyright, but I don’t think it’s the right model for the intuition people have here. Instead, try mentally analogizing to confidentiality. When a classified document is published in the public sphere (e.g. as evidence in a trial, as testimony before congress, etc.), this forcibly declassifies it. No matter how much the originator of the document might want to still keep it a secret, the legal protections of confidentiality don’t apply to it any more: it’s out there now. Anyone who reads it could plausibly have just read the public-sphere copy, so there’s no longer any way to charge people who have knowledge of the previously-classified information with any crime.

Dayshine · on Jan 29, 2020

Well, my name, my job title, my employer, my job history.

These are all my information, and selling them to marketing companies is definitely not archiving.

Would you be OK with a company scraping your blog and selling it?

derefr · on Jan 29, 2020

> Would you be OK with a company scraping your blog and selling it?

Selling it how? If they put my blog posts in a book and try to sell that book, that’s copyright infringement. If they put my blog posts in an ML model corpus to train a translation service, and they then charge pay-per-use access to the resulting service... I don’t think I’d care, nor do I think there’s anything morally or legally wrong with that. If they scrape my name and phone number and generate a Yellow-Pages-like index from them? That’s explicitly allowed by law; and heck, that’s why I embedded the information onto my site in vCard microformat in the first place!

To put my philosophy succinctly: if web.archive.org can scrape your data without you having an explicit relationship with them granting them that right, then bad.evil.com can too. You can allow both (= publicizing your information), or neither (= protecting your information), but you can’t allow one but not the other. “Third parties you don’t have a relationship with, who access your data through the public sphere without entering into a specific licensing arrangement with you” are legally one big amorphous blob. You can’t make a law that splits that blob up, because it’s an opaque blob; in the ACL system that is contract law, all entities you don’t have contracts with are just one entity—“the public.” If you want some specific entities to have access to your information, that’s what protecting your data (= setting an ACL “the public = disallow”) and then explicitly licensing it out by entering into contracts (= setting an ACL “entity X = allow”) is for.

Dayshine · on Jan 29, 2020

Why can't I have terms on my website that say how you can use my information?

Examples where this is allowed:

- Images/media (Creative commons)

- Code (Open source licenses)

You say it isn't allowed for:

- Personal data

Unless I'm misunderstanding your philosophy (which seems to say copyright is OK, but public information must be public to all): You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my name, job title and employer as a marketing bundle?

Edit: An aside, it's really confusing that you seem to be editing your previous replies minutes after I responded. I thought HN only let users edit during the "no replies" period?

cthalupa · on Jan 29, 2020

>Why can't I have terms on my website that say how you can use my information?

>- Personal data

So there's a couple of things in play here. You can't (generally) copyright facts - "Cthalupa is a Rocket Surgeon for the Space Force since 2001", if true, would not be something that I could get a copyright on.

https://www.newmediarights.org/business_models/artist/are_fa...

The second thing is that terms have to be agreed upon by both parties. If you give me information without us coming to an agreement on terms, I can't be bound by them. If you just put a link to a TOS on your website and don't require people agree to it before giving them access to data on your website, we did not enter into a contractual agreement.

derefr · on Jan 29, 2020

> Why can't I have terms on my website that say how you can use my information?

Neither Creative Commons nor copyleft (nor copyright in general!) can assert anything about private use. IP rights are commercial rights; they affect sellers of your IP. They don’t affect end-consumers of your IP.

Note that even the GPL can’t force someone to publish the source of their GPLed-library-containing program, if they never publish the program itself, but only build it for their own private use.

Why? Because, by broadcasting the code of your GPLed library, you granted people an implicit use-right to it! Not a redistribution right; not a derivative-works right; but a use right. (If this wasn’t true, then people would be breaking the law by reading “common” newspapers in a cafe, or by listening to the radio, since they never entered into any explicit contract with the distributor/broadcaster.)

How does software licensing work, then? Mostly by 1. companies installing software on computers for their employees to use being considered IP redistributors; and 2. attachment of copyright through sampling when asset samples [e.g. brushes/textures in Photoshop] are distributed through the program. Other than that, there’s really no law forcing end-users to pay for software licenses. This is why e.g. WinRAR would never have been able to sue anybody. They published their shareware binary (without gating it behind a contractual relationship, like Adobe’s Creative Cloud installer); so now you have a use-right to it!

> You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my phone number?

Copyright exists because your ability to make money from your own creative works hinges on your ability to exclusively license those works. If a publisher can get a redistribution license to your manuscript for free from a third party, why would they buy it from you?

You having exclusive access to your phone number does not make you money; others having access to your phone number does not deprive you of money you could have made by keeping that information private. Thus, there’s no advantage to introducing IP law into this domain (the domain of facts.)

There was a recent court case about someone creating a subway map by copying the raw data from existing subway maps, where the comments went deeper into this.

Dayshine · on Jan 29, 2020

Right, sure, I hadn't really thought about how copyright isn't really enforceable against individuals. That's very interesting.

However, I'm really not sure how this is relevant to your moral stance on commercial use of "public" personal information.

Why do you believe it's reasonable to prevent unauthorized commercial exploitation of creative works, but not to prevent unauthorized commercial exploitation of personal information.

The former simply affects the small percentage of people who sell their works.

The latter affects the vast majority of the population who receive targeted spam, have their information collated and sold for profiling, are victims of identity fraud when those databases are inevitably leaked, etc.

For what it's worth, as I mentioned in my first comment, the GDPR absolutely gives me rights to control how my personal information is used. And the GDPR has a near total exemption for individual use.

What benefits do you see of commercial use against the wishes of the person that published it that outweigh the risks? (making money isn't a benefit)

derefr · on Jan 30, 2020

I think you got the wrong idea if you were thinking I was saying copyright isn’t enforceable “against individuals.”

My example of copyleft was specifically about the thing the Affero GPL tries to avoid (to unknown success): the possibility of someone using GPLed libraries to set up a commercial web service. Because they never release the binary, but only have people interact with it over the Internet, there’s no derivative work being made available in the commercial domain. So copyright doesn’t apply. Even though you’re a company making money off GPLed libraries!

briandear · on Jan 29, 2020

> If they put my blog posts in a book and try to sell that book, that’s copyright infringement

Actually even if they don’t sell it, but give it away, that’s still copyright infringement.

6gvONxR4sf7o · on Jan 29, 2020

I have a linkedin so that I can point people at it. I also want human recruiters who have actually read the thing to send me relevant jobs. If my profile ended up affecting my credit report, I'd be pissed. I expect you would be too.

People put data places for specific purposes (to show recruiters) and want the ability to limit use to that purpose. How that's accomplished is just a technicality most people don't care about.

barrkel · on Jan 29, 2020

Not so sure about that. Messages on LinkedIn are mediated in a single place, and you have a measure of control over how your profile shows up in searches. If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.

derefr · on Jan 29, 2020

> If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.

I would point out that this is still possible (even probable!) without any bots being involved at all. Back before CVs were online, humans working for recruitment agencies would “scrape” information from local, physical job boards by hand into their company’s databases (where “database” here could just mean a filing cabinet.)

IMHO, the real solution to that is a spam filter (or an “agent”, in the old world.) Just because a lot of people want to talk to you, and most of them aren’t very interesting, doesn’t mean they need to be prevented from accessing you—they just need to be prioritized by interesting-ness, which is something you can do yourself, or hire a service to do for you.

ppseafield · on Jan 29, 2020

I think the GP in this context means, e.g. LinkedIn wants their information to be public in the cases where it benefits them as a business. But then they want it to not be public when it doesn't benefit them. There is no such thing as "public information, except ..." - information is public, or it's not. If none of LinkedIn's data was public, they would have a much harder time getting people to sign up, and having as many users signed up as possible is part of their business model.

From a copyright perspective (since that's what LinkedIn's lawyers claimed): imagine if a newspaper sued another newspaper, saying that - not just the content of its paper - the information in the newspaper was copyrighted and could not be accessed by "unauthorized" third party companies. Either you print it, or you don't!

jakelazaroff · on Jan 29, 2020

I want to be able to use LinkedIn to network with colleagues and people in my industry. If someone wants to scrape my profile to make a report on industry trends, I’m fine with it. What I don’t want is hiQ vacuuming up my data so they can snitch to my employer if they think I’m job hunting.

How is this a paradox? Tech — the web in particular — is supposed to be an equalizing force, but HiQ is clearly trying to give my employer more power over me. We are an industry that prides itself on solving difficult problems — how is our response here to just throw up our hands and say “it’s all or nothing”?

LameRubberDucky · on Jan 29, 2020

Wow, I just looked up what hiQ does and have to say it's pretty scummy in my opinion. Why do people create stuff like this? Don't they know it will likely come back to bite them one day?

For reference:

"There is more information about your employees outside the walls of your organization than inside it. hiQ curates and leverages this public data to drive employee-positive actions.

Our machine learning-based SaaS platform provides flight risks and skill footprints of enterprise organizations, allowing HR teams to make better, more reliable people decisions."

Dayshine · on Jan 29, 2020

The thing is, GDPR has theoretically solved this in the EU. The UK's ICO is about to publish guidance prohibiting scraping public user information for marketing (where the user would not expect it to be used for that).

It's a really easy solution, because companies need to prove how they got your data when asked.

When you track the source of the mailing list you're getting spam from and they say "We scraped it from LinkedIn", they get fined.

Nasrudith · on Jan 29, 2020

We haven't solved that in the same way we haven't "solved" encryption not having a magical good people only door despite spook tantrums. There fundamentally isn't a possible mechanism and really wanting it doesn't change that.

It is a result of equality - not of outcome but rules. Open for everyone but those whose applications you don't like isn't open. On a technical level trying to prevent it is like the "evil bit" as a solution to malware.

jakelazaroff · on Jan 29, 2020

Of course there are possible mechanisms. There are heuristics to detect bots. The whole reason for this lawsuit is that LinkedIn blocked hiQ from scraping their website.

I'm also not necessarily talking about a technical defense against unwanted scraping. Write a law makes it illegal to do something like "scraping personally identifiable information and storing or presenting it non–anonymized", and prosecute companies who break it. I'm sure there are loopholes in that particular example, but the point is we can absolutely add shades of gray here.

> Open for everyone but those whose applications you don't like isn't open.

Openness should be a means, not an end. If we make something "not open" but it prevents 95% of undesirable uses and only 5% of desirable ones, is that not a tradeoff worth discussing?

gph · on Jan 29, 2020

>Tech — the web in particular — is supposed to be an equalizing force

Don't take the Google PR so seriously, the tech industry wants to make money like every other industry.

Nitramp · on Jan 29, 2020

"public" is not the right concept here I think. E.g. imagine a composer conducting a public airing of some work of music (e.g. on some festival). That you were able to hear the music in public doesn't mean the {composer,artists,...} give up their copyrights.

davorak · on Jan 29, 2020

> public doesn't mean the {composer,artists,...} give up their copyrights.

Copyright law protects the original work of the composer and artists in your example.

User profile data on Linkedin is not Linkedin original work.

Additionally user profiles are mostly made up of facts, which are not copyrightable.

mokus · on Jan 29, 2020

I think copyright as you mention here is the right concept, or at least a lot closer. In particular, the limits on copyright. If someone is reciting a list of facts in public, they can’t expect people not to record those facts, because copyright doesn’t apply to that. Reciting the list in public using computers shouldn’t change that.

Wowfunhappy · on Jan 29, 2020

I want web scraping to be legal—but, is it really contradictory to say "I want this data to be accessible to real humans only"?

Any person can post on Hacker News. However, if someone made a bot to post to Hacker News, I think most of us would be pretty upset.

Dayshine · on Jan 29, 2020

What you might mean is:

- I want my data to be publicly available

- I don't want my data to be processed/distributed/sold without my permission

E.g. individual use is fine, profit making is not.

Which is my expectation with LinkedIn. I want people to see my profile, I don't want them to sell it as marketing leads!

mic47 · on Jan 29, 2020

Agree with this, and I would also add this - I want to be in control of my data and change the setting. - I want to be able to delete my data.

If scraping on linked-in is banned (and linked-in is enforcing it), then I do have have control of my data, since I can change the setting, and it will no longer be public (It's not perfect, since some might already scraped it, but the extend would be much smaller). Also, if I decide to delete my data, linked can do that for data it control, but not for scraped data.

stubish · on Jan 29, 2020

Scraping information is not the same as posting. There are a number of bots that scrape Hacker News and people here generally consider them pretty cool.

nathanlied · on Jan 29, 2020

Isn't that a different paradigm, though? A posting bot set loose on a forum/platform will (normally) visibly degrade service in a much more visible and impactful way than a scraping bot. And in either case, writing (and running!) a bot that posts on HN is not illegal behaviour in itself.

danShumway · on Jan 29, 2020

I dunno.

I'm on Hackernews to see interesting articles and read interesting conversation. If a bot can post interesting articles and make interesting conversation, I'm not sure I care that it's a bot. And if a human can't do those things, I'm not sure I care whether or not they're 'real'.

https://xkcd.com/810/

That focus on "we don't care if you self-promote, we don't care why you're here, we just want you to be a good citizen" is part of why I like HN.

It's not completely black and white, but in general I believe that users online have the Right to Delegate[0]. That right should only be legally taken away if there's a really, unbelievably compelling social justification for doing so. I am pretty skeptical that banning web scraping has that kind of justification.

[0]: https://anewdigitalmanifesto.com/#right-to-delegate

alaaf · on Jan 29, 2020

Isn’t this different from read access? Apart from server resources, downloading content doesn’t effect a web site as much as posting on it.

TheCoelacanth · on Jan 29, 2020

I don't really care if the comments are from bots, per se. I care if they are quality comments or not. Whether or not the comments are from bots is just a proxy for whether or not they are actually good.

manaskarekar · on Jan 29, 2020

Makes me wonder, how many users on HN are actually very convincing bots?

Nasrudith · on Jan 29, 2020

Humans cannot directly access websites - a machine is always involved. I know perfectly well what you mean but the distinction on a deeper level is fundamentally imaginary.

The best is some sort of heuristic like captchas and even they can be outsourced so that the human doing them isn't actually viewing the content.

The thing about a bot which bothers people is the behavior anyway. A bottish acting human would get people just as upset.

sillyquiet · on Jan 29, 2020

Yeah, I have nothing at all against scraping per se, it's more about the huge bot traffic the commercial scrapers generate, which would ALSO be fine, but 1. it can be hard to tell scrapers from malicious DDOS bots sometimes and 2.) the person being scraped literally pays for that scraping traffic.

a1369209993 · on Jan 29, 2020

So, apply DOS/DDOS mitigations? You need to have those anyway, so you might as well use them for this too.

sillyquiet · on Jan 29, 2020

Yeah, but where do you draw the line between 'oh sorry we dropped your request because of rate limiting' (or whatever mitigation strategy) and 'oh, we dropped your request because u scraping us bro' legally? IANAL, but this lawsuit seems to indicated that putting barriers in front of scraping attempts is a no-no.

a1369209993 · on Jan 30, 2020

"You issued more than $X requests per $Y seconds. We don't care whether you're scraping or not; try again later." for any values of X and Y.

tyingq · on Jan 29, 2020

There's plenty of grey here. For example, scrapers that try to check people in for flights to get better seats. Some that tried to charge for that. That creates problems, where some customers benefit at the expense of others, high load on a "locking type" piece of code, etc. Similar for ticket sales for concerts, and probably other spaces.

There are also companies that provide added value by compiling and correlating "public info" in a useful way that creates value. If Google let me scrape their search and remove ads, it would be popular, but is it "legal"? Or maybe Google Maps?

LameRubberDucky · on Jan 29, 2020

I would think, and of course could be wrong, it would be as legal as Google scraping all of the web sites that they do in order to create their search engine in the first place. In particular, Google provides cached versions of web pages. That's pretty hardcore scraping.