Hacker News new | past | comments | ask | show | jobs | submit login
Congrats! Web scraping is legal! (US precedent) (parsers.me)
1057 points by ehurynovich 24 days ago | hide | past | web | favorite | 388 comments



Linkedin is taking this to the Supreme Court: https://www.law360.com/articles/1237505/linkedin-will-go-to-...

No ultimate decision was ever made, and no, this doesn't make web scraping 100% legal. Wake me up when there's a new announcement because anyone interested in this already know this old news.


This is a really big deal. Currently (IMHO) the US Supreme Court is a wholly-owned subsidiary of multinational corporations due to the shenanigans that happened with Obama, McConnell and Garland, so will likely side with LinkedIn since it's the larger corporation:

https://www.npr.org/2018/06/29/624467256/what-happened-with-...

I feel like siding with LinkedIn here would open up the web to extortion though, like troll companies that would send cease and desist letters to all scrapers (even search engines). I think it could be argued that letting one company scrape when another is denied is discrimination.

Then again, I don't know how conservative and republican-leaning courts decide corporate law. Maybe in this case since so much money is at stake, they might worry that banning scraping would infringe on something like free speech and ruffle the feathers of some of the wealthier contributors in their base. Especially on the media side since I imagine they use bots in one form or another to find newsworthy stories.

IANAL (obviously!), I just find it entertaining/dismaying to ponder these things in these times.


The thing is, this is still all about a preliminary injunction. Even if the injunction is found to be without merit, that still doesn't provide a final answer to of if LinkedIn can successfully sue HiQ to force HiQ to stop scraping under the CFAA.


Bad look on LinkedIn, if you want the information to be visible publicly, don’t fault others for learning that information.

Gate the information behind a login then sue the scraper for violating TOS and not scraping itself, I can understand that.


Doesnt this ruling say login gate doesnt matter


The link you gave, says (sic):

> Sign up now for free access to this content > [...] > Email (NOTE: Free email domains not supported)

I think this sort of antinomy[1] sounds ironical especially in this thread. This is practically legal dependency injection pattern, when you call something free, then administer antinomies, catches-22, etc.

[1] https://en.wikipedia.org/wiki/Antinomy


5-4 overturn?


"HiQ only takes information from public LinkedIn profiles. By definition, any member of the public has the right to access this information. Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site."

Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

"In this case, hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law."

This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

Seems to imply that every business is somehow beholden to every contract signed by anyone.


> Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

ToS are subservient to the law; you can (probably) terminate a service account from a user that breaks your ToS, but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.

> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

IANAL, but I believe that'd fall on intent, and intent is often difficult to prove at a personal level, but not necessarily at a company level. If your intent for putting up barriers that happen to impact scraping, whatever they may be, was indeed to knowingly prevent scraping from a particular company, then you may be liable under this decision. This is the only part of the decision I'm torn on, since it's a bit messy to really prove such things. I'd be much more comfortable with allowing companies to take whatever measures they feel necessary to prevent scraping, and also allowing scrapers to legally circumvent those measures without threat of prosecution, assuming they didn't actually hack into anything.


> but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.

Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

I have interpreted the LinkedIn ruling to mean that scraping public data is no longer criminal activity but it still leaves you open to civil lawsuits for violating the ToS of the website you are scraping.


> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

How would that even work? If I browse to any random public page of your website, it's served to me before you've even transmitted the terms of service. How could I be bound by those terms of service when I haven't even seen them?


As an engineer, I agree with what you are saying, but I think normal people and the courts disagree.

I think these sorts of contracts are called Adhesion Contracts (https://www.investopedia.com/terms/a/adhesion-contract.asp) and we interact with them all the time. For example, if you valet your car, the valet will hand you a piece of paper with a number printed on it to retrieve your car. On that paper you will find an adhesion contract that is valid and real (although not as powerful as the types of contracts that you sign)


This does not work at least for software licensing based on precedents for shrink-wrap contracts, so again would not work for licensing use of data.

A paper served you by the valet is not an immediate contract as you can deny agreeing to it and service does not happen.

You cannot do that with a publicly visible website, unless you show ToS and require agreement before first use. If you allow a non-transferable license then said data cannot be used by a search engine. If it's transferable you just pushed the problem towards scraping a different bot. (Well, you could have a direct agreement with a few major search engines.)

Caveat emptor: not a lawyer.


IANAL, but it seems like ToS could still govern your use of the data which you viewed. Sure, it seems like you couldn't claim any violation based on visiting a random page. But if the ToS is clearly identified on the page and you do something with the data that violates them, perhaps the owner of the site has a case.


> perhaps the owner of the site has a case.

Except it sounds like the owner doesn't. If the information is on the page made public, the owner of the page can't place terms on what is done with the data downstream. They'd have to implement some real binding system such as authentication where CFAA would apply. (IANAL)


Correct, but all of that is void if the data presented is any sort of protected information (copyright, IP, etc.). You can't, for example, scrape Yahoo Finance for pricing and dividend history and republish on your own stock tools website. They have a license to redistribute that data and publish on their own website. Similar story for copyrighted text and things of that nature.


That would require at least showing that ToS on first use. A link on a page is insufficient.

And said ToS would have to force copyright reassignment rather than a general licence, making LinkedIn culpable for any unlawful content published by users of its site.


I am a lawyer, and there isn't really an easy answer to these questions.

TOS are a lot like EULAs. If they look like contracts of adhesion, then they're going to get more scrutiny and skepticism. The TOS that you claim applies even to every single random visitor to your site where they do not in fact affirmatively agree to the terms is potentially going to look more like a contract of adhesion. That's a lot harder to enforce.

If they are used more for CYA so that you can ban undesirable accounts from your website which people explicitly agreed to when they signed up for it, or so that you can just up and alter your entire business model without having to give all of your customers refunds, then they're easier to defend.

Just my general opinion, of course. Every jurisdiction is different.


Also not a lawyer, but you cannot force me to accept your terms of service. Contract law requires both parties agree to enter it.

When you create an account, etc., you are agreeing to those terms. If I browse a public webpage that just has a terms of service link on the bottom of it, I've not agreed to anything.


> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.

Typically you'll see TOS say something along the lines of "by continuing to access this site you agree..." or "if you do not agree with these terms you may not access this site..."

Whether that's enough to create a binding contract depends on the jurisdiction and who you ask.


It can also depend on the terms themselves. I can put "by using this site you agree to bake me a chocolate cake" on my website all day, but that doesn't mean I will be able to force you to bake me a chocolate cake.


Terms of Service is a form of contractual agreement, which requires there be an offer and subsequent agreement by the parties.

I don't think criminal law was ever part of this.


From the article, the LinkedIn decision was that scraping data does not violate the Computer Fraud and Abuse Act. Violating that act was considered to be criminal activity. (https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act)


But the claim of a violation was only a claim as part of a civil trial. The law has both civil and criminal elements to it, and this is about the tort part of the law.

LinkedIn made threats accusing hiq of criminal behavior, but that doesn't mean there's any criminal precedent being set here, as far as I can tell. And no one was criminally charged.

Separately, part of the ruling states that for the purposes of authorization, defying a cease and desist letter does not constitute illegal access, which might have some criminal implications. They imply some sort of technical authorization system must be bypassed, which didn't happen, since the data is "public."

(Which doesn't square well, imho, with existing meatspace law. If a public serving business banned someone from their store, the door being unlocked isn't an excuse to ignore that ban and trespass. But I digress.)

With the overlapping areas of law, it's admittedly beyond my understanding. But the law is generally viewed, like dmca, as being overreaching, if not at least partly unconstitutional.


The CFAA is overreaching, and used often as a catch all. 'Reply All' has a good episode which explores this. This is actually what was used against Aaron Swartz when he was charged for downloading academic journals from MIT, and why his charges were unjustly severe.

Reply All - #43 The Law That Sticks https://gimletmedia.com/shows/reply-all/rnhoxb


It doesn't completely answer your question, but what Nathan is pointing out is that private contracts cannot negate common law.


There's a long, long history (probably hundreds –if not thousands– of years old) of selling aggregated or processed publicly-available information.

I'm not particularly thrilled with it, but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.

LinkedIn users (as opposed to the company) might actually like what HiQ is doing, as it may help their own prospects.


> but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.

It's not free, it takes time to collect data. Buying it makes a lot of sense as long as you pay less than what's your own time worth to you...


It is true in the current situation, though I would prefer that we ensure free data must be free. In that case buyers of data would be incentivized to pressure providers of free data to improve the data quality.


The data does remain free, as long as LinkedIn still provides it for free.

The data without the noise is what you're paying for. The service of winnowing out what you care about from what you don't care about.

Considering how big of an effort it is, and that the source from which it came is still available, why should the cleaned data be free? If I collect fallen trees from public land, chop it into usable firewood, should my bundles of firewood also be free? Or I collect solar power with my own solar cells, should I have to give you the electricity for free?


I think this is especially relevant when it comes to things that fall under disclosure & transparency requirements - a lot of information that is legally required to be made available isn't legally required to be convenient. So, as a patient, you may have the absolute right[1] to a free copy of the charge master[2] of a hospital you're admitted to but it could be required that you pick it up in person or that it is only supplied in microfliche form... so a company that's aggregated this and is reselling it can deliver real value.

1. This specific example is BS but plausible - I just wanted something more specific than the vagaries around things like FOIAs or shareholder reports which both have specific facts that can be rendered useless unless you have the context.

2. Basically, list of how much procedures cost.


Absolutely spot-on.

I'm thinking of processed GIS data. If you have ever tried using the various formats that are supplied by government sites, you know what a huge pain it is.

I'm happy to pay a reasonable price for an interpreted and bowdlerized version.


I actually have! I had to import a huge file of all of the culverts around storm drains in a state, and each culvert was multiple pieces of geometry, none of them grouped together in any logical way. It was just a huge list of rectangles that looked like culverts when viewed visually but no way to identify them as being one culvert without heuristics on how close each rectangle was to others. Massively long process that should not have been so.


What do you mean free data must be free?

The data is free, but the aggregated formatted data has been worked on and processed, are you saying the resulting aggregated data should also be free? That isn't going to happen, why would anyone do that work for free?

Or are you afraid linkedIn and others will make everything private? That's completely up to linkedIn or individual linkedIn users what they want to make private vs public. Maybe more data would be made private if they don't want it scraped. I don't think that's inherently a good or bad thing.


I'm trying to puzzle out how this works in practice. So if LinkedIn has truly public data (no login required to view) then it can be scraped no problem.

But if it's only accessible with a login, then it falls under TOS and they can be blocked?


> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

This is just a preliminary injunction. This wasn't an actual ruling on the case. This just says that until there is a ruling they can't stop the scraping to make sure the company isn't put under while waiting for an actual ruling.


You don’t understand what a preliminary injunction is then.

It’s a very, very strong indication that they will win. Courts don’t issue preliminary injunctions unless it’s extremely likely the side who won the preliminary injunction will win.


Huh, I thought in USA they also did them to avoid an injunction having the effect of making the judgement irrelevant. So, where the case is not clear cut the injunction could prevent one party acting to 'kill' the other (and so avoid judgement) in the meantime?

Could you cite something on this that indicates this (my understanding here) is wrong?


It only requires a “substantial” likelihood that side will win (not an “extreme” one), which basically means there’s a substantive dispute. The more difficult criterion is a substantial likelihood that irreparable harm will occur if the injunction isn’t granted (irreparable harm is supposed to be a pretty extreme thing — it means you can’t fix it with any amount of money).


Please read the linked decision before putting words in judges' mouths:

https://parsers.me/appeal-from-the-united-states-district-co...


The issuance of an injunction is in no way related to how the future court battle will result.


> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?

LinkedIn has long wanted to have their cake and eat it too - they advertised that data as being publicly accessible and allow google to index specific user pages but then attempt to restrict other bots from crawling it.

If you have private data behind a login there isn't an issue here - if you have public data but want some people to login before viewing it (or not be able to view it) then that's where this ruling comes up. So, this mostly hits sneaky SEO folks and dark UX patterns that rely on tempting someone with accessible data and then pulling the rug out from them at the last minute.

If your website places data outside of authentication then everyone should be able to see that data... I'm curious to see the specifics around

> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?

though - DoS attacks are clearly illegal, but with this precedent there's going to be a lot of back and forth to see where the line between DoS and scraping falls... and I think that makes this precedent a lot weaker than the headline would have you believe. A company can still threaten to drag you through a lot of litigation by accusing you of malicious page requests, it'll take a few cases to define where that line needs to fall.


This reminds me about Twitter, when I click to see a thread for a tweet it asks me to login, but if I open the link in a new tab it loads the thread just fine.


Linkedin want their data to be scraped by bots so they have to keep it public, otherwise you wouldn't find peoples profile from Google. They just don't want bots from from their competitors like hiQ to scrape it.


To me, this is crucial. If it's public and available for google, it's public and available for everyone. If you want content to be private, then make it private and accept that you won't get search engine traffic. Otherwise, don't be surprised when your publicly accessible content is accessed by gasp the public.


> hiQ argued

That does not mean that hte court agreed.

The judges said that CFAA doesn't apply.

In other words, the judges said that LinkedIn couldn't use the US legal system to force HiQ to stop. Judges didn't say that LinkedIn was barred from using technical measures.

The court did allow a preliminary injunction against LinkedIn, due to the possibility of "monopolies" (to be determined in Court later), pending resolution of that latter question.

LinkedIn might still win their claim to their right to block scrapers via technical means.


Something about malicious interfering with a contract? That might prevent technical measures?


They want to imply that, but they are wrong.

LinkedIn can't prevent HiQ from attempting to scrape their site through force of law.

LinkedIn can rate limit requests, make their site hard to scrape, change their format, whatever. LinkedIn is in no way responsible for how HiQ fulfills its contract to its customers. HiQ is attempting to say that if I sign a contract to provide you with a Tesla, then it would be illegal for Tesla to stop me from just taking one from them to give to you. If that sounds stupid, that's because it is.


> Seems to imply that every business is somehow beholden to every contract signed by anyone.

Implied contract is that if you publish something, it's public and you have no right to dictate what software people use to consume it.


You also have no obligation to them.


The court document says "... refrain from putting in place any legal or technical measures with the effect of blocking hiQ's access to public profiles." on page 11. I wonder if they mean targeted measures specifically blocking hiQ but allowing others such as Google.


https://www.eff.org/cases/hiq-v-linkedin

LinkedIn aint the victim here...


This is the part I disagree with:

> hiQ also asked the court to prohibit LinkedIn from blocking its access to public profiles while the court considered the merits of its request. hiQ won a preliminary injunction against LinkedIn in district court, and LinkedIn appealed.

Whether LinkedIn is the good guy or bad guy here doesn't matter when the decision creates precedence for the rest of us.

Surely a healthier precedent is that we can respond arbitrarily to requests and have no obligation to the requester. So what if I want to randomize the html structure on every request or block requests from Tor because 100% of them are abuse? Can someone take me to court on the grounds that either is effectively "blocking" their scraping syndicate? Why not?

I feel like once CFAA is off the table (which I do agree with), the cat and mouse game is a fair middle ground. Keep web scraping a sport!


Here is my attempt to draw an analogy.

There is a large banner next to the highway that shows some weather information that if properly organized (lets say monthly almanac) you would find people to pay money for it. The banner owner does not make money this way - he ask you to go to his website and signup for an account. But you drive the highway (internet) every day, look at the banner, write down the weather updates, and then offer them on your website as a sale. The owner gets angry and sue you. The court decides you are free to drive by the highway and you free to put your eyeballs on their weather banner, especially given the banner is available to everyone (LinkedIn profiles are avail to view without needing an account) and you are free to use the information you obtained for free without interference with said banner in a form of a monthly almanac that you sell. At the end of the day, the banner owner does not own the weather information that someone else put in there (for example a weather meteorologist).

I think personally its a healthy decision. Otherwise it would be similar to prejudice of who should be allowed to enter and browse a street store that by law is available to everyone.


https://en.wikipedia.org/wiki/Tortious_interference

This would mostly mean that you cannot start interfering with webscraping you previously allowed merely because you learned that they're making money with the scraped data.


It seems absurd if the 'interference' only directly affects their own property. Like, if my neighbors start monetizing livestreaming my backyard, suddenly I can't put up a fence? Except worse because in actuality, this third-party contract is costing them money through server load and bandwidth.


Your analogy doesn't hold. Your backyard is private property. The data that LinkedIn publishes is intended for the public. That's why Google can index the pages and give you results from LinkedIn.


> Your analogy doesn't hold.

It does, in the US. You're likely making an inconsistent comparison.

Property ownership has nothing to do with visual access. You cannot legally be barred from casually (involuntarily) perceiving something. It's reasonable to put up physical barriers to reduce what is casually perceived. It's a very good analogy.


However it doesn't hold - as your neighbor I can't bar you from putting up a fence because it'll intrude on my view of your property... granted people try to do that _all the time_ but I think it's commonly understood that putting up a fence for privacy is allowed.

It's also not a great analogy for this case because another party is given continued easy access to view my backyard while the first party is denied - and the analogy breaks down here because, as a neighbor, I have no inherent right to view your private life at least as much as any of your other neighbors.


> it's commonly understood that putting up a fence for privacy is allowed.

Try building that fence into the stratosphere. A regulatory body will prevent that.

> I have no inherent right to view your private life at least as much as any of your other neighbors.

That's a different analogy, not a violation of the first.

It's not necessary for every part of the analogy to hold, being an analogy.


It's trivial to fix that - the exterior of GP's house then. That's available for public viewing; is intended for it, but is private property. If you monetise livestreaming it and describe it in your ToS, GP can't repaint the front door, or get new windows?

Or perhaps slightly less contrived:

If I publish a monthly lowlights reel of my favourite sports team as a podcast discussion on where they can improve in all their lost games, and then they suddenly go on a winning streak for >1month so my USP is gone and I have nothing to talk about..?


Those examples don't fit because they are contracts not made in good faith. They aren't things you can control.

In this case, it was rules that the public data is available. It was a good faith contract on the part of HiQ to assume they could collect public data from a public website.

It would not be a good faith contract to assume you could control the paint colors on a property you don't own.

It seems to me that the interference ruling was wholly independent on deciding that what hiq was doing is legal.


Does that mean that ia grocery store offers free samples, I can go in every day and take all the samples, and the grocery is not allowed to selectively prevent me access?


It means that if they're offering free samples and refuse to offer you the same service they're offering to other customers they might be in hot water - which is consistent with what a lot of folks consider ethical. Offering an item for free to some folks and not to others is a form of discrimination - it's usually not a particularly troubling form of discrimination but in this case Google is allowed to walk up and take all the samples and the grocery store manager just smiles and nods - but when you (hiQ in this example) try and get one you're hit with an injunction and barred from entry.


Can someone taking down a open source project, like the leftpad debacle, be sued for tort?


I mean, anyone can be sued for anything. I can file a lawsuit with basically zero legitimacy to it. It'll probably get thrown out, but you were still sued.

If the question is could someone win, potentially. The argument would basically have to be that the removal of that open source project is akin to other cases of negligent interference.

If this is a specific concern, consult a lawyer - 'cause I'm not one.


Doubtful. If linkedin had completely taken down their site, or put it all behind a login account, them this case would not have turned out the same.

Maybe if leftpad somehow tried to block only some users from using their publically available plugin.


User data is not theirs property.


Your backyard can be a walled garden--this is about the public front of the property.


Exactly your backyard is of course yours. But you are not at liberty to use it to damage others. There's lots of rules about this. For example, opening a brothel on your own land is definitely not legal without considering how it affects the neighborhood.


When is it "damage", and when is it "declining to contribute" ?


If I decide to change the class names, or HTML structure of the page, is that no longer allowed?

How far does this go?


Are you doing it just to spite scrapers, i.e. with "malicious intent"? If you have some other reason, you won't be guilty of intentional tortious interference.


> "Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site."

How would this affect Cloudflare's "checking your browser" anti-DDoS protection screen, meant to block bot requests from accessing sites?


So if I (like many others) have cloudflare web scraping protection turned on, that is now against American law?


There's no reason to believe that.


It will be interesting to see how this will impact the bot detection market like perimeterX, etc.


> If LinkedIn wanted to force users to sign in to view profile info

Do they not already do this? Every link I've ever seen for LinkedIn has redirected me to sign up page rather than showing me the content.


They want search engines to index their profiles and provide organic search results links to their site, but then those same sites will require you to sign in when clicking a link to another public profile. You can search for that 2nd profile in Google and then view it without signing in, but not by clicking internal links. I've experienced this with Quora, LinkedIn, Instagram, FB and others. They want to have their cake and eat it too.


As a user of LinkedIn, I can pick which portions of my profile information I would like to be publicly available. This is not by default, so most people do not have it public. You can try seeing my profile without logging in. :-)


They block you out after the first few profiles you view. Try a private browser and you can still see them.


Your second point is interesting. I suspect the contract between hiQ and some company is that hiQ provides info on public profiles, and if LinkedIn removes all public profiles by requiring a login the contract would become moot. Just the same if I was to change my profile settings from public to private, hiQ wouldn't be in breach of their contract (nor would I).


Scrapping should either be legal or not. The fact that you have a contract to sell the content you assumed it's legal to scrape, should not matter. Too bad if you lose money


>Are those [ToS] legally void now?

They were pretty much legally void even before this precedent was established. They are only valid when they don't violate any existing U.S. law. Any authority assumed beyond that is completely false.

>can the court force me to revert that change?

No.


I wonder if it has anything to do with the fact that the data is actually owned by LinkedIn users, and they expressed that they want their data to be publicly available?


Unlikely. The license to LinkedIn retains ownership, but the user's retention of information ownership doesn't compel LinkedIn to affirmatively do things with that data (i.e. LinkedIn isn't forced to vend the data to a given consumer if the user says so).

The license further goes on to clarify that LinkedIn will vend public data to search engines, but the definition of "search engine" is almost certainly assumed (by LinkedIn, at least) to be up to them.


It's because previously public information became private. They can't do that apparently.


i'm wondering if that robots.txt might then get you sued due to blocking scrapers / bots?


A robots.txt file doesn’t block scrapers, it’s the equivalent of a no trespassing sign. I don’t think putting up signs will suddenly become illegal?


Robots.txt doesn't interfere with anything. It's a suggestion.


The toxicity towards web-scraping is really what makes me lose hope in the current web. People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.

This precedent doesn't really mean much but is definitely step in the right direction.


The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.

But when you scrape it my load goes up dramatically. A load I have to pay for.

It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.

I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.


So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.


This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.


Provide an API for public data to reduce the costs associated with rendering a full blown page, and deliver just the information needed.


Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.


I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.

So no, I won’t pay you for the privilege of you saving money.


No, I'll happily just scrape your site instead. But if you'd rather not have that happen, provide an API.


Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?


My guess is hiQ (and others) would happily pay for an API over the data they're scraping right now.


Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.


Why buy the cow when you get the milk for free?


This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.

Could I suggest:

"Why buy the cheese when you get the milk for free?"


Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.

Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.


The contention was not about load, it was about using the data.


> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.

That's a complete misconception. Of course you can find manufacture inconsistent ideologies if you combine ideas from different people, but I think you'd have a difficult time finding one person who believes what you just described.

What I want is, put simply, organizational transparency, personal privacy. I believe humans have a right to privacy, but I don't believe organizations have rights, period, and I believe radical transparency within an organization prevents organizations from trampling the rights of individuals.

Organizations in this case include corporations, governments, and nonprofits.


I'm interested in hearing your take on "organizational transparency". Like please push the concept / idea to its 'full' realization and tell me that picture, even if it implies a little bit of "sci-fi"¹.

Digging this because I think that domain / paradigm will see unparalleled evolution in the next few decades.

[1]: I mean, don't stop at current law / values / behaviors; like people from the 1940s wouldn't have dared speak about their idea of the 1970s because they'd think their belief "impossible". No flying cars though (Clarke-tech), because that's not a decision of the individual.


I don't think that looking too far ahead is useful: this is just a matter of pragmatics. Revolutionary change in a peaceful society happens via a long sequence of small, incremental changes, and that's a good thing, because you get to see how each of the changes plays out. I think the best sci-fi persuades you that it's looking at the distant future when in fact it's only using the future as a foil to provide deep insight into the present.

The short-term, the small, incremental changes I'd like to see are:

1. Reversal of the default privacy setting of government docs. Instead of documents being default-private and citizens having to make FOIA requests to make those documents public, documents should be default-public, and government workers should have to apply through and adversarial system (similar to courts) to classify documents, proving to a court why the document needs to be classified.

2. Classified documents should have a short (1 year max) timeframe after which they are declassified, or government workers should have to reapply to justify why the documents need to remain classified.

3. Political party documents should be public, without any provision for classifying them.

4. Tax-exempt organization documents should be public, without any provision for classifying them.

5. IPO'ed organization documents should be public, without any provision for classifying them.

6. Body cams on all police and military while on duty (when they are acting on behalf of an organization). 1 and 2 would apply to the footage from these cams as well.

7. Exceptions to 1-6 should be made for the personally-identifiable information of people who are not in the organization.

8. Organizations should be required to maintain a list of all the personally-identifiable information they have on a person (including employees), and provide that data to that person on demand by that person or their legal guardian, as well as a list of all people with whom that data has been shared, and be required to delete that information upon request by that person or their legal guardian.

9. Research which receives public funding should be forced to publish its results publicly.

10. All software which receives public funding should be forced to publish its source publicly.

11. Government documents should be published in open-source formats suitable for computer analysis (i.e. CSV, text, or some XML format--no PDFs).


Can't agree more with your first paragraph.

1, 2, 6, 11 should be no-brainers if people were educated IMHO — but this is 1920 relative to electricity or cars; still a long way to go before the mainsteam masses get it (which very much includes political figures). I would think 2030-2040 for the emergence of ethical consensus and concern (the kind that pervades political parties and social classes).

That is assuming the needle doesn't move too much farther in the authoritarian direction until then (the 20-year trend is really not looking that way currently).

3, 4, 5, 9, 10 are/would be met by strong opposition from interest groups, I'm sure you see that too. Everything I know about 3 tells me it's never going to happen with current parties / politicians. It's at least 1 generation away and I'm not sure the concept itself isn't utopia. 9 and 10 as well, I think it largely depends on the cultural paradigm (and this world's in 2020 is really not aligned with that, nor does it trend or even look that way). 4, 5 likewise, complex topics, lots and lots and lots of gatekeepers and lobbyists.

My take on these is they're very costly in terms of political capital; and they are largely debatable (politically, legally, philosophically, etc., you'll find passionate captains on both sides); thus there are 'better' (more consensual, with direct net positive effect) lower hanging fruits imho.

7 and 8 are hard problems, notably because of scale and the need for automation — it's part of a much bigger domain, automation of compliance and building "trustable" systems etc.; the kind that bridge or plane engineers must build, and probably software engineers too, but you know we're far from that if you read this forum.

I'd say 1 2 6 11 and 7 8 on the way to scale/automation already paint a whole different regime and degree of maturity for a 21st century State. I'd like to think we're now ~1 generation away from enactment of such norms.


What if the organization is one person in an LLC? Do they get rights? If so then a big company can hire a bunch of little LLCs to act as rights-having proxies for any task that requires them.


I'm going to assume you're asking in good faith and try to address the confusion here.

The human does get rights, the organization doesn't.

In some cases, believing that humans have rights and believing that organizations have rights might lead one to the same action. In those cases, I'd take the action. I wouldn't want to violate a human's rights out of some vindictive dislike of organizations: that's not the point. The point is that I'd take that action because I believe in human rights, not because I believe that the organization has rights.

And with organizational transparency: the entire point of organizational transparency is to protect human rights. In cases where organizational transparency would trample human rights to privacy, I would go with human rights every time. Violating the privacy of humans to achieve organizational transparency would defeat the entire purpose.


Let's say that individual humans have the right to keep secrets. Let's also say that they have the right to keep secrets with their associates, and to tell them to who they please. Now, doesn't that make it legal for a group of people to keep secrets about you? What about selling them? I just don't see what doing away with the legal fiction of corporate personage would do about Facebook.


It may not be your intent, but you're using some very vague, inapplicable terminology to make some screwed up behavior sound normal.

If you can tell secrets to who you please and sell them on the internet, they aren't secrets. Somewhere in the middle of what you're saying, the secrets stopped being secrets, but you kept using the word as if it still applied.

Facebook isn't a group of associates trading anecdotes about their friends: the server guy has never met Mark Zuckerberg, and they are not "associates" in any meaningful way. They're not friends, or even really allies: Facebook certainly has shown inconsistent concern for the well-being of its workers. So let's also drop the "associates" terminology: these aren't "associates", they're employers and employees. Employees aren't acting as individual humans on their own behalf, they're acting on behalf of an organization.

Putting aside the rights conversation for a second, let me ask you a question: if you tell your friend a secret in confidence, and they turn around and sell it to anyone on the internet who will pay a low fee, that would be pretty screwed up, no? We don't even have to talk about rights here: this is just screwed up behavior, regardless of the rights conversation.


"Now, doesn't that make it legal for a group of people to keep secrets about you?"

A group of people sure, but corporations are not people.


Corporations are groups of people.


Corporations are a power of attorney document. They are golems that sometimes act on behalf of people. They are not people.


Are contracts allowed in your worldview? Contracts must be signed by individuals, but when they act as representatives of a company, they are legally binding for that company. If they only have individual rights, then all individuals who didn't physically sign a contract cannot be held to it.


Correct.

At a small scale, it makes sense for people to appoint another person whom they trust to represent them in negotiation. But in a lot of cases, that's not how representatives are chosen. Particularly in corporations, the leadership of a corporation was not chosen by the employees to represent them, and in fact often doesn't even have the best interest of the employees in mind. A lot of the largest problems in our society arise from this fact.

Consider the case of a company that agrees to sell to a larger corporation, under the condition that they lay off half their workforce in advance of the sale. Surely we can agree that the laid-off workers were not fairly represented by the person signing the contract.

One could argue that the workers agreed to give up some of their rights in their employment contract, but I'd argue that they did so under duress: their option is sign the contract and work for the company, or starve and let their families starve. Sure, they can go work for another company, but other companies will require them to similarly sign away their rights.

This shouldn't be taken as a recommendation to blithely break contract law. Corporations don't have rights, but they do have power, and it would be unwise to behave as if they can't make your life miserable if you decide to cross them.


> The human does get rights, the organization doesn't.

Organisations are (usually) legal persons, too; they just have fewer responsibilities, get fewer rights in exchange.


That's how things are, not how they should be.


No rights, full liability would be a bad deal too.


A bad deal for whom?

It's a bad deal for corporations, but I do not care. Lack of liability is the cause of a ton of problems in our society.

Just to pick two stories of corporate sociopathy: Probably the reason people at State Farm are unconcernd about forging signatures[1] is that they know that the worst case scenario is that State Farm loses some business and maybe gets a fine: they are unlikely to go to jail for forgery or to have fines exacted from their personal bank accounts. Similarly, when Practice Fusion literally killed people[2], their execs had little to fear: nobody went to jail, nobody was fined: shareholders who had no visibility into the decision paid the fines.

When banks tanked the economy with irresponsible lending most were bailed out and gave their workers bonuses, while the people who were unable to pay their mortgages were ignored.

A little more liability for destructive behavior would be great for most people.

[1] https://news.ycombinator.com/item?id=22177812

[2] https://news.ycombinator.com/item?id=22165946


> they are unlikely to go to jail for forgery

Forgery is criminal regardless of private or official document is concerned. Even in a military setting, forgery of business-related documents are illegal.

> A little more liability for destructive behavior would be great for most people.

Why not full rights, full liability? Replace imprisonment and death by temporary and permanent suspension of company (including re-establishment of a sequel organisation out of a subset of stakeholders) respectively; and voila.


That's an optimistic theory, but read the link I posted and decide for yourself if you think anybody is going to end up in jail for it.

It seems like you're just picking out one aspect of each of my posts to disagree with. Do you have any disagreement with my overall point?


The person and the LLC are legally separate entities, although an LLC is often passed through for tax purposes.


Making organization membership public would trample on personal privacy quite effectively in some respects, such as with disease support groups or PACs; medical privacy is taken seriously, but is there such a thing as political affiliation being private? Is it a violation of someone's privacy to reveal they give to the ACLU?


This is where pseudonymous identities make a lot of sense.

In a world where organizations are radically transparent and individuals have radical privacy, before you join the disease support group or donate to the ACLU, you already know the organization's records about that transaction will be public information.

You can restrict your activities to organizations that do not keep personally identifiable information. Or you could join the support group with a pseudonymous identity or donate cryptocurrency to the ACLU.


The entire point of organizational transparency is to prevent organizations from trampling the rights of individuals, so in cases where organizational transparency would trample the rights of individuals, the rights of individuals supersedes the need for organizational transparency.


So, should it have become public knowledge that Brendan Eich donated $1,000 to support Proposition 8?


I'm gonna need some context there. To begin with you can't just donate to support a bill. What you probably mean is that he donated to an organization or politician who supported Prop 8, so say that.

The important detail is whether the funds were his personal funds, or whether they belonged to his company: i.e. was he acting on his own behalf, or on behalf of an organization?

It's an inane question, though, because the inane point you're trying to make is that individual privacy might protect homophobes. That's true, but it would also protect gays and allies who donated to oppose proposition 8. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason gay marriage is legal today is that people donated to people like Harvey Milk, at a time when donating to the campaign of a gay governor was risking your job and social standing.

Human rights still apply to humans who do bad things. If you are willing to give up human rights to fight bad people trying to do bad things, then those rights won't be there to protect good people trying to do good things, either.


So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?


Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:

> So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?

Yes, but it would also protect people who donated to a virtuous cause. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason virtuous causes have had any success at all is that people donated to support them, at a time when donating to those virtuous causes was risking your job and social standing.

Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.


> Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:

I'm attempting to clarify my question by removing irrelevant details, which it looked like you got hung up on last time.

> Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.

So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".


> So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".

No. Please try to respond to what I actually say instead of making stuff up; this is a straw man argument.

There are plenty of other ways we could find out about organizations supporting odious causes and boycott those organizations, without violating the privacy of their members. In fact the the point of "organizational transparency" is to make it hard to hide when organizations do bad things.

In addition to accusing me of saying things I didn't say, you're ignoring what I actually did say. Are you willing to make it impossible for people to privately donate to virtuous causes as a means of social change, when donating to support those causes publicly so would be a risk to their careers and reputations? I'm not going to continue this conversation further if you won't respond to this point.


> I don't believe organizations have rights, period

So a group of people, joining together in a common cause, don’t have rights as members of that group?

You are contradicting yourself. Organizations are simply groups of people with a shared cause. To deny rights to the organization, you necessarily have to deny personal rights. Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.

It’s absurd.


> So a group of people, joining together in a common cause, don’t have rights as members of that group?

Emphasis added. No, they don't. They have rights as individual people.

> [...] this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals [...]

Yes, that. "Sam" and "John" would have cause for civil (and probably criminal) action against the interloper. "John and Sam, Inc" has no say in the matter.

It might be useful to grant John and Sam, Inc the privilege to own property, but even that isn't actually a right except insofar as Sam and John have a right not have the value of their assets (ie 50% ownership of John and Sam, Inc - functioning as a proxy for ownership of various comics) actively sabotaged/vandalized.


> Organizations are simply groups of people with a shared cause.

No, they're not. As soon as you have two people in an organization, you've got two different causes. The shared elements of those causes allow them to collaborate, but each of them has slightly different views of what they're working toward. And even when you have a very small, well-specified goal, every individual in the organization has different levels of investment, and other values and boundaries they're not willing to cross to achieve that goal. And the larger organizations get, the wider the variety of disparate goals that can occur within the organization, because individuals in the organization may not even interact directly with one another.

Example: John and Sam work for Facebook. John and Sam want to make money to feed themselves and their families, and don't want to have PTSD. But Mark Zuckerberg wants John and Sam to look at an endless stream of horrific PTSD-inducing images so that he can maintain the reputation of the Facebook platform and get incredibly rich. Where is the shared cause here, exactly?

> To deny rights to the organization, you necessarily have to deny personal rights.

Just because an organization doesn't have rights doesn't mean we have to go out of our way to take away their rights.

> Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.

It sounds like you've figured out that John and Sam, as individuals, each have a right to have a private conversation. John and Sam, Inc. doesn't have rights, but that doesn't suddenly remove John and Sam's individual rights.

The entire point of organizational transparency is to protect the rights of individuals, so obviously if organizational transparency would violate the rights of individuals, the individual rights to privacy supercede the need for organizational transparency.

It's telling that your example of an organization has two people in it. At a small scale, organizations tend to protect the rights of the individuals in the organization fairly well. It's at larger scales that the non-rights of an organization come in conflict more often with the rights of individuals.


You see this inconsistency in individual people all the time. Like how almost everyone wants to be found easily on LinkedIn, Twitter, etc, so they add a public profile photo. Then they freak out because someone starts scraping these photos to build a facial recognition model.


I think that comes from ignorance of the implications of uploading their photo, not from an inconsistency in what they want.


One of my clients is involved in property tax collection and reporting. Property Tax records are public info, and their website allows looking up the records for any property without a login. However, the data behind this website it the _source_ of the public records, and not the public records themselves (which would be local government databases).

For years now we've been in an arms race with someone using a botnet to scrape all of the account information for a particular county. My client doesn't care so much about the data; it's the server load that's a problem. Normal activity for this site is a few dozen account searches per minute, but when the botnet gets through our blockade it sends hundreds of search requests per second, overwhelmimg the site. The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data, which they'd probably provide for a minimal fee.

Data hosting isn't free, even if the data is.


Wouldn't the solution be to offer a streamlined download (maybe even as a torrent if you're worried about bandwidth) of all the data then?


I work on a fully open data repository. The website has the API linked in 3 places, so when I find inappropriate scraping I block it with "HTTP 420 ... see <API link> or contact <email>".

Some people probably switch to using the API, but no-one has ever contacted us. They either give up, or run their scraper on a different computer -- I've seen the same scraper move between university computers, departments, then (in the evening) to a consumer broadband IP.


I really don't understand why anyone would bother writing and using a web scraper when an API exists. Does the API not provide all the same data/functions as the website? Scrapers are a big PITA compared to just using an API: they're much harder to write to be reliable, and they can break at any time, whenever the site makes even the smallest change. APIs avoid all that mess, and make performance far better too (on both sides), since you're only downloading the data you want, not a ton of Javascript and HTML that you don't.


APIs are often not as complete as the web interface, since the customer sees the web interface and normally the customer is what drives the revenue model of the company.

If pages are driven via an API, then the API is preferable, but publicly facing websites are often a mix of server-side HTML generation and API enrichment, for caching if nothing else.


In that case it seems that the webmasters complaining about scraping need to make sure their APIs actually provide access to all the same data, if they want people to use the APIs instead of scraping.


If the scraper contacted the client, said what they need the data for, and (probably) paid for api access, then my client would probably go for it.

My client is under no obligation to make access to this data easier. It's not really their data either; the information is property addesses, owner names and addresses, and tax assessments and payments. My client wouldn't want to make it easier for scammers to get that data. So they're not going to do anything unless they know the scraper is legit. If that's the case, the api would require authentication, and any fees would be for the server load, not the data.


probably, but as GP said, "The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data"

Some people just don't care for the commons


For what purpose? That’s like suggesting that if people keep jumping your fence and trampling your roses because it’s a shortcut to a public park (in this case, the county records office) that already has public access roads, that you should be obliged to build a sidewalk through your garden, at your own expense, when the real answer should be that the public road should be improved.


If your goal is to get your roses to stop being trampled, it's probably easier to install a few pavers than to spend years petitioning to get a road built.

The ideal answer and the efficient answer are not usually the same.


Yeah, the real problem with scraping is that it's often done very haphazardly and bluntly. Sometimes it's very difficult to tell the difference between a scraper and someone trying to DOS your site.


I'm in a similar job. We block people from scraping if they break a threshold, but we also refer them to the reporting system, which can get all of the information that they are collecting in a variety of formats.

I wonder if something like this would be allowed: if all the public information was available in a well-collated format, then can scrapers be blocked? I imagine that will eventually be fought in court as well.


Maybe you could contact the scrapper? Just post magnet links on the site that allows them to get nicely formatted dump of what they want.


We did figure out who the scraper probably is, but only after several years. For a long time they used an untracable botnet, but after blocking that they eventually switched to a corporate network we traced to a data aggregation company. But we don't know for sure who's doing the scraping; it could be the company, a rogue employee, or a botnet that got loose on their network.


At work we have all of our data available publicly as easy to parse XML files, but no matter what we do the bot owner's refuse to use it. They'd rather hammer our search engine with sequential searches instead.


Would rate limiting be a viable solution?


We've done that, but it's tough to rate-limit a botnet because of the ip address spread. Also, their crappy scraper software doesn't even bother to check if requests are successful; it spews them just as fast no matter how our site responds.


No. They botnets works through multiple regions on multiple cloud providers - that's how they achieve such high throughput. For any single IP address, the load is reasonable, but for the whole botnet it's absurd.

Currently bot traffic accounts for 2/3 of my load, meaning that the cost of providing my service is 3x what it would be without these persistent bots.


Can you just put it behind a CDN that protects you?


just put an option to download the raw csvs buried somewhere there. someone who is putting in the effort to bot scrapers will find that link, and save your server the load.


> People want their data to be public

People don't want their data to be public. People want other people's data to be public. One's own data everyone thinks should be private and tightly controlled. This applies to people and businesses equally.


In this case, LinkedIn users kind of do want their “public profiles” to be public. They’re online CVs; by definition, if you make one, your goal is to get it into the hands of anyone who asks for it!

LinkedIn, likewise, has built its business model on an implicit contract with its users that it’s going to show their CV to anyone who asks for it.

I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV. A CV (individually, rather than in aggregate) is ultimately useful for only one thing: marketing the CV’s author’s skills. Why wouldn’t I want my marketable skills scraped into some private “talent matchmaking” agency’s databases, such that someone could find me—and hire me—when I show up as a result of some fancy OLAP query they paid that agency to run on their scraped data? It’s more roundabout than them just finding my CV on LinkedIn, but I’m still glad they found it!


>I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV.

LinkedIn is really really clear that:

- They won't share you information with 3rd parties

- You're not allowed to use information on LinkedIn for commercial purposes without their permission

- Other users can view your personal data

So, why would I expect random third party companies to be able to scrape and sell my personal information?

My personal information is there for the individual use of others, and for authorised use by recruiters (who are vetted/managed by LinkedIn).

I've chased down the convention spam mail I get using my GDPR rights, and surprise surprise, they got my details by scraping LinkedIn. That is absolutely not expected nor acceptable use of my data...


There’s a difference between “my information” and “the public webpage that I went through a publishing workflow to create from a curated selection of my information.”

Let me put it this way: if I have a Wordpress blog, I’d certainly be miffed if Wordpress let bots see my drafts... but I’d also be miffed if Wordpress didn’t let bots (Google, for one!) see the published blog itself. It’s a blog; a public website! Anyone or anything with the URL is supposed to be able to retrieve the page! It’s not “my information” any more†; it’s been broadcast!

† You might want to mentally analogize to copyright, but I don’t think it’s the right model for the intuition people have here. Instead, try mentally analogizing to confidentiality. When a classified document is published in the public sphere (e.g. as evidence in a trial, as testimony before congress, etc.), this forcibly declassifies it. No matter how much the originator of the document might want to still keep it a secret, the legal protections of confidentiality don’t apply to it any more: it’s out there now. Anyone who reads it could plausibly have just read the public-sphere copy, so there’s no longer any way to charge people who have knowledge of the previously-classified information with any crime.


Well, my name, my job title, my employer, my job history.

These are all my information, and selling them to marketing companies is definitely not archiving.

Would you be OK with a company scraping your blog and selling it?


> Would you be OK with a company scraping your blog and selling it?

Selling it how? If they put my blog posts in a book and try to sell that book, that’s copyright infringement. If they put my blog posts in an ML model corpus to train a translation service, and they then charge pay-per-use access to the resulting service... I don’t think I’d care, nor do I think there’s anything morally or legally wrong with that. If they scrape my name and phone number and generate a Yellow-Pages-like index from them? That’s explicitly allowed by law; and heck, that’s why I embedded the information onto my site in vCard microformat in the first place!

To put my philosophy succinctly: if web.archive.org can scrape your data without you having an explicit relationship with them granting them that right, then bad.evil.com can too. You can allow both (= publicizing your information), or neither (= protecting your information), but you can’t allow one but not the other. “Third parties you don’t have a relationship with, who access your data through the public sphere without entering into a specific licensing arrangement with you” are legally one big amorphous blob. You can’t make a law that splits that blob up, because it’s an opaque blob; in the ACL system that is contract law, all entities you don’t have contracts with are just one entity—“the public.” If you want some specific entities to have access to your information, that’s what protecting your data (= setting an ACL “the public = disallow”) and then explicitly licensing it out by entering into contracts (= setting an ACL “entity X = allow”) is for.


Why can't I have terms on my website that say how you can use my information?

Examples where this is allowed:

- Images/media (Creative commons)

- Code (Open source licenses)

You say it isn't allowed for:

- Personal data

Unless I'm misunderstanding your philosophy (which seems to say copyright is OK, but public information must be public to all): You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my name, job title and employer as a marketing bundle?

Edit: An aside, it's really confusing that you seem to be editing your previous replies minutes after I responded. I thought HN only let users edit during the "no replies" period?


>Why can't I have terms on my website that say how you can use my information?

>- Personal data

So there's a couple of things in play here. You can't (generally) copyright facts - "Cthalupa is a Rocket Surgeon for the Space Force since 2001", if true, would not be something that I could get a copyright on.

https://www.newmediarights.org/business_models/artist/are_fa...

The second thing is that terms have to be agreed upon by both parties. If you give me information without us coming to an agreement on terms, I can't be bound by them. If you just put a link to a TOS on your website and don't require people agree to it before giving them access to data on your website, we did not enter into a contractual agreement.


> Why can't I have terms on my website that say how you can use my information?

Neither Creative Commons nor copyleft (nor copyright in general!) can assert anything about private use. IP rights are commercial rights; they affect sellers of your IP. They don’t affect end-consumers of your IP.

Note that even the GPL can’t force someone to publish the source of their GPLed-library-containing program, if they never publish the program itself, but only build it for their own private use.

Why? Because, by broadcasting the code of your GPLed library, you granted people an implicit use-right to it! Not a redistribution right; not a derivative-works right; but a use right. (If this wasn’t true, then people would be breaking the law by reading “common” newspapers in a cafe, or by listening to the radio, since they never entered into any explicit contract with the distributor/broadcaster.)

How does software licensing work, then? Mostly by 1. companies installing software on computers for their employees to use being considered IP redistributors; and 2. attachment of copyright through sampling when asset samples [e.g. brushes/textures in Photoshop] are distributed through the program. Other than that, there’s really no law forcing end-users to pay for software licenses. This is why e.g. WinRAR would never have been able to sue anybody. They published their shareware binary (without gating it behind a contractual relationship, like Adobe’s Creative Cloud installer); so now you have a use-right to it!

> You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my phone number?

Copyright exists because your ability to make money from your own creative works hinges on your ability to exclusively license those works. If a publisher can get a redistribution license to your manuscript for free from a third party, why would they buy it from you?

You having exclusive access to your phone number does not make you money; others having access to your phone number does not deprive you of money you could have made by keeping that information private. Thus, there’s no advantage to introducing IP law into this domain (the domain of facts.)

There was a recent court case about someone creating a subway map by copying the raw data from existing subway maps, where the comments went deeper into this.


Right, sure, I hadn't really thought about how copyright isn't really enforceable against individuals. That's very interesting.

However, I'm really not sure how this is relevant to your moral stance on commercial use of "public" personal information.

Why do you believe it's reasonable to prevent unauthorized commercial exploitation of creative works, but not to prevent unauthorized commercial exploitation of personal information.

The former simply affects the small percentage of people who sell their works.

The latter affects the vast majority of the population who receive targeted spam, have their information collated and sold for profiling, are victims of identity fraud when those databases are inevitably leaked, etc.

For what it's worth, as I mentioned in my first comment, the GDPR absolutely gives me rights to control how my personal information is used. And the GDPR has a near total exemption for individual use.

What benefits do you see of commercial use against the wishes of the person that published it that outweigh the risks? (making money isn't a benefit)


I think you got the wrong idea if you were thinking I was saying copyright isn’t enforceable “against individuals.”

My example of copyleft was specifically about the thing the Affero GPL tries to avoid (to unknown success): the possibility of someone using GPLed libraries to set up a commercial web service. Because they never release the binary, but only have people interact with it over the Internet, there’s no derivative work being made available in the commercial domain. So copyright doesn’t apply. Even though you’re a company making money off GPLed libraries!


> If they put my blog posts in a book and try to sell that book, that’s copyright infringement

Actually even if they don’t sell it, but give it away, that’s still copyright infringement.


I have a linkedin so that I can point people at it. I also want human recruiters who have actually read the thing to send me relevant jobs. If my profile ended up affecting my credit report, I'd be pissed. I expect you would be too.

People put data places for specific purposes (to show recruiters) and want the ability to limit use to that purpose. How that's accomplished is just a technicality most people don't care about.


Not so sure about that. Messages on LinkedIn are mediated in a single place, and you have a measure of control over how your profile shows up in searches. If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.


> If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.

I would point out that this is still possible (even probable!) without any bots being involved at all. Back before CVs were online, humans working for recruitment agencies would “scrape” information from local, physical job boards by hand into their company’s databases (where “database” here could just mean a filing cabinet.)

IMHO, the real solution to that is a spam filter (or an “agent”, in the old world.) Just because a lot of people want to talk to you, and most of them aren’t very interesting, doesn’t mean they need to be prevented from accessing you—they just need to be prioritized by interesting-ness, which is something you can do yourself, or hire a service to do for you.


I think the GP in this context means, e.g. LinkedIn wants their information to be public in the cases where it benefits them as a business. But then they want it to not be public when it doesn't benefit them. There is no such thing as "public information, except ..." - information is public, or it's not. If none of LinkedIn's data was public, they would have a much harder time getting people to sign up, and having as many users signed up as possible is part of their business model.

From a copyright perspective (since that's what LinkedIn's lawyers claimed): imagine if a newspaper sued another newspaper, saying that - not just the content of its paper - the information in the newspaper was copyrighted and could not be accessed by "unauthorized" third party companies. Either you print it, or you don't!


I want to be able to use LinkedIn to network with colleagues and people in my industry. If someone wants to scrape my profile to make a report on industry trends, I’m fine with it. What I don’t want is hiQ vacuuming up my data so they can snitch to my employer if they think I’m job hunting.

How is this a paradox? Tech — the web in particular — is supposed to be an equalizing force, but HiQ is clearly trying to give my employer more power over me. We are an industry that prides itself on solving difficult problems — how is our response here to just throw up our hands and say “it’s all or nothing”?


Wow, I just looked up what hiQ does and have to say it's pretty scummy in my opinion. Why do people create stuff like this? Don't they know it will likely come back to bite them one day?

For reference:

"There is more information about your employees outside the walls of your organization than inside it. hiQ curates and leverages this public data to drive employee-positive actions.

Our machine learning-based SaaS platform provides flight risks and skill footprints of enterprise organizations, allowing HR teams to make better, more reliable people decisions."


The thing is, GDPR has theoretically solved this in the EU. The UK's ICO is about to publish guidance prohibiting scraping public user information for marketing (where the user would not expect it to be used for that).

It's a really easy solution, because companies need to prove how they got your data when asked.

When you track the source of the mailing list you're getting spam from and they say "We scraped it from LinkedIn", they get fined.


We haven't solved that in the same way we haven't "solved" encryption not having a magical good people only door despite spook tantrums. There fundamentally isn't a possible mechanism and really wanting it doesn't change that.

It is a result of equality - not of outcome but rules. Open for everyone but those whose applications you don't like isn't open. On a technical level trying to prevent it is like the "evil bit" as a solution to malware.


Of course there are possible mechanisms. There are heuristics to detect bots. The whole reason for this lawsuit is that LinkedIn blocked hiQ from scraping their website.

I'm also not necessarily talking about a technical defense against unwanted scraping. Write a law makes it illegal to do something like "scraping personally identifiable information and storing or presenting it non–anonymized", and prosecute companies who break it. I'm sure there are loopholes in that particular example, but the point is we can absolutely add shades of gray here.

> Open for everyone but those whose applications you don't like isn't open.

Openness should be a means, not an end. If we make something "not open" but it prevents 95% of undesirable uses and only 5% of desirable ones, is that not a tradeoff worth discussing?


>Tech — the web in particular — is supposed to be an equalizing force

Don't take the Google PR so seriously, the tech industry wants to make money like every other industry.


"public" is not the right concept here I think. E.g. imagine a composer conducting a public airing of some work of music (e.g. on some festival). That you were able to hear the music in public doesn't mean the {composer,artists,...} give up their copyrights.


> public doesn't mean the {composer,artists,...} give up their copyrights.

Copyright law protects the original work of the composer and artists in your example.

User profile data on Linkedin is not Linkedin original work.

Additionally user profiles are mostly made up of facts, which are not copyrightable.


I think copyright as you mention here is the right concept, or at least a lot closer. In particular, the limits on copyright. If someone is reciting a list of facts in public, they can’t expect people not to record those facts, because copyright doesn’t apply to that. Reciting the list in public using computers shouldn’t change that.


I want web scraping to be legal—but, is it really contradictory to say "I want this data to be accessible to real humans only"?

Any person can post on Hacker News. However, if someone made a bot to post to Hacker News, I think most of us would be pretty upset.


What you might mean is:

- I want my data to be publicly available

- I don't want my data to be processed/distributed/sold without my permission

E.g. individual use is fine, profit making is not.

Which is my expectation with LinkedIn. I want people to see my profile, I don't want them to sell it as marketing leads!


Agree with this, and I would also add this - I want to be in control of my data and change the setting. - I want to be able to delete my data.

If scraping on linked-in is banned (and linked-in is enforcing it), then I do have have control of my data, since I can change the setting, and it will no longer be public (It's not perfect, since some might already scraped it, but the extend would be much smaller). Also, if I decide to delete my data, linked can do that for data it control, but not for scraped data.


Scraping information is not the same as posting. There are a number of bots that scrape Hacker News and people here generally consider them pretty cool.


Isn't that a different paradigm, though? A posting bot set loose on a forum/platform will (normally) visibly degrade service in a much more visible and impactful way than a scraping bot. And in either case, writing (and running!) a bot that posts on HN is not illegal behaviour in itself.


I dunno.

I'm on Hackernews to see interesting articles and read interesting conversation. If a bot can post interesting articles and make interesting conversation, I'm not sure I care that it's a bot. And if a human can't do those things, I'm not sure I care whether or not they're 'real'.

https://xkcd.com/810/

That focus on "we don't care if you self-promote, we don't care why you're here, we just want you to be a good citizen" is part of why I like HN.

It's not completely black and white, but in general I believe that users online have the Right to Delegate[0]. That right should only be legally taken away if there's a really, unbelievably compelling social justification for doing so. I am pretty skeptical that banning web scraping has that kind of justification.

[0]: https://anewdigitalmanifesto.com/#right-to-delegate


Isn’t this different from read access? Apart from server resources, downloading content doesn’t effect a web site as much as posting on it.


I don't really care if the comments are from bots, per se. I care if they are quality comments or not. Whether or not the comments are from bots is just a proxy for whether or not they are actually good.


Makes me wonder, how many users on HN are actually very convincing bots?


Humans cannot directly access websites - a machine is always involved. I know perfectly well what you mean but the distinction on a deeper level is fundamentally imaginary.

The best is some sort of heuristic like captchas and even they can be outsourced so that the human doing them isn't actually viewing the content.

The thing about a bot which bothers people is the behavior anyway. A bottish acting human would get people just as upset.


Yeah, I have nothing at all against scraping per se, it's more about the huge bot traffic the commercial scrapers generate, which would ALSO be fine, but 1. it can be hard to tell scrapers from malicious DDOS bots sometimes and 2.) the person being scraped literally pays for that scraping traffic.


So, apply DOS/DDOS mitigations? You need to have those anyway, so you might as well use them for this too.


Yeah, but where do you draw the line between 'oh sorry we dropped your request because of rate limiting' (or whatever mitigation strategy) and 'oh, we dropped your request because u scraping us bro' legally? IANAL, but this lawsuit seems to indicated that putting barriers in front of scraping attempts is a no-no.


"You issued more than $X requests per $Y seconds. We don't care whether you're scraping or not; try again later." for any values of X and Y.


There's plenty of grey here. For example, scrapers that try to check people in for flights to get better seats. Some that tried to charge for that. That creates problems, where some customers benefit at the expense of others, high load on a "locking type" piece of code, etc. Similar for ticket sales for concerts, and probably other spaces.

There are also companies that provide added value by compiling and correlating "public info" in a useful way that creates value. If Google let me scrape their search and remove ads, it would be popular, but is it "legal"? Or maybe Google Maps?


I would think, and of course could be wrong, it would be as legal as Google scraping all of the web sites that they do in order to create their search engine in the first place. In particular, Google provides cached versions of web pages. That's pretty hardcore scraping.


I put "legal" in quotes because I assume it's not criminal. It would, though, likely result in some pretty quick civil action.


Another problem with web scraping is a B2B website offering services for B2C companies to better reach consumers. It can be a tricky thing to do without basically giving your clients list to all your competitors.


Sure. My data is still my data, and if I publish it on my platform for free, that still shouldn't automatically give you the right to copy the data and provide on your platform.

It's basically the same as a TV broadcasting a film for free, and then going after you legally if you recorded that film and uploaded it to your website.


This does not legalize theft, it says sites cannot respond to suspected scrapers differently than they respond to non-suspected scrapers. You can still rate limit, as long as you do it universally to all site users

Copyright law is unchanged. If someone scrapes your blog and then re-uses your posts on their own blog, you still have possible copyright infringement claim


This is a bad analogy as scraped but copyrighted works are still protected by copyright. Whats in question is whether you should have exclusive rights to information you have shared but did not copyright. Seems people also think this noncopyrighted data should also be protected even if its munged and added to an original work by the scraper.


It's not the same thing at all. The film is broadcast without re-transmission rights.


How is that different? Where on my website I gave anyone "re-transmission" or "re-publishing" rights?


It's different for broadcast films and tv https://en.wikipedia.org/wiki/Retransmission_consent


Billions are being made collecting data and content for free and publishing it along with targeted advertisements. Web-scrapers can collect that aggregated data and redistribute it or create competing services. This is terrifying to certain tech giants and threatens their moat's and lock-in/network advantage. They will lobby hard against it with PR campaigns citing everything from security, privacy, copyrights and all sorts of other exaggerated bogeymen. At the end of the day its mostly about preserving their monopolies though.


I can understand why some do not want scrapers - increased traffic (with practically zero benefits to the owners) is one obvious reason.

(Some people will then say "But why not just offer APIs", but that's a lot of extra work and maintenance).

It's like with instagram and other social media platforms. The content creators put in the hard work, while the leeches are stealing content for their own benefit, giving zero credits to the original content creators.


There’s a very effective way out, don’t your data on the public web.


[decided to delete because I misunderstood the context]


Let's not conflate a) a person's personal data, and b) a business's dataset. The GP and the article are clearly referring to the latter. Preventing web scraping won't protect users from businesses collecting their data.


Instagram is not the content creator tho


What I'm trying to say is: For every popular content creator on IG, there are tens and hundreds of (more or less automatic) content curators that do nothing more than scrape content with lots of likes, and re-post on their own channels. Then when they get sufficient followers, they make money through paid product placements, account flipping, pay-to-play sharing, and what not. More often than not, there's no linking to the original pages / creators.


I think you mean contradiction, not paradox.


Not really. Both are correct, and paradox fits better in OPs context,

paradox: a statement or proposition that seems self-contradictory or absurd but in reality expresses a possible truth.


A paradox is something that initially looks like a contradiction, but after closer inspection, is not.


Probably he or she was black listed on Facebook


People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it.

By People you mean business or people people? Because I don't think people want everything to be public, many in fact use various networks to avoid oversharing and even then many people don't want their old bosses or exes looking at their profiles, there just don't exist tools to limit access that granularly.


I don't think the direction would ever be clear, even if the legality were clearly established. The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.

Companies want to provide some information to some people; but providing all information to all people is analogous to allowing customers to make a meal of free food samples, on a recurring basis.


The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.

It already is. There are entire companies, like Distil Networks, who exist solely to protect companies from bots/scrapers/etc. Actually, looks like Distil got acquired and are now part of Imperva, but anyway, the idea is the same. This is definitely an existing field.

https://www.imperva.com/products/bot-management/

Disclosure: former Distil employee, but I have no financial stake in this discussion, and have mixed feelings about scraping. Clearly it can be beneficial in some situations, but when I think about having to pay exorbitant prices to scalpers for tickets to an event, because they used a bot to buy up all the tickets, that is less appealing.


> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it

Can you really not imagine a world where a person accidentally or in poor judgment uploads something private to their own site (their real name, home address, credit card#, or any piece of highly damaging information that could cost them their careers) and wishes for it to be removed? (but can't because many of these scraping sites never respond to takedown requests)

People make mistakes and post things they shouldn't. A mistake from someone many years in the past that they've made amends for shouldn't haunt someone for the rest of their lives.

But it does when we decide that every single line of text ever uttered online must be preserved and easily accessible by anyone for all eternity.

Blocking scrapers is an arms-race escalation because these sites refuse to remove content, and it's used as a tool for character assassination by bad actors. It's a proactive defense.


It sounds like the same old issue - lack of understanding of the fundamentals underlying technologies.

Otherwise they would realize what they demand is contradictory and incoherent like demanding to be both viewable by all and not viewable. DRM is one fundamental example of it.


People love when information is free, but as soon as monetization appears they see it like taking food from their plate.


they want there data to be public for a specific use, I feel like that's pretty easy to understand. LinkedIn: they want their info to be public -> to get jobs


The title of this post is misleading. The Court's decision related to HiQ's attempt to obtain a preliminary injunction. It's clearly an initial victory for hiQ in that the Court affirmed granting an injunction based on a significant likelihood that hiQ would ultimately prevail and that HiQ would suffer irreparable damage if the injunction was not granted. However, the Court never actually dealt with the merits of the case and, accordingly, stating that the case has precedential value is misleading. The Court itself noted:

"At this preliminary injunction stage, we do not resolve the companies’ legal dispute definitively, nor do we address all the claims and defenses they have pleaded in the district court. Instead, we focus on whether hiQ has raised serious questions on the merits of the factual and legal issues presented to us, as well as on the other requisites for preliminary relief."


German copyright has the concept of a "Datenbankwerk" (since the 90s).

E.g. the telephone book contains lots of boring facts that are each in themselves not copyrightable. However the collection in itself is copyrightable, as it required substantial effort to create.

It seems odd that US copyright law wouldn't have a similar provision, or that it doesn't apply here?


Also known as "sweat of the brow." The US Supreme Court rejected such protections in the 90s. I'm sure it's complicated, but copyrights are not presumed.

https://en.wikipedia.org/wiki/Sweat_of_the_brow#United_State...


It does, and the result is that phone books and maps get fictional entries inserted in order to prove copying -- because it is perfectly legal to do your own work to amass the same data set.

https://en.wikipedia.org/wiki/Fictitious_entry


Like a Paper town or that time that Genius caught Google stealing their lyrics by playing with straight and curly quotes and writing a tool that watermarked songs by interchanging those quotes on the site. It was brilliant.


You might even say it was Genius.


If you scroll down to the "Legal Actions" section on Wikipedia, you'll find that these copyright traps have generally failed to serve their purpose when attempted in courts.


It's an actual carve-out. Database work is explicitly NOT protected - Feist v Rural was actually about a telephone book deemed unprotectable!


There is a concept in US copyright known as "thin" copyright. Collections of uncopyrightable information can in fact be copyrighted, but the arrangement of the information must have some spark or minimal creative energy. Mere "sweat of the brow" is insufficient to confer copyrightability on the work.

Such "thin" copyright tends to mean that there is a strong presumption against infringement. You generally need to demonstrate that the work has been copied virtually in its entirety to find infringement; partial borrowing is insufficient.


IANAL, but I think the US does have something similar. In the US, you can take works from the public domain, and then perform some sort of work, ex: restoration of a film, and then copyright that work. The difference being that someone else can use/release their own version of the same content, but they cannot use the work you did just because the content comes from the public domain.


US law has a very similar approach. See e.g. https://www.bitlaw.com/copyright/database.html

But I don't think that's relevant here, as a) this isn't a copyright case and b) HiQ are not attempting to recreate the entire "compilation" of LinkedIn


I'm pretty sure the US has the same thing, basically. You can't copyright facts, but you can copyright presentation. So someone can't copy your map directly and sell it, because there's an artistic component to it. But they can make their own map with the same data, with a different style to it, and sell that.


> Now many site owners are trying to put technical obstacles to competitors who completely copy their information that is not protected by copyright. For example, ticket prices, product lots, open user profiles, and so on. Some sites consider this information “their own”, and consider web scraping as “theft”. Legally, this is not the case, which is now officially enshrined in the US.

Does this mean we can now scrape e.g. YouTube videos, Amazon reviews, IMDB reviews, Facebook events ... ?


Yes you can scrape them, no you cannot repubilsh them. Everything you listed is protected by copyright. You cannot infringe on copyrights because of this ruling.

>hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law

Does this mean that Google's random recaptcha check is interference?


I think any ruling that says LinkedIn can't put in protectionary measures against automated requests is doomed to be overturned, as long as they're not doing it discriminately. Captcha, rate limiting, user agent testing, etc are all common tools to protect against malicious/unintentional denials of service. The question is what was LinkedIn doing, and did it specifically target hiQ while permitting others of the same class of traffic.


Why would it be an issue if it is discriminatory? Linkedin can use its servers any way they like, unless they ve promised their users that their data can be scraped indiscriminately


Because of the court case. This is just an injunction pending an actual decision.


I'm curious how entities like https://www.omdbapi.com/ can continue their activity, get $$$ and not get shut down.


Yeah what is the line here? Would it be against the rules to block known user agents, throttling of traffic?


No, because what one side of a case argues is not the law. What judges decide is the law.


Probably not. Facts aren't copyrightable but creative works are.

So prices on Amazon.com are facts. User reviews are creative so probably copyrighted.

Similarly the videos on YouTube are copyrighted. However the number of views and the number of likes are probably scrapable.


See that's where I have problem with this. Isn't data just _data_?

Lets draw some pararells to real life. If I go to public space like town square - can't I take pictures, notes and records then go home and draw my analytics from it? What if I read something in a book I bought, can't I quote it?

Same thing should be with web resources even if they are creative - as long as I don't publish them I should be able to scrape whatever public resources I want and use them in my analytics, machine learning or whatever.


This is why I strongly prefer the Dutch term 'auteursrecht' (author's rights) as opposed to copyright. Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting.

Downloading publicly available data should (by definition of public) not be a violation of someone's rights. However it's easy to see why it wouldn't be desirable for someone to republish creative works as their own, so it's reasonable to give the author control over how their work should be published.

And in the case of price data or similar you would be hard pressed to deem anyone the 'author' of it, hence it would be weird to enforce the author's rights.


>Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting. //

Copyright does make _copying_ tortuous. Broad personal use exceptions in USA, for example, make this appear not to be true, but it is the act of copying - even without publication - that is protected in general.

Ripping a CD in UK, for example is copyright infringement without a general personal use exception (there are exceptions, under Fair Dealing, but whatever you're doing almost certainly doesn't fall into them).

See eg UK CDPA1988, Chapter II, section 16(1)(a); or USC17, Chapter 1, 106(1).


You are discussing the fair use provisions of copyright law.

Not a lawyer, but:

You can do all of that, but:

You cannot scan the book you bought, and put it on your website for sale or even free - unless it's copyright is up or you are given permission by the copyright holder.

You can not take a picture of someones painting in high detail, then sell prints of it - unless it's copyright is up or you are given permission by the copyright holder.


In addition, there are some buildings and landmarks that you can't simply take photos of and then resell

https://www.rd.com/advice/travel/eiffel-tower-illegal-photos... http://www.photographers-resource.co.uk/photography/Legal/Ac...


Your examples are really wanting greater freedom to copy rather than about the distinction between data and creative work. Copyright is supposed to encourage people to make creative work, not encourage people to record existing facts. I think this distinction is important because creative work isn't actually necessary to anyone else - they could create their own different one if they wanted. But data might only have one correct value and if that was locked away by copyright, it would limit other people's ability to do things that can't be done with some different data.


As far as law is concerned, data is not just data -- bits have colour:

https://ansuz.sooke.bc.ca/entry/23


Additionally, some public areas prohibit photography of architecture because of copyright.

https://www.diyphotography.net/10-famous-landmarks-youre-all...


> Isn't data just _data_?

Think of Law around data as using dependent types. The legal protections depend on the type of the data, and the type depends on the content (among other things). You have to determine the type BEFORE you can tell what the law says about it, since the law only cares about the type. You could probably encode the law nicely with something like Idris, but any "code as law" type governance system without dependent types won't be able to express existing law.


> Isn't data just data?

No. At the risk of just repeating the comment you didn't understand, creative works are not "just data" - they are copyrightable works that the owner has control over who can use them, not just for profit, but for any reason with few exceptions.

You don't just get to drop someone else's work product into your algorithm without their permission.


There are cases where "dropping into your algorithm" would count as fair use such as a search engine of copyrighted content.


> You don't just get to drop someone else's work product into your algorithm without their permission.

Why not?


Because copyright law exists.


I don't think using data as input to an algorithm necessarily breaks copyright law.

I can read a book to post my impression on it somewhere right? I can read it and say "it was beautiful" on twitter.

I can then automate my "taste meter" through machine learning, it reads a given book character by character, and spits out what I'd think of it if I actually read it. Then posts it on twitter, says "it was beautiful".

Did I break copyright law? I don't think so.


You can't take something copyrighted by someone else and re-distribute it without their permission. However, I suspect you can capture it freely if you don't re-distribute it.


I think the fashion industry should exert their right to have their work removed from photographs.


Neat straw man but you're actually proving my point. There are scenarios under which they can't do that (fair use) but there are also many scenarios where they would be entirely within their right to do so.


> User reviews are creative so probably copyrighted.

I wonder if the number of stars are copyrighted. It's not creative, but a fact.


Probably not since each star review is a separate "work" by a separate author. Mechanically combining multiple non-copyrightable things into one doesn't make it copyrightable. If Amazon arranged their users' star reviews into an infographic that would be copyrightable.


Why would a review be a copyrightable creative work, while a LinkedIn resume wouldn't be?


I think perhaps the layout, cover letter, and maybe any flourishing notes are copyrightable, but the actual details of work experience and education are not.


Yeah, I would think the "description" section for each job would be copyrightable, but the simple "title", "company", "year" fields would not be.


There's some huge datasets of Amazon reviews available. Stanford has a big scrape out there, plus there's one from Amazon themselves in the AWS datasets.


Youtube videos are definitely protected by copyright, though.


In theory, right?

See the South Park WWITB issue.

I believe South Park used a videoclip from youtube, and Youtube’s ContentID system removed the video South Park had used, because Youtube considered it a violation of South Park’s copyright.


Just because YouTube gets it wrong doesn't mean it's just theory. YouTube is not the only site that has automated content scanning for copyright violations. Getty and other photo sites have gotten this wrong in the same way by sending C&D letters for violations to the actual copyright holders.


I was specifically discussing Youtube.


Shouldn't the copyright belong to the creator not to youtube? Basically youtube shouldn't be able to sue you, it should be up to the creator to do so.


Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: