No ultimate decision was ever made, and no, this doesn't make web scraping 100% legal. Wake me up when there's a new announcement because anyone interested in this already know this old news.
I feel like siding with LinkedIn here would open up the web to extortion though, like troll companies that would send cease and desist letters to all scrapers (even search engines). I think it could be argued that letting one company scrape when another is denied is discrimination.
Then again, I don't know how conservative and republican-leaning courts decide corporate law. Maybe in this case since so much money is at stake, they might worry that banning scraping would infringe on something like free speech and ruffle the feathers of some of the wealthier contributors in their base. Especially on the media side since I imagine they use bots in one form or another to find newsworthy stories.
IANAL (obviously!), I just find it entertaining/dismaying to ponder these things in these times.
Gate the information behind a login then sue the scraper for violating TOS and not scraping itself, I can understand that.
> Sign up now for free access to this content
> Email (NOTE: Free email domains not supported)
I think this sort of antinomy sounds ironical especially in this thread. This is practically legal dependency injection pattern, when you call something free, then administer antinomies, catches-22, etc.
Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
"In this case, hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law."
This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?
Seems to imply that every business is somehow beholden to every contract signed by anyone.
ToS are subservient to the law; you can (probably) terminate a service account from a user that breaks your ToS, but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.
> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?
IANAL, but I believe that'd fall on intent, and intent is often difficult to prove at a personal level, but not necessarily at a company level. If your intent for putting up barriers that happen to impact scraping, whatever they may be, was indeed to knowingly prevent scraping from a particular company, then you may be liable under this decision. This is the only part of the decision I'm torn on, since it's a bit messy to really prove such things. I'd be much more comfortable with allowing companies to take whatever measures they feel necessary to prevent scraping, and also allowing scrapers to legally circumvent those measures without threat of prosecution, assuming they didn't actually hack into anything.
Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.
I have interpreted the LinkedIn ruling to mean that scraping public data is no longer criminal activity but it still leaves you open to civil lawsuits for violating the ToS of the website you are scraping.
How would that even work? If I browse to any random public page of your website, it's served to me before you've even transmitted the terms of service. How could I be bound by those terms of service when I haven't even seen them?
I think these sorts of contracts are called Adhesion Contracts (https://www.investopedia.com/terms/a/adhesion-contract.asp) and we interact with them all the time. For example, if you valet your car, the valet will hand you a piece of paper with a number printed on it to retrieve your car. On that paper you will find an adhesion contract that is valid and real (although not as powerful as the types of contracts that you sign)
A paper served you by the valet is not an immediate contract as you can deny agreeing to it and service does not happen.
You cannot do that with a publicly visible website, unless you show ToS and require agreement before first use.
If you allow a non-transferable license then said data cannot be used by a search engine. If it's transferable you just pushed the problem towards scraping a different bot.
(Well, you could have a direct agreement with a few major search engines.)
Caveat emptor: not a lawyer.
Except it sounds like the owner doesn't. If the information is on the page made public, the owner of the page can't place terms on what is done with the data downstream. They'd have to implement some real binding system such as authentication where CFAA would apply. (IANAL)
And said ToS would have to force copyright reassignment rather than a general licence, making LinkedIn culpable for any unlawful content published by users of its site.
TOS are a lot like EULAs. If they look like contracts of adhesion, then they're going to get more scrutiny and skepticism. The TOS that you claim applies even to every single random visitor to your site where they do not in fact affirmatively agree to the terms is potentially going to look more like a contract of adhesion. That's a lot harder to enforce.
If they are used more for CYA so that you can ban undesirable accounts from your website which people explicitly agreed to when they signed up for it, or so that you can just up and alter your entire business model without having to give all of your customers refunds, then they're easier to defend.
Just my general opinion, of course. Every jurisdiction is different.
When you create an account, etc., you are agreeing to those terms. If I browse a public webpage that just has a terms of service link on the bottom of it, I've not agreed to anything.
Typically you'll see TOS say something along the lines of "by continuing to access this site you agree..." or "if you do not agree with these terms you may not access this site..."
Whether that's enough to create a binding contract depends on the jurisdiction and who you ask.
I don't think criminal law was ever part of this.
LinkedIn made threats accusing hiq of criminal behavior, but that doesn't mean there's any criminal precedent being set here, as far as I can tell. And no one was criminally charged.
Separately, part of the ruling states that for the purposes of authorization, defying a cease and desist letter does not constitute illegal access, which might have some criminal implications. They imply some sort of technical authorization system must be bypassed, which didn't happen, since the data is "public."
(Which doesn't square well, imho, with existing meatspace law. If a public serving business banned someone from their store, the door being unlocked isn't an excuse to ignore that ban and trespass. But I digress.)
With the overlapping areas of law, it's admittedly beyond my understanding. But the law is generally viewed, like dmca, as being overreaching, if not at least partly unconstitutional.
Reply All - #43 The Law That Sticks
I'm not particularly thrilled with it, but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.
LinkedIn users (as opposed to the company) might actually like what HiQ is doing, as it may help their own prospects.
It's not free, it takes time to collect data. Buying it makes a lot of sense as long as you pay less than what's your own time worth to you...
The data without the noise is what you're paying for. The service of winnowing out what you care about from what you don't care about.
Considering how big of an effort it is, and that the source from which it came is still available, why should the cleaned data be free? If I collect fallen trees from public land, chop it into usable firewood, should my bundles of firewood also be free? Or I collect solar power with my own solar cells, should I have to give you the electricity for free?
1. This specific example is BS but plausible - I just wanted something more specific than the vagaries around things like FOIAs or shareholder reports which both have specific facts that can be rendered useless unless you have the context.
2. Basically, list of how much procedures cost.
I'm thinking of processed GIS data. If you have ever tried using the various formats that are supplied by government sites, you know what a huge pain it is.
I'm happy to pay a reasonable price for an interpreted and bowdlerized version.
The data is free, but the aggregated formatted data has been worked on and processed, are you saying the resulting aggregated data should also be free? That isn't going to happen, why would anyone do that work for free?
Or are you afraid linkedIn and others will make everything private? That's completely up to linkedIn or individual linkedIn users what they want to make private vs public. Maybe more data would be made private if they don't want it scraped. I don't think that's inherently a good or bad thing.
But if it's only accessible with a login, then it falls under TOS and they can be blocked?
This is just a preliminary injunction. This wasn't an actual ruling on the case. This just says that until there is a ruling they can't stop the scraping to make sure the company isn't put under while waiting for an actual ruling.
It’s a very, very strong indication that they will win. Courts don’t issue preliminary injunctions unless it’s extremely likely the side who won the preliminary injunction will win.
Could you cite something on this that indicates this (my understanding here) is wrong?
LinkedIn has long wanted to have their cake and eat it too - they advertised that data as being publicly accessible and allow google to index specific user pages but then attempt to restrict other bots from crawling it.
If you have private data behind a login there isn't an issue here - if you have public data but want some people to login before viewing it (or not be able to view it) then that's where this ruling comes up. So, this mostly hits sneaky SEO folks and dark UX patterns that rely on tempting someone with accessible data and then pulling the rug out from them at the last minute.
If your website places data outside of authentication then everyone should be able to see that data... I'm curious to see the specifics around
> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
though - DoS attacks are clearly illegal, but with this precedent there's going to be a lot of back and forth to see where the line between DoS and scraping falls... and I think that makes this precedent a lot weaker than the headline would have you believe. A company can still threaten to drag you through a lot of litigation by accusing you of malicious page requests, it'll take a few cases to define where that line needs to fall.
That does not mean that hte court agreed.
The judges said that CFAA doesn't apply.
In other words, the judges said that LinkedIn couldn't use the US legal system to force HiQ to stop. Judges didn't say that LinkedIn was barred from using technical measures.
The court did allow a preliminary injunction against LinkedIn, due to the possibility of "monopolies" (to be determined in Court later), pending resolution of that latter question.
LinkedIn might still win their claim to their right to block scrapers via technical means.
LinkedIn can't prevent HiQ from attempting to scrape their site through force of law.
LinkedIn can rate limit requests, make their site hard to scrape, change their format, whatever. LinkedIn is in no way responsible for how HiQ fulfills its contract to its customers. HiQ is attempting to say that if I sign a contract to provide you with a Tesla, then it would be illegal for Tesla to stop me from just taking one from them to give to you. If that sounds stupid, that's because it is.
Implied contract is that if you publish something, it's public and you have no right to dictate what software people use to consume it.
LinkedIn aint the victim here...
> hiQ also asked the court to prohibit LinkedIn from blocking its access to public profiles while the court considered the merits of its request. hiQ won a preliminary injunction against LinkedIn in district court, and LinkedIn appealed.
Whether LinkedIn is the good guy or bad guy here doesn't matter when the decision creates precedence for the rest of us.
Surely a healthier precedent is that we can respond arbitrarily to requests and have no obligation to the requester. So what if I want to randomize the html structure on every request or block requests from Tor because 100% of them are abuse? Can someone take me to court on the grounds that either is effectively "blocking" their scraping syndicate? Why not?
I feel like once CFAA is off the table (which I do agree with), the cat and mouse game is a fair middle ground. Keep web scraping a sport!
There is a large banner next to the highway that shows some weather information that if properly organized (lets say monthly almanac) you would find people to pay money for it. The banner owner does not make money this way - he ask you to go to his website and signup for an account. But you drive the highway (internet) every day, look at the banner, write down the weather updates, and then offer them on your website as a sale. The owner gets angry and sue you. The court decides you are free to drive by the highway and you free to put your eyeballs on their weather banner, especially given the banner is available to everyone (LinkedIn profiles are avail to view without needing an account) and you are free to use the information you obtained for free without interference with said banner in a form of a monthly almanac that you sell. At the end of the day, the banner owner does not own the weather information that someone else put in there (for example a weather meteorologist).
I think personally its a healthy decision. Otherwise it would be similar to prejudice of who should be allowed to enter and browse a street store that by law is available to everyone.
This would mostly mean that you cannot start interfering with webscraping you previously allowed merely because you learned that they're making money with the scraped data.
It does, in the US. You're likely making an inconsistent comparison.
Property ownership has nothing to do with visual access.
You cannot legally be barred from casually (involuntarily) perceiving something. It's reasonable to put up physical barriers to reduce what is casually perceived. It's a very good analogy.
It's also not a great analogy for this case because another party is given continued easy access to view my backyard while the first party is denied - and the analogy breaks down here because, as a neighbor, I have no inherent right to view your private life at least as much as any of your other neighbors.
Try building that fence into the stratosphere. A regulatory body will prevent that.
> I have no inherent right to view your private life at least as much as any of your other neighbors.
That's a different analogy, not a violation of the first.
It's not necessary for every part of the analogy to hold, being an analogy.
Or perhaps slightly less contrived:
If I publish a monthly lowlights reel of my favourite sports team as a podcast discussion on where they can improve in all their lost games, and then they suddenly go on a winning streak for >1month so my USP is gone and I have nothing to talk about..?
In this case, it was rules that the public data is available. It was a good faith contract on the part of HiQ to assume they could collect public data from a public website.
It would not be a good faith contract to assume you could control the paint colors on a property you don't own.
It seems to me that the interference ruling was wholly independent on deciding that what hiq was doing is legal.
If the question is could someone win, potentially. The argument would basically have to be that the removal of that open source project is akin to other cases of negligent interference.
If this is a specific concern, consult a lawyer - 'cause I'm not one.
Maybe if leftpad somehow tried to block only some users from using their publically available plugin.
How far does this go?
How would this affect Cloudflare's "checking your browser" anti-DDoS protection screen, meant to block bot requests from accessing sites?
Do they not already do this? Every link I've ever seen for LinkedIn has redirected me to sign up page rather than showing me the content.
They were pretty much legally void even before this precedent was established. They are only valid when they don't violate any existing U.S. law. Any authority assumed beyond that is completely false.
>can the court force me to revert that change?
The license further goes on to clarify that LinkedIn will vend public data to search engines, but the definition of "search engine" is almost certainly assumed (by LinkedIn, at least) to be up to them.
This precedent doesn't really mean much but is definitely step in the right direction.
But when you scrape it my load goes up dramatically. A load I have to pay for.
It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.
I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.
So no, I won’t pay you for the privilege of you saving money.
Could I suggest:
"Why buy the cheese when you get the milk for free?"
Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.
That's a complete misconception. Of course you can find manufacture inconsistent ideologies if you combine ideas from different people, but I think you'd have a difficult time finding one person who believes what you just described.
What I want is, put simply, organizational transparency, personal privacy. I believe humans have a right to privacy, but I don't believe organizations have rights, period, and I believe radical transparency within an organization prevents organizations from trampling the rights of individuals.
Organizations in this case include corporations, governments, and nonprofits.
Digging this because I think that domain / paradigm will see unparalleled evolution in the next few decades.
: I mean, don't stop at current law / values / behaviors; like people from the 1940s wouldn't have dared speak about their idea of the 1970s because they'd think their belief "impossible". No flying cars though (Clarke-tech), because that's not a decision of the individual.
The short-term, the small, incremental changes I'd like to see are:
1. Reversal of the default privacy setting of government docs. Instead of documents being default-private and citizens having to make FOIA requests to make those documents public, documents should be default-public, and government workers should have to apply through and adversarial system (similar to courts) to classify documents, proving to a court why the document needs to be classified.
2. Classified documents should have a short (1 year max) timeframe after which they are declassified, or government workers should have to reapply to justify why the documents need to remain classified.
3. Political party documents should be public, without any provision for classifying them.
4. Tax-exempt organization documents should be public, without any provision for classifying them.
5. IPO'ed organization documents should be public, without any provision for classifying them.
6. Body cams on all police and military while on duty (when they are acting on behalf of an organization). 1 and 2 would apply to the footage from these cams as well.
7. Exceptions to 1-6 should be made for the personally-identifiable information of people who are not in the organization.
8. Organizations should be required to maintain a list of all the personally-identifiable information they have on a person (including employees), and provide that data to that person on demand by that person or their legal guardian, as well as a list of all people with whom that data has been shared, and be required to delete that information upon request by that person or their legal guardian.
9. Research which receives public funding should be forced to publish its results publicly.
10. All software which receives public funding should be forced to publish its source publicly.
11. Government documents should be published in open-source formats suitable for computer analysis (i.e. CSV, text, or some XML format--no PDFs).
1, 2, 6, 11 should be no-brainers if people were educated IMHO — but this is 1920 relative to electricity or cars; still a long way to go before the mainsteam masses get it (which very much includes political figures). I would think 2030-2040 for the emergence of ethical consensus and concern (the kind that pervades political parties and social classes).
That is assuming the needle doesn't move too much farther in the authoritarian direction until then (the 20-year trend is really not looking that way currently).
3, 4, 5, 9, 10 are/would be met by strong opposition from interest groups, I'm sure you see that too. Everything I know about 3 tells me it's never going to happen with current parties / politicians. It's at least 1 generation away and I'm not sure the concept itself isn't utopia. 9 and 10 as well, I think it largely depends on the cultural paradigm (and this world's in 2020 is really not aligned with that, nor does it trend or even look that way). 4, 5 likewise, complex topics, lots and lots and lots of gatekeepers and lobbyists.
My take on these is they're very costly in terms of political capital; and they are largely debatable (politically, legally, philosophically, etc., you'll find passionate captains on both sides); thus there are 'better' (more consensual, with direct net positive effect) lower hanging fruits imho.
7 and 8 are hard problems, notably because of scale and the need for automation — it's part of a much bigger domain, automation of compliance and building "trustable" systems etc.; the kind that bridge or plane engineers must build, and probably software engineers too, but you know we're far from that if you read this forum.
I'd say 1 2 6 11 and 7 8 on the way to scale/automation already paint a whole different regime and degree of maturity for a 21st century State. I'd like to think we're now ~1 generation away from enactment of such norms.
The human does get rights, the organization doesn't.
In some cases, believing that humans have rights and believing that organizations have rights might lead one to the same action. In those cases, I'd take the action. I wouldn't want to violate a human's rights out of some vindictive dislike of organizations: that's not the point. The point is that I'd take that action because I believe in human rights, not because I believe that the organization has rights.
And with organizational transparency: the entire point of organizational transparency is to protect human rights. In cases where organizational transparency would trample human rights to privacy, I would go with human rights every time. Violating the privacy of humans to achieve organizational transparency would defeat the entire purpose.
If you can tell secrets to who you please and sell them on the internet, they aren't secrets. Somewhere in the middle of what you're saying, the secrets stopped being secrets, but you kept using the word as if it still applied.
Facebook isn't a group of associates trading anecdotes about their friends: the server guy has never met Mark Zuckerberg, and they are not "associates" in any meaningful way. They're not friends, or even really allies: Facebook certainly has shown inconsistent concern for the well-being of its workers. So let's also drop the "associates" terminology: these aren't "associates", they're employers and employees. Employees aren't acting as individual humans on their own behalf, they're acting on behalf of an organization.
Putting aside the rights conversation for a second, let me ask you a question: if you tell your friend a secret in confidence, and they turn around and sell it to anyone on the internet who will pay a low fee, that would be pretty screwed up, no? We don't even have to talk about rights here: this is just screwed up behavior, regardless of the rights conversation.
A group of people sure, but corporations are not people.
At a small scale, it makes sense for people to appoint another person whom they trust to represent them in negotiation. But in a lot of cases, that's not how representatives are chosen. Particularly in corporations, the leadership of a corporation was not chosen by the employees to represent them, and in fact often doesn't even have the best interest of the employees in mind. A lot of the largest problems in our society arise from this fact.
Consider the case of a company that agrees to sell to a larger corporation, under the condition that they lay off half their workforce in advance of the sale. Surely we can agree that the laid-off workers were not fairly represented by the person signing the contract.
One could argue that the workers agreed to give up some of their rights in their employment contract, but I'd argue that they did so under duress: their option is sign the contract and work for the company, or starve and let their families starve. Sure, they can go work for another company, but other companies will require them to similarly sign away their rights.
This shouldn't be taken as a recommendation to blithely break contract law. Corporations don't have rights, but they do have power, and it would be unwise to behave as if they can't make your life miserable if you decide to cross them.
Organisations are (usually) legal persons, too; they just have fewer responsibilities, get fewer rights in exchange.
It's a bad deal for corporations, but I do not care. Lack of liability is the cause of a ton of problems in our society.
Just to pick two stories of corporate sociopathy: Probably the reason people at State Farm are unconcernd about forging signatures is that they know that the worst case scenario is that State Farm loses some business and maybe gets a fine: they are unlikely to go to jail for forgery or to have fines exacted from their personal bank accounts. Similarly, when Practice Fusion literally killed people, their execs had little to fear: nobody went to jail, nobody was fined: shareholders who had no visibility into the decision paid the fines.
When banks tanked the economy with irresponsible lending most were bailed out and gave their workers bonuses, while the people who were unable to pay their mortgages were ignored.
A little more liability for destructive behavior would be great for most people.
Forgery is criminal regardless of private or official document is concerned. Even in a military setting, forgery of business-related documents are illegal.
> A little more liability for destructive behavior would be great for most people.
Why not full rights, full liability? Replace imprisonment and death by temporary and permanent suspension of company (including re-establishment of a sequel organisation out of a subset of stakeholders) respectively; and voila.
It seems like you're just picking out one aspect of each of my posts to disagree with. Do you have any disagreement with my overall point?
In a world where organizations are radically transparent and individuals have radical privacy, before you join the disease support group or donate to the ACLU, you already know the organization's records about that transaction will be public information.
You can restrict your activities to organizations that do not keep personally identifiable information. Or you could join the support group with a pseudonymous identity or donate cryptocurrency to the ACLU.
The important detail is whether the funds were his personal funds, or whether they belonged to his company: i.e. was he acting on his own behalf, or on behalf of an organization?
It's an inane question, though, because the inane point you're trying to make is that individual privacy might protect homophobes. That's true, but it would also protect gays and allies who donated to oppose proposition 8. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason gay marriage is legal today is that people donated to people like Harvey Milk, at a time when donating to the campaign of a gay governor was risking your job and social standing.
Human rights still apply to humans who do bad things. If you are willing to give up human rights to fight bad people trying to do bad things, then those rights won't be there to protect good people trying to do good things, either.
> So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?
Yes, but it would also protect people who donated to a virtuous cause. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason virtuous causes have had any success at all is that people donated to support them, at a time when donating to those virtuous causes was risking your job and social standing.
Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
I'm attempting to clarify my question by removing irrelevant details, which it looked like you got hung up on last time.
> Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".
No. Please try to respond to what I actually say instead of making stuff up; this is a straw man argument.
There are plenty of other ways we could find out about organizations supporting odious causes and boycott those organizations, without violating the privacy of their members. In fact the the point of "organizational transparency" is to make it hard to hide when organizations do bad things.
In addition to accusing me of saying things I didn't say, you're ignoring what I actually did say. Are you willing to make it impossible for people to privately donate to virtuous causes as a means of social change, when donating to support those causes publicly so would be a risk to their careers and reputations? I'm not going to continue this conversation further if you won't respond to this point.
So a group of people, joining together in a common cause, don’t have rights as members of that group?
You are contradicting yourself. Organizations are simply groups of people with a shared cause. To deny rights to the organization, you necessarily have to deny personal rights. Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
Emphasis added. No, they don't. They have rights as individual people.
> [...] this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals [...]
Yes, that. "Sam" and "John" would have cause for civil (and probably criminal) action against the interloper. "John and Sam, Inc" has no say in the matter.
It might be useful to grant John and Sam, Inc the privilege to own property, but even that isn't actually a right except insofar as Sam and John have a right not have the value of their assets (ie 50% ownership of John and Sam, Inc - functioning as a proxy for ownership of various comics) actively sabotaged/vandalized.
No, they're not. As soon as you have two people in an organization, you've got two different causes. The shared elements of those causes allow them to collaborate, but each of them has slightly different views of what they're working toward. And even when you have a very small, well-specified goal, every individual in the organization has different levels of investment, and other values and boundaries they're not willing to cross to achieve that goal. And the larger organizations get, the wider the variety of disparate goals that can occur within the organization, because individuals in the organization may not even interact directly with one another.
Example: John and Sam work for Facebook. John and Sam want to make money to feed themselves and their families, and don't want to have PTSD. But Mark Zuckerberg wants John and Sam to look at an endless stream of horrific PTSD-inducing images so that he can maintain the reputation of the Facebook platform and get incredibly rich. Where is the shared cause here, exactly?
> To deny rights to the organization, you necessarily have to deny personal rights.
Just because an organization doesn't have rights doesn't mean we have to go out of our way to take away their rights.
> Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
It sounds like you've figured out that John and Sam, as individuals, each have a right to have a private conversation. John and Sam, Inc. doesn't have rights, but that doesn't suddenly remove John and Sam's individual rights.
The entire point of organizational transparency is to protect the rights of individuals, so obviously if organizational transparency would violate the rights of individuals, the individual rights to privacy supercede the need for organizational transparency.
It's telling that your example of an organization has two people in it. At a small scale, organizations tend to protect the rights of the individuals in the organization fairly well. It's at larger scales that the non-rights of an organization come in conflict more often with the rights of individuals.
For years now we've been in an arms race with someone using a botnet to scrape all of the account information for a particular county. My client doesn't care so much about the data; it's the server load that's a problem. Normal activity for this site is a few dozen account searches per minute, but when the botnet gets through our blockade it sends hundreds of search requests per second, overwhelmimg the site. The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data, which they'd probably provide for a minimal fee.
Data hosting isn't free, even if the data is.
Some people probably switch to using the API, but no-one has ever contacted us. They either give up, or run their scraper on a different computer -- I've seen the same scraper move between university computers, departments, then (in the evening) to a consumer broadband IP.
If pages are driven via an API, then the API is preferable, but publicly facing websites are often a mix of server-side HTML generation and API enrichment, for caching if nothing else.
My client is under no obligation to make access to this data easier. It's not really their data either; the information is property addesses, owner names and addresses, and tax assessments and payments. My client wouldn't want to make it easier for scammers to get that data. So they're not going to do anything unless they know the scraper is legit. If that's the case, the api would require authentication, and any fees would be for the server load, not the data.
Some people just don't care for the commons
The ideal answer and the efficient answer are not usually the same.
I wonder if something like this would be allowed: if all the public information was available in a well-collated format, then can scrapers be blocked? I imagine that will eventually be fought in court as well.
Currently bot traffic accounts for 2/3 of my load, meaning that the cost of providing my service is 3x what it would be without these persistent bots.
People don't want their data to be public. People want other people's data to be public. One's own data everyone thinks should be private and tightly controlled. This applies to people and businesses equally.
LinkedIn, likewise, has built its business model on an implicit contract with its users that it’s going to show their CV to anyone who asks for it.
I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV. A CV (individually, rather than in aggregate) is ultimately useful for only one thing: marketing the CV’s author’s skills. Why wouldn’t I want my marketable skills scraped into some private “talent matchmaking” agency’s databases, such that someone could find me—and hire me—when I show up as a result of some fancy OLAP query they paid that agency to run on their scraped data? It’s more roundabout than them just finding my CV on LinkedIn, but I’m still glad they found it!
LinkedIn is really really clear that:
- They won't share you information with 3rd parties
- You're not allowed to use information on LinkedIn for commercial purposes without their permission
- Other users can view your personal data
So, why would I expect random third party companies to be able to scrape and sell my personal information?
My personal information is there for the individual use of others, and for authorised use by recruiters (who are vetted/managed by LinkedIn).
I've chased down the convention spam mail I get using my GDPR rights, and surprise surprise, they got my details by scraping LinkedIn. That is absolutely not expected nor acceptable use of my data...
Let me put it this way: if I have a Wordpress blog, I’d certainly be miffed if Wordpress let bots see my drafts... but I’d also be miffed if Wordpress didn’t let bots (Google, for one!) see the published blog itself. It’s a blog; a public website! Anyone or anything with the URL is supposed to be able to retrieve the page! It’s not “my information” any more†; it’s been broadcast!
† You might want to mentally analogize to copyright, but I don’t think it’s the right model for the intuition people have here. Instead, try mentally analogizing to confidentiality. When a classified document is published in the public sphere (e.g. as evidence in a trial, as testimony before congress, etc.), this forcibly declassifies it. No matter how much the originator of the document might want to still keep it a secret, the legal protections of confidentiality don’t apply to it any more: it’s out there now. Anyone who reads it could plausibly have just read the public-sphere copy, so there’s no longer any way to charge people who have knowledge of the previously-classified information with any crime.
These are all my information, and selling them to marketing companies is definitely not archiving.
Would you be OK with a company scraping your blog and selling it?
Selling it how? If they put my blog posts in a book and try to sell that book, that’s copyright infringement. If they put my blog posts in an ML model corpus to train a translation service, and they then charge pay-per-use access to the resulting service... I don’t think I’d care, nor do I think there’s anything morally or legally wrong with that. If they scrape my name and phone number and generate a Yellow-Pages-like index from them? That’s explicitly allowed by law; and heck, that’s why I embedded the information onto my site in vCard microformat in the first place!
To put my philosophy succinctly: if web.archive.org can scrape your data without you having an explicit relationship with them granting them that right, then bad.evil.com can too. You can allow both (= publicizing your information), or neither (= protecting your information), but you can’t allow one but not the other. “Third parties you don’t have a relationship with, who access your data through the public sphere without entering into a specific licensing arrangement with you” are legally one big amorphous blob. You can’t make a law that splits that blob up, because it’s an opaque blob; in the ACL system that is contract law, all entities you don’t have contracts with are just one entity—“the public.” If you want some specific entities to have access to your information, that’s what protecting your data (= setting an ACL “the public = disallow”) and then explicitly licensing it out by entering into contracts (= setting an ACL “entity X = allow”) is for.
Examples where this is allowed:
- Images/media (Creative commons)
- Code (Open source licenses)
You say it isn't allowed for:
- Personal data
Unless I'm misunderstanding your philosophy (which seems to say copyright is OK, but public information must be public to all): You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my name, job title and employer as a marketing bundle?
Edit: An aside, it's really confusing that you seem to be editing your previous replies minutes after I responded. I thought HN only let users edit during the "no replies" period?
>- Personal data
So there's a couple of things in play here. You can't (generally) copyright facts - "Cthalupa is a Rocket Surgeon for the Space Force since 2001", if true, would not be something that I could get a copyright on.
The second thing is that terms have to be agreed upon by both parties. If you give me information without us coming to an agreement on terms, I can't be bound by them. If you just put a link to a TOS on your website and don't require people agree to it before giving them access to data on your website, we did not enter into a contractual agreement.
Neither Creative Commons nor copyleft (nor copyright in general!) can assert anything about private use. IP rights are commercial rights; they affect sellers of your IP. They don’t affect end-consumers of your IP.
Note that even the GPL can’t force someone to publish the source of their GPLed-library-containing program, if they never publish the program itself, but only build it for their own private use.
Why? Because, by broadcasting the code of your GPLed library, you granted people an implicit use-right to it! Not a redistribution right; not a derivative-works right; but a use right. (If this wasn’t true, then people would be breaking the law by reading “common” newspapers in a cafe, or by listening to the radio, since they never entered into any explicit contract with the distributor/broadcaster.)
How does software licensing work, then? Mostly by 1. companies installing software on computers for their employees to use being considered IP redistributors; and 2. attachment of copyright through sampling when asset samples [e.g. brushes/textures in Photoshop] are distributed through the program. Other than that, there’s really no law forcing end-users to pay for software licenses. This is why e.g. WinRAR would never have been able to sue anybody. They published their shareware binary (without gating it behind a contractual relationship, like Adobe’s Creative Cloud installer); so now you have a use-right to it!
> You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my phone number?
Copyright exists because your ability to make money from your own creative works hinges on your ability to exclusively license those works. If a publisher can get a redistribution license to your manuscript for free from a third party, why would they buy it from you?
You having exclusive access to your phone number does not make you money; others having access to your phone number does not deprive you of money you could have made by keeping that information private. Thus, there’s no advantage to introducing IP law into this domain (the domain of facts.)
There was a recent court case about someone creating a subway map by copying the raw data from existing subway maps, where the comments went deeper into this.
However, I'm really not sure how this is relevant to your moral stance on commercial use of "public" personal information.
Why do you believe it's reasonable to prevent unauthorized commercial exploitation of creative works, but not to prevent unauthorized commercial exploitation of personal information.
The former simply affects the small percentage of people who sell their works.
The latter affects the vast majority of the population who receive targeted spam, have their information collated and sold for profiling, are victims of identity fraud when those databases are inevitably leaked, etc.
For what it's worth, as I mentioned in my first comment, the GDPR absolutely gives me rights to control how my personal information is used. And the GDPR has a near total exemption for individual use.
What benefits do you see of commercial use against the wishes of the person that published it that outweigh the risks? (making money isn't a benefit)
My example of copyleft was specifically about the thing the Affero GPL tries to avoid (to unknown success): the possibility of someone using GPLed libraries to set up a commercial web service. Because they never release the binary, but only have people interact with it over the Internet, there’s no derivative work being made available in the commercial domain. So copyright doesn’t apply. Even though you’re a company making money off GPLed libraries!
Actually even if they don’t sell it, but give it away, that’s still copyright infringement.
People put data places for specific purposes (to show recruiters) and want the ability to limit use to that purpose. How that's accomplished is just a technicality most people don't care about.
I would point out that this is still possible (even probable!) without any bots being involved at all. Back before CVs were online, humans working for recruitment agencies would “scrape” information from local, physical job boards by hand into their company’s databases (where “database” here could just mean a filing cabinet.)
IMHO, the real solution to that is a spam filter (or an “agent”, in the old world.) Just because a lot of people want to talk to you, and most of them aren’t very interesting, doesn’t mean they need to be prevented from accessing you—they just need to be prioritized by interesting-ness, which is something you can do yourself, or hire a service to do for you.
From a copyright perspective (since that's what LinkedIn's lawyers claimed): imagine if a newspaper sued another newspaper, saying that - not just the content of its paper - the information in the newspaper was copyrighted and could not be accessed by "unauthorized" third party companies. Either you print it, or you don't!
How is this a paradox? Tech — the web in particular — is supposed to be an equalizing force, but HiQ is clearly trying to give my employer more power over me. We are an industry that prides itself on solving difficult problems — how is our response here to just throw up our hands and say “it’s all or nothing”?
"There is more information about your employees outside the walls of your organization than inside it. hiQ curates and leverages this public data to drive employee-positive actions.
Our machine learning-based SaaS platform provides flight risks and skill footprints of enterprise organizations, allowing HR teams to make better, more reliable people decisions."
It's a really easy solution, because companies need to prove how they got your data when asked.
When you track the source of the mailing list you're getting spam from and they say "We scraped it from LinkedIn", they get fined.
It is a result of equality - not of outcome but rules. Open for everyone but those whose applications you don't like isn't open. On a technical level trying to prevent it is like the "evil bit" as a solution to malware.
I'm also not necessarily talking about a technical defense against unwanted scraping. Write a law makes it illegal to do something like "scraping personally identifiable information and storing or presenting it non–anonymized", and prosecute companies who break it. I'm sure there are loopholes in that particular example, but the point is we can absolutely add shades of gray here.
> Open for everyone but those whose applications you don't like isn't open.
Openness should be a means, not an end. If we make something "not open" but it prevents 95% of undesirable uses and only 5% of desirable ones, is that not a tradeoff worth discussing?
Don't take the Google PR so seriously, the tech industry wants to make money like every other industry.
Copyright law protects the original work of the composer and artists in your example.
User profile data on Linkedin is not Linkedin original work.
Additionally user profiles are mostly made up of facts, which are not copyrightable.
Any person can post on Hacker News. However, if someone made a bot to post to Hacker News, I think most of us would be pretty upset.
- I want my data to be publicly available
- I don't want my data to be processed/distributed/sold without my permission
E.g. individual use is fine, profit making is not.
Which is my expectation with LinkedIn. I want people to see my profile, I don't want them to sell it as marketing leads!
If scraping on linked-in is banned (and linked-in is enforcing it), then I do have have control of my data, since I can change the setting, and it will no longer be public (It's not perfect, since some might already scraped it, but the extend would be much smaller). Also, if I decide to delete my data, linked can do that for data it control, but not for scraped data.
I'm on Hackernews to see interesting articles and read interesting conversation. If a bot can post interesting articles and make interesting conversation, I'm not sure I care that it's a bot. And if a human can't do those things, I'm not sure I care whether or not they're 'real'.
That focus on "we don't care if you self-promote, we don't care why you're here, we just want you to be a good citizen" is part of why I like HN.
It's not completely black and white, but in general I believe that users online have the Right to Delegate. That right should only be legally taken away if there's a really, unbelievably compelling social justification for doing so. I am pretty skeptical that banning web scraping has that kind of justification.
The best is some sort of heuristic like captchas and even they can be outsourced so that the human doing them isn't actually viewing the content.
The thing about a bot which bothers people is the behavior anyway. A bottish acting human would get people just as upset.
There are also companies that provide added value by compiling and correlating "public info" in a useful way that creates value. If Google let me scrape their search and remove ads, it would be popular, but is it "legal"? Or maybe Google Maps?
It's basically the same as a TV broadcasting a film for free, and then going after you legally if you recorded that film and uploaded it to your website.
Copyright law is unchanged. If someone scrapes your blog and then re-uses your posts on their own blog, you still have possible copyright infringement claim
(Some people will then say "But why not just offer APIs", but that's a lot of extra work and maintenance).
It's like with instagram and other social media platforms. The content creators put in the hard work, while the leeches are stealing content for their own benefit, giving zero credits to the original content creators.
paradox: a statement or proposition that seems self-contradictory or absurd but in reality expresses a possible truth.
By People you mean business or people people? Because I don't think people want everything to be public, many in fact use various networks to avoid oversharing and even then many people don't want their old bosses or exes looking at their profiles, there just don't exist tools to limit access that granularly.
Companies want to provide some information to some people; but providing all information to all people is analogous to allowing customers to make a meal of free food samples, on a recurring basis.
It already is. There are entire companies, like Distil Networks, who exist solely to protect companies from bots/scrapers/etc. Actually, looks like Distil got acquired and are now part of Imperva, but anyway, the idea is the same. This is definitely an existing field.
Disclosure: former Distil employee, but I have no financial stake in this discussion, and have mixed feelings about scraping. Clearly it can be beneficial in some situations, but when I think about having to pay exorbitant prices to scalpers for tickets to an event, because they used a bot to buy up all the tickets, that is less appealing.
Can you really not imagine a world where a person accidentally or in poor judgment uploads something private to their own site (their real name, home address, credit card#, or any piece of highly damaging information that could cost them their careers) and wishes for it to be removed? (but can't because many of these scraping sites never respond to takedown requests)
People make mistakes and post things they shouldn't. A mistake from someone many years in the past that they've made amends for shouldn't haunt someone for the rest of their lives.
But it does when we decide that every single line of text ever uttered online must be preserved and easily accessible by anyone for all eternity.
Blocking scrapers is an arms-race escalation because these sites refuse to remove content, and it's used as a tool for character assassination by bad actors. It's a proactive defense.
Otherwise they would realize what they demand is contradictory and incoherent like demanding to be both viewable by all and not viewable. DRM is one fundamental example of it.
"At this preliminary injunction stage, we do not resolve the
companies’ legal dispute definitively, nor do we address all
the claims and defenses they have pleaded in the district
court. Instead, we focus on whether hiQ has raised serious
questions on the merits of the factual and legal issues
presented to us, as well as on the other requisites for
E.g. the telephone book contains lots of boring facts that are each in themselves not copyrightable. However the collection in itself is copyrightable, as it required substantial effort to create.
It seems odd that US copyright law wouldn't have a similar provision, or that it doesn't apply here?
Such "thin" copyright tends to mean that there is a strong presumption against infringement. You generally need to demonstrate that the work has been copied virtually in its entirety to find infringement; partial borrowing is insufficient.
But I don't think that's relevant here, as a) this isn't a copyright case and b) HiQ are not attempting to recreate the entire "compilation" of LinkedIn
Does this mean we can now scrape e.g. YouTube videos, Amazon reviews, IMDB reviews, Facebook events ... ?
>hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law
Does this mean that Google's random recaptcha check is interference?
So prices on Amazon.com are facts. User reviews are creative so probably copyrighted.
Similarly the videos on YouTube are copyrighted. However the number of views and the number of likes are probably scrapable.
Lets draw some pararells to real life. If I go to public space like town square - can't I take pictures, notes and records then go home and draw my analytics from it? What if I read something in a book I bought, can't I quote it?
Same thing should be with web resources even if they are creative - as long as I don't publish them I should be able to scrape whatever public resources I want and use them in my analytics, machine learning or whatever.
Downloading publicly available data should (by definition of public) not be a violation of someone's rights. However it's easy to see why it wouldn't be desirable for someone to republish creative works as their own, so it's reasonable to give the author control over how their work should be published.
And in the case of price data or similar you would be hard pressed to deem anyone the 'author' of it, hence it would be weird to enforce the author's rights.
Copyright does make _copying_ tortuous. Broad personal use exceptions in USA, for example, make this appear not to be true, but it is the act of copying - even without publication - that is protected in general.
Ripping a CD in UK, for example is copyright infringement without a general personal use exception (there are exceptions, under Fair Dealing, but whatever you're doing almost certainly doesn't fall into them).
See eg UK CDPA1988, Chapter II, section 16(1)(a); or USC17, Chapter 1, 106(1).
Not a lawyer, but:
You can do all of that, but:
You cannot scan the book you bought, and put it on your website for sale or even free - unless it's copyright is up or you are given permission by the copyright holder.
You can not take a picture of someones painting in high detail, then sell prints of it - unless it's copyright is up or you are given permission by the copyright holder.
Think of Law around data as using dependent types. The legal protections depend on the type of the data, and the type depends on the content (among other things). You have to determine the type BEFORE you can tell what the law says about it, since the law only cares about the type. You could probably encode the law nicely with something like Idris, but any "code as law" type governance system without dependent types won't be able to express existing law.
No. At the risk of just repeating the comment you didn't understand, creative works are not "just data" - they are copyrightable works that the owner has control over who can use them, not just for profit, but for any reason with few exceptions.
You don't just get to drop someone else's work product into your algorithm without their permission.
I can read a book to post my impression on it somewhere right? I can read it and say "it was beautiful" on twitter.
I can then automate my "taste meter" through machine learning, it reads a given book character by character, and spits out what I'd think of it if I actually read it. Then posts it on twitter, says "it was beautiful".
Did I break copyright law? I don't think so.
I wonder if the number of stars are copyrighted. It's not creative, but a fact.
See the South Park WWITB issue.
I believe South Park used a videoclip from youtube, and Youtube’s ContentID system removed the video South Park had used, because Youtube considered it a violation of South Park’s copyright.