IMHO, LinkedIn doesn't have a right to stop scraping after the fact, but they have the right to take technical steps to stop scrapers from accessing their site.
That being said, I hate LinkedIn as a company and I fully support anyone trying to mess with them. They are not a social network, they are a sleazy website that convinces people to willingly provide personal information which they then turn around and sell at ridiculously high prices. Even if you are legitimately using LinkedIn as an end-user, it’s easy to get blocked for using it too much and being forced to pay just to interact with people on the site.
As far as I can tell, no one has made that argument, so I'm not sure why you feel the need to rebut it.
I think it all pretty much boils down to this quote from the article:
> LinkedIn's position disturbs Orin Kerr, a legal scholar at George Washington University. "You can't publish to the world and then say 'no, you can't look at it,'" Kerr told Ars.
The title is "It’s illegal to scrape our website without permission". So that argument is implied in the headline, at least.
As for Orin Kerr, I'm sure he'd agree that there are private parts of the internet (my payment information being an obvious example). Just because something is deployed to the internet doesn't mean it is "published to the world" as he claims.
He's not just bloviating; before you disagree with him, reviewing his arguments is worthwhile. (I do disagree with him in part, and agree with his reasoning but don't like the outcomes in part, but in any case, he's a pretty accomplished lawyer, and I'm not any kind of lawyer, so there's that.)
If a company paid a million people to call into an information service hotline and each request one fact from it—and then the company recorded and compiled the answers into their own database to start their own information service—is that illegal?
I sure hope not. From there it doesn't seem to far off to claim that "He was really only ever showing up for work to learn some skills and how he's using those skills to run his own business!" sort of lawsuits. If you compile difficult to find but freely available information into a more easy to digest format I see that as virtually always a net positive.
Everyone that scrapes LinkedIn (or anywhere else) either knows that they are doing it against LinkedIn's wishes or doesn't care.
I think you just defined a social network, unfortunately.
So it's fine to mess with them, even illegally, just because you don't like them?
* convinces people to willingly provide personal information
Convincing is not forcing, and in fact, you say "willingly" yourself. Any business convinces you to willingly give them money or other value.
* sell at ridiculously high prices
To respect to what? Prices are determined by the market. We are not talking about a life-saving medicine or health care, on which there could be some debate.
We are talking about a company selling information, which has value and which it acquires through an infrastructure that takes a lot of money to run. Value that customers are willing to pay for.
* Even if you are legitimately using LinkedIn as an end-user, it’s easy to get blocked
For what definition of legitimately? Yours? Since it's their business, they can define what is a legitimate free use and what can be a paid one.
* They are not a social network
So based on what I pointed out, they are not a social network only because they are not free? Do all social network have to be free and give your data for advertisers to be social networks?
The previous commenter never mentioned messing with them illegally and disagrees with the analysis that this scraping is illegal. He said he hopes the courts do not rule this is illegal. Pretty disingenuous to start a reply like that...
Many of your other points are not really giving the previous commenter any charity at all.
>For what definition of legitimately? Yours? Since it's their business, they can define what is a legitimate free use and what can be a paid one.
He was literally trying to provide an example of where their scraping protections can be appear to be overzealous to casual users, not arguing whether those users are legitimate or not.
What is the premise of your argument? To me, it seems you are simply trying to defend LinkedIn's business practices and legal pursuits, rather than discussing anything about the legality of scraping or the specifics of LinkedIn's anti-scraping implementations.
If it's not illegal, then scummy behaviour against a scummy company doesn't exactly set my moral compass off. You reap what you sow.
Very good. That way the whole world will be blind and toothless. --Tevye, Fiddler on the Roof
If you allow yourself to act as badly as others act, you don't have much of a moral compass.
The world can only improve when we hold ourselves to a higher standard than we see, as it's too easy to rationalize our own behavior, and harshly judge others' actions.
The defense against a wolf is (generally) not eating the wolf.
The defense against invasion of privacy is not more privacy invasion.
The issue I see with this is in 2 part.
First, any issue that comes down to "moral compass" is inherently dangerous. We can find many examples of the simple concept that what to one person is Good is to another Evil. In this case, I think Linkedin shareholders would not appreciate calls to mess with the site, or with people having trouble jobhunting because the site is going down repeatedly due to DDOS or whatever.
The second is that these kinds of calls to action (linkedin sucks, fuck with it) smell like vigilantism to me, and while Batman is my favorite hero (I really only lift because I kinda sorta wanna be batman), vigilantism doesn't contribute to a stable society. Rule of law works better than the chaos of multiple agents enforcing their own moral code as law.
EDIT: I'm happy to be downvoted if I'm saying something stupid, but while doing so I would very much appreciate a quick comment as to why I'm wrong so I can improve my knowledge.
Most people seem to think this law (if it is held up in court) is wrong and should be changed.
You might say, oh well, we have a democratic right to change or influence our laws. But a princeton study has found no correlation between public preferences of the majority of the population and enacted policy: https://scholar.princeton.edu/sites/default/files/mgilens/fi...
From memory, the only thing that leads populations to revolt against their government is high enough food prices. Outside of that, revolts almost never happen.
What I'm trying to say is A) unjust, monopolistic, or excessive laws are probably more normal than the opposite because B) the idea that democracy means people have some weigh in lawmaking might be a myth and C) most people don't do anything about it because they only act when the very basics of their livelihood are threatened.
Your view, that some unjust or excessive laws are preferable to total chaos, seems to carry the assumption that laws are naturally benign and/or made to serve some purpose for society, therefore we should not challenge them without good reasons to do so. If the opposite is true and most laws or a high enough number of them are not just, then the fact that the vast majority of people disagrees with them is only natural.
This is a fairly long-winded way of saying that most people would say you're being downvoted because there are plenty of terrible laws that we should not acquiesce to silently.
What a shitty thing to do.
This probably happened a lot more a few years ago. Is perhaps 2FA making this harder these days?
I wonder if Google, Yahoo, MS etc have done anything like watch for requests from LinkedIn with correct credentials, block them, and reset the user's password and give them a warning that they just gave their account password to a third party and this is a Very Bad Idea.
Literally yesterday I got, for the first time for that user, one of the spams sent "on behalf of" a user who clearly hadn't given out my e-mail address, so I guess they still are up to these no-good deeds.
I think that they used to have some FAQ entry explaining why worrying about this is silly and nothing bad could happen, but I can't find it any more (probably because it's nonsense). However, just because they should be shamed for this whenever possible, here's a Slate article on their overall security: http://www.slate.com/articles/technology/safety_net/2015/02/... .
Agree with you that giving your account password to any third party is madness, even more so actually soliciting it.
Every service has "terms of service" which are the conditions that you are allowed to access the service and what you may do with the service once being granted access. For example, if you start pouring toxic waste into your sewer, you will find that the city will both disconnect you from the service and they will fine you for violating the terms of service you nominally agreed to when being hooked up.
In LinkedIn's case, they allow you to access their service, with HTTP, to render a page in a browser for viewing of that page. Full stop. Any other use of the data you acquire over HTTP, or any other method of acquiring said data over HTTP is disallowed by the terms of service.
Not only does LinkedIn have a legal right to stop scraping after the fact, they have literally centuries of common law in support of that position.
As for their ability to control what you do with the information: there might be a limited license on the data granted from users to LinkedIn that is not transferrable, so maybe you couldn't build a service that redistributed that information, but I don't see why obtaining and holding it would be illegal.
As for the analogies to power and telephone and such, those are built on property owned by a local government and there are usually other extra laws related to them: it isn't due to some common law position that you can't mess with their stuff. Here, I am not a lawyer, but I am a government official with a particular interest in sewage; here is a link to the sewer use ordinances form our local sanitation district: pay particular attention to 2.03.
Every single one of them concluded that based on how the law was written and how the web worked, there is no legal way to scrape a web site without its explicit permission to do so.
That won't stop people from trying of course and it was a source of constant entertainment in the ops team at Blekko at how people tried to sneak around at scraping (it can get very creative) but; it isn't legal, you can and will get banned from all access for it, and if you use the results in another product or offering you will be found liable for damages.
Google scrapes several of my sites and I've never given Google explicit permission to do so.
The implicit contract is that you let them scrape because you want to show up in their search results which will send you traffic. If you don't care about Google traffic then set /deny in your robots.txt and get back the bandwidth you were giving them.
Only for definitions of explicit I must be unfamiliar with.
If the presence of a robots.txt makes one's intent for a given resource explicit one way or the other, the lack of one (and the lack of some communication in some other channel) must mean there is no explicit permission.
> it is very clear to me that a search engine is
> operating on the legal equivalent of thin ice,
For a search engine it is super clear, robots.txt is all. If you say yes explicitly, great. If you say no explicitly, that has to be honored. If you say nothing, then its up to the search engine to decide which way to interpret it, but if the site owner complains because you picked wrong you have to honor their wishes (which may include destroying any cached data as well).
PadMapper, Perfect10, and the newspapers generated a ton of cases based on 'scraping a web site and using the data.' There are also about a dozen comparative shopping sites that have been dinged for the exact same issues. (look vs Amazon or vs Walmart).
Whether CFAA, DMCA, Torte law (contracts), or something else applies is constantly being discussed :-). I'm just the messenger here. I haven't found a single case that has held that the point of view of the scraper of someone else's web site should prevail. The argument that it should be allowed 'to help new businesses get off the ground' is like saying Apple should pay out some of its cash hoard as grants to startups trying to break into some business. I have yet to read anything that was sympathetic to that point of view.
- Reminds me of CraigsList vs PadMapper. In that scenario I side with CL -- it was right to block PM. PM or others should not be allowed to build a new UI on top of CL because CL was the one that put in years of effort of nurturing its listings, its network, building brand equity and taking associated risks and costs.
- As others have highlighted, the data is publicly accessibly and there is no agreement the scraper/crawler is bound by. The agreement is between the LinkedIn user and LinkedIn. The scraper is connected to the Internet pipe crawling the Internet freely as it wants. It's not reproducing the data anywhere so copyright should not be an issue.
- What if a scraper didn't scrape LinkedIn but just the Google or Archive.org cached versions and read those instead? It would not be pressuring LinkedIn server resources in this case.
- What if all of my employees allow me to scrape their LinkedIn data? Can I scrape all of their info? Can LinkedIn stop me from doing that (In the case of Facebook vs Power Ventures, the answer is that LinkedIn would be able to prevent this behaviour).
- Who owns the data? Medium.com doesn't own the posts. LinkedIn doesn't own the CVs.
I'm just saying the legal system doesn't see it that way, they have said so in many cases, and so far everyone who has used your argument or variations of it in court has failed to prevail.
With HTTP and LinkedIn, there is no such step. There's no pre-connection agreement. LinkedIn could present such an agreement on first connection, but they do not.
LinkedIn has two things that they do which protect them; First, they specify they disallow access in their robots.txt file. While not a binding agreement per se it is the default mechanism that is accepted by the community for apriori identifying whether or not automated access is possible. Second, when they detect an access pattern that violates their terms of service they actively block the access proactively notify the source of the violation.
The sad truth is that web scraping has been around since the very beginnings of the Web back in 1993 and this question has been litigated in every way that you might choose to argue it, the body of case law is enough to fill at least two volumes in the reference section of the library.
There is no legal or ethical basis for scraping the web without permission. And if it isn't explicitly allowed by a site the presumption is that it is disallowed (no 'open door' exception).
If you're talking about making anonymous requests to their service, they only allow a few of those before they stop showing you profiles. If you circumvent that protection, it's a bit more like hooking a cable up to a power line (illegal) or dumping your commercial waste in the sewer (illegal).
But if you say, 'I am NOT a bot', like spoofing a browser's user agent string, but you are a bot, then you are requesting access under a pretense, in order to circumvent their terms of service. Kinda feels morally wrong, and illegal.
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Snackmaster Pro/666.0.666
What do you do?
I also tell my browser to lie about what it is sometimes, due to sites that are malfunctioning, but whose owners choose to document the errors instead of fixing them with "Use Chrome" (or IE, or whatever) checks.
Is that 'kinda' illegal or morally wrong (two very different things)?
If by client you mean a robot, then you are pretending to be a browser and you are accessing the service without permission.
Let me ask you a question, say your client was hitting my service with that user agent, 100 times a second, crawling through urls sequentionaly. Lets say I added it to my robots.txt deny list and starting blocking that user agent. Would you change the user agent and continue?
If someone creates a site that says, 'Access to this site is for 640x480 browsers only, any other use is forbidden'. Then I think its pretty clear that its a stupid site but also that faking your screen resolution is accessing a site without consent. There is no slope, someone (Linkedin) putting explicit terms on their website is pretty clear.
What if I send a null UA? Or use it as an opportunity to share my favorite quote?
What if the behavior of my software doesn't attack like a robot, does keep the request volume reasonable (use whatever you think is reasonable here) but also doesn't do what you might expect a human clicking around to do?
The point is, the scraper would have to hide their intentions and identity, which removes any claim they are being 'honest' in their intentions and not trying to circumvent the provider of the services efforts to prevent scraping.
# Notice: The use of robots or other automated means to access LinkedIn without
# the express permission of LinkedIn is strictly prohibited.
"You agree that you will not ... develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology or manual work) to scrape the Services."
That said, this should be a breach of contract issue. It's an overreach to invoke federal fraud law.
Most news sites publish to the world. But scraping a news site's content and monetizing it yourself is not ok. Legally, it violates intellectual property law. But laws aside, I assume most people would agree, if someone spent the time researching and writing an article, they should have the right to monetize it and nobody else.
In this case, IP law may not apply, but the concept is the same. I don't love LinkedIn myself. But they spent the time building a platform for collecting that info. I don't see why it should be OK for other people to scrape and monetize it.
EG, if it returns an image - it doesn't imply I can use the image anywhere I want.
Without going into too much detail, a lot of hedge funds have teams constantly searching for kernels of data that can contribute some kind of signal for market movements. This data can come in the form of satellite imagery for oil tankers or manufacturing centers, but it can also come from the very creative use of scraped and aggregated data. It's typically very difficult to identify, collect and analyze on a technical level (as 'chollida1 has lamented in the past: normalization, labeling/bucketing and analysis of disparate data across different formats, sources and processing timeframes is a pernicious problem at this scale). From a compliance standpoint there are also generally strict requirements governing legality of use.
Depending on the specific data, you might be capable of predicting earnings or broader market movements with a <5% margin of error each quarter for years at a time (I've personally seen and worked on projects with <1%, but that's the exception, not the norm). That tactic is usually found at discretionary funds; at quantitative funds the uses are much more abstract and cross-pollinated so as not to target single-equities, but rather holistic trends. Regardless, every fund is using data in some way these days; it's just a matter of how sophisticated, creative abstract they get in their analysis of it.
hiQ Labs doesn't collect data for this specific purpose, but it is absolutely related. In the past I have stayed away from crawling LinkedIn and Yelp precisely because they are very litigious (regardless of the eventual outcome and legality). Now that there's another relatively high profile case out in the open like this, I'm interested in seeing how it proceeds and what the ramifications will be for companies that collect data across a wide range of uses. As Grimmelman mentioned in the article, this can impact a lot of types of businesses, not just those in the same space as hiQ. Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.
Given this is a predominantly a web development community it always surprises me how little creativity there is in the articles on investing. Neural networks and machine learning sound cool but the reality is almost none of the readership would be able to make any money off them.
Simply tracking how many sales or users exist in databases by watching sequential IDs should be the go-to method for any web developer trying to get an edge. I would have expected HN to have articles where people are getting creative on that, ie trying to use measures of entropy on usernames to get rough subscriber numbers etc.
Even plain scraping of prices etc, is often full of great insight that is ignored. If a grocery store drops their prices in profitable categories against their competitors, that could be the signal about an incoming price war for an entire sector. There's a lot more information in that than social media feeds and all the other sorts of sexy data that get coverage in the media.
Seems like a great application of the German tank problem  that was mentioned on HN the other day.
If you look at the possible returns the equity market is going to make from a stock in a dollar value, and how much research spend is as a percentage of that, you'll quickly see it doesn't pay for much.
You can tell simply by looking at the broker research - that's probably the extent that analysts take things.
The big stocks obviously have a lot of it happening. (eBay listings, airline pricing etc is obviously touted a lot)
But once you start to go down to the mid caps, you enter a void where there isn't much heavy data focused research done, and it's very possible you can have a better gauge of the business than any other investor on the planet once you pull out this data out.
Google comes to my mind. I'm only partly joking, as it seems to me that the line between search engine web crawling and other forms of web scraping is very thin.
Are there "societies of scrapers"?
Inside, are certain sites more worthwhile - and which ones (eg reddit, eBay, trade union websites, whatever)
How about scraper brokers? Do they exist?
Are there scammy scrapers? Make up BS and sell as scraped data?
How big is this?
Not that I'm aware of, no.
> Inside, are certain sites more worthwhile - and which ones (eg reddit, eBay, trade union websites, whatever)
Yes, absolutely. For many purposes websites that sell their own data are less useful (less signal exclusivity). Specific sources of data will be much more valuable depending on what the data is about.
> How about scraper brokers? Do they exist?
Yes. You're not getting access without an NDA in addition to paying quite a lot.
> Are there scammy scrapers? Make up BS and sell as scraped data?
That depends on how easy it is to verify the data. For most of what you'd term "alternative data" you'll know if it's real in 2 - 12 weeks, and it's not sustainable to sell crap.
But a lot of parties scrape dodgy financial timeseries data (ticks and quotes on equities or options) and sell it, priced as though it were tick data when it's barely accurately OHLC. They mostly sell this sort of data to amateurs who don't realize tick data is expensive for a reason.
> How big is this?
Very big. Most hedge funds ingest a lot of data whether they curate it internally or source it from elsewhere.
If hedge funds floating drones above oil tankers, I'd guess they'd be accussed of corporate espionage / spying / invasion of privacy?
Ok, so oil tankers are big and "in the clear". What if $TANKERCORP floats big parachute balloons above its tankers to imply "looking past these is unauthorized viewing"?
Then if a HedgeFund gets a clever angle on a satellite photo.. is that the equivalent of breaking a lock, or violating CFAA?
There are two other notes in response to your question:
1. Drones are different from satellites, and are more susceptible to regulation in the way you're positing because they can be prevented from flying above specific areas. However, most of the same problems with countering them apply, because drones can record better three dimensional footage. In your specific example, if a tanker disguised itself overhead, it would still be legal to have a drone monitor the tanker from the sides, as long as doing so didn't break any law set by the FAA.
2. Drones are actively used these days for things like monitoring production facilities ("how many cars come out of this factory" for an oversimplified version). If they have to monitor from a distance, so be it, they'll do it. The effective countermeasure here is to have a huge amount of land that can't provide any intelligence, because the drones aren't allowed to fly over it and can't see far enough in to the facility.
There's definitely a productive ethics discussion that can be had here, but the legal precedents don't really allow for combatting these techniques right now. If it's public, it can be collected, ingested and used in an algorithm to determine alpha.
In the Eastern European country from where I'm from (a NATO member) the Google StreetView car even got to photograph and publicly put on the Internet the outside of military and air bases with clear signs of "do not take photographs" visible on StreetView itself. It's funny, my company also used to work in this space (local business directory with business addresses, photos of said business etc) and one of my former colleagues got detained for a day because of taking photos of businesses in the downtown area of one of the biggest cities in my country. He hadn't seen that there was a military "objective" in his line of view (probably some military HQ or someth, not a proper military base with tanks and trucks). Talk about the advantages of being an internet giant like Google..
Later edit: I was talking for example about links like these: https://email@example.com,26.0524843,3a,80.8y,2... . That is actually the HQ of NATO's "Multinational Division Southeast", whatever that is (http://www.nato.int/cps/in/natohq/news_125356.htm?selectedLo...). Fact is that if I were to take photos of those buildings as a simple citizen I would be breaking the law, not sure how Google got away with it.
It's not practical or safe to put a huge parachute or balloon over a tanker to block overhead imagery. Any sailor can tell you it just wouldn't work.
If you scrape regularly, then pick up a dozen or more machines around the world, in less than friendly areas to US law. Pay with a rechargeable credit card or bitcoin. And the buy servers and set up a hadoop cluster that handles scan-jobs.
The worst case scenario is that LinkedIN, YELP, and others get some of your servers shut down. Wash, rinse, repeat.
EDIT: please note, this was only a thought-game to bypass rude and destructive laws like the CFAA, which weaponizes TOS'ses, EULA's, and other implicit contracts of adhesion (as in, you have to agree to see). Ideally, we would be better off with both these laws to have a sane scope, and for companies to not expect things to happen with content in public.
The last one, I did 13 crawlers to keeping comparing prices of all drugstore products online. I'm selling it to drugstore ecommerces who wants to know when their competitor prices and when they're doing promotions, if competitors prices are higher... Well, I just automated the job of a person that was doing this job manually every single day, looking into competitor's site
If it is illegal, you have to ask to google maps to remove all houses from your database and just let in the houses who gave the permission to it.
The both sentence means the same bullshit, they are both wrong:
Google Maps: "The front-door of my house is faced to a public street. That does not mean you can take photos of it on Street View and use a very smart OCR, to read my house number. Not to mention that sometimes you give my house photo to others by Captcha asking the house number, c'mon!"
Linked: "My CV is half-public. That does not mean you can take crawler of it."
... This is only a cool discussion in North Korea or maybe in China. Not in the rest of the world.
Your very example of Google Maps in Germany - well, turns out Google had to give people the option to opt out: https://europe.googleblog.com/2010/10/how-many-german-househ...
Same goes for web sites - robots.txt exists for a reason. If your crawler ignores that, well, I'd suggest talking to a lawyer.
I hate to be the bearer of bad new but...maybe. A reading of the Computer Fraud and Abuse Act could make robots.txt legally enforceable. And given the government's approach to CFAA cases a very aggressive interpretation, under the right circumstances (for example, when it provides evidence that the scraper knew that scrapint was not authorized), seems like a real possibility.
Among the many other things CFAA criminalizes, it makes it a crime to "intentionally access a computer without authorization or exceed authorized access . . and thereby obtain information from any protected computer;"
A "protected computer" is, among other things, a computer "which is used in or affecting interstate or foreign commerce or communication." That would probably cover just about any server on the Internet.
The interesting question is are you violating the CFAA in a way that will cause the executive branch to exercise their discretion to prosecute, and moreso, is the CFAA even constitutional.
From a moral point of view: robots.txt states an intent, and intentionally ignoring that is not a nice thing.
And often, "not a nice thing" translates into legal action. So, if this matters to you, you should find an expert :)
As a lawyer asked inane questions by friends, a common answer is: "I am no sure if its legal to do, but I'm absolutely sure you're a asshole if you do it."
If you don't want it scraped, take it down, or put it behind a login.
If the user provides the login to a scraper, then the scraper has permission.
If I can walk near a pool, should I also be able to run? Is running anything more than faster walking? If I'm allowed to be around the pool walking with my entry ID, should I also be allowed to place my ID on a little motorized car and make it dart around the pool really fast? Should I be able to duplicate my badge, put it on a bunch of little cars and direct them to quickly get all the floaties before anyone else? How about giving them all their own fake IDs? Now all the same questions, except there is a sign that explicitly prohibits all of these examples except for walking.
It seems disingenuous to argue that the automation and rapid increase of a thing should be allowed just because a thing is allowed. That doesn't typically match our intuitive notions of ethics in other parts of society, like driving or walking around a pool. Yes, you can walk around a pool as much as you want, but if you change to running then you have fundamentally altered your behavior through increased capability, not merely done "more of walking" to utilize more of your freedom.
I suppose a natural counterargument to this analogy might be that running around pools is unsafe, and scraping is not unsafe in the same way. But my point here is establishing that a behavior intrinsically changes into a different behavior if you increase the speed at which you're doing it or the capability at which you can do it.
The analogy is starting to break down, but I think it's still instructive for the problem of applying a simple first principles approach.
Likewise in some high-crime jurisdictions, if you did not lock your car you are liable for it getting stolen or broken into. An unlocked car is too tempting for some people to just walk past and not take it.
I know it might sound crazy but you could make the argument that a massive pool of highly-structured and very valuable data just sitting out in the open is an attractive nuisance and steps should be taken to put it behind a locked gate. Once that requirement has been satisfied, normal trespassing laws apply.
For example, if a woman walks down a dark alley wearing short skirts and gets raped, it isn't her fault. I mean can you imagine if we said "well, she was just an attractive nuisance!" The judge would throw the book at you.
I am saddened by this.
Once they have withdrawn permission, they can call the police and you can be charged with trespass, though that's usually a misdemeanor rather than a felony.
Yes, that will cause potential issues for other people, which is why they tend not to do that, but if you trivially create a thousand more agents, and potentially trigger a degradation of service, how are you different to the people who block junctions at traffic lights?
I'm not keen on inconveniencing people, and "it's not that bad" is a poor argument for doing something that someone has explicitly asked you not to do.
That's why it might be reasonable for laws around this sort of thing to be different in the virtual world.
What does is it have to do with ethics? CFAA law does not apply here and that is what matters.
But if you contest that or have an issue with the specific terminology I used, perhaps you'd prefer this terminology instead: "I don't think this is a good framework for interpreting the CFAA and establishing legal precedent."
The nature of the original complaint is authorised access to publicly published information. Again, if you do not want people to read publicly published information, do not publish it publicly.
And as a side note - we don't need inappropriate analogies; the web is real, we can discuss the real issue.
I think the issue is around public data here. If you made medical records available to the public you would be in trouble in many countries. Sharing those may land you in the same trouble as the website who initially posted it under that law. The person shouldn't be charged with unauthorized access because the materials are public.
If there is a problem with that, see the second line
Even if you go for "forgiveness rather than permission", if you ride it a second time after I've told you I don't want you riding it, then you're in the wrong.
Obviously, there are philosophical arguments to be made about the status of information online, but if the information were provided to LinkedIn, to be used on LinkedIn properties, I as a user would take umbrage with it being taken by a 3rd party, even if it were viewable publicly.
If you have an issue with that, you should be moving your bike somewhere people can't take a photo from the public street. Not have someone creatively interpret a law that says where I am is suddenly not public property, because you asked me to stop using my camera.
Perhaps my analogy wasn't great, but the grey area is around going to LinkedIn's server (whether or not this is "public" or their property they allow you access to is another philosophical question, though in the eyes of the law it appears it's the latter), deliberately extracting value from it, and then getting annoyed when you're asked not to.
Inherently it seems as though it's the old question of whether a server is public, or private but accessible (like those POPs  there was a thread on recently).
If it has those. The data on LI (other people's employment histories) is not its own IP.
That would be copyright infringement because photographing a copyrighted work is considered reproducing the work under the law.
Taking a photo of the bike is also a poor analogy. Scrapers don't take one photo, they take photos of all the bikes. And scrapers don't keep the photos for themselves, they sell them for a profit. Also, the original bike isn't parked, it's placed in a gallery (probably with an admission fee? I don't know the business model of linkedin).
Physical analogies for this kind of thing are always flawed, it's just dishonest/misleading to pretend that copying data is ever analogous to taking a physical object (the owner of the original is never deprived of the original when data is copied).
"I don't know the business model of linkedin"
Most of it is selling premium features to recruiters and other businesses. I'm not sure if Hi-Q's service interferes with that or not, but LinkedIn should not be trying to have their cake and eat it by leaving things in public then complaining when the public accesses it in a way they don't like.
Actually, the size of the data and the number of requests is very relevant. More data means more information, means more money. It also means more bandwidth and processing power required to process requests. You're not taking a photo of the bike, you're asking the bike to give you a photo of it.
> it's just dishonest/misleading to pretend that copying data is ever analogous to taking a physical object (the owner of the original is never deprived of the original when data is copied).
Leaving LinkedIn aside, possession of the original data is never the issue with digital piracy. It's a straw man. The hurt occurs when people benefit from the work the original author put into creating that data without proper compensation. Just because you can clone my gizmo (which I spent years working on) without taking the original one doesn't mean you're not hurting me. That gizmo could give me an advantage you wouldn't otherwise have. I place hours of working into something that doesn't put food on the table because you can clone my work, but I can't clone my food.
There's a reason an empty CD costs 50c but a music album costs $10. You're not paying for the physical medium. You're paying for the IP. And yes, digital distributions are cheaper because of this, but that doesn't make them free.
> Most of it is selling premium features to recruiters and other businesses.
I'd say it's pretty obviously interfering with their business model.
> LinkedIn should not be trying to have their cake and eat it by leaving things in public then complaining when the public accesses it in a way they don't like.
LinkedIn could ban IPs that make unreasonable number of requests in a short amount of time.
"The hurt occurs when people benefit from the work the original author put into creating that data without proper compensation"
Not necessarily. If I'm paying for print of some imaginative artwork that was created using the picture of the bike, that doesn't mean the bike owner lost anything, even if he spent time building the bike with his own hands. Similarly, if the only reason why people paid Hi-Q was for the extra work that they put in, LinkedIn didn't lose money because people would not have bought their product without that extra work.
There is certainly an argument that Hi-Q should have licenced the content first, but it's public data. If they want to make licence deals, don't put it in the view of the public street then whine when people are documenting what's in public.
"It's a straw man."
No, the straw man is pretending that a copy is the same as theft. Theft is theft because someone is depriving you of the original, not because you imagine you might have had more sales if the copy didn't exist. There's a reason why there are different words for different things, and pretending that a copy is the same as taking a physical object it a lie. Period.
"I place hours of working into something that doesn't put food on the table because you can clone my work, but I can't clone my food."
But, you put the price up too high, so I opted not to buy it. Maybe borrow the CD from a friend, or listen to something else. Or, you decided I couldn't buy it in the format or region I wanted. There are real issues, but pretending that a copy = a lost sale is utter bull that's been debunked time and time again, yet is regularly repeated by people trying to inject emotional arguments instead of facts.
"I'd say it's pretty obviously interfering with their business model"
Then perhaps they should address the business model or not put their content out there in public unprotected if it's that valuable to their income.
"LinkedIn could ban IPs that make unreasonable number of requests in a short amount of time."
Yes they could. Which would not have to involve the courts in any way. Or, they could protect the content in some other way that (for example) requires a log in and adherence to T&Cs, with which they could easily kick violators off their site for non-compliance.
The issue is that LinkedIn are trying to have it both ways - gathering the benefits of public content while blocking others who use the now-public content in ways that are usually acceptable for public content to be used. Sorry, not acceptable, you pick one - take the content away from the public street or accept that some people will use what has been shown to the public.
With this, I agree 100%.
> No, the straw man is pretending that a copy is the same as theft. Theft is theft because someone is depriving you of the original, not because you imagine you might have had more sales if the copy didn't exist. There's a reason why there are different words for different things, and pretending that a copy is the same as taking a physical object it a lie. Period.
That's just pedantry. The debate isn't between "copy" and "theft", it's between "theft" and "copyright infringement".
> But, you put the price up too high, so I opted not to buy it. Maybe borrow the CD from a friend, or listen to something else. Or, you decided I couldn't buy it in the format or region I wanted. There are real issues, but pretending that a copy = a lost sale is utter bull that's been debunked time and time again, yet is regularly repeated by people trying to inject emotional arguments instead of facts.
This is wrong on so many levels, I'm not sure there's any point in continuing this debate. Are you accusing me of using emotional blackmail instead of facts because I point out that "you can clone my work, but I can't clone my food"?
I'm not using myself as an example because I want pity. I'm doing it because it's easier in writing, and because I'm a software developer.
My work takes hours of hours of time and effort (not accounting the hours I spent in school). If it' ok for everyone to clone my work, I won't make any money from it. We still live in a society where goods and services are exchanged with money. I exchanged my hours of work for no money, but I can't exchange no money for basic living necessities such as food. There's no feelings involved here. In the current economy, work going in, and no food coming out is not a viable business model. And if nobody payed for digital content, there would be a lot less digital content.
> pretending that a copy = a lost sale is utter bull that's been debunked time and time again
This is another straw man. Whether or not an illegal copy is or isn't a lost sale is irrelevant. You don't have the right to make that copy in the first place. If everyone made illegal copies, there would be no sales. So then why should only some be entitled to illegal copies? There isn't a distinction between people who can make copies and people who must pay for copies, so either everyone must pay for copies or no one must pay for copies. That's how law and economy work. You can't make exceptions by yourself. Either everyone is allowed, or no one is allowed. And for digital content that is for sale, no one is allowed illegal copies. If laws are made that allow poor people to receive goods for free, these laws must address both digital and physical goods.
It's clearly a value-add and not theft.
You have to keep in mind that an entire generation was brainwashed that personal data isn't that "personal", so Google, FB and the rest can have amazing profits.
Most of these discussions are stained by general unawareness of the privacy and copyright law.
Ofc because the value of the data supplier (usually a single person, etc) these never reaches the courts, which just reinforces the ongoing misconceptions.
If you really want to test this, try copying the content from Google, Facebook hiQ or whoever that's big enough to go after you.
But people somehow believe that it's okay for businesses to do what regular persons aren't allowed to.
As someone who has done a lot of scraping in the past (sometimes for good, sometimes not), the number one thing you need to respect as a scraper is that email or phone call you get saying "Stop doing that."
In almost all instances, you're legally fine in the real world until you get some communication to stop and/or blacklisted. After that point, what you are doing becomes a crime.
- LinkedIn is not a public resource, it is a private company that pays for servers.
- LinkedIn might scrape too, but that argument isn't going to hold up in court, and the scraping they do is probably in line with their EULA (protip: never install a social networking app on your phone, ever).
- The analogy to the storefront, taking pictures in public, etc, all break down because scraping LinkedIn requires you to access their resources.
- The analogy to browsing a store is great. If you are in a store, and they ask you to leave, and you don't, that's trespassing. Trespassing isn't legal.
The CFAA isn't a great law. There are a lot of gray areas. But LinkedIn seems within their rights here.
If anyone wants to know how this is going to wind up: https://www.eff.org/deeplinks/2015/06/padmapper-and-3taps-se...
Enter my scraper. It copied data into a local PostgreSQL database that our customers could run reports against. A process that used to take a human 6 hours a day now took 30 error-free seconds. The scraper was even a benefit to the website as we ran it overnight during low-load conditions, and because my software was smarter than a web browser, we could retrieve the same information as a human with about 1/3 the number of web requests. Perfect, right!
Well, no. We got an angry call from the developers complaining that we were the ones making their site slow, even though 1) we measurably didn't, 2) we were lighter on resources than the humans we were replacing, and 3) we scrupulously obeyed their rate limiting requirements and erred on the side of running 10% slower than they had originally requested from their customers.
That particular problem went away when we pointed out that their shiny new website wasn't Section 508 accessibility compliant and could not be made to be without literally throwing out their entire web service and starting over, but our website was, and that if they continued to allow us to serve screen reader compatible pages to our disabled customers then there wouldn't be a need to have the .mil website shut down and a Congressional investigation launched. All parties involved decided this was a reasonable compromise.
Congress writes bad laws all the time. So LinkedIn might be within their rights, but that doesn't mean they should have those rights.
It's bad for innovation to allow for selective discrimination like this. LinkedIn is perfectly happy to allow Google, Yahoo, Bing, and many, many more companies to scrape their content and use it for personal profit. Giving them the option to sue an upstart for doing exactly the same thing as Google is unfair and oligopolistic.
Letting established tech giants get away with this will slowly erode American dominance in technology.
In the 80s, it was reasonable to assume that connecting to some port on a remote machine owned by another person or company could constitute unauthorized access. But today, billions of people connect to ports on remote machines thousands of times a day for completely legitimate reasons, so it's reasonable to assume that data that can be accessed by just asking nicely over the internet is considered intended for public consumption.
It seems permissive, but I think that's a crucial component. If some company makes accidentally makes their S3 buckets public, it's completely unfair to say that accessing that information is illegal, especially when they are serving up other information in public S3 buckets which they want people to access.
But it's not. LinkedIn's data is their entire business. They are within their rights to restrict access to it.
This is the classic ant and grasshopper story. If HiQ wants access to the type of data they are scraping from LinkedIn, they can build that data themselves.
"There has grown in the minds of certain groups in this country the idea that just because a man or corporation has made a profit out of the public for a number of years, the government and the courts are charged with guaranteeing such a profit in the future, even in the face of changing circumstances and contrary to public interest. This strange doctrine is supported by neither statute or common law. Neither corporations or individuals have the right to come into court and ask that the clock of history be stopped, or turned back."
> If HiQ wants access to the type of data they are scraping from LinkedIn, they can build that data themselves.
If they don't want their data to be public, then they shouldn't make it public. They could require authorization to view any content on the site and solve this problem instantly.
It's more like complaining about others selling tickets to view said painting from the sidewalk. HiQ repackages and sells data it scrapes from LinkedIn.
It's like taking pictures of the paintings from the street, and reselling those pictures.
If they don't like that... that the painting down. Simple.
Personally I think anything I can see I should be able to take a photo of for sentimental purposes.
As with all rights, it varies with jurisdiction.
edit: didn't see the sibling reply which makes the same point
Between bot-nets, mechanical turks, deep learning, data brokering, lack of globally-enforced privacy laws that require divulging of sourced personal data, etc., I can't see a way for LinkedIn to prevent others from scraping and gaining from their publicly- and user-accessible data. They'll drive it underground, but if the concern is preventing others from grabbing the data at all, versus performance management, they'll still leak like a sieve.
If your business is in the recruiter space, expect to have your API keys revoked and to receive a cease and desist letter as well.
You would need real-looking bot accounts that you'll use to scrape. You'd need a realistically randomized rate limit, sampling from some distribution conditional on the type of the source page. You'd need realistic mouse/keyboard movements. Realistic hours of operations. Can't be scraping at 4AM and 4PM, and all of the hours in-between. Occasional noise operations, such as searching for a job, or getting salary estimates. You'd be geographically constrained. You wouldn't want your bot from Boston to be looking at too many individuals in Houston (regularly). Maybe you'd use a Markov chain to have the bot make decisions? I doubt the blackhats would have good training data for a neural net. You'd need tens of thousands of these bots to cover the linkedin user base in reasonable time (say, once every week on average), and these bots would have to either overlap or seriously underlap on who they cover.
Best use case would scraper-API that you can use to look up batches of specific people, with your bots looking at others only to look realistic.
(Or maybe not? It's a fun question, but I know fuck all about this. Not my area of expertise.)
Average botnet size is 20,000 compromised PC's. Srizbi is estimated at 450,000. Another vector I'd explore is teaming up with crypto-miners. As I understand it, there are no economic returns tapping into the CPUs any longer, so miners are using only GPUs and ASICs; if this is true, they'll have some spare CPU cycles, that they'd probably be willing to rent out to get some marginal returns on the CPUs that have to run and manage the mining chips, running a JVM or some other VM. If we can do that, then we can probably tap 2-3M hosts, many of them rotating in and out per day.
Throw out an army of mechanical turk assignments to get real humans to register fake accounts. They get paid upon submitting an account and password, which your scraping servers verify, then change the password and commandeer. Perhaps have them register the fake account while running under a container or VM on their computer; the container/VM is instrumented to capture all activity. The activity metrics and data are uploaded to a deep learning system, that identifies the patterns that work and the ones that don't, and uses that to guide the developers of what to randomize, and by how much.
Add in a component to randomly invite/follow other fake and real accounts, and generate Markov-chain-generated copypasta. Set aside a portion of the fake accounts to only build up networks of users. Initially restrict the market of customers to those who only want once-a-year-updated data. As the network builds, use the notification of changes to selectively scrape only changed user profiles, and upsell for more up-to-date profiles at that time.
If I was LinkedIn, I'd probably concentrate on infiltrating botnet operators, and shutting them down. It would be one large cat-and-mouse game.
It turns out I was spamming my friends to get on linkedin because of them.
This was in the early days of Linkedin. That was such a douchey move, even if there was a warning their UI and stuff enable it easy to mass spam your contact list with 1-2 clicks.
Everyone knows that LinkedIn gets paid by recruiters rather than recruitees. But of course, that is perfectly fine. People are happy to be the product if it results in job leads.
1., Self-updating rolodex for sales people.
2., Recruitment tool.
It works for both, no real competitor in sight due to the massive network effect.
Some niche networks are doing ok, like Xing in DACH region. Not aware of anything special in China, guess everyone is on WeChat anyhow.
LinkedIn is almost entirely worthless without the occasional good lead from a recruiter. I find it's mostly drowned out by the huge amounts of insignificant skin peddlers who dole out one worthless lead after another, looking for some fresh skin to peddle.
below the marketing language this is exactly those 2 points. guess it is easier if you work in enterprise, this business speak is a different language. pretty verbose, low information density.
I'm fine with that.
But I'm also fine if they do the like the article suggests and require anything non-scrapable to be behind an account prompt, even if everyone with a account can access it.
I don't think it's fair to make Linked In foot the bill for someone else's business. They shouldn't have to serve that content to people who aren't actually their users.
So put it behind a password. It's not reasonable to expect to get only the benefits of publicly-searchable data without any of the drawbacks.
If a Google-like competitor started, all Google would have to do to crush them is demand big-name sites formally prohibit the competitor from accessing their content or risk being delisted. And magically, it becomes impossible/illegal to build a duck-duck-go.
On one hand I'm against the cartel beahviour of Google doing something like that, but on the other hand, if Google asks and the other company agrees to block DDG, why shouldn't that be allowed?
And insurance companies only want to serve people who won't claim etc. LI is free to force users to usage agreements, but why should third-parties (taxpayers) foot the bill for enforcing them?
The law by itself is ok, but I suspect lawmakers were referring to accessing a single personal workstation, probably not taking into account a cluster of servers containing public accessible data.
In the US. Not elsewhere. Is it possible that the centre of gravity of innovation will be somewhere, er, less litigious soon?
Reminds me of the City of London's historical approach to most forms of regulation (think: Francis Drake, privateers, the convertibility of lute strings &c).
That is the whole point of 'public.'
Likewise Facebook doesn't exist to help connect people together to create a more personal, connected world. It exists to connect people so that they can get the users to share as much personal information as possible, so they can profit from that information.
It's always important to remember that on social media you are the product. You have to weigh what you benefits it is really providing vs. what you are giving up.
There, I fixed it.
It is important to understand that, legal or not, in the US anyone can sue you for anything. In another comment people were discussing the legality of following or ignoring "robots.txt". I tend to be pragmatic about this stuff. If you fabricate your own legal interpretation and end-up being sued by LinkedIn, it could end-up costing you $250K.
When facing large corporations law firms often ask for sizable retainers ($100K+) and proof of cash-on-hand to go beyond that, $250K total not being an unusual number. They don't do this out of greed. They do it because litigation at those levels can be very expensive. If you only have $100K you could find yourself burning through all of it quickly. If you don't have more cash to continue you'll lose the lawsuit and the $100K you spent will be burned for no reason at all. In other words, the law firm is protecting you by asking you to have enough cash to litigate.
A few years ago we started to develop a very extensive product based on obtaining data from Amazon. Some of the data is available through their API and other data had to be scraped in various forms. The product is extremely valuable yet the issues pertaining to scraping made me decide to put it on hold. Even if you can make millions a year the prospect facing a monster like Amazon in a lawsuit, as improbable as it might be, is scary enough to go look for other pastures.
Screw both of these companies. One runs a shitty "professional" social network (data collection tool) and the other scoops up their data droppings and makes it their core business. I just can't have sympathy for either side.
> Facebook sued, claiming that its cease-and-desist letter made Power's access unauthorized under the terms of the CFAA. Power disagreed and argued that having permission from Facebook users was good enough—it didn't need separate approval from Facebook itself.
How can be illegal if users are giving their permission? What happens if I give my permission to an external service to extract my own data?
Not making a claim about scraping, but it's maybe more apt to compare to walking into a store and writing a list of everything for sale their costs.
A long time ago, companies that were compiling phonebooks also thought that they could use e.g. copyright to prevent others from using that compiled data for commercial gain. They were wrong, because copyright doesn't protect "mere aggregation" of data, as courts have ruled.
In this case, LinkedIn is not arguing on the basis of copyright, so it's a different legal argument. But the essence of the case is they same - they want to have a business model around aggregating other people's data and then providing access to that data, while limiting what people who access it can do with it. They don't have a right to this business model. If technical means of restricting access don't work, or if adopting them means that they drive most of their customers away, tough luck.
I meant there is nothing that LinkedIn 'created' out of that users data. But it's little complex or little immoral to steal users data hosted on their infrastructure that costs them a lot.
(Note, HN strips out the trailing dot from the URL above.)
Im going to turn around now, and whatever happens to this site is going to happen. They worked so hard, to ask for this.
They have all kinds of blocks in place.
* What if someone had a few oDesk workers to manually enter in the LinkedIn data?
* What if someone wrote a scraper that parses the Google cache version of the public LinkedIn page?
They have no legal obligation to provide accurate information to scrapers.
According to TechCrunch, they sued hundreds of "Anonymous" individuals last year.
How can unnamed people can be sued?
"this scraping violated the Computer Fraud and Abuse Act, the controversial 1986 law that makes computer hacking a crime. HiQ sued, asking courts to rule that its activities did not, in fact, violate the CFAA."
If you don't want it read, take it down.
First, the scraper consumes a lot more bandwidth (which some websites might have real trouble with). Second, the aggregated scraped data is a lot more valuable than what any single person receives from reading a few pages.
Websites could (and sometimes do) limit the number of views and the frequency of accesses from the same IP, which allows users to read the page and discourages scrapers without "taking it down".
Anyone can make a LinkedIn account, ergo anyone can scrape any public profile.
The EFF said this, so they seem to agree that it set a bad precedent, even for those just scraping.
"Under the government's theory, anyone who disregards – or doesn't read – the terms of service on any website could face computer crime charges," said EFF civil liberties director Jennifer Granick in a press release at the time. "Price-comparison services, social network aggregators and users who skim a few years off their ages could all be criminals if the government prevails."
This is false. Web pages expect scrapers to respect robots.txt and users to respect the terms and conditions. From what I understand, the CFAA law was not meant for these violations, and the fight is whether or not this law can be used to win the trial. However, other laws can be put into place to address these situations.
I agree that what you say is the only sensible way to look at things.
If you have permission for one thing it doesn't mean you have permission for something different.