Hacker News new | past | comments | ask | show | jobs | submit login
LinkedIn: It’s illegal to scrape our website without permission (arstechnica.com)
374 points by mcone on July 31, 2017 | hide | past | web | favorite | 287 comments

Unpopular opinion: when you make a HTTP request you're asking the server to give you information. The server has the right to say no.

IMHO, LinkedIn doesn't have a right to stop scraping after the fact, but they have the right to take technical steps to stop scrapers from accessing their site.

LinkedIn takes plenty of technical precautions to block scraping. I’ve built bots that scrape them in the past, it’s surprisingly difficult as LinkedIn is very good at determining you’re a bot and blocking you. So it’s hard to argue that a service which is scraping LinkedIn is doing it without knowledge that they are going against LinkedIn’s wishes. Whether or not this is illegal is up to the courts to determine, and I really hope they decide it is fine (although I have zero faith that’s how the ruling will come down).

That being said, I hate LinkedIn as a company and I fully support anyone trying to mess with them. They are not a social network, they are a sleazy website that convinces people to willingly provide personal information which they then turn around and sell at ridiculously high prices. Even if you are legitimately using LinkedIn as an end-user, it’s easy to get blocked for using it too much and being forced to pay just to interact with people on the site.

> So it’s hard to argue that a service which is scraping LinkedIn is doing it without knowledge that they are going against LinkedIn’s wishes.

As far as I can tell, no one has made that argument, so I'm not sure why you feel the need to rebut it.

I think it all pretty much boils down to this quote from the article:

> LinkedIn's position disturbs Orin Kerr, a legal scholar at George Washington University. "You can't publish to the world and then say 'no, you can't look at it,'" Kerr told Ars.

> no one has made that argument, so I'm not sure why you feel the need to rebut it.

The title is "It’s illegal to scrape our website without permission". So that argument is implied in the headline, at least.

As for Orin Kerr, I'm sure he'd agree that there are private parts of the internet (my payment information being an obvious example). Just because something is deployed to the internet doesn't mean it is "published to the world" as he claims.

That is not Kerr's argument. A key point (which he has developed to a very detailed degree, as both a law professor and actively defending people, such as Weev) is the access controls in place.

He's not just bloviating; before you disagree with him, reviewing his arguments is worthwhile. (I do disagree with him in part, and agree with his reasoning but don't like the outcomes in part, but in any case, he's a pretty accomplished lawyer, and I'm not any kind of lawyer, so there's that.)

"Published to the world" is a bit off; I'd compare it more to recording telephone conversations with corporations' public service numbers. They always tell you that you are being recorded (for "training and quality purposes") so they give up their right to not be recorded in turn.

If a company paid a million people to call into an information service hotline and each request one fact from it—and then the company recorded and compiled the answers into their own database to start their own information service—is that illegal?

>If a company paid a million people to call into an information service hotline and each request one fact from it—and then the company recorded and compiled the answers into their own database to start their own information service—is that illegal?

I sure hope not. From there it doesn't seem to far off to claim that "He was really only ever showing up for work to learn some skills and how he's using those skills to run his own business!" sort of lawsuits. If you compile difficult to find but freely available information into a more easy to digest format I see that as virtually always a net positive.

I don't see anything that connects the title (written by LinkedIn) to the motivation of the people that are scraping it (not LinkedIn).

Everyone that scrapes LinkedIn (or anywhere else) either knows that they are doing it against LinkedIn's wishes or doesn't care.

> a sleazy website that convinces people to willingly provide personal information which they then turn around and sell at ridiculously high prices

I think you just defined a social network, unfortunately.

Don't the other social networks throttle http requests from certain dubious ip addresses? I think they all do this.

* That being said, I hate LinkedIn as a company and I fully support anyone trying to mess with them

So it's fine to mess with them, even illegally, just because you don't like them?

* convinces people to willingly provide personal information

Convincing is not forcing, and in fact, you say "willingly" yourself. Any business convinces you to willingly give them money or other value.

* sell at ridiculously high prices

To respect to what? Prices are determined by the market. We are not talking about a life-saving medicine or health care, on which there could be some debate.

We are talking about a company selling information, which has value and which it acquires through an infrastructure that takes a lot of money to run. Value that customers are willing to pay for.

* Even if you are legitimately using LinkedIn as an end-user, it’s easy to get blocked

For what definition of legitimately? Yours? Since it's their business, they can define what is a legitimate free use and what can be a paid one.

* They are not a social network

So based on what I pointed out, they are not a social network only because they are not free? Do all social network have to be free and give your data for advertisers to be social networks?

>So it's fine to mess with them, even illegally, just because you don't like them?

The previous commenter never mentioned messing with them illegally and disagrees with the analysis that this scraping is illegal. He said he hopes the courts do not rule this is illegal. Pretty disingenuous to start a reply like that...

Many of your other points are not really giving the previous commenter any charity at all.

For example,

>For what definition of legitimately? Yours? Since it's their business, they can define what is a legitimate free use and what can be a paid one.

He was literally trying to provide an example of where their scraping protections can be appear to be overzealous to casual users, not arguing whether those users are legitimate or not.

What is the premise of your argument? To me, it seems you are simply trying to defend LinkedIn's business practices and legal pursuits, rather than discussing anything about the legality of scraping or the specifics of LinkedIn's anti-scraping implementations.

> So it's fine to mess with them, even illegally, just because you don't like them?

If it's not illegal, then scummy behaviour against a scummy company doesn't exactly set my moral compass off. You reap what you sow.

An eye for an eye, a tooth for a tooth.

Very good. That way the whole world will be blind and toothless. --Tevye, Fiddler on the Roof

If you allow yourself to act as badly as others act, you don't have much of a moral compass.

The world can only improve when we hold ourselves to a higher standard than we see, as it's too easy to rationalize our own behavior, and harshly judge others' actions.

As a sheep, you can take the moral high ground against a wolf, but you'll still get eaten. Companies generally don't care about holding themselves to higher standards of behaviour, unless it hurts their bottom line. Institutions don't understand shame, right, or wrong. The only thing they understand is power.

As a principled human being, you can choose your actions.

The defense against a wolf is (generally) not eating the wolf.

The defense against invasion of privacy is not more privacy invasion.

>moral compass

The issue I see with this is in 2 part.

First, any issue that comes down to "moral compass" is inherently dangerous. We can find many examples of the simple concept that what to one person is Good is to another Evil. In this case, I think Linkedin shareholders would not appreciate calls to mess with the site, or with people having trouble jobhunting because the site is going down repeatedly due to DDOS or whatever.

The second is that these kinds of calls to action (linkedin sucks, fuck with it) smell like vigilantism to me, and while Batman is my favorite hero (I really only lift because I kinda sorta wanna be batman), vigilantism doesn't contribute to a stable society. Rule of law works better than the chaos of multiple agents enforcing their own moral code as law.

EDIT: I'm happy to be downvoted if I'm saying something stupid, but while doing so I would very much appreciate a quick comment as to why I'm wrong so I can improve my knowledge.

everything comes down to a moral compass of some kind. Your comment expects some sort of objective measurement of good and evil, but I don't see any. The law can be, and is often, in the wrong.

Most people seem to think this law (if it is held up in court) is wrong and should be changed.

You might say, oh well, we have a democratic right to change or influence our laws. But a princeton study has found no correlation between public preferences of the majority of the population and enacted policy: https://scholar.princeton.edu/sites/default/files/mgilens/fi...

From memory, the only thing that leads populations to revolt against their government is high enough food prices. Outside of that, revolts almost never happen.

What I'm trying to say is A) unjust, monopolistic, or excessive laws are probably more normal than the opposite because B) the idea that democracy means people have some weigh in lawmaking might be a myth and C) most people don't do anything about it because they only act when the very basics of their livelihood are threatened.

Your view, that some unjust or excessive laws are preferable to total chaos, seems to carry the assumption that laws are naturally benign and/or made to serve some purpose for society, therefore we should not challenge them without good reasons to do so. If the opposite is true and most laws or a high enough number of them are not just, then the fact that the vast majority of people disagrees with them is only natural.

This is a fairly long-winded way of saying that most people would say you're being downvoted because there are plenty of terrible laws that we should not acquiesce to silently.

Thank you for taking the time to reply, this is good to chew on.

thank you for reading. There are studies that contradict the one I posted, in one way or another, so I'm saying these things for sure

I would replace the word "convinces" with "deceives". If you deceive a person into, for example, giving them access and permission to use your personal contact list however they please, then proceed to do just that, you don't have anything to complain about.

I'm lucky my dad uses a different password for his gmail account than his linkedin login (using said gmail address). He was stuck for a while complaining that he couldn't log in. Turned out he was already logged in, and LinkedIn was just presenting him with a "put your gmail password in here so we can raid your contacts" box. It looks just like the login page, so he kept putting in his LinkedIn (not gmail) password and it kept saying "your password is incorrect."

What a shitty thing to do.

All his contacts are lucky too.

Agree. The number of times I've received LinkedIn emails from people I no longer speak to who I'm confident would have no actual inclination to connect with me is certainly in the double digits. They've all been conned into giving LinkedIn their email password, and LinkedIn is going crazy as result.

This probably happened a lot more a few years ago. Is perhaps 2FA making this harder these days?

Are they still doing that? I haven't seen a request from LinkedIn for email access credentials in a while. Still a pretty sketchy thing to do IMO.

I wonder if Google, Yahoo, MS etc have done anything like watch for requests from LinkedIn with correct credentials, block them, and reset the user's password and give them a warning that they just gave their account password to a third party and this is a Very Bad Idea.

> Are they still doing that? I haven't seen a request from LinkedIn for email access credentials in a while.

Literally yesterday I got, for the first time for that user, one of the spams sent "on behalf of" a user who clearly hadn't given out my e-mail address, so I guess they still are up to these no-good deeds.

A whole bunch of other sites ask for your Google, etc credentials instead of using the SSO API. I've even seen some tax software ask for your Bank login. It's really a bad practice but LinkedIn isn't the worst offender.

Speaking of offenders, I just noticed that the seemingly popular Venmo payments app asks for the web login credentials to any bank you try to link it to. Hell no I'm not giving some random app login credentials to any of my bank accounts.

Yes, this is maddening. I had a Craigslister who would only pay for a transaction via Venmo. (She was willing to make the transaction in my presence and wait for it to confirm; it wasn't a scam on her part.) With great reluctance (I needed the transaction at the time), I changed my bank password to something else, signed up for Venmo, got the money, de-enrolled from Venmo, and changed my bank password back.

I think that they used to have some FAQ entry explaining why worrying about this is silly and nothing bad could happen, but I can't find it any more (probably because it's nonsense). However, just because they should be shamed for this whenever possible, here's a Slate article on their overall security: http://www.slate.com/articles/technology/safety_net/2015/02/... .

It's a few years ago now .. but it definitely happened.

Agree with you that giving your account password to any third party is madness, even more so actually soliciting it.

There was a class action suit about this a few years ago. Completely inappropriate. Though I don't think it should have any bearing on whether or not it's OK to scrape their site.

This might be one of the best replies on HN that I have seen. Spot on.

I don't think you've characterized this accurately. When you make an HTTP request to LinkedIn you are accessing their service. There is a long history of this relationship, you plug your house into the sewer line and you connect to the sewer service. You connect to the power pole and connect to the electricity service. You connect to the telephone pole and connect to the telephone service.

Every service has "terms of service" which are the conditions that you are allowed to access the service and what you may do with the service once being granted access. For example, if you start pouring toxic waste into your sewer, you will find that the city will both disconnect you from the service and they will fine you for violating the terms of service you nominally agreed to when being hooked up.

In LinkedIn's case, they allow you to access their service, with HTTP, to render a page in a browser for viewing of that page. Full stop. Any other use of the data you acquire over HTTP, or any other method of acquiring said data over HTTP is disallowed by the terms of service.

Not only does LinkedIn have a legal right to stop scraping after the fact, they have literally centuries of common law in support of that position.

(I am not a lawyer.) As far as I understand the legal precedents involved, random terms of services for websites are not effective in this scenario as the public profiles do not require having any account or other relationship. This actually went to court, and because Zappos didn't force users to click through a terms of service to access their service, the terms of service was invalid.

As for their ability to control what you do with the information: there might be a limited license on the data granted from users to LinkedIn that is not transferrable, so maybe you couldn't build a service that redistributed that information, but I don't see why obtaining and holding it would be illegal.

As for the analogies to power and telephone and such, those are built on property owned by a local government and there are usually other extra laws related to them: it isn't due to some common law position that you can't mess with their stuff. Here, I am not a lawyer, but I am a government official with a particular interest in sewage; here is a link to the sewer use ordinances form our local sanitation district: pay particular attention to 2.03.


I worked at Google for four years, an independent search engine for 5 more after that, and at IBM after it acquired said search engine for 18 months after that. Everyone of those organizations spent many thousands of dollars on legal fees over just this question and reviewed tons of case law.

Every single one of them concluded that based on how the law was written and how the web worked, there is no legal way to scrape a web site without its explicit permission to do so.

That won't stop people from trying of course and it was a source of constant entertainment in the ops team at Blekko at how people tried to sneak around at scraping (it can get very creative) but; it isn't legal, you can and will get banned from all access for it, and if you use the results in another product or offering you will be found liable for damages.

> there is no legal way to scrape a web site without its explicit permission to do so.

Google scrapes several of my sites and I've never given Google explicit permission to do so.

If your robots.txt file is /allow then you did. If you have no robots.txt file then it's an open question. If you put a /deny into your robots.txt file Google will stop scraping your site.

The implicit contract is that you let them scrape because you want to show up in their search results which will send you traffic. If you don't care about Google traffic then set /deny in your robots.txt and get back the bandwidth you were giving them.

> If you have no robots.txt file then it's an open question.

Only for definitions of explicit I must be unfamiliar with.

If the presence of a robots.txt makes one's intent for a given resource explicit one way or the other, the lack of one (and the lack of some communication in some other channel) must mean there is no explicit permission.

That is correct, for what it was worth IBM's legal team came down on the side of 'assume deny' and Google was (at the time I was there) 'assume allow.'

I think "assume allow" is perfectly reasonable. It's just implicit, not explicit.

To the extent to which that is the case, though, it isn't due to the terms of service; and that is also a case of how you are using the data for later, which is a separate question from the scraping and collection process: it is very clear to me that a search engine is operating on the legal equivalent of thin ice, particularly with details like snippets and synthesis ;P. Whether the CFAA applies (as indicated in this article) is an open question, but that just isn't quite so obvious as "you also can't connect up to the public sewer".

   > it is very clear to me that a search engine is 
   > operating on the legal equivalent of thin ice, 
We may be saying similar things but from a metaphor I think of search engines operating on 'thick' ice. It has been litigated so much that there is a bevy of case law to refer to at all levels. Eric Goldman's blog used to have a pretty good list of the number of suits of various kind and the searchengine blog covered many of them as well.

For a search engine it is super clear, robots.txt is all. If you say yes explicitly, great. If you say no explicitly, that has to be honored. If you say nothing, then its up to the search engine to decide which way to interpret it, but if the site owner complains because you picked wrong you have to honor their wishes (which may include destroying any cached data as well).

PadMapper, Perfect10, and the newspapers generated a ton of cases based on 'scraping a web site and using the data.' There are also about a dozen comparative shopping sites that have been dinged for the exact same issues. (look vs Amazon or vs Walmart).

Whether CFAA, DMCA, Torte law (contracts), or something else applies is constantly being discussed :-). I'm just the messenger here. I haven't found a single case that has held that the point of view of the scraper of someone else's web site should prevail. The argument that it should be allowed 'to help new businesses get off the ground' is like saying Apple should pay out some of its cash hoard as grants to startups trying to break into some business. I have yet to read anything that was sympathetic to that point of view.

yuummm Torte law. (It's tort, and tort law is generally considered to be distinct from contract law, because in tort rights and duties come from common law whereas in contract law they come from acts of agreement between two parties).

Chocolate Torte is my favorite :-) Thanks for the clarification, in the various articles I've read over the years on this topic they refer to tort law (no doubt because much of the argument references common law and the way in which the relations are argued) and I made the leap to 'contracts' which was incorrect.

The scenario is a bit more nuanced though and creeps into Internet freedom.

- Reminds me of CraigsList vs PadMapper[1]. In that scenario I side with CL -- it was right to block PM. PM or others should not be allowed to build a new UI on top of CL because CL was the one that put in years of effort of nurturing its listings, its network, building brand equity and taking associated risks and costs.

- As others have highlighted, the data is publicly accessibly and there is no agreement the scraper/crawler is bound by. The agreement is between the LinkedIn user and LinkedIn. The scraper is connected to the Internet pipe crawling the Internet freely as it wants. It's not reproducing the data anywhere so copyright should not be an issue.

- What if a scraper didn't scrape LinkedIn but just the Google or Archive.org cached versions and read those instead? It would not be pressuring LinkedIn server resources in this case.

- What if all of my employees allow me to scrape their LinkedIn data? Can I scrape all of their info? Can LinkedIn stop me from doing that (In the case of Facebook vs Power Ventures, the answer is that LinkedIn would be able to prevent this behaviour).

- Who owns the data? Medium.com doesn't own the posts. LinkedIn doesn't own the CVs.

  [1]: https://news.ycombinator.com/item?id=4286325

Now go read the 3Taps vs Craigslist cases (https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.) to start. To be clear here, I feel like I understand your argument that facts aren't copyrightable or protectable and that how you got them is not relevant.

I'm just saying the legal system doesn't see it that way, they have said so in many cases, and so far everyone who has used your argument or variations of it in court has failed to prevail.

When you make a connection to the city sewers or to the power company, there is some kind of pre-connection step where the terms are presented and you agree to those terms.

With HTTP and LinkedIn, there is no such step. There's no pre-connection agreement. LinkedIn could present such an agreement on first connection, but they do not.

That argument has been tried in a variety of ways and been shot down in court repeatedly. (there are parallels to tenants not agreeing to the terms of their internet connection where the landlord provided it).

LinkedIn has two things that they do which protect them; First, they specify they disallow access in their robots.txt file. While not a binding agreement per se it is the default mechanism that is accepted by the community for apriori identifying whether or not automated access is possible. Second, when they detect an access pattern that violates their terms of service they actively block the access proactively notify the source of the violation.

The sad truth is that web scraping has been around since the very beginnings of the Web back in 1993 and this question has been litigated in every way that you might choose to argue it, the body of case law is enough to fill at least two volumes in the reference section of the library.

There is no legal or ethical basis for scraping the web without permission. And if it isn't explicitly allowed by a site the presumption is that it is disallowed (no 'open door' exception).

When you make your conenction to Linkedin, What user agent would you provide? One that's blank, another that's "lelandbatey bot" or one that says, "Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405" ?

You agree to the terms of service when you sign up.

If you're talking about making anonymous requests to their service, they only allow a few of those before they stop showing you profiles. If you circumvent that protection, it's a bit more like hooking a cable up to a power line (illegal) or dumping your commercial waste in the sewer (illegal).

I disagree with your analogy. To me, the key word in "HTTP request" is request. A request is something that can be granted or not.

Perhaps more clearly would be any HTTP request that LinkedIn believes is in violation of their terms of service will be denied. It can be hard to know when the first request arrives if it is someone scraping the site or not, but once it is clear that it is someone scraping they actively deny all future requests. If they could know that the request coming in was going to be a scrape and not a page view they would preemptively deny it.

But what is the difference between a scrape and a page view? If a human looks at it once, after scraping, does it become a page view? Is pocket downloading content on my behalf for me to read later, a scraper? What's the difference between a scraper and an offline browser who's content a human never browses?

True, but in making the request, you will provide information on who is making that request. If you say, "I am a bot!", and they grant you permission, your request is legal.

But if you say, 'I am NOT a bot', like spoofing a browser's user agent string, but you are a bot, then you are requesting access under a pretense, in order to circumvent their terms of service. Kinda feels morally wrong, and illegal.

That argument works, insofar as it does, only for more recognizable bots and browsers. If I write a client of some sort that identifies itself as:

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Snackmaster Pro/666.0.666

What do you do?

I also tell my browser to lie about what it is sometimes, due to sites that are malfunctioning, but whose owners choose to document the errors instead of fixing them with "Use Chrome" (or IE, or whatever) checks.

Is that 'kinda' illegal or morally wrong (two very different things)?

If so, that seems like a belief that all sorts of browser defaults are 'kinda' wrong and/or illegal to change. Javascript? Lying about installed fonts/screen dimensions/whatever? Refusing to keep nonsession cookies between sessions? That slope would seem to get pretty slippery...

In your first case, if you are running on Windows NT 6.1 using WebKit on a new browser for humans called 'Snakemaster Pro', then you aren't doing anything wrong.

If by client you mean a robot, then you are pretending to be a browser and you are accessing the service without permission.

Let me ask you a question, say your client was hitting my service with that user agent, 100 times a second, crawling through urls sequentionaly. Lets say I added it to my robots.txt deny list and starting blocking that user agent. Would you change the user agent and continue?

If someone creates a site that says, 'Access to this site is for 640x480 browsers only, any other use is forbidden'. Then I think its pretty clear that its a stupid site but also that faking your screen resolution is accessing a site without consent. There is no slope, someone (Linkedin) putting explicit terms on their website is pretty clear.

Have you ever heard of "headless browsers" (like [chrome](https://github.com/dhamaniasad/HeadlessBrowsers/issues/37)? What are some defining characteristics of browsers that are absent in scraping clients? If I open a browser window while doing the scraping is that acceptable?

I very rarely use robots, and think I've only been "abusive" (not really abusive, in my book) once.

What if I send a null UA? Or use it as an opportunity to share my favorite quote?

What if the behavior of my software doesn't attack like a robot, does keep the request volume reasonable (use whatever you think is reasonable here) but also doesn't do what you might expect a human clicking around to do?

There isn't a universal 'I'm a bot' setting. There are user agent conventions, but they are hardly standard. Your point works in theory, but it's not something one can just implement and be reasonably confident that they won't be scraped.

No, it won't prevent being scraped. That's not the point I was making.

The point is, the scraper would have to hide their intentions and identity, which removes any claim they are being 'honest' in their intentions and not trying to circumvent the provider of the services efforts to prevent scraping.

The user agent header is not an authorization header. There is explicitly an authorization header.

I understand the appeal of this argument, but there are clearly cases where a computationally valid request and response is illegal despite the fact that the server "chose" to satisfy the request. An obvious example would be any exploit where an attacker can construct a particular request and get access to someone else's private data.

The analogy of pouring waste is not accurate, we are looking at what is done with the service, e.g. it would be the equivalent of allowing drinking the water but not cooking with it, or using electricity for specific devices. The contract is a debit/volume, what I do with it is irrelevant, and by this analogy on the web I should be allowed to scrape if I stay in the allowed bandwidth by the website.

I mostly agree. the traditional method [1] should be honored. but clearly, there are some bad agents out there that ignore the rules and mess things up for everyone. They are pretty clear about their expectations:

    # Notice: The use of robots or other automated means to access LinkedIn without
    # the express permission of LinkedIn is strictly prohibited.
[1] https://www.linkedin.com/robots.txt

puuuh, somebody should explain the concept of user-agent groups to them (would heavily simplify their robots.txt)

I don't think that's an unpopular opinion. But thinking they can/should have the right to press charges against you for trying would be, I hope.

I completely agree with this, but if a determined company is scraping the data, they can do various things to make the traffic blend in, it's not always possible for technical means to detect scraping. To which I say "oh well, deal with it or put it behind authentication", but some may disagree.

They also state this clearly in their terms of service.

"You agree that you will not ... develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology or manual work) to scrape the Services."


That said, this should be a breach of contract issue. It's an overreach to invoke federal fraud law.

Have no doubt this will be unpopular, but I think LinkedIn is right.

Most news sites publish to the world. But scraping a news site's content and monetizing it yourself is not ok. Legally, it violates intellectual property law. But laws aside, I assume most people would agree, if someone spent the time researching and writing an article, they should have the right to monetize it and nobody else.

In this case, IP law may not apply, but the concept is the same. I don't love LinkedIn myself. But they spent the time building a platform for collecting that info. I don't see why it should be OK for other people to scrape and monetize it.

I don't think that is an unpopular opinion. It also implies that LinkedIn is wrong, because of their HTTP server says yes, they want to retroactively say no.

Isn't the server giving you permission to view the data? Not give?

EG, if it returns an image - it doesn't imply I can use the image anywhere I want.

Generally I agree, but with linkedin anything of value seems to require logging in, which means their terms of service come into play. This is very different to scrapping data that they make open and browsable by all.

Legally, I think scrappers should respect robots.txt

They do have an amazingly large robots.txt in fact.


I'm following this story with a lot of interest. I've done (and still do!) a lot of data crawling/scraping. In the past I've worked on so-called "alternative data" collection and analysis for financial forecasting.

Without going into too much detail, a lot of hedge funds have teams constantly searching for kernels of data that can contribute some kind of signal for market movements. This data can come in the form of satellite imagery for oil tankers or manufacturing centers, but it can also come from the very creative use of scraped and aggregated data. It's typically very difficult to identify, collect and analyze on a technical level (as 'chollida1 has lamented in the past: normalization, labeling/bucketing and analysis of disparate data across different formats, sources and processing timeframes is a pernicious problem at this scale). From a compliance standpoint there are also generally strict requirements governing legality of use.

Depending on the specific data, you might be capable of predicting earnings or broader market movements with a <5% margin of error each quarter for years at a time (I've personally seen and worked on projects with <1%, but that's the exception, not the norm). That tactic is usually found at discretionary funds; at quantitative funds the uses are much more abstract and cross-pollinated so as not to target single-equities, but rather holistic trends. Regardless, every fund is using data in some way these days; it's just a matter of how sophisticated, creative abstract they get in their analysis of it.

hiQ Labs doesn't collect data for this specific purpose, but it is absolutely related. In the past I have stayed away from crawling LinkedIn and Yelp precisely because they are very litigious (regardless of the eventual outcome and legality). Now that there's another relatively high profile case out in the open like this, I'm interested in seeing how it proceeds and what the ramifications will be for companies that collect data across a wide range of uses. As Grimmelman mentioned in the article, this can impact a lot of types of businesses, not just those in the same space as hiQ. Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.

Alternate data doesn't even need to be as sexy as satellite photos, hell you almost certainly want the data that isn't sexy, the stuff people haven't thought of because it's too boring. Alternative data vendors above all want sales, and even the funds themselves want things to show off to clients. This gives you great opportunities to look at the alternative data they aren't touching.

Given this is a predominantly a web development community it always surprises me how little creativity there is in the articles on investing. Neural networks and machine learning sound cool but the reality is almost none of the readership would be able to make any money off them.

Simply tracking how many sales or users exist in databases by watching sequential IDs should be the go-to method for any web developer trying to get an edge. I would have expected HN to have articles where people are getting creative on that, ie trying to use measures of entropy on usernames to get rough subscriber numbers etc.

Even plain scraping of prices etc, is often full of great insight that is ignored. If a grocery store drops their prices in profitable categories against their competitors, that could be the signal about an incoming price war for an entire sector. There's a lot more information in that than social media feeds and all the other sorts of sexy data that get coverage in the media.

"...tracking how many sales or users exist in databases by watching sequential IDs..."

Seems like a great application of the German tank problem [1] that was mentioned on HN the other day.

[1] https://en.wikipedia.org/wiki/German_tank_problem

The reason hedgefunds look at satellite images and oil tankers is because everyone looks at sequential ids and price changes so that doesn't give an edge.

That's simply not true - equity analysts can cover anywhere between 5 and 500 stocks, do you really think they have the time or skill set to track all of that? It really is laborious, grueling work.

If you look at the possible returns the equity market is going to make from a stock in a dollar value, and how much research spend is as a percentage of that, you'll quickly see it doesn't pay for much.

You can tell simply by looking at the broker research - that's probably the extent that analysts take things.

The big stocks obviously have a lot of it happening. (eBay listings, airline pricing etc is obviously touted a lot)

But once you start to go down to the mid caps, you enter a void where there isn't much heavy data focused research done, and it's very possible you can have a better gauge of the business than any other investor on the planet once you pull out this data out.

> Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.

Google comes to my mind. I'm only partly joking, as it seems to me that the line between search engine web crawling and other forms of web scraping is very thin.

It's more than thin: just scrape a few more sites, make a public search interface and say you are working on a search engine. In the meantime, you might analyze the data for other (not so) related purposes.

I find scraping fascinating.

Are there "societies of scrapers"?

Inside, are certain sites more worthwhile - and which ones (eg reddit, eBay, trade union websites, whatever)

How about scraper brokers? Do they exist?

Are there scammy scrapers? Make up BS and sell as scraped data?

How big is this?

> Are there "societies of scrapers"?

Not that I'm aware of, no.

> Inside, are certain sites more worthwhile - and which ones (eg reddit, eBay, trade union websites, whatever)

Yes, absolutely. For many purposes websites that sell their own data are less useful (less signal exclusivity). Specific sources of data will be much more valuable depending on what the data is about.

> How about scraper brokers? Do they exist?

Yes. You're not getting access without an NDA in addition to paying quite a lot.

> Are there scammy scrapers? Make up BS and sell as scraped data?

That depends on how easy it is to verify the data. For most of what you'd term "alternative data" you'll know if it's real in 2 - 12 weeks, and it's not sustainable to sell crap.

But a lot of parties scrape dodgy financial timeseries data (ticks and quotes on equities or options) and sell it, priced as though it were tick data when it's barely accurately OHLC. They mostly sell this sort of data to amateurs who don't realize tick data is expensive for a reason.

> How big is this?

Very big. Most hedge funds ingest a lot of data whether they curate it internally or source it from elsewhere.

"satellite imagery for oil tankers" - Interesting.

If hedge funds floating drones above oil tankers, I'd guess they'd be accussed of corporate espionage / spying / invasion of privacy?

Ok, so oil tankers are big and "in the clear". What if $TANKERCORP floats big parachute balloons above its tankers to imply "looking past these is unauthorized viewing"?

Then if a HedgeFund gets a clever angle on a satellite photo.. is that the equivalent of breaking a lock, or violating CFAA?

Satellite imagery like this is legal to within a few feet of practical resolution, pretty much anywhere. The effective countermeasure is hiding things from a satellite, not attempting to sue satellite operators for flying overhead. I'm not aware of a specific law against using anything that is literally viewable from the sky, at least in the United States (someone can correct me if I'm wrong, but last I checked Google Maps blurs out some locations or keeps them outdated because the government requests it, not because of a formal law forcing them to do so).

There are two other notes in response to your question:

1. Drones are different from satellites, and are more susceptible to regulation in the way you're positing because they can be prevented from flying above specific areas. However, most of the same problems with countering them apply, because drones can record better three dimensional footage. In your specific example, if a tanker disguised itself overhead, it would still be legal to have a drone monitor the tanker from the sides, as long as doing so didn't break any law set by the FAA.

2. Drones are actively used these days for things like monitoring production facilities ("how many cars come out of this factory" for an oversimplified version). If they have to monitor from a distance, so be it, they'll do it. The effective countermeasure here is to have a huge amount of land that can't provide any intelligence, because the drones aren't allowed to fly over it and can't see far enough in to the facility.

There's definitely a productive ethics discussion that can be had here, but the legal precedents don't really allow for combatting these techniques right now. If it's public, it can be collected, ingested and used in an algorithm to determine alpha.

> someone can correct me if I'm wrong, but last I checked Google Maps blurs out some locations or keeps them outdated because the government requests it, not because of a formal law forcing them to do so

In the Eastern European country from where I'm from (a NATO member) the Google StreetView car even got to photograph and publicly put on the Internet the outside of military and air bases with clear signs of "do not take photographs" visible on StreetView itself. It's funny, my company also used to work in this space (local business directory with business addresses, photos of said business etc) and one of my former colleagues got detained for a day because of taking photos of businesses in the downtown area of one of the biggest cities in my country. He hadn't seen that there was a military "objective" in his line of view (probably some military HQ or someth, not a proper military base with tanks and trucks). Talk about the advantages of being an internet giant like Google..

Later edit: I was talking for example about links like these: https://www.google.ro/maps/@44.4062748,26.0524843,3a,80.8y,2... . That is actually the HQ of NATO's "Multinational Division Southeast", whatever that is (http://www.nato.int/cps/in/natohq/news_125356.htm?selectedLo...). Fact is that if I were to take photos of those buildings as a simple citizen I would be breaking the law, not sure how Google got away with it.

Hedge fund researchers have also chartered private airplanes to fly over oil storage facilities and use infrared cameras to check tank levels. In the USA at least this is completely legal as long as they observe regular FAA flight rules. For the oil market as a whole this is a good thing since it helps price discovery.

It's not practical or safe to put a huge parachute or balloon over a tanker to block overhead imagery. Any sailor can tell you it just wouldn't work.

Well, it's pretty simple to legally dodge these kinds of threats.

If you scrape regularly, then pick up a dozen or more machines around the world, in less than friendly areas to US law. Pay with a rechargeable credit card or bitcoin. And the buy servers and set up a hadoop cluster that handles scan-jobs.

The worst case scenario is that LinkedIN, YELP, and others get some of your servers shut down. Wash, rinse, repeat.

EDIT: please note, this was only a thought-game to bypass rude and destructive laws like the CFAA, which weaponizes TOS'ses, EULA's, and other implicit contracts of adhesion (as in, you have to agree to see). Ideally, we would be better off with both these laws to have a sane scope, and for companies to not expect things to happen with content in public.

You're too technical, feels like. Usually it's done in even simpler way - you have 3rd party provider from one of this countries (or who aggregates data from you), who does all scrapping and data cleaning for you for reasonable price.

I do a lot of crawler services as well.

The last one, I did 13 crawlers to keeping comparing prices of all drugstore products online. I'm selling it to drugstore ecommerces who wants to know when their competitor prices and when they're doing promotions, if competitors prices are higher... Well, I just automated the job of a person that was doing this job manually every single day, looking into competitor's site

If it is illegal, you have to ask to google maps to remove all houses from your database and just let in the houses who gave the permission to it.

The both sentence means the same bullshit, they are both wrong:

Google Maps: "The front-door of my house is faced to a public street. That does not mean you can take photos of it on Street View and use a very smart OCR, to read my house number. Not to mention that sometimes you give my house photo to others by Captcha asking the house number, c'mon!"

Linked: "My CV is half-public. That does not mean you can take crawler of it."

... This is only a cool discussion in North Korea or maybe in China. Not in the rest of the world.

I would suggest occasionally actually informing yourself of the wider world.

Your very example of Google Maps in Germany - well, turns out Google had to give people the option to opt out: https://europe.googleblog.com/2010/10/how-many-german-househ...

Same goes for web sites - robots.txt exists for a reason. If your crawler ignores that, well, I'd suggest talking to a lawyer.

I wasn’t aware that robots.txt had any legal meaning. Am I wrong?

> Am I wrong?

I hate to be the bearer of bad new but...maybe. A reading of the Computer Fraud and Abuse Act could make robots.txt legally enforceable. And given the government's approach to CFAA cases a very aggressive interpretation, under the right circumstances (for example, when it provides evidence that the scraper knew that scrapint was not authorized), seems like a real possibility.

Among the many other things CFAA criminalizes, it makes it a crime to "intentionally access[] a computer without authorization or exceed[] authorized access . . and thereby obtain[] information from any protected computer;"

A "protected computer" is, among other things, a computer "which is used in or affecting interstate or foreign commerce or communication." That would probably cover just about any server on the Internet.


I wouldn't be too fast to jump to conclusion robots.txt legally enforceable. You would need to cite prior case law's. Without any case law's it make decision on a law error prone at best.

The possibility still exists. After reading this (insanely broad) definition I think the chance is not even that low.

Saying something is illegal under the CFAA isn't really a stretch, it's one of the broadest statutes ever written. If you have an employee handbook at your job right now and are using the internet for your own entertainment you may be violating the CFAA.

The interesting question is are you violating the CFAA in a way that will cause the executive branch to exercise their discretion to prosecute, and moreso, is the CFAA even constitutional.

I'd talk to a lawyer. There are presumably some legal arguments that might apply. There definitely have been cases revolving around it, but I have no idea what the outcomes were.

From a moral point of view: robots.txt states an intent, and intentionally ignoring that is not a nice thing.

And often, "not a nice thing" translates into legal action. So, if this matters to you, you should find an expert :)

> From a moral point of view: robots.txt states an intent, and intentionally ignoring that is not a nice thing.

As a lawyer asked inane questions by friends, a common answer is: "I am no sure if its legal to do, but I'm absolutely sure you're a asshole if you do it."

If you are wrong then there is no sanity left in the world.

If you post it on a public network, it is defacto, public.

If you don't want it scraped, take it down, or put it behind a login.

If the user provides the login to a scraper, then the scraper has permission.

That's a pretty literalist, one-size fits all approach to policy. I don't think it's a good framework to use for applying ethics considerations.

If I can walk near a pool, should I also be able to run? Is running anything more than faster walking? If I'm allowed to be around the pool walking with my entry ID, should I also be allowed to place my ID on a little motorized car and make it dart around the pool really fast? Should I be able to duplicate my badge, put it on a bunch of little cars and direct them to quickly get all the floaties before anyone else? How about giving them all their own fake IDs? Now all the same questions, except there is a sign that explicitly prohibits all of these examples except for walking.

It seems disingenuous to argue that the automation and rapid increase of a thing should be allowed just because a thing is allowed. That doesn't typically match our intuitive notions of ethics in other parts of society, like driving or walking around a pool. Yes, you can walk around a pool as much as you want, but if you change to running then you have fundamentally altered your behavior through increased capability, not merely done "more of walking" to utilize more of your freedom.

I suppose a natural counterargument to this analogy might be that running around pools is unsafe, and scraping is not unsafe in the same way. But my point here is establishing that a behavior intrinsically changes into a different behavior if you increase the speed at which you're doing it or the capability at which you can do it.

The pool has the right to kick you out, same as any website. The pool cannot call the police and charge you with a felony for misusing their resources.

Does the pool have any recourse if you proceed to bypass the ban? Do you have to re-enter to pool to bypass it, or does sending in confederates with their own badges to continue your work also bypassing the ban? How about sending in new motorized cars?

The analogy is starting to break down, but I think it's still instructive for the problem of applying a simple first principles approach.

There is a legal concept known as "attractive nuisance"[1]. If I have a pool and neighborhood kids come to play and someone gets hurt, it's my fault. Even if I was away from my house and never gave permission (or explicitly forbade them from swimming), if I don't have proper access controls in place, the courts say it is too tempting for the neighbors to just come over and swim. I need to put up a locking gate to keep them out.

Likewise in some high-crime jurisdictions, if you did not lock your car you are liable for it getting stolen or broken into[2]. An unlocked car is too tempting for some people to just walk past and not take it.

I know it might sound crazy but you could make the argument that a massive pool of highly-structured and very valuable data just sitting out in the open is an attractive nuisance and steps should be taken to put it behind a locked gate. Once that requirement has been satisfied, normal trespassing laws apply.

[1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine

[2] http://www.cbc.ca/news/canada/montreal/ndg-resident-question...

Laws like that are ridiculous. You can see that by looking at how the reasoning does not expand to certain areas.

For example, if a woman walks down a dark alley wearing short skirts and gets raped, it isn't her fault. I mean can you imagine if we said "well, she was just an attractive nuisance!" The judge would throw the book at you.

They're usually targeted at children who don't know better and don't have the cognitive skills to understand consequences. Usually... but not always.

"An unlocked car is too tempting for some people to just walk past and not take it."

I am saddened by this.

I don't think it's true, and the source doesn't back it up. Even if it's illegal to leave your car unlocked, that doesn't mean it's your fault if someone steals it. Presumably you'd both be liable.

Bypassing the ban after you have been clearly notified of it is trespass.

That's basically what LinkedIn is arguing here. They have explicitly send hiQ a cease-and-desist letter.

> The pool cannot call the police and charge you with a felony for misusing their resources.

Once they have withdrawn permission, they can call the police and you can be charged with trespass, though that's usually a misdemeanor rather than a felony.

The website can't really kick you out though, it can only kick your agent out and you can trivially create a thousand more. The website can politely ask you to stop just like the pool, but it can't actually do anything if you ignore it.

Except block your IP address, or your user agent, or the pattern your software makes when it connects.

Yes, that will cause potential issues for other people, which is why they tend not to do that, but if you trivially create a thousand more agents, and potentially trigger a degradation of service, how are you different to the people who block junctions at traffic lights?

I'm not keen on inconveniencing people, and "it's not that bad" is a poor argument for doing something that someone has explicitly asked you not to do.

I think you agree with me. If someone is abusing your website, or even just doing something you've asked them not to do, you can't literally kick them out like you could in person. There's all sorts of techniques for blocking whatever they're doing, but if they're determined and your website is still accessible, they can come back.

That's why it might be reasonable for laws around this sort of thing to be different in the virtual world.

Sure they can if they know who you are, that means authorization, when L-In will put their pages behind a login they would be able to kick hiQ out.

The way HTTP works is there is a request and response. If LinkedIn does not want to provide a response, they don't have to. If they don't want to accept requests, then maybe the World Wide Web isn't for them.

So was Napster. Turns out, law>technology.

> I don't think it's a good framework to use for applying ethics considerations.

What does is it have to do with ethics? CFAA law does not apply here and that is what matters.

Ostensibly "ethical considerations" are a precursor to a law being ratified, and will also be used when a law is not sufficiently specified as to obviate the decision. If a party is not denying their activity in court, but rather claiming that it doesn't match existing laws, this would come into play.

But if you contest that or have an issue with the specific terminology I used, perhaps you'd prefer this terminology instead: "I don't think this is a good framework for interpreting the CFAA and establishing legal precedent."

I'm interested in this point. I was under the impression that the CFAA would apply here because they've explicitly stated what users are authorized and not authorized to do on their platform. Other than incur shame from the tech community they don't have to actually prevent the unauthorized access or use.

The publisher of the information is ultimately responsible for the disclosure of the information. How it is read is of no consequence, as the information has been provided to be read. Certainly, there are issues about resource usage i.e. heavy readers, such as scrapers, but throttling is a perfectly acceptable approach to overuse of resources for both parties, as access is still available over managed resources.

The nature of the original complaint is authorised access to publicly published information. Again, if you do not want people to read publicly published information, do not publish it publicly.

And as a side note - we don't need inappropriate analogies; the web is real, we can discuss the real issue.

What LinkedIn is doing seems more like a pool threatening to sue people that look at the facility itself, from outside of their property. It's not clear whether viewing a public website is more (reasonably) similar to entering someone's property versus looking at their property from outside of it. To me, it seems much more like looking at it from outside. But, fortunately for LinkedIn, they can prevent people from viewing their site! (They just have to figure out for whom they want to do so.)

I like this analogy. I think maybe the answer would be to include technical limitations, like API limits so scrapers can only work as fast as a human. Or maybe captcha interaction like select all images of a car could be integrated? I think it should be up to the pool operator to make sure people follow the rules they put in place, not call the cops because someone is running.

Yes! Exactly. This is the best, most clear argument for how cases like this should be handled.

What if the user is provided with permission to view medical records - your medical records for example? Would it be ok to scrape and sell those? What about children's school profiles and records?

Selling medical records would fall under a different law same goes for sharing credit #s cards you collect.

I think the issue is around public data here. If you made medical records available to the public you would be in trouble in many countries. Sharing those may land you in the same trouble as the website who initially posted it under that law. The person shouldn't be charged with unauthorized access because the materials are public.

See the first line.

If there is a problem with that, see the second line

For shame!

At the risk of bringing out the "you wouldn't download a handbag" brigade - you could make the argument that just because I leave my bike unlocked on the street doesn't give you a right to ride it off.

Even if you go for "forgiveness rather than permission", if you ride it a second time after I've told you I don't want you riding it, then you're in the wrong.

Obviously, there are philosophical arguments to be made about the status of information online, but if the information were provided to LinkedIn, to be used on LinkedIn properties, I as a user would take umbrage with it being taken by a 3rd party, even if it were viewable publicly.

The bike example is a very poor analogy - the data isn't removed from LinkedIn, it's merely copied. It doesn't matter how many times the site is scraped, the data is unchanged and still available. A better analogy would be me taking a photo of the bike while walking past. It shouldn't matter how many times you tell me to stop, if it's on public property you can't really stop me from taking the photo.

If you have an issue with that, you should be moving your bike somewhere people can't take a photo from the public street. Not have someone creatively interpret a law that says where I am is suddenly not public property, because you asked me to stop using my camera.

Not sure why you're being down-voted, the Web were designed to be public. If you want to prevent me from taking pictures of the exterior to your Cafeteria on the public street you'd have to build a wall / put it behind a login. But then don't complain that you are loosing customers because they can't see it or can not find information to your site through search engines.

Ok, if I left the prototype for a new bike on the pavement, and you came along and 3D printed an exact replica. Sure, the original still exists, but you just violated trademark/registered design laws because it was there. It's not the same as a photo because a photo of a bike doesn't give you the same value as the actual bike, whereas scraping the content of a webpage does give you the same value.

Perhaps my analogy wasn't great, but the grey area is around going to LinkedIn's server (whether or not this is "public" or their property they allow you access to is another philosophical question, though in the eyes of the law it appears it's the latter), deliberately extracting value from it, and then getting annoyed when you're asked not to.

Inherently it seems as though it's the old question of whether a server is public, or private but accessible (like those POPs [0] there was a thread on recently).

[0] https://en.wikipedia.org/wiki/Privately_owned_public_space

> you just violated trademark/registered design laws

If it has those. The data on LI (other people's employment histories) is not its own IP.

What if I paint the bike artfully, put a copyright notice on it, and you sell the photos you took?

Standard IANAL.

That would be copyright infringement because photographing a copyrighted work is considered reproducing the work under the law.

> A better analogy would be me taking a photo of the bike while walking past.

Taking a photo of the bike is also a poor analogy. Scrapers don't take one photo, they take photos of all the bikes. And scrapers don't keep the photos for themselves, they sell them for a profit. Also, the original bike isn't parked, it's placed in a gallery (probably with an admission fee? I don't know the business model of linkedin).

The number of photos is irrelevant to the analogy, though, as is what people do with the photos afterwards. If the bikes are visible from the public street, people can take as many pictures of every bike they want, and then make money from them if they want. It doesn't affect the owners' usage of the bike (unlike the original analogy, where the owner loses access, which was what I was trying to correct)

Physical analogies for this kind of thing are always flawed, it's just dishonest/misleading to pretend that copying data is ever analogous to taking a physical object (the owner of the original is never deprived of the original when data is copied).

"I don't know the business model of linkedin"

Most of it is selling premium features to recruiters and other businesses. I'm not sure if Hi-Q's service interferes with that or not, but LinkedIn should not be trying to have their cake and eat it by leaving things in public then complaining when the public accesses it in a way they don't like.

> The number of photos is irrelevant to the analogy

Actually, the size of the data and the number of requests is very relevant. More data means more information, means more money. It also means more bandwidth and processing power required to process requests. You're not taking a photo of the bike, you're asking the bike to give you a photo of it.

> it's just dishonest/misleading to pretend that copying data is ever analogous to taking a physical object (the owner of the original is never deprived of the original when data is copied).

Leaving LinkedIn aside, possession of the original data is never the issue with digital piracy. It's a straw man. The hurt occurs when people benefit from the work the original author put into creating that data without proper compensation. Just because you can clone my gizmo (which I spent years working on) without taking the original one doesn't mean you're not hurting me. That gizmo could give me an advantage you wouldn't otherwise have. I place hours of working into something that doesn't put food on the table because you can clone my work, but I can't clone my food.

There's a reason an empty CD costs 50c but a music album costs $10. You're not paying for the physical medium. You're paying for the IP. And yes, digital distributions are cheaper because of this, but that doesn't make them free.

> Most of it is selling premium features to recruiters and other businesses.

I'd say it's pretty obviously interfering with their business model.

> LinkedIn should not be trying to have their cake and eat it by leaving things in public then complaining when the public accesses it in a way they don't like.

LinkedIn could ban IPs that make unreasonable number of requests in a short amount of time.

If LinkedIn are being that negatively affected by a single scraper, they should deal with it - block it, only allow a specific number of requests from an IP per day, anything that doesn't involve lawsuits. The problem is them trying to pretend that publicly visible content is really private if they say so, without them trying to protect it in any real way.

"The hurt occurs when people benefit from the work the original author put into creating that data without proper compensation"

Not necessarily. If I'm paying for print of some imaginative artwork that was created using the picture of the bike, that doesn't mean the bike owner lost anything, even if he spent time building the bike with his own hands. Similarly, if the only reason why people paid Hi-Q was for the extra work that they put in, LinkedIn didn't lose money because people would not have bought their product without that extra work.

There is certainly an argument that Hi-Q should have licenced the content first, but it's public data. If they want to make licence deals, don't put it in the view of the public street then whine when people are documenting what's in public.

"It's a straw man."

No, the straw man is pretending that a copy is the same as theft. Theft is theft because someone is depriving you of the original, not because you imagine you might have had more sales if the copy didn't exist. There's a reason why there are different words for different things, and pretending that a copy is the same as taking a physical object it a lie. Period.

"I place hours of working into something that doesn't put food on the table because you can clone my work, but I can't clone my food."

But, you put the price up too high, so I opted not to buy it. Maybe borrow the CD from a friend, or listen to something else. Or, you decided I couldn't buy it in the format or region I wanted. There are real issues, but pretending that a copy = a lost sale is utter bull that's been debunked time and time again, yet is regularly repeated by people trying to inject emotional arguments instead of facts.

"I'd say it's pretty obviously interfering with their business model"

Then perhaps they should address the business model or not put their content out there in public unprotected if it's that valuable to their income.

"LinkedIn could ban IPs that make unreasonable number of requests in a short amount of time."

Yes they could. Which would not have to involve the courts in any way. Or, they could protect the content in some other way that (for example) requires a log in and adherence to T&Cs, with which they could easily kick violators off their site for non-compliance.

The issue is that LinkedIn are trying to have it both ways - gathering the benefits of public content while blocking others who use the now-public content in ways that are usually acceptable for public content to be used. Sorry, not acceptable, you pick one - take the content away from the public street or accept that some people will use what has been shown to the public.

> If LinkedIn are being that negatively affected by a single scraper, they should deal with it - block it, only allow a specific number of requests from an IP per day, anything that doesn't involve lawsuits. The problem is them trying to pretend that publicly visible content is really private if they say so, without them trying to protect it in any real way.

With this, I agree 100%.

> No, the straw man is pretending that a copy is the same as theft. Theft is theft because someone is depriving you of the original, not because you imagine you might have had more sales if the copy didn't exist. There's a reason why there are different words for different things, and pretending that a copy is the same as taking a physical object it a lie. Period.

That's just pedantry. The debate isn't between "copy" and "theft", it's between "theft" and "copyright infringement".

> But, you put the price up too high, so I opted not to buy it. Maybe borrow the CD from a friend, or listen to something else. Or, you decided I couldn't buy it in the format or region I wanted. There are real issues, but pretending that a copy = a lost sale is utter bull that's been debunked time and time again, yet is regularly repeated by people trying to inject emotional arguments instead of facts.

This is wrong on so many levels, I'm not sure there's any point in continuing this debate. Are you accusing me of using emotional blackmail instead of facts because I point out that "you can clone my work, but I can't clone my food"?

I'm not using myself as an example because I want pity. I'm doing it because it's easier in writing, and because I'm a software developer.

My work takes hours of hours of time and effort (not accounting the hours I spent in school). If it' ok for everyone to clone my work, I won't make any money from it. We still live in a society where goods and services are exchanged with money. I exchanged my hours of work for no money, but I can't exchange no money for basic living necessities such as food. There's no feelings involved here. In the current economy, work going in, and no food coming out is not a viable business model. And if nobody payed for digital content, there would be a lot less digital content.

> pretending that a copy = a lost sale is utter bull that's been debunked time and time again

This is another straw man. Whether or not an illegal copy is or isn't a lost sale is irrelevant. You don't have the right to make that copy in the first place. If everyone made illegal copies, there would be no sales. So then why should only some be entitled to illegal copies? There isn't a distinction between people who can make copies and people who must pay for copies, so either everyone must pay for copies or no one must pay for copies. That's how law and economy work. You can't make exceptions by yourself. Either everyone is allowed, or no one is allowed. And for digital content that is for sale, no one is allowed illegal copies. If laws are made that allow poor people to receive goods for free, these laws must address both digital and physical goods.

So? How does this impact the bike owner in anyway?

The owner makes money from the admission fee clients pay to see the bike. Taking photos and selling albums defeats that purpose.

But the owner is offering the images up for free to the public. hiQ is taking those publicly-available photos and annotating them with, "red bike", "pink bike", "broken bike", "professional bike", etc.

It's clearly a value-add and not theft.

It depends.

You have to keep in mind that an entire generation was brainwashed that personal data isn't that "personal", so Google, FB and the rest can have amazing profits.

Most of these discussions are stained by general unawareness of the privacy and copyright law.

Ofc because the value of the data supplier (usually a single person, etc) these never reaches the courts, which just reinforces the ongoing misconceptions.

If you really want to test this, try copying the content from Google, Facebook hiQ or whoever that's big enough to go after you.

But people somehow believe that it's okay for businesses to do what regular persons aren't allowed to.

That law needs to be abused into the ground. We should all be filing frivolous lawsuits claiming CFAA violations whenever someone we don't like is accessing our sites. All we need is a little disclaimer in 5pt text stating who does not have authorization, like say all members of Congress.

I'm seeing a lot of bad analogies thrown around in this comment thread, mostly based on emotional response and/or a dislike for LinkedIn.

As someone who has done a lot of scraping in the past (sometimes for good, sometimes not), the number one thing you need to respect as a scraper is that email or phone call you get saying "Stop doing that."

In almost all instances, you're legally fine in the real world until you get some communication to stop and/or blacklisted. After that point, what you are doing becomes a crime.

- LinkedIn is not a public resource, it is a private company that pays for servers. - LinkedIn might scrape too, but that argument isn't going to hold up in court, and the scraping they do is probably in line with their EULA (protip: never install a social networking app on your phone, ever). - The analogy to the storefront, taking pictures in public, etc, all break down because scraping LinkedIn requires you to access their resources. - The analogy to browsing a store is great. If you are in a store, and they ask you to leave, and you don't, that's trespassing. Trespassing isn't legal.

The CFAA isn't a great law. There are a lot of gray areas. But LinkedIn seems within their rights here.

If anyone wants to know how this is going to wind up: https://www.eff.org/deeplinks/2015/06/padmapper-and-3taps-se...

Many years ago, I wrote a scraper for a certain .mil financial website. My company at the time - and its customers - had legitimate accounts on that website, but the site was horrible to use. Page loads took upwards of a minute, and there were no front-page notification systems to let you know if any of the information on the site had changed. Kind of a big deal when it was a window into the government's accounts payable system. For instance, if you didn't notice that an auditor had left a comment on an invoice, and you didn't respond to it within X hours, then the gov't could deny the invoice and you'd have to start over. The end result was our customers paying full-time employees to manually examine every single open invoice in the .mil system every single day, which just increased load on the website and made it even slower.

Enter my scraper. It copied data into a local PostgreSQL database that our customers could run reports against. A process that used to take a human 6 hours a day now took 30 error-free seconds. The scraper was even a benefit to the website as we ran it overnight during low-load conditions, and because my software was smarter than a web browser, we could retrieve the same information as a human with about 1/3 the number of web requests. Perfect, right!

Well, no. We got an angry call from the developers complaining that we were the ones making their site slow, even though 1) we measurably didn't, 2) we were lighter on resources than the humans we were replacing, and 3) we scrupulously obeyed their rate limiting requirements and erred on the side of running 10% slower than they had originally requested from their customers.

That particular problem went away when we pointed out that their shiny new website wasn't Section 508 accessibility compliant and could not be made to be without literally throwing out their entire web service and starting over, but our website was, and that if they continued to allow us to serve screen reader compatible pages to our disabled customers then there wouldn't be a need to have the .mil website shut down and a Congressional investigation launched. All parties involved decided this was a reasonable compromise.

> But LinkedIn seems within their rights here.

Congress writes bad laws all the time. So LinkedIn might be within their rights, but that doesn't mean they should have those rights.

It's bad for innovation to allow for selective discrimination like this. LinkedIn is perfectly happy to allow Google, Yahoo, Bing, and many, many more companies to scrape their content and use it for personal profit. Giving them the option to sue an upstart for doing exactly the same thing as Google is unfair and oligopolistic.

Letting established tech giants get away with this will slowly erode American dominance in technology.

The question is, how would you better define hacking? To make the letter of the law match the spirit? "Exceeding authorised access" is an early attempt that predates modern internet usage. You can't just say "anything the computer lets you do, is legal" because code exploits are just a computer following the (poorly-written) instructions in its code.

I'm not a lawyer, but it should a start with a clearly demonstrated attempt at prohibiting access to content. The words, "don't use this" are not enough. There needs to be active, ongoing safeguards to protect the data, i.e., authorization tokens, credentials, encryption keys, etc.

In the 80s, it was reasonable to assume that connecting to some port on a remote machine owned by another person or company could constitute unauthorized access. But today, billions of people connect to ports on remote machines thousands of times a day for completely legitimate reasons, so it's reasonable to assume that data that can be accessed by just asking nicely over the internet is considered intended for public consumption.

It seems permissive, but I think that's a crucial component. If some company makes accidentally makes their S3 buckets public, it's completely unfair to say that accessing that information is illegal, especially when they are serving up other information in public S3 buckets which they want people to access.

> Giving them the option to sue an upstart for doing exactly the same thing as Google is unfair and oligopolistic.

But it's not. LinkedIn's data is their entire business. They are within their rights to restrict access to it.

This is the classic ant and grasshopper story. If HiQ wants access to the type of data they are scraping from LinkedIn, they can build that data themselves.

> But it's not. LinkedIn's data is their entire business.

"There has grown in the minds of certain groups in this country the idea that just because a man or corporation has made a profit out of the public for a number of years, the government and the courts are charged with guaranteeing such a profit in the future, even in the face of changing circumstances and contrary to public interest. This strange doctrine is supported by neither statute or common law. Neither corporations or individuals have the right to come into court and ask that the clock of history be stopped, or turned back."

> If HiQ wants access to the type of data they are scraping from LinkedIn, they can build that data themselves.


They are within their right to require authorization to their content, yes, but they don't do that. They make it public and allow some companies to scrape that content and resell it for profit (like Google), but while restricting other companies from doing the same.

If they don't want their data to be public, then they shouldn't make it public. They could require authorization to view any content on the site and solve this problem instantly.

What leads you to believe that LinkedIn's data is public?

I believe mywittyname is specifically referring to the LinkedIn profile pages that are publicly visible without requiring a login and is not claiming that all LinkedIn's data is public. Thus, the question is why doesn't LinkedIn simply hide all profile data behind user authentication?

Ah, complaining when people look at the painting in your store front gallery on 5th avenue because you have to pay for the upkeep of the sidewalk.

Actually, no.

It's more like complaining about others selling tickets to view said painting from the sidewalk. HiQ repackages and sells data it scrapes from LinkedIn.

Actually, no.

It's like taking pictures of the paintings from the street, and reselling those pictures.

If they don't like that... that the painting down. Simple.

Although I agree with you in principle, I think your analogy is flawed. The nature of a web request is that you're asking in the first place, and the serve has to serve you. If you're taking a photo, you're not asking anything; data transfer has a cost, too.

That is an interesting analogy. It reminds me of my experience visiting the Alamo in San Antonio. Once inside you cannot take photos. I cannot say for sure but I think there are armed guards that made sure cameras were not used.

Personally I think anything I can see I should be able to take a photo of for sentimental purposes.

Hmm, then that becomes a copyright issue I suppose.

The type of data being discussed here (factual data about people) cannot be copyrighted - i.e. the fact John Doe is a Software Engineer for ACME Inc is not copyrightable.

A better example I think is walking into a store and writing down what they price everything as, then selling aggregate pricing data to people. I can't see any reason that would be illegal.

Would likely be illegal if you continued to do after the store asked you to leave...

Compilations can sometimes be protected.

Well its a Database Right, which is a property right rather than copyright.


As with all rights, it varies with jurisdiction.

Sounds like an Intellectual Property violation to me. You don't have to hide your IP in order to protect it: there are laws that deal with this.

What IP is contained within LI?

I wasn't talking about LI, I was talking about the photo scenario from the comment I was replying to.

Of course my original state was an over simplification, but I don't think it is like selling tickets, more like someone using a picture of the painting in the window to market their own products, since one person selling tickets to view the painting wouldn't do very well since they don't really 'block' the view for anyone else. I can still walk up right next to their queue and look at the painting. LinkedIn is arguing that no one should be allowed to walk by and take a picture, which is a right of panorama issue with all that that entails (in the US it would suggest that they don't have a case if they tried to use that argument by analogy).

edit: didn't see the sibling reply which makes the same point

Even if successfully litigated, doesn't this just move the scraping activity to less-obvious means, and to better-funded scrapers less concerned with legalities, making LinkedIn's efforts to clamp down upon scraping even more difficult? Is there a business case for LinkedIn to monetize the scraping by selling access to an API instead?

Between bot-nets, mechanical turks, deep learning, data brokering, lack of globally-enforced privacy laws that require divulging of sourced personal data, etc., I can't see a way for LinkedIn to prevent others from scraping and gaining from their publicly- and user-accessible data. They'll drive it underground, but if the concern is preventing others from grabbing the data at all, versus performance management, they'll still leak like a sieve.

LinkedIn gutted their API a couple years ago and now the information they took out has been moved to a private API

If your business is in the recruiter space, expect to have your API keys revoked and to receive a cease and desist letter as well.

This sounds like the setup to an up-spiraling arms race with "dark scrapers". I'm guessing behind-the-scenes, LinkedIn figures they can out-spend the extralegal scrapers, and likely considers their efforts will deliver halo effects to the rest of Microsoft. It would be educational to hear how LinkedIn plans to take down bot-net-based scraping that uses deep learning to identify patterns that successfully mimic human users and bypass their bot detection; could possibly help other white hats who want to battle bots and general malware.

I'm skeptical of the potential of dark scrapers at scale. You'd need to simulate too much human behavior to be unidentifiable, and humans are slow.

You would need real-looking bot accounts that you'll use to scrape. You'd need a realistically randomized rate limit, sampling from some distribution conditional on the type of the source page. You'd need realistic mouse/keyboard movements. Realistic hours of operations. Can't be scraping at 4AM and 4PM, and all of the hours in-between. Occasional noise operations, such as searching for a job, or getting salary estimates. You'd be geographically constrained. You wouldn't want your bot from Boston to be looking at too many individuals in Houston (regularly). Maybe you'd use a Markov chain to have the bot make decisions? I doubt the blackhats would have good training data for a neural net. You'd need tens of thousands of these bots to cover the linkedin user base in reasonable time (say, once every week on average), and these bots would have to either overlap or seriously underlap on who they cover.

Best use case would scraper-API that you can use to look up batches of specific people, with your bots looking at others only to look realistic.

(Or maybe not? It's a fun question, but I know fuck all about this. Not my area of expertise.)

Along the lines of "fun question, I'll take a stab at it just for giggles"; this would be far more interesting as an interview question than "estimate how many soccer balls can fit in a 747".

Average botnet size is 20,000 compromised PC's. Srizbi is estimated at 450,000. Another vector I'd explore is teaming up with crypto-miners. As I understand it, there are no economic returns tapping into the CPUs any longer, so miners are using only GPUs and ASICs; if this is true, they'll have some spare CPU cycles, that they'd probably be willing to rent out to get some marginal returns on the CPUs that have to run and manage the mining chips, running a JVM or some other VM. If we can do that, then we can probably tap 2-3M hosts, many of them rotating in and out per day.

Throw out an army of mechanical turk assignments to get real humans to register fake accounts. They get paid upon submitting an account and password, which your scraping servers verify, then change the password and commandeer. Perhaps have them register the fake account while running under a container or VM on their computer; the container/VM is instrumented to capture all activity. The activity metrics and data are uploaded to a deep learning system, that identifies the patterns that work and the ones that don't, and uses that to guide the developers of what to randomize, and by how much.

Add in a component to randomly invite/follow other fake and real accounts, and generate Markov-chain-generated copypasta. Set aside a portion of the fake accounts to only build up networks of users. Initially restrict the market of customers to those who only want once-a-year-updated data. As the network builds, use the notification of changes to selectively scrape only changed user profiles, and upsell for more up-to-date profiles at that time.

If I was LinkedIn, I'd probably concentrate on infiltrating botnet operators, and shutting them down. It would be one large cat-and-mouse game.

If the goal of such an operation were to effectively create an alternative to LinkedIn, along the likes of other "claim your listing" sites, then this could be a worthwhile cause.

Meanwhile LinkedIn was happy to scrape its users' email archives and contact lists. It would appear there's some karma at play here...

Seriously, they made it seems like I'm connecting with my contacts that have linkedin account already.

It turns out I was spamming my friends to get on linkedin because of them.

This was in the early days of Linkedin. That was such a douchey move, even if there was a warning their UI and stuff enable it easy to mass spam your contact list with 1-2 clicks.

That is an interesting fact.. LinkedIn broke the ruling against scraping they are relying on to prevent scraping.

Scraping would be forbidden via TOS; theoretically it would be the users giving access to linked in (who actually is bound by the terms) who would be liable, not linked in.

If it's illegal to scrape without permission, that makes the behavior of scraping illegal. They are not claiming it's a TOS violation, they're claiming it's a CFAA violation, which is a federal statue and (theoretically) applies equally to everyone.

My point is that someone with permission gave them access.

Having "permission" is not an applicable defense to a federal criminal charge, the law supersedes some random person (i.e. a LinkedIn user) saying "oh yeah, that's okay."

That still wouldn't prevent an entity (person or corporation) that never agreed to the TOS from scraping.

LinkedIn and Reid Hoffman have stories (don't know Reid personally) that I don't buy as one of the first LinkedIn users, premium, ads, and rejected API customer. I think LinkedIn is a scam, where the service is done for the recruiters and job postings while they make you believe they are thinking in you. They never work for an OkCupid or FB like services but for businesses except adding basic introduction messaging.

> LinkedIn is a scam, where the service is done for the recruiters

Everyone knows that LinkedIn gets paid by recruiters rather than recruitees. But of course, that is perfectly fine. People are happy to be the product if it results in job leads.

This is not what they are selling with the premium offering and this why they are lying. They are offering a way to improve selling your products and services but without any technical innovation beyond messaging.

LinkedIn has two main functions:

1., Self-updating rolodex for sales people. 2., Recruitment tool.

It works for both, no real competitor in sight due to the massive network effect.

Some niche networks are doing ok, like Xing in DACH region. Not aware of anything special in China, guess everyone is on WeChat anyhow.

Seems like Job websites like Indeed Prime and Google's new job product are serious competitors to LinkedIn.

LinkedIn is almost entirely worthless without the occasional good lead from a recruiter. I find it's mostly drowned out by the huge amounts of insignificant skin peddlers who dole out one worthless lead after another, looking for some fresh skin to peddle.

This is not what they are offering, look at the premium offerings: https://premium.linkedin.com/ do you know the ROI of using LinkedIn Premium (e.g. InMail) to just contact people using a zillion of methods available, and then adding them to your LinkedIn...

not sure where the confusion is.

below the marketing language this is exactly those 2 points. guess it is easier if you work in enterprise, this business speak is a different language. pretty verbose, low information density.

I think LI get paid by ads.

It would be interesting to put a TOS into your scraper by the pragma or useragent field that says if you return data to the scraper you accept the TOS.

>One plausible reading of the law—the one LinkedIn is advocating—is that once a website operator asks you to stop accessing its site, you commit a crime if you don't comply.

I'm fine with that.

But I'm also fine if they do the like the article suggests and require anything non-scrapable to be behind an account prompt, even if everyone with a account can access it.

I don't think it's fair to make Linked In foot the bill for someone else's business. They shouldn't have to serve that content to people who aren't actually their users.

> They shouldn't have to serve that content to people who aren't actually their users.

So put it behind a password. It's not reasonable to expect to get only the benefits of publicly-searchable data without any of the drawbacks.

The policy is also harmful to innovation.

If a Google-like competitor started, all Google would have to do to crush them is demand big-name sites formally prohibit the competitor from accessing their content or risk being delisted. And magically, it becomes impossible/illegal to build a duck-duck-go.

Isn't that monopolistic behaviour? I suspect that it might violate some laws, though I don't know anything in particular. On the other hand, why couldn't DDG just do what Google does and say "robots.txt prevents us from getting a description for this site"?

On one hand I'm against the cartel beahviour of Google doing something like that, but on the other hand, if Google asks and the other company agrees to block DDG, why shouldn't that be allowed?

This is the crux of the issue, it's my impression that the CFAA doesn't have any language that sites have to take precautions to prevent unauthorized access.

> They shouldn't have to serve that content to people who aren't actually their users.

And insurance companies only want to serve people who won't claim etc. LI is free to force users to usage agreements, but why should third-parties (taxpayers) foot the bill for enforcing them?

Does this case mean, if LinkedIn wins, I can write a cease and desist to Facebook, LinkedIn and Google to stop 'accessing' my PC with cookies and tracking my data?

A few weeks ago I was looking up public profiles on LinkedIn and I noticed what I interpreted to be some network-side fingerprinting of some kind. The first couple profiles came up, but from that point I was only served a sign up page. It didn't matter if I changed browsers or spawned new incognito sessions.

possibly just filtering based on your IP?

That crossed my mind, but it'd mean they'd be willing to lock out large networks behind a NAT. I was wondering if there's something more sophisticated available.

100% they are willing to lock out NAT networks, have been subject to this more than once browsing normally.

Also, they are just locking out users that aren't signed in. If a user has a login, they are probably much more likely to err on the side of caution before locking down access in some way.

you can only view a few profiles at a time, and changing to incognito window will do the same.

""The CFAA makes it a crime to "access a computer without authorization or exceed authorized access." Courts have been struggling to figure out what this means ever since Congress passed it more than 30 years ago.""

The law by itself is ok, but I suspect lawmakers were referring to accessing a single personal workstation, probably not taking into account a cluster of servers containing public accessible data.

"Authorization" should be clarified to mean requiring credentials that formally grant access. In my eyes, public-facing content on a website is explicitly available to anyone. It clearly does not require authorization.

"It's a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the Web."

In the US. Not elsewhere. Is it possible that the centre of gravity of innovation will be somewhere, er, less litigious soon?

Reminds me of the City of London's historical approach to most forms of regulation (think: Francis Drake, privateers, the convertibility of lute strings &c).

LinkedIn doesn't want YOUR work history easily shareable or consumed by other services. They sell this data to the highest bidders -- typically recruiters.

"HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting".


I'd always been suspicious of linking up with my current employer/coworkers, now I guess it was a valid concern.

When I make my CV public on LinkedIn I expect it to be public. Even for bots.

That is the whole point of 'public.'

LinkedIn's purpose is not to help you as a worker. It's purpose is to scrape together as much personal information about people as possible to make money from it.

Likewise Facebook doesn't exist to help connect people together to create a more personal, connected world. It exists to connect people so that they can get the users to share as much personal information as possible, so they can profit from that information.

It's always important to remember that on social media you are the product. You have to weigh what you benefits it is really providing vs. what you are giving up.

As far as I'm concerned, if the information is publicly visible on a site (not behind a login) and as long as the scraping doesn't cause performance issues or generate costs on the site, then it should be perfectly fine to do.

> As far as I'm concerned, if the information is publicly visible on a site then it should be perfectly fine to do.

There, I fixed it.

I don't think it's okay to cause enough stress to the web host to cause delays for other visitors.

Whilst that is your opinion, it is not a fact, nor is it legally valid.

Hypocrites... They should have lost any legal ownership to information freely available on their site long ago when they decided it was ok to scrape private information from their users email accounts without authorisation.

I've always believed that if you are scrapable then be prepared to be scraped. The internet is still a digital Wild West and stoping these guys will not stop some underground source with a cloud flare or tor hidden service or outside your countries laws from doing what they want with what they can get. they can build a site or they can sell it, stoping these guys just stops a public and honest company, sure they can stop the public companies but the best thing to do is protect yourself. Don't leave the keys to the door out front and get mad when someone uses the keys.

It would be very good to have one or a few solid legal decisions relating to scraping. Today, as far as I know, the entire segment is still in limbo.

It is important to understand that, legal or not, in the US anyone can sue you for anything. In another comment people were discussing the legality of following or ignoring "robots.txt". I tend to be pragmatic about this stuff. If you fabricate your own legal interpretation and end-up being sued by LinkedIn, it could end-up costing you $250K.

When facing large corporations law firms often ask for sizable retainers ($100K+) and proof of cash-on-hand to go beyond that, $250K total not being an unusual number. They don't do this out of greed. They do it because litigation at those levels can be very expensive. If you only have $100K you could find yourself burning through all of it quickly. If you don't have more cash to continue you'll lose the lawsuit and the $100K you spent will be burned for no reason at all. In other words, the law firm is protecting you by asking you to have enough cash to litigate.

A few years ago we started to develop a very extensive product based on obtaining data from Amazon. Some of the data is available through their API and other data had to be scraped in various forms. The product is extremely valuable yet the issues pertaining to scraping made me decide to put it on hold. Even if you can make millions a year the prospect facing a monster like Amazon in a lawsuit, as improbable as it might be, is scary enough to go look for other pastures.

What if I build a new public search engine and LinkedIn blocks me because I am not Google or Bing? They could stop potential (and much needed) competitors.

It would be interesting to make a "search engine" that returns results that can be aggregated. In the LinkedIn case that would be over a company.

> HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting

Screw both of these companies. One runs a shitty "professional" social network (data collection tool) and the other scoops up their data droppings and makes it their core business. I just can't have sympathy for either side.

> To expand its user base, Power asked users to provide their Facebook credentials and then—with their permission—sent Power.com invitations to their Facebook friends. Facebook, naturally, didn't appreciate this marketing tactic. They sent Power a cease-and-desist letter and also blocked the IP addresses Power was using to communicate with Facebook's servers.

> Facebook sued, claiming that its cease-and-desist letter made Power's access unauthorized under the terms of the CFAA. Power disagreed and argued that having permission from Facebook users was good enough—it didn't need separate approval from Facebook itself.

How can be illegal if users are giving their permission? What happens if I give my permission to an external service to extract my own data?

Oh you don't want your stuff scraped? Well then get the fuck of the public internet. Simple as that.

I'd have to agree with you. Public info is public info.

I can't even view LinkedIn with a web browser, how the hell do robots get in?

Back in the days I wrote my own crawler and layout analyzer to collect news articles for my research (cf. http://www.unixuser.org/~euske/python/webstemmer/ ). I thought in 2005 it was mostly acceptable as long as you don't hog their bandwidth, but today I feel it would be looked differently. It is kinda sad to see that the more and more part of the web is treated not public. It seems that everyone likes to build their wall.

A perfect example of corporate hypocrisy. Both Facebook and LinkedIn did illegal things when they were younger but now that they are more established, they don't want anyone to do the same thing to them.

I mean that's just the standard way non altruistic entities interact with the law isn't it? Appeal to it when it benefits them, ignore it when it doesn't.

Web service is not like ad on a paper or a public billboard. If service is being scraped by other companies who makes or not makes money for it, or prevents the owner making money with it (filtering out the ads) this service is most likely to cause extra costs as bandwith and processing power. They are not free. So you should have right to block anybody from your site as you wish!

LinkedIn is free to block the IPs of the scrapers.

Not sure how to feel about this. While in theory, scraping data is a shady practice - companies like LinkedIn leave the door open for it.

Ironic because LinkedIn scrapes and collects every scrap of contact info they can find. I've got people I sent an email to once in 199? suddenly popping up suggesting we know each other.

How is scraping data shady? Is it also shady to walk into a store and look at things without buying anything?

>Is it also shady to walk into a store and look at things without buying anything?

Not making a claim about scraping, but it's maybe more apt to compare to walking into a store and writing a list of everything for sale their costs.

I don't see a problem with that. Market transparency is a good thing, as it allows the market to be efficient.

True, the real problem lies with big corporation data collection.

... then drop shipping it online for a profit

It's more like going into the store, taking all the pennies from the free penny jar, then peacing out.

Scraping isn't shady.

... and neither is picking gum from under subway railings

I guess for LinkedIn it could matter most, because their only livelihood is CV and employer companies Data, if you steal a poor man's bread they are gonna be upset. It was dead as a social network long ago if it ever was, now their only asset is the data, so I guess their right to protect their only asset is legit.

Their desire to protect it is legit, but it doesn't necessarily give them a right to do so.

A long time ago, companies that were compiling phonebooks also thought that they could use e.g. copyright to prevent others from using that compiled data for commercial gain. They were wrong, because copyright doesn't protect "mere aggregation" of data, as courts have ruled.

In this case, LinkedIn is not arguing on the basis of copyright, so it's a different legal argument. But the essence of the case is they same - they want to have a business model around aggregating other people's data and then providing access to that data, while limiting what people who access it can do with it. They don't have a right to this business model. If technical means of restricting access don't work, or if adopting them means that they drive most of their customers away, tough luck.

Yes it's a consensus user-generarated content or facts are not liable to copy rights, I am not sure about the legality however it should not take long to prove as such.

I meant there is nothing that LinkedIn 'created' out of that users data. But it's little complex or little immoral to steal users data hosted on their infrastructure that costs them a lot.

In US, it's an established legal fact.


(Note, HN strips out the trailing dot from the URL above.)

LinkedIns Email Password ConArtistery has inspired a whole generation of Anti-Malware and Lawyer-Plugins, preventing the layman from giving away his data, even on "friendly" sites.

Im going to turn around now, and whatever happens to this site is going to happen. They worked so hard, to ask for this.

If I spammed people as much as LinkedIn I wouldn't be throwing stones. They completely ignore the law.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact