
LinkedIn: It’s illegal to scrape our website without permission - mcone
https://arstechnica.com/tech-policy/2017/07/linkedin-its-illegal-to-scrape-our-website-without-permission/
======
5trokerac3
Unpopular opinion: when you make a HTTP request you're asking the server to
give you information. The server has the right to say no.

IMHO, LinkedIn doesn't have a right to stop scraping after the fact, but they
have the right to take technical steps to stop scrapers from accessing their
site.

~~~
ChuckMcM
I don't think you've characterized this accurately. When you make an HTTP
request to LinkedIn you are accessing their _service._ There is a long history
of this relationship, you plug your house into the sewer line and you connect
to the sewer service. You connect to the power pole and connect to the
electricity service. You connect to the telephone pole and connect to the
telephone service.

Every service has "terms of service" which are the conditions that you are
allowed to access the service and what you may do with the service once being
granted access. For example, if you start pouring toxic waste into your sewer,
you will find that the city will both disconnect you from the service and they
will fine you for violating the terms of service you nominally agreed to when
being hooked up.

In LinkedIn's case, they allow you to access their service, with HTTP, to
render a page in a browser for viewing of that page. Full stop. Any other use
of the data you acquire over HTTP, or any other method of acquiring said data
over HTTP is disallowed by the terms of service.

Not only does LinkedIn have a legal right to stop scraping after the fact,
they have literally centuries of common law in support of that position.

~~~
lelandbatey
When you make a connection to the city sewers or to the power company, there
is some kind of pre-connection step where the terms are presented and you
agree to those terms.

With HTTP and LinkedIn, there is no such step. There's no pre-connection
agreement. LinkedIn could present such an agreement on first connection, but
they do not.

~~~
ChuckMcM
That argument has been tried in a variety of ways and been shot down in court
repeatedly. (there are parallels to tenants not agreeing to the terms of their
internet connection where the landlord provided it).

LinkedIn has two things that they do which protect them; First, they specify
they disallow access in their robots.txt file. While not a binding agreement
per se it is the default mechanism that is accepted by the community for
apriori identifying whether or not automated access is possible. Second, when
they detect an access pattern that violates their terms of service they
actively block the access proactively notify the source of the violation.

The sad truth is that web scraping has been around since the very beginnings
of the Web back in 1993 and this question has been litigated in every way that
you might choose to argue it, the body of case law is enough to fill at least
two volumes in the reference section of the library.

There is no legal or ethical basis for scraping the web without permission.
And if it isn't explicitly allowed by a site the presumption is that it is
disallowed (no 'open door' exception).

------
dsacco
I'm following this story with a lot of interest. I've done (and still do!) a
lot of data crawling/scraping. In the past I've worked on so-called
"alternative data" collection and analysis for financial forecasting.

Without going into too much detail, a lot of hedge funds have teams constantly
searching for kernels of data that can contribute some kind of signal for
market movements. This data can come in the form of satellite imagery for oil
tankers or manufacturing centers, but it can also come from the very creative
use of scraped and aggregated data. It's typically very difficult to identify,
collect and analyze on a technical level (as 'chollida1 has lamented in the
past: normalization, labeling/bucketing and analysis of disparate data across
different formats, sources and processing timeframes is a pernicious problem
at this scale). From a compliance standpoint there are also generally strict
requirements governing legality of use.

Depending on the specific data, you might be capable of predicting earnings or
broader market movements with a <5% margin of error each quarter for years at
a time (I've personally seen and worked on projects with <1%, but that's the
exception, not the norm). That tactic is usually found at discretionary funds;
at quantitative funds the uses are much more abstract and cross-pollinated so
as not to target single-equities, but rather holistic trends. Regardless,
every fund is using data in some way these days; it's just a matter of how
sophisticated, creative abstract they get in their analysis of it.

hiQ Labs doesn't collect data for this specific purpose, but it is absolutely
related. In the past I have stayed away from crawling LinkedIn and Yelp
precisely because they are very litigious (regardless of the eventual outcome
and legality). Now that there's another relatively high profile case out in
the open like this, I'm interested in seeing how it proceeds and what the
ramifications will be for companies that collect data across a wide range of
uses. As Grimmelman mentioned in the article, this can impact a lot of types
of businesses, not just those in the same space as hiQ. Outside of finance I
am familiar with many tech companies which (openly or otherwise), kickstarted
what are now widely known enterprises through cleverly crawling or scraping
massive amounts of data.

~~~
hellbanner
"satellite imagery for oil tankers" \- Interesting.

If hedge funds floating drones above oil tankers, I'd guess they'd be accussed
of corporate espionage / spying / invasion of privacy?

Ok, so oil tankers are big and "in the clear". What if $TANKERCORP floats big
parachute balloons above its tankers to imply "looking past these is
unauthorized viewing"?

Then if a HedgeFund gets a clever angle on a satellite photo.. is that the
equivalent of breaking a lock, or violating CFAA?

~~~
dsacco
Satellite imagery like this is legal to within a few feet of practical
resolution, pretty much anywhere. The effective countermeasure is hiding
things from a satellite, not attempting to sue satellite operators for flying
overhead. I'm not aware of a specific law against using anything that is
literally viewable from the sky, at least in the United States (someone can
correct me if I'm wrong, but last I checked Google Maps blurs out some
locations or keeps them outdated because the government requests it, not
because of a formal law forcing them to do so).

There are two other notes in response to your question:

1\. Drones are different from satellites, and are more susceptible to
regulation in the way you're positing because they can be prevented from
flying above specific areas. However, most of the same problems with
countering them apply, because drones can record better three dimensional
footage. In your specific example, if a tanker disguised itself overhead, it
would still be legal to have a drone monitor the tanker from the _sides_ , as
long as doing so didn't break any law set by the FAA.

2\. Drones are actively used these days for things like monitoring production
facilities ("how many cars come out of this factory" for an oversimplified
version). If they have to monitor from a distance, so be it, they'll do it.
The effective countermeasure here is to have a huge amount of land that can't
provide any intelligence, because the drones aren't allowed to fly over it and
can't see far enough in to the facility.

There's definitely a productive ethics discussion that can be had here, but
the legal precedents don't really allow for combatting these techniques right
now. If it's public, it can be collected, ingested and used in an algorithm to
determine alpha.

~~~
paganel
> someone can correct me if I'm wrong, but last I checked Google Maps blurs
> out some locations or keeps them outdated because the government requests
> it, not because of a formal law forcing them to do so

In the Eastern European country from where I'm from (a NATO member) the Google
StreetView car even got to photograph and publicly put on the Internet the
outside of military and air bases with clear signs of "do not take
photographs" visible on StreetView itself. It's funny, my company also used to
work in this space (local business directory with business addresses, photos
of said business etc) and one of my former colleagues got detained for a day
because of taking photos of businesses in the downtown area of one of the
biggest cities in my country. He hadn't seen that there was a military
"objective" in his line of view (probably some military HQ or someth, not a
proper military base with tanks and trucks). Talk about the advantages of
being an internet giant like Google..

Later edit: I was talking for example about links like these:
[https://www.google.ro/maps/@44.4062748,26.0524843,3a,80.8y,2...](https://www.google.ro/maps/@44.4062748,26.0524843,3a,80.8y,239.77h,95.92t/data=!3m6!1e1!3m4!1sJmkNyhGqTVOG55p8vhmrig!2e0!7i13312!8i6656?hl=en)
. That is actually the HQ of NATO's "Multinational Division Southeast",
whatever that is
([http://www.nato.int/cps/in/natohq/news_125356.htm?selectedLo...](http://www.nato.int/cps/in/natohq/news_125356.htm?selectedLocale=en)).
Fact is that if I were to take photos of those buildings as a simple citizen I
would be breaking the law, not sure how Google got away with it.

------
brunock222
I do a lot of crawler services as well.

The last one, I did 13 crawlers to keeping comparing prices of all drugstore
products online. I'm selling it to drugstore ecommerces who wants to know when
their competitor prices and when they're doing promotions, if competitors
prices are higher... Well, I just automated the job of a person that was doing
this job manually every single day, looking into competitor's site

If it is illegal, you have to ask to google maps to remove all houses from
your database and just let in the houses who gave the permission to it.

The both sentence means the same bullshit, they are both wrong:

Google Maps: "The front-door of my house is faced to a public street. That
does not mean you can take photos of it on Street View and use a very smart
OCR, to read my house number. Not to mention that sometimes you give my house
photo to others by Captcha asking the house number, c'mon!"

Linked: "My CV is half-public. That does not mean you can take crawler of it."

... This is only a cool discussion in North Korea or maybe in China. Not in
the rest of the world.

~~~
groby_b
I would suggest occasionally actually informing yourself of the wider world.

Your very example of Google Maps in Germany - well, turns out Google had to
give people the option to opt out: [https://europe.googleblog.com/2010/10/how-
many-german-househ...](https://europe.googleblog.com/2010/10/how-many-german-
households-have-opted.html)

Same goes for web sites - robots.txt exists for a reason. If your crawler
ignores that, well, I'd suggest talking to a lawyer.

~~~
briandear
I wasn’t aware that robots.txt had any legal meaning. Am I wrong?

~~~
pdabbadabba
> Am I wrong?

I hate to be the bearer of bad new but...maybe. A reading of the Computer
Fraud and Abuse Act could make robots.txt legally enforceable. And given the
government's approach to CFAA cases a very aggressive interpretation, under
the right circumstances (for example, when it provides evidence that the
scraper knew that scrapint was not authorized), seems like a real possibility.

Among the many other things CFAA criminalizes, it makes it a crime to
"intentionally access[] a computer without authorization or exceed[]
authorized access . . and thereby obtain[] information from any protected
computer;"

A "protected computer" is, among other things, a computer "which is used in or
affecting interstate or foreign commerce or communication." That would
probably cover just about any server on the Internet.

[https://www.law.cornell.edu/uscode/text/18/1030](https://www.law.cornell.edu/uscode/text/18/1030)

~~~
hackits
I wouldn't be too fast to jump to conclusion robots.txt legally enforceable.
You would need to cite prior case law's. Without any case law's it make
decision on a law error prone at best.

~~~
solarkraft
The possibility still exists. After reading this (insanely broad) definition I
think the chance is not even that low.

------
jalfresi
If you post it on a public network, it is defacto, public.

If you don't want it scraped, take it down, or put it behind a login.

If the user provides the login to a scraper, then the scraper has permission.

~~~
dsacco
That's a pretty literalist, one-size fits all approach to policy. I don't
think it's a good framework to use for applying ethics considerations.

If I can walk near a pool, should I also be able to run? Is running anything
more than faster walking? If I'm allowed to be around the pool walking with my
entry ID, should I also be allowed to place my ID on a little motorized car
and make it dart around the pool really fast? Should I be able to duplicate my
badge, put it on a bunch of little cars and direct them to quickly get all the
floaties before anyone else? How about giving them all their own fake IDs? Now
all the same questions, except there is a sign that explicitly prohibits all
of these examples except for walking.

It seems disingenuous to argue that the automation and rapid increase of a
thing should be allowed just because a thing is allowed. That doesn't
typically match our intuitive notions of ethics in other parts of society,
like driving or walking around a pool. Yes, you can walk around a pool as much
as you want, but if you change to running then you have fundamentally altered
your behavior through increased capability, not merely done "more of walking"
to utilize more of your freedom.

I suppose a natural counterargument to this analogy might be that running
around pools is unsafe, and scraping is not unsafe in the same way. But my
point here is establishing that a behavior intrinsically changes into a
different behavior if you increase the speed at which you're doing it or the
capability at which you can do it.

~~~
freehunter
The pool has the right to kick you out, same as any website. The pool cannot
call the police and charge you with a felony for misusing their resources.

~~~
dsacco
Does the pool have any recourse if you proceed to bypass the ban? Do you have
to re-enter to pool to bypass it, or does sending in confederates with their
own badges to continue your work also bypassing the ban? How about sending in
new motorized cars?

The analogy is starting to break down, but I think it's still instructive for
the problem of applying a simple first principles approach.

~~~
freehunter
There is a legal concept known as "attractive nuisance"[1]. If I have a pool
and neighborhood kids come to play and someone gets hurt, it's my fault. Even
if I was away from my house and never gave permission (or explicitly forbade
them from swimming), if I don't have proper access controls in place, the
courts say it is too tempting for the neighbors to just come over and swim. I
need to put up a locking gate to keep them out.

Likewise in some high-crime jurisdictions, if you did not lock your car you
are liable for it getting stolen or broken into[2]. An unlocked car is too
tempting for some people to just walk past and _not_ take it.

I know it might sound crazy but you could make the argument that a massive
pool of highly-structured and very valuable data just sitting out in the open
is an attractive nuisance and steps should be taken to put it behind a locked
gate. Once that requirement has been satisfied, normal trespassing laws apply.

[1]
[https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine](https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine)

[2] [http://www.cbc.ca/news/canada/montreal/ndg-resident-
question...](http://www.cbc.ca/news/canada/montreal/ndg-resident-questions-
ticket-for-leaving-car-unlocked-1.2493093)

~~~
enraged_camel
Laws like that are ridiculous. You can see that by looking at how the
reasoning does not expand to certain areas.

For example, if a woman walks down a dark alley wearing short skirts and gets
raped, it isn't her fault. I mean can you imagine if we said "well, she was
just an attractive nuisance!" The judge would throw the book at you.

~~~
freehunter
They're usually targeted at children who don't know better and don't have the
cognitive skills to understand consequences. Usually... but not always.

------
snarfy
That law needs to be abused into the ground. We should all be filing frivolous
lawsuits claiming CFAA violations whenever someone we don't like is accessing
our sites. All we need is a little disclaimer in 5pt text stating who does not
have authorization, like say all members of Congress.

------
debacle
I'm seeing a lot of bad analogies thrown around in this comment thread, mostly
based on emotional response and/or a dislike for LinkedIn.

As someone who has done a lot of scraping in the past (sometimes for good,
sometimes not), the number one thing you need to respect as a scraper is that
email or phone call you get saying "Stop doing that."

In almost all instances, you're legally fine in the real world until you get
some communication to stop and/or blacklisted. After that point, what you are
doing becomes a crime.

\- LinkedIn is not a public resource, it is a private company that pays for
servers. \- LinkedIn might scrape too, but that argument isn't going to hold
up in court, and the scraping they do is probably in line with their EULA
(protip: never install a social networking app on your phone, ever). \- The
analogy to the storefront, taking pictures in public, etc, all break down
because scraping LinkedIn requires you to access their resources. \- The
analogy to browsing a store is great. If you are in a store, and they ask you
to leave, and you don't, that's trespassing. Trespassing isn't legal.

The CFAA isn't a great law. There are a lot of gray areas. But LinkedIn seems
within their rights here.

If anyone wants to know how this is going to wind up:
[https://www.eff.org/deeplinks/2015/06/padmapper-and-3taps-
se...](https://www.eff.org/deeplinks/2015/06/padmapper-and-3taps-settle-suit-
craigslist-over-use-real-estate-facts)

~~~
mywittyname
> But LinkedIn seems within their rights here.

Congress writes bad laws all the time. So LinkedIn might be within their
rights, but that doesn't mean they should have those rights.

It's bad for innovation to allow for selective discrimination like this.
LinkedIn is perfectly happy to allow Google, Yahoo, Bing, and many, many more
companies to scrape their content and use it for personal profit. Giving them
the option to sue an upstart for doing exactly the same thing as Google is
unfair and oligopolistic.

Letting established tech giants get away with this will slowly erode American
dominance in technology.

~~~
debacle
> Giving them the option to sue an upstart for doing exactly the same thing as
> Google is unfair and oligopolistic.

But it's not. LinkedIn's data is their entire business. They are within their
rights to restrict access to it.

This is the classic ant and grasshopper story. If HiQ wants access to the type
of data they are scraping from LinkedIn, they can build that data themselves.

~~~
mywittyname
They are within their right to require authorization to their content, yes,
but they don't do that. They make it public and allow some companies to scrape
that content and resell it for profit (like Google), but while restricting
other companies from doing the same.

If they don't want their data to be public, then they shouldn't make it
public. They could require authorization to view any content on the site and
solve this problem instantly.

~~~
debacle
What leads you to believe that LinkedIn's data is public?

~~~
int_handler
I believe mywittyname is specifically referring to the LinkedIn profile pages
that are publicly visible without requiring a login and is not claiming that
all LinkedIn's data is public. Thus, the question is why doesn't LinkedIn
simply hide all profile data behind user authentication?

------
hyperion2010
Ah, complaining when people look at the painting in your store front gallery
on 5th avenue because you have to pay for the upkeep of the sidewalk.

~~~
huhtenberg
Actually, no.

It's more like complaining about others selling tickets to view said painting
from the sidewalk. HiQ repackages and sells data it scrapes from LinkedIn.

~~~
williamle8300
Actually, no.

It's like taking pictures of the paintings from the street, and reselling
those pictures.

If they don't like that... that the painting down. Simple.

~~~
ksk
Hmm, then that becomes a copyright issue I suppose.

~~~
openmosix
The type of data being discussed here (factual data about people) cannot be
copyrighted - i.e. the fact John Doe is a Software Engineer for ACME Inc is
not copyrightable.

~~~
lightbyte
A better example I think is walking into a store and writing down what they
price everything as, then selling aggregate pricing data to people. I can't
see any reason that would be illegal.

~~~
bostonpete
Would likely be illegal if you continued to do after the store asked you to
leave...

------
yourapostasy
Even if successfully litigated, doesn't this just move the scraping activity
to less-obvious means, and to better-funded scrapers less concerned with
legalities, making LinkedIn's efforts to clamp down upon scraping even more
difficult? Is there a business case for LinkedIn to monetize the scraping by
selling access to an API instead?

Between bot-nets, mechanical turks, deep learning, data brokering, lack of
globally-enforced privacy laws that require divulging of sourced personal
data, _etc._ , I can't see a way for LinkedIn to prevent others from scraping
and gaining from their publicly- and user-accessible data. They'll drive it
underground, but if the concern is preventing others from grabbing the data at
all, versus performance management, they'll still leak like a sieve.

~~~
whatnotests
LinkedIn gutted their API a couple years ago and now the information they took
out has been moved to a private API

If your business is in the recruiter space, expect to have your API keys
revoked and to receive a cease and desist letter as well.

~~~
yourapostasy
This sounds like the setup to an up-spiraling arms race with "dark scrapers".
I'm guessing behind-the-scenes, LinkedIn figures they can out-spend the
extralegal scrapers, and likely considers their efforts will deliver halo
effects to the rest of Microsoft. It would be educational to hear how LinkedIn
plans to take down bot-net-based scraping that uses deep learning to identify
patterns that successfully mimic human users and bypass their bot detection;
could possibly help other white hats who want to battle bots and general
malware.

~~~
kharms
I'm skeptical of the potential of dark scrapers at scale. You'd need to
simulate too much human behavior to be unidentifiable, and humans are slow.

You would need real-looking bot accounts that you'll use to scrape. You'd need
a realistically randomized rate limit, sampling from some distribution
conditional on the type of the source page. You'd need realistic
mouse/keyboard movements. Realistic hours of operations. Can't be scraping at
4AM and 4PM, and all of the hours in-between. Occasional noise operations,
such as searching for a job, or getting salary estimates. You'd be
geographically constrained. You wouldn't want your bot from Boston to be
looking at too many individuals in Houston (regularly). Maybe you'd use a
Markov chain to have the bot make decisions? I doubt the blackhats would have
good training data for a neural net. You'd need tens of thousands of these
bots to cover the linkedin user base in reasonable time (say, once every week
on average), and these bots would have to either overlap or seriously underlap
on who they cover.

Best use case would scraper-API that you can use to look up batches of
specific people, with your bots looking at others only to look realistic.

(Or maybe not? It's a fun question, but I know fuck all about this. Not my
area of expertise.)

~~~
yourapostasy
Along the lines of "fun question, I'll take a stab at it just for giggles";
this would be far more interesting as an interview question than "estimate how
many soccer balls can fit in a 747".

Average botnet size is 20,000 compromised PC's. Srizbi is estimated at
450,000. Another vector I'd explore is teaming up with crypto-miners. As I
understand it, there are no economic returns tapping into the CPUs any longer,
so miners are using only GPUs and ASICs; if this is true, they'll have some
spare CPU cycles, that they'd probably be willing to rent out to get some
marginal returns on the CPUs that have to run and manage the mining chips,
running a JVM or some other VM. If we can do that, then we can probably tap
2-3M hosts, many of them rotating in and out per day.

Throw out an army of mechanical turk assignments to get real humans to
register fake accounts. They get paid upon submitting an account and password,
which your scraping servers verify, then change the password and commandeer.
Perhaps have them register the fake account while running under a container or
VM on their computer; the container/VM is instrumented to capture all
activity. The activity metrics and data are uploaded to a deep learning
system, that identifies the patterns that work and the ones that don't, and
uses that to guide the developers of what to randomize, and by how much.

Add in a component to randomly invite/follow other fake and real accounts, and
generate Markov-chain-generated copypasta. Set aside a portion of the fake
accounts to only build up networks of users. Initially restrict the market of
customers to those who only want once-a-year-updated data. As the network
builds, use the notification of changes to selectively scrape only changed
user profiles, and upsell for more up-to-date profiles at that time.

If I was LinkedIn, I'd probably concentrate on infiltrating botnet operators,
and shutting them down. It would be one large cat-and-mouse game.

------
athenot
Meanwhile LinkedIn was happy to scrape its users' email archives and contact
lists. It would appear there's some karma at play here...

~~~
eadz
That is an interesting fact.. LinkedIn broke the ruling against scraping they
are relying on to prevent scraping.

~~~
throwaway91111
Scraping would be forbidden via TOS; theoretically it would be the users
giving access to linked in (who actually is bound by the terms) who would be
liable, not linked in.

~~~
bb611
If it's illegal to scrape without permission, that makes the behavior of
scraping illegal. They are not claiming it's a TOS violation, they're claiming
it's a CFAA violation, which is a federal statue and (theoretically) applies
equally to everyone.

~~~
throwaway91111
My point is that someone with permission gave them access.

~~~
bb611
Having "permission" is not an applicable defense to a federal criminal charge,
the law supersedes some random person (i.e. a LinkedIn user) saying "oh yeah,
that's okay."

------
wslh
LinkedIn and Reid Hoffman have stories (don't know Reid personally) that I
don't buy as one of the first LinkedIn users, premium, ads, and rejected API
customer. I think LinkedIn is a scam, where the service is done for the
recruiters and job postings while they make you believe they are thinking in
you. They never work for an OkCupid or FB like services but for businesses
except adding basic introduction messaging.

~~~
MagnumOpus
> LinkedIn is a scam, where the service is done for the recruiters

Everyone knows that LinkedIn gets paid by recruiters rather than recruitees.
But of course, that is perfectly fine. People are happy to be the product if
it results in job leads.

~~~
wslh
This is not what they are selling with the premium offering and this why they
are lying. They are offering a way to improve selling your products and
services but without any technical innovation beyond messaging.

~~~
pinaceae
LinkedIn has two main functions:

1., Self-updating rolodex for sales people. 2., Recruitment tool.

It works for both, no real competitor in sight due to the massive network
effect.

Some niche networks are doing ok, like Xing in DACH region. Not aware of
anything special in China, guess everyone is on WeChat anyhow.

~~~
wslh
This is not what they are offering, look at the premium offerings:
[https://premium.linkedin.com/](https://premium.linkedin.com/) do you know the
ROI of using LinkedIn Premium (e.g. InMail) to just contact people using a
zillion of methods available, and then adding them to your LinkedIn...

~~~
pinaceae
not sure where the confusion is.

below the marketing language this is exactly those 2 points. guess it is
easier if you work in enterprise, this business speak is a different language.
pretty verbose, low information density.

~~~
wslh
Are we reading the same page? [https://business.linkedin.com/sales-
solutions?trk=pre_hub_b_...](https://business.linkedin.com/sales-
solutions?trk=pre_hub_b_lmor_lss#)

------
adrr
It would be interesting to put a TOS into your scraper by the pragma or
useragent field that says if you return data to the scraper you accept the
TOS.

------
wccrawford
>One plausible reading of the law—the one LinkedIn is advocating—is that once
a website operator asks you to stop accessing its site, you commit a crime if
you don't comply.

I'm fine with that.

But I'm also fine if they do the like the article suggests and require
anything non-scrapable to be behind an account prompt, even if everyone with a
account can access it.

I don't think it's fair to make Linked In foot the bill for someone else's
business. They shouldn't have to serve that content to people who aren't
actually their users.

~~~
mbillie1
> They shouldn't have to serve that content to people who aren't actually
> their users.

So put it behind a password. It's not reasonable to expect to get only the
benefits of publicly-searchable data without any of the drawbacks.

~~~
mywittyname
The policy is also harmful to innovation.

If a Google-like competitor started, all Google would have to do to crush them
is demand big-name sites formally prohibit the competitor from accessing their
content or risk being delisted. And magically, it becomes impossible/illegal
to build a duck-duck-go.

~~~
ue_
Isn't that monopolistic behaviour? I suspect that it might violate some laws,
though I don't know anything in particular. On the other hand, why couldn't
DDG just do what Google does and say "robots.txt prevents us from getting a
description for this site"?

On one hand I'm against the cartel beahviour of Google doing something like
that, but on the other hand, if Google asks and the other company agrees to
block DDG, why shouldn't that be allowed?

------
osullip
Does this case mean, if LinkedIn wins, I can write a cease and desist to
Facebook, LinkedIn and Google to stop 'accessing' my PC with cookies and
tracking my data?

------
TomSawyer
A few weeks ago I was looking up public profiles on LinkedIn and I noticed
what I interpreted to be some network-side fingerprinting of some kind. The
first couple profiles came up, but from that point I was only served a sign up
page. It didn't matter if I changed browsers or spawned new incognito
sessions.

~~~
mgkimsal
possibly just filtering based on your IP?

~~~
TomSawyer
That crossed my mind, but it'd mean they'd be willing to lock out large
networks behind a NAT. I was wondering if there's something more sophisticated
available.

~~~
dylz
100% they are willing to lock out NAT networks, have been subject to this more
than once browsing normally.

------
FedericoCapell0
""The CFAA makes it a crime to "access a computer without authorization or
exceed authorized access." Courts have been struggling to figure out what this
means ever since Congress passed it more than 30 years ago.""

The law by itself is ok, but I suspect lawmakers were referring to accessing a
single personal workstation, probably not taking into account a cluster of
servers containing public accessible data.

~~~
mywittyname
"Authorization" should be clarified to mean requiring credentials that
formally grant access. In my eyes, public-facing content on a website is
explicitly available to anyone. It clearly does not require authorization.

------
keithpeter
_" It's a fight that could determine whether an anti-hacking law can be used
to curtail the use of scraping tools across the Web."_

In the US. Not elsewhere. Is it possible that the centre of gravity of
innovation will be somewhere, er, less litigious soon?

Reminds me of the City of London's historical approach to most forms of
regulation (think: Francis Drake, privateers, the convertibility of lute
strings &c).

------
deedubaya
LinkedIn doesn't want YOUR work history easily shareable or consumed by other
services. They sell this data to the highest bidders -- typically recruiters.

------
atomicone
"HiQ scrapes data about thousands of employees from public LinkedIn profiles,
then packages the data for sale to employers worried about their employees
quitting".

cute.

~~~
0x4f3759df
I'd always been suspicious of linking up with my current employer/coworkers,
now I guess it was a valid concern.

------
pmlnr
When I make my CV public on LinkedIn I expect it to be public. Even for bots.

That is the whole point of 'public.'

~~~
funkymike
LinkedIn's purpose is not to help you as a worker. It's purpose is to scrape
together as much personal information about people as possible to make money
from it.

Likewise Facebook doesn't exist to help connect people together to create a
more personal, connected world. It exists to connect people so that they can
get the users to share as much personal information as possible, so they can
profit from that information.

It's always important to remember that on social media you are the product.
You have to weigh what you benefits it is really providing vs. what you are
giving up.

------
rocky1138
As far as I'm concerned, if the information is publicly visible on a site (not
behind a login) and as long as the scraping doesn't cause performance issues
or generate costs on the site, then it should be perfectly fine to do.

~~~
pmlnr
> As far as I'm concerned, if the information is publicly visible on a site
> then it should be perfectly fine to do.

There, I fixed it.

~~~
rocky1138
I don't think it's okay to cause enough stress to the web host to cause delays
for other visitors.

------
tomxor
Hypocrites... They should have lost any legal ownership to information freely
available on their site long ago when they decided it was ok to scrape private
information from their users email accounts without authorisation.

------
cdevs
I've always believed that if you are scrapable then be prepared to be scraped.
The internet is still a digital Wild West and stoping these guys will not stop
some underground source with a cloud flare or tor hidden service or outside
your countries laws from doing what they want with what they can get. they can
build a site or they can sell it, stoping these guys just stops a public and
honest company, sure they can stop the public companies but the best thing to
do is protect yourself. Don't leave the keys to the door out front and get mad
when someone uses the keys.

------
rebootthesystem
It would be very good to have one or a few solid legal decisions relating to
scraping. Today, as far as I know, the entire segment is still in limbo.

It is important to understand that, legal or not, in the US anyone can sue you
for anything. In another comment people were discussing the legality of
following or ignoring "robots.txt". I tend to be pragmatic about this stuff.
If you fabricate your own legal interpretation and end-up being sued by
LinkedIn, it could end-up costing you $250K.

When facing large corporations law firms often ask for sizable retainers
($100K+) and proof of cash-on-hand to go beyond that, $250K total not being an
unusual number. They don't do this out of greed. They do it because litigation
at those levels can be very expensive. If you only have $100K you could find
yourself burning through all of it quickly. If you don't have more cash to
continue you'll lose the lawsuit and the $100K you spent will be burned for no
reason at all. In other words, the law firm is protecting you by asking you to
have enough cash to litigate.

A few years ago we started to develop a very extensive product based on
obtaining data from Amazon. Some of the data is available through their API
and other data had to be scraped in various forms. The product is extremely
valuable yet the issues pertaining to scraping made me decide to put it on
hold. Even if you can make millions a year the prospect facing a monster like
Amazon in a lawsuit, as improbable as it might be, is scary enough to go look
for other pastures.

------
twsted
What if I build a new public search engine and LinkedIn blocks me because I am
not Google or Bing? They could stop potential (and much needed) competitors.

~~~
bostonvaulter2
It would be interesting to make a "search engine" that returns results that
can be aggregated. In the LinkedIn case that would be over a company.

------
bschwindHN
> HiQ scrapes data about thousands of employees from public LinkedIn profiles,
> then packages the data for sale to employers worried about their employees
> quitting

Screw both of these companies. One runs a shitty "professional" social network
(data collection tool) and the other scoops up their data droppings and makes
it their core business. I just can't have sympathy for either side.

------
toniprada
> To expand its user base, Power asked users to provide their Facebook
> credentials and then—with their permission—sent Power.com invitations to
> their Facebook friends. Facebook, naturally, didn't appreciate this
> marketing tactic. They sent Power a cease-and-desist letter and also blocked
> the IP addresses Power was using to communicate with Facebook's servers.

> Facebook sued, claiming that its cease-and-desist letter made Power's access
> unauthorized under the terms of the CFAA. Power disagreed and argued that
> having permission from Facebook users was good enough—it didn't need
> separate approval from Facebook itself.

How can be illegal if users are giving their permission? What happens if I
give my permission to an external service to extract my own data?

------
dna_polymerase
Oh you don't want your stuff scraped? Well then get the fuck of the public
internet. Simple as that.

~~~
point78
I'd have to agree with you. Public info is public info.

------
smegel
I can't even view LinkedIn with a web browser, how the hell do robots get in?

------
euske
Back in the days I wrote my own crawler and layout analyzer to collect news
articles for my research (cf.
[http://www.unixuser.org/~euske/python/webstemmer/](http://www.unixuser.org/~euske/python/webstemmer/)
). I thought in 2005 it was mostly acceptable as long as you don't hog their
bandwidth, but today I feel it would be looked differently. It is kinda sad to
see that the more and more part of the web is treated not public. It seems
that everyone likes to build their wall.

------
mattm
A perfect example of corporate hypocrisy. Both Facebook and LinkedIn did
illegal things when they were younger but now that they are more established,
they don't want anyone to do the same thing to them.

~~~
graphitezepp
I mean that's just the standard way non altruistic entities interact with the
law isn't it? Appeal to it when it benefits them, ignore it when it doesn't.

------
anonymous344
Web service is not like ad on a paper or a public billboard. If service is
being scraped by other companies who makes or not makes money for it, or
prevents the owner making money with it (filtering out the ads) this service
is most likely to cause extra costs as bandwith and processing power. They are
not free. So you should have right to block anybody from your site as you
wish!

~~~
Fej
LinkedIn is free to block the IPs of the scrapers.

------
pgeorgep
Not sure how to feel about this. While in theory, scraping data is a shady
practice - companies like LinkedIn leave the door open for it.

~~~
lightbyte
How is scraping data shady? Is it also shady to walk into a store and look at
things without buying anything?

~~~
cr1895
>Is it also shady to walk into a store and look at things without buying
anything?

Not making a claim about scraping, but it's maybe more apt to compare to
walking into a store and writing a list of everything for sale their costs.

~~~
amelius
I don't see a problem with that. Market transparency is a good thing, as it
allows the market to be efficient.

~~~
pgeorgep
True, the real problem lies with big corporation data collection.

------
ijafri
I guess for LinkedIn it could matter most, because their only livelihood is CV
and employer companies Data, if you steal a poor man's bread they are gonna be
upset. It was dead as a social network long ago if it ever was, now their only
asset is the data, so I guess their right to protect their only asset is
legit.

~~~
int_19h
Their _desire_ to protect it is legit, but it doesn't necessarily give them a
_right_ to do so.

A long time ago, companies that were compiling phonebooks also thought that
they could use e.g. copyright to prevent others from using that compiled data
for commercial gain. They were wrong, because copyright doesn't protect "mere
aggregation" of data, as courts have ruled.

In this case, LinkedIn is not arguing on the basis of copyright, so it's a
different legal argument. But the essence of the case is they same - they want
to have a business model around aggregating other people's data and then
providing access to that data, while limiting what people who access it can do
with it. They don't have a _right_ to this business model. If technical means
of restricting access don't work, or if adopting them means that they drive
most of their customers away, tough luck.

~~~
ijafri
Yes it's a consensus user-generarated content or facts are not liable to copy
rights, I am not sure about the legality however it should not take long to
prove as such.

I meant there is nothing that LinkedIn 'created' out of that users data. But
it's little complex or little immoral to steal users data hosted on their
infrastructure that costs them a lot.

~~~
int_19h
In US, it's an established legal fact.

[https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R...](https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._Rural_Telephone_Service_Co).

(Note, HN strips out the trailing dot from the URL above.)

------
honestoHeminway
LinkedIns Email Password ConArtistery has inspired a whole generation of Anti-
Malware and Lawyer-Plugins, preventing the layman from giving away his data,
even on "friendly" sites.

Im going to turn around now, and whatever happens to this site is going to
happen. They worked so hard, to ask for this.

------
nilved
If I spammed people as much as LinkedIn I wouldn't be throwing stones. They
completely ignore the law.

------
fiatjaf
I once tried to scrape
[http://www.diseasesdatabase.com](http://www.diseasesdatabase.com) by all
means, but failed so hard I can't even tell this story without embarrassment.

They have all kinds of blocks in place.

------
aantix
Here's a few twists that would be interesting to see how they held up in
court...

* What if someone had a few oDesk workers to manually enter in the LinkedIn data?

* What if someone wrote a scraper that parses the Google cache version of the public LinkedIn page?

~~~
notfromhere
The oDesk thing is allowed, I believe. Or at least can't be detected

------
troisx
Once again the 9th circuit makes a ridiculous decision that hopefully gets
overturned. The fact that they sided with Facebook in that lawsuit shows how
out-of-touch they are with reality. Hopefully it gets overturned.

------
danschumann
From a non-legal perspective, this is a battle of employers vs employees
rights. Do employers have the right to (in a sense) spy on their employees,
watching their every move to see if they want to switch jobs?

------
hellbanner
Related: (also frontpage):
[https://news.ycombinator.com/item?id=14891192](https://news.ycombinator.com/item?id=14891192)

------
emodendroket
Since it would criminalize tons of current behavior and set a terrible
precedent if this prevails, I expect them to be completely successful.

------
a_lifters_life
What a joke, as if LinkedIn is some important website more so than <name a
million others>

------
WalterBright
I seem to recall that the courts ruled that a phone book is copyrightable, but
the data within it is not. Perhaps that is precedent?

------
drefgert
Seems to me LinkedIn is missing the chance to be a source of identity
verification for other services.

------
didibus
With LinkedIn's current policy, are they the owner of my data? What if I
scrape my own profile?

~~~
mehh
Quite likely yes they are the owners of your data.

------
drefgert
Is there ANY data that can be read from LinkedIn programritically without
their permission?

------
jug
It is illegal to let your browser fetch this text without my permission.

~~~
mehh
I suspect we don't need your permission as under the terms of use of this
sight you forfeit your rights to this user generated content, and the rights
now belong to Hacker News.

------
duncan_bayne
Yet another reason to shut down your LinkedIn account and walk away.

------
jlebrech
instead of making it a legal matter they could just start giving bogus
information to the scrapers.

They have no legal obligation to provide accurate information to scrapers.

------
artur_makly
it is incredible how many global companies are _successfully_ scraping LI as
part of their core business. you would be suprised

------
cerf
Sometimes I wish we could start over.

------
jamesmattis
Is scraping public data on a specific platform like LinkedIn is illegal? Why?

According to TechCrunch, they sued hundreds of "Anonymous" individuals last
year.

How can unnamed people can be sued?

------
Tharkun
"illegal"? Under what law?

~~~
haikuginger
The CFAA, which makes it a crime to access a computer system without proper
permission.

~~~
Tharkun
That's a load of controversial hogwash. If you have permission to view the
page, then you have permission to scrape the page.

Anyone can make a LinkedIn account, ergo anyone can scrape any public profile.

~~~
tyingq
People have gone to jail for it. [https://www.wired.com/2010/11/wiseguys-
plead-guilty/](https://www.wired.com/2010/11/wiseguys-plead-guilty/)

~~~
type0
They were being sentenced to prison but not for scraping, their business were
a fraudulent economic activity. They broke the terms of buying those tickets
by impersonating legitimate customers. I don't see the connection with
scraping.

~~~
tyingq
There were two charges, one specifically CFAA for bypassing the
captcha...scraping.

