
Congrats! Web scraping is legal! (US precedent) - ehurynovich
https://parsers.me/us-court-fully-legalized-website-scraping-and-technically-prohibited-it/
======
AznHisoka
Linkedin is taking this to the Supreme Court:
[https://www.law360.com/articles/1237505/linkedin-will-go-
to-...](https://www.law360.com/articles/1237505/linkedin-will-go-to-supreme-
court-over-data-scraping)

No ultimate decision was ever made, and no, this doesn't make web scraping
100% legal. Wake me up when there's a new announcement because anyone
interested in this already know this old news.

~~~
zackmorris
This is a really big deal. Currently (IMHO) the US Supreme Court is a wholly-
owned subsidiary of multinational corporations due to the shenanigans that
happened with Obama, McConnell and Garland, so will likely side with LinkedIn
since it's the larger corporation:

[https://www.npr.org/2018/06/29/624467256/what-happened-
with-...](https://www.npr.org/2018/06/29/624467256/what-happened-with-merrick-
garland-in-2016-and-why-it-matters-now)

I feel like siding with LinkedIn here would open up the web to extortion
though, like troll companies that would send cease and desist letters to all
scrapers (even search engines). I think it could be argued that letting one
company scrape when another is denied is discrimination.

Then again, I don't know how conservative and republican-leaning courts decide
corporate law. Maybe in this case since so much money is at stake, they might
worry that banning scraping would infringe on something like free speech and
ruffle the feathers of some of the wealthier contributors in their base.
Especially on the media side since I imagine they use bots in one form or
another to find newsworthy stories.

IANAL (obviously!), I just find it entertaining/dismaying to ponder these
things in these times.

~~~
shkkmo
The thing is, this is still all about a preliminary injunction. Even if the
injunction is found to be without merit, that still doesn't provide a final
answer to of if LinkedIn can successfully sue HiQ to force HiQ to stop
scraping under the CFAA.

------
powrtoch
"HiQ only takes information from public LinkedIn profiles. By definition, any
member of the public has the right to access this information. Most
importantly, the appeals court also upheld a lower court ruling that prohibits
LinkedIn from interfering with hiQ’s web scraping of its site."

Surely I'm not reading this correctly. This would seem to suggest that
websites are not legally allowed to prevent bots from crawling their sites.
Lots of sites have ToS preventing such things, are those legally void now? Are
captchas on public pages illegal, even if you request the page 8000 times in a
second?

"In this case, hiQ argued that LinkedIn’s technical measures to block web
scraping interfere with hiQ’s contracts with its own customers who rely on
this data. In legal jargon, this is called” malicious interference with a
contract”, which is prohibited by American law."

This is almost weirder. If LinkedIn wanted to force users to sign in to view
profile info, would they be not allowed to do that because some company had
signed a contract that implicitly assumed access to that data? If someone
writes a web scraper for my site, and I unknowingly change my site in a way
that breaks that scraper, can a court force me to revert the change?

Seems to imply that every business is somehow beholden to every contract
signed by anyone.

~~~
nathanlied
> Lots of sites have ToS preventing such things, are those legally void now?
> Are captchas on public pages illegal, even if you request the page 8000
> times in a second?

ToS are subservient to the law; you can (probably) terminate a service account
from a user that breaks your ToS, but if the user does not have a service
account (as is the case for HiQ, it doesn't seem they were using accounts for
it), then your ToS does not apply, since you've technically not entered a
binding legal contract with them.

> This is almost weirder. If LinkedIn wanted to force users to sign in to view
> profile info, would they be not allowed to do that because some company had
> signed a contract that implicitly assumed access to that data? If someone
> writes a web scraper for my site, and I unknowingly change my site in a way
> that breaks that scraper, can a court force me to revert the change?

IANAL, but I believe that'd fall on intent, and intent is often difficult to
prove at a personal level, but not necessarily at a company level. If your
intent for putting up barriers that happen to impact scraping, whatever they
may be, was indeed to knowingly prevent scraping from a particular company,
then you may be liable under this decision. This is the only part of the
decision I'm torn on, since it's a bit messy to really prove such things. I'd
be much more comfortable with allowing companies to take whatever measures
they feel necessary to prevent scraping, and also allowing scrapers to legally
circumvent those measures without threat of prosecution, assuming they didn't
actually hack into anything.

~~~
andrewmutz
> but if the user does not have a service account (as is the case for HiQ, it
> doesn't seem they were using accounts for it), then your ToS does not apply,
> since you've technically not entered a binding legal contract with them.

Are you sure about this? I am not a lawyer, but I believe that the Terms of
Service applies to all users, not just those that explicitly set up a user
account.

I have interpreted the LinkedIn ruling to mean that scraping public data is no
longer _criminal_ activity but it still leaves you open to _civil_ lawsuits
for violating the ToS of the website you are scraping.

~~~
rohansingh
> Are you sure about this? I am not a lawyer, but I believe that the Terms of
> Service applies to all users, not just those that explicitly set up a user
> account.

How would that even work? If I browse to any random public page of your
website, it's served to me before you've even transmitted the terms of
service. How could I be bound by those terms of service when I haven't even
seen them?

~~~
andrewmutz
As an engineer, I agree with what you are saying, but I think normal people
and the courts disagree.

I think these sorts of contracts are called Adhesion Contracts
([https://www.investopedia.com/terms/a/adhesion-
contract.asp](https://www.investopedia.com/terms/a/adhesion-contract.asp)) and
we interact with them all the time. For example, if you valet your car, the
valet will hand you a piece of paper with a number printed on it to retrieve
your car. On that paper you will find an adhesion contract that is valid and
real (although not as powerful as the types of contracts that you sign)

~~~
AstralStorm
This does not work at least for software licensing based on precedents for
shrink-wrap contracts, so again would not work for licensing use of data.

A paper served you by the valet is not an immediate contract as you can deny
agreeing to it and service does not happen.

You cannot do that with a publicly visible website, unless you show ToS and
require agreement before first use. If you allow a non-transferable license
then said data cannot be used by a search engine. If it's transferable you
just pushed the problem towards scraping a different bot. (Well, you could
have a direct agreement with a few major search engines.)

Caveat emptor: not a lawyer.

------
kabacha
The toxicity towards web-scraping is really what makes me lose hope in the
current web. People want their data to be public and all of the benefits that
comes with public data but then they want to chose who gets to see it - it's a
complete and utter paradox.

This precedent doesn't really mean much but is definitely step in the right
direction.

~~~
DougWebb
One of my clients is involved in property tax collection and reporting.
Property Tax records are public info, and their website allows looking up the
records for any property without a login. However, the data behind this
website it the _source_ of the public records, and not the public records
themselves (which would be local government databases).

For years now we've been in an arms race with someone using a botnet to scrape
all of the account information for a particular county. My client doesn't care
so much about the data; it's the server load that's a problem. Normal activity
for this site is a few dozen account searches per minute, but when the botnet
gets through our blockade it sends hundreds of search requests per second,
overwhelmimg the site. The operator of the botnet has NEVER tried to contact
my client to ask for an efficient api to access the data, which they'd
probably provide for a minimal fee.

Data hosting isn't free, even if the data is.

~~~
vorpalhex
Wouldn't the solution be to offer a streamlined download (maybe even as a
torrent if you're worried about bandwidth) of all the data then?

~~~
Symbiote
I work on a fully open data repository. The website has the API linked in 3
places, so when I find inappropriate scraping I block it with "HTTP 420 ...
see <API link> or contact <email>".

Some people probably switch to using the API, but no-one has ever contacted
us. They either give up, or run their scraper on a different computer -- I've
seen the same scraper move between university computers, departments, then (in
the evening) to a consumer broadband IP.

~~~
magduf
I really don't understand why anyone would bother writing and using a web
scraper when an API exists. Does the API not provide all the same
data/functions as the website? Scrapers are a big PITA compared to just using
an API: they're much harder to write to be reliable, and they can break at any
time, whenever the site makes even the smallest change. APIs avoid all that
mess, and make performance far better too (on both sides), since you're only
downloading the data you want, not a ton of Javascript and HTML that you
don't.

~~~
barrkel
APIs are often not as complete as the web interface, since the customer sees
the web interface and normally the customer is what drives the revenue model
of the company.

If pages are driven via an API, then the API is preferable, but publicly
facing websites are often a mix of server-side HTML generation and API
enrichment, for caching if nothing else.

~~~
magduf
In that case it seems that the webmasters complaining about scraping need to
make sure their APIs actually provide access to all the same data, if they
want people to use the APIs instead of scraping.

------
pseudolus
The title of this post is misleading. The Court's decision related to HiQ's
attempt to obtain a preliminary injunction. It's clearly an initial victory
for hiQ in that the Court affirmed granting an injunction based on a
significant likelihood that hiQ would ultimately prevail and that HiQ would
suffer irreparable damage if the injunction was not granted. However, the
Court never actually dealt with the merits of the case and, accordingly,
stating that the case has precedential value is misleading. The Court itself
noted:

"At this preliminary injunction stage, we do not resolve the companies’ legal
dispute definitively, nor do we address all the claims and defenses they have
pleaded in the district court. Instead, we focus on whether hiQ has raised
serious questions on the merits of the factual and legal issues presented to
us, as well as on the other requisites for preliminary relief."

------
Nitramp
German copyright has the concept of a "Datenbankwerk" (since the 90s).

E.g. the telephone book contains lots of boring facts that are each in
themselves not copyrightable. However the collection in itself is
copyrightable, as it required substantial effort to create.

It seems odd that US copyright law wouldn't have a similar provision, or that
it doesn't apply here?

~~~
dsr_
It does, and the result is that phone books and maps get fictional entries
inserted in order to prove copying -- because it is perfectly legal to do your
own work to amass the same data set.

[https://en.wikipedia.org/wiki/Fictitious_entry](https://en.wikipedia.org/wiki/Fictitious_entry)

~~~
AgloeDreams
Like a Paper town or that time that Genius caught Google stealing their lyrics
by playing with straight and curly quotes and writing a tool that watermarked
songs by interchanging those quotes on the site. It was brilliant.

~~~
EForEndeavour
You might even say it was Genius.

------
amelius
> Now many site owners are trying to put technical obstacles to competitors
> who completely copy their information that is not protected by copyright.
> For example, ticket prices, product lots, open user profiles, and so on.
> Some sites consider this information “their own”, and consider web scraping
> as “theft”. Legally, this is not the case, which is now officially enshrined
> in the US.

Does this mean we can now scrape e.g. YouTube videos, Amazon reviews, IMDB
reviews, Facebook events ... ?

~~~
grepfru_it
Yes you can scrape them, no you cannot repubilsh them. Everything you listed
is protected by copyright. You cannot infringe on copyrights because of this
ruling.

>hiQ argued that LinkedIn’s technical measures to block web scraping interfere
with hiQ’s contracts with its own customers who rely on this data. In legal
jargon, this is called” malicious interference with a contract”, which is
prohibited by American law

Does this mean that Google's random recaptcha check is interference?

~~~
peeters
I think any ruling that says LinkedIn can't put in protectionary measures
against automated requests is doomed to be overturned, as long as they're not
doing it discriminately. Captcha, rate limiting, user agent testing, etc are
all common tools to protect against malicious/unintentional denials of
service. The question is what was LinkedIn doing, and did it specifically
target hiQ while permitting others of the same class of traffic.

~~~
buboard
Why would it be an issue if it is discriminatory? Linkedin can use its servers
any way they like, unless they ve promised their users that their data can be
scraped indiscriminately

~~~
matttb
Because of the court case. This is just an injunction pending an actual
decision.

------
mrosett
I'm glad scraping isn't criminal. Applying the CFAA here is ridiculous. But
saying LinkedIn can't put technical measures in place to prevent scraping
seems like a huge stretch to me. Why should they have to pay server costs for
persistent scraping, particularly from a company that is actively trying to
harm them?

------
echelon
So many ideas start to come to mind if scraping is legal.

Can we start to scrape Google Search in order to bootstrap building an
alternative to Google Search? Search is a really hard problem (that somebody
should tackle), but if we can leverage what Google has already scraped from
the web and associated with popular search terms, we can use that to help
train and validate our search model.

Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing
service that strips out all the ads? It's hard to bootstrap a social media
website, but if you can import all the content from the existing giants, your
site is no longer a wasteland.

Can we finally scrape and get rid of IMDB? I'd love to put all of their
content on a wiki and be done with it.

~~~
cirenehc
> Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing
> service that strips out all the ads?

How are you going to pay for it? Subscription model doesn't work for
search/social networks.

~~~
echelon
Ad-free Reddit would be sustainable if:

\- Comments are ephemeral, expiring after two weeks (no growing storage costs)

\- "Reddit Gold" helps to offset costs

\- Run Wikipedia-like donation drives yearly

\- Write everything in bare-metal Rust so that CPU is cheap. Likewise, make
intelligent choices about schema and service design for scalability.

\- Don't continue to drive unnecessary feature work (that is usually just to
drive ad engagement and growth).

------
michaelbuckbee
IANAL - but also ran afoul of LinkedIns data scraping blocks (and got a cease
and desist letter) but what they said was it was a violation of the TOS.

This decision doesn’t seem to affect that?

------
onetimemanytime
>> _Most importantly, the appeals court also upheld a lower court ruling that
prohibits LinkedIn from interfering with hiQ’s web scraping of its site. This
fundamentally changes the balance of power in dealing with such cases in the
future._

This, I don't agree with. I agree that it's not fraud to send a bot to scrape
public info but the site should every right to block a bot or person.

~~~
ksangeelee
This doesn't sit well with me either. However, LinkedIn are trying to redefine
a fundamental principle of the web, i.e. easy access to publicly available
information, and they're doing it simply to protect their commercial
interests.

It would be terrible to see the web compromised by a spat between two (in my
opinion) scummy companies, and I think we could do with some hard push-back
against attacks on the web generally.

------
kaveh_h
One question to those who dislike web scraping as they deem it infringes
copyright laws.

Given that Google scrapes LinkedIn public profiles and add data from it to
it’d index and when presenting search results, is it then not discrimination
that Microsoft tries to block HiQ and not google?

------
pseudolus
A link to the actual decision:

[http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17...](http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf)

------
mic47
I strongly disagree that not allowing scraping protection on social network is
a good thing (reasons mostly from this thread
[https://news.ycombinator.com/item?id=22182144](https://news.ycombinator.com/item?id=22182144)
)

What I would do on Linkedin side is:

1\. Split public setting into 2 settings: public for everyone including
scraping, and second choice that make it public for Linkedin users (aka
banning scraping [1])

2\. Everyone on Linkedin who have current setting public would go though
dialog to choose which one, with default setting to ban scraping (which is
hard to criticize, since that option is protecting privacy more).

3\. If you allow scraping, the form should warn you about consequences, and
make you to acknowledge them (which is again a good practice, if you want to
expose your information to EVERYONE on the internet).

4\. Now you have users consent to protect their data from scraping.

Sounds like dark pattern? Kind of yes, but for good thing and it's hard to
argue against it, since it increases privacy.

[1] Idea here is that I can allow anyone on Linkedin to see my profile,
whether it is potential employer, or someone who met me at the conference
(i.e. to improve my Linkedin experience). But I do not want that data to be
harvested, by third party for any reason (specially, because they would then
use it to to send me spam / advertising / prefill my profile on different
website / ...).

edit: formatting

~~~
buboard
Scrapers can just create a user account then. They still seem to be covered by
this injunction

~~~
mic47
Legal: It will be hard to argue in court that you have right to create fake
accounts. That is definitely against TOS, and at that point Linkedin can sue
the company instead, so I believe it would scraper would not be successful at
court.

Technical: This is actually what you want, as limiting scraping per user is
easier than per IP, since IP can be shared by users, so your rate limits have
to be higher for IP address than for individual user. Additionally, creating
fake account is more work which will in the end making scraping more
expensive.

~~~
buboard
a real account could scrape tons of stuff given time

------
tiffanyh
This is phenomenal timing for Clearview AI who in the last week was exposed by
the NYTimes for working with law enforcement to identity suspects via their
database of web scrapped images of individuals.

[https://news.ycombinator.com/item?id=22173899](https://news.ycombinator.com/item?id=22173899)

[https://news.ycombinator.com/item?id=22083775](https://news.ycombinator.com/item?id=22083775)

------
francasso
So, if company sells a product based on scraping google search results, google
trying to block scraping would constitute "malicious interference with a
contract"?

~~~
icedchai
This is interesting to to think about, considering Google gets its own results
from scraping...

------
andrewseanryan
I ran a company based on web scraping for a couple years and I heard never
ending comments about how what we were doing was illegal. Thank god that
conversation is over.

~~~
foxx-boxx
You are abusing other people’s websites without intention to buy something.
It’s a lot like stealing.

~~~
ravenstine
So... Google are the world's most successful criminal syndicate?

------
jakelazaroff
Let’s not pretend this is a pure win. There are good uses of web scraping,
like Archive.org trying to preserve the web. But what HiQ is doing is looking
at public LinkedIn profiles and then snitching to employers if they think an
employee is searching for a new job.

It’s easy to blanket say “web scraping is legal, do what you will“. The tricky
part is protecting people’s public data while not giving a huge moat to giant
corporations who control it.

~~~
wtetzner
That's the thing. Web scraping isn't really the problem here. It's what
companies are doing with personal information. If LinkedIn started doing the
same thing as HiQ, it would be just as bad (probably worse), but the legality
of web scraping is irrelevant to that.

~~~
jakelazaroff
That's a good point, and we certainly should write our data protection laws to
prevent LinkedIn from doing the same thing — but there's a crucial difference
between the two.

I've consented to give my data to LinkedIn, and I can withdraw my consent and
data if they start doing something I don't like. On the other hand, hiQ has
vacuumed up my data _without_ my consent, and there's really no way for me to
stop them other than retreating from my public profile. Certainly the legality
of web scraping is relevant to _that_.

~~~
wtetzner
I guess that's the issue. We need laws to control what companies can do with
personal information, even if it happens to be publicly available. I don't
think scraping itself is really the issue. If you used mechanical turk to hire
a bunch of people to go look and user profiles and write down information
about them, you'd have the same problem.

------
dragonsh
This case does not make web scraping legal or illegal, it just sets a
precedent that Computer Frauds and Abuse Act (CFAA 1986) cannot directly be
applied in web scraping of public data for fair use. Web scraping can be legal
with caveats which is, as long as you are just scraping a public information
and do not re-sale it for profit without significant transformation and value
addition i.e. a fair use doctrine is applied.

If you use bots to login using a username and password and than scrap the
information, it's still wrong and infringement, as action of login binds you
to an implicit contract with this website and by logging in you accept those
terms. The ruling on HiQ vs LinkedIn is quite nuanced and if you are crawling
and than repackaging that information and selling it somehow then there is a
highly likely chance the precedent in this case won't apply.

Technical barriers like rate limiting and captcha are legitimate way to guard
against not just web crawlers but denial of service (DoS) attacks, so it's
about how it's been put forward by lawyer. So in general website can still
continue to block like google, facebook, amazon and all the big sites are
doing it. It's not illegal, I found a better deep understanding of this case
based on explanation from a EFF. [1] [2]

[1] [https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-
v-l...](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-
protects-scraping-public-data)

[2] [https://www.eff.org/document/hiq-v-linkedin-ninth-circuit-
de...](https://www.eff.org/document/hiq-v-linkedin-ninth-circuit-decision)

------
fouc
Anyone ever read any stories about the concept of "agents" where everyone had
their own computer agents that did all the work and talked to other agents do
all the steps like book your flights, tickets, order your food,
collect/collate data for your purposes, etc?

We need to make the internet "Agent" friendly.. we should stop assuming the
end user (end human?) will ever see any webpage on the internet.

~~~
Jupe
Having an agent proxy for you won't last long. If that agent has connections
to your friends' agents, your place of work, etc., then the protection
provided by the proxy whittles away quickly.

To find you, all anyone would need is a photo of you, match it to an existing
photo tagged as your agent, and your agent's online presence is clearly
identified.

You could rotate to a new agent every time you go online, or randomly while
online, but forget about making purchases, maintaining friendships or building
"cred".

The only sure way to separate you, the person, from your online presence is to
not have an online presence. For the company in question above, the only way
to isolate yourself from their "running to tell Mom" business model is to
simply not put your resume up on LinkedIn.

This ruling is troubling in many ways. HiQ can scrape the web, and now,
apparently, so can every other personal information brokers out there. (Ref:
[https://en.wikipedia.org/wiki/Information_broker](https://en.wikipedia.org/wiki/Information_broker),
which naively suggests this information is only important to advertisers; it
is also of interest to governments, police, employers, even the parents of the
S.O. you intend on marrying, or the S.O. themselves.)

Is there a company that will track your online posts while you are supposed to
be working, and "run and tell Mom" that you were doing something other than
work? As an employer, would you pay for such a service?

I'm not even convinced that air-gapping your person from the internet would
work in this new world. A lack of online presence, while not damming in itself
(yet), could indicate "something to hide."

------
jackfoxy
> The CFAA is adopted to prevent deliberate intrusion on someone else’s
> computer — in particular, computer hacking

 _hacking_ is commonly used to mean different things, and we imply from the
context what the author really meant. Do any lawyers know if _computer
hacking_ has a legal definition, or if this decision will lead to a specific
legal definition of the term?

~~~
senorjazz
From what I understand it is very loose and can mean "you accessed our systems
legitmately but we didn't want you to".

I think it was last year that facebook went after fake likes / fake followers
companies who were logging via the login page and they liking / following.
Companies in China / New Zealand and I think NY? got threatened with the CFAA
(unsure if it went further) but it made the tech news at the time.

I don't think anyone hear would call writing a bot that logged in with actual
details and performed an action (the same a real person could) as hacking, but
facebook were saying as it broke the TOS, it was unauthorised, thus hacking
laws apply. With their budget, I guess they get to decide what hacking is

------
CrazyStat
The EFF has a writeup on this decision from September that I found more
helpful than the linked article:
[https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-
v-l...](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-
protects-scraping-public-data)

------
Mikho
So, the whole Google's search is based on scraping publicly accessible
information from public web-sites. Nothing to be surprised about that the
practice is declared legal. Otherwise they should have demand for Google to
basically liquidate Google search unless it signs an agreement with every
single site in the index.

------
prirun
IANAL, but IMO LinkedIn goofed by using the CFAA. eBay and BiddersEdge had a
similar case, and eBay won, preventing scraping of its website using trespass:

[https://casetext.com/case/ebay-v-bidders-
edge](https://casetext.com/case/ebay-v-bidders-edge)

I can see both sides of this, and am not sure which is better. If a site is
not allowed to block traffic, does that mean blocking a DDOS attack is
illegal? What if the company doing the scraping is so aggressive about it that
they are impacting performance?

One fair way (to me) would be to figure out how many pages a typical user
looked at per month, and if a company is scraping the site, they are limited
to accessing it like one customer; they can't launch 100 threads to scarf down
data as fast as possible. This probably wouldn't work for most scrapers'
businesses, but it seems reasonable from the origin web site's POV.

Another possibility would be to require web sites that accept and then
publicize user submissions to give users a choice about whether they want to
make their information public. If they do, the hosting site has to include the
information in a publicly-accessible feed, like a daily compressed download.
This seems reasonable to me, because the hosting site is getting all kinds of
information for free from users. It's not like it would take a lot of effort
to do a streaming JSON dump of the stuff that changed every day, and if
necessary, they could throttle the download rate, but not so low that the data
for a day couldn't be downloaded in a day. Ie, a competitor might be a day
behind, but no more. Of course, the competitor would have to publish a feed
too, allowing other people to lunch off of them just like they are lunching
off other sites.

With LinkedIn, all of the information they obtain and publish is submitted by
users, so for LI to say they "own" this information makes no sense. At least
with eBay, the current auction price is not not something the user submitted,
assuming it has bids.

------
shkkmo
Ugh, this is yet another poorly written coverage of a court case that is more
complicated than simple snippets and one liners allow.

This absolutely does not say that web scraping is legal (the ruling
specifically notes other ways that it could be illegal).

It doesn't even say for sure that web scraping isn't illegal under the CFAA.
This is a ruling about a preliminary injuction and as such all it determines
is that HiQ has a good enough arguement and valid enough concerns that a
preliminary injuction is warranted until LinkedIn wins a suit against HiQ to
force them to stop scraping.

Take the time, and read the linked ruling. Don't trust poorly written and
researched articles from companies that stand to gain from a particular
interpretation.

------
ondrae
I have a question about the company in the article, HiQ. Their website says
they identify flight risks. Does that mean if your company has a contract with
them, then they will notify your boss if your LinkedIn looks like you are job
hunting? Haters.

------
nickjj
I guess this means the web is lawfully seen as belonging to the public which
seems both good and bad, because while it might be publicly accessible as a
whole, each individual site is privately owned.

I'm pretty sure with any brick and mortar business, as a business owner you
are legally allowed to disallow service to anyone without having to provide a
reason.

That would be the equivalent of allowing to block bots / unwanted traffic on
your site, except the brick and mortar case is way easier to prevent. If the
person won't leave your physical business, you can have them removed by the
police but to prevent a bot the burden is fully on you as a site owner.

~~~
schwartzworld
> as an owner you are legally allowed to disallow service to anyone without
> having to provide a reason

as a website owner you absolutely have the right to limit who does or doesn't
visit your site, but that's up to you to enforce. Scraping is literally
equivalent to reading a giant public billboard and writing what it says
somewhere else. How could that be illegal?

~~~
nickjj
> Scraping is literally equivalent to reading a giant public billboard and
> writing what it says somewhere else. How could that be illegal?

If you treat the website you're visiting as a privately owned business then
you're trespassing on private property by scraping their site, assuming that
site is trying to prevent that behavior.

If you don't treat the website as a privately owned business, then what do you
classify it as? It can't be considered in the public domain because someone
owns, operates and pays for the resources to make the site work. It is the
site owner's private property IMO. By having the site public, they are
inviting the world to check it out but they should have the right to disallow
service.

Also, if the internet is supposed to be distributed and each site is an
independent node on the system, how is that any different than an
independently owned brick and mortar business operating in some location? In
this case the physical world is "the internet".

It's not a clear cut thing, and I hate the idea of censorship, but I can't see
this case's outcome becoming the norm. There's too many loop holes. Like, is
going to your site 50 thousand times a second hoping to get new information
from the public billboard a legal move?

Or, to put it another way, I'm pretty sure if you were able to mind control
people and you commanded a billion people to flood a physical business so that
it could not operate and serve its customers this would be quickly seen as an
unlawful move in the physical world. You would probably get shut down by the
state or government too for disrupting service for neighboring businesses and
citizens.

~~~
antisthenes
You treat it as a public-facing building storefront, with you standing outside
and taking pictures.

As long as there are big glass windows, the patrons and the information gained
from looking inside are not subject to privacy, and are thus freely accessible
to anyone walking by.

What you're not entitled to is the backend workings of the website - the
application/code/databases, and credentials. It would be akin to going into a
restaurant and forcing yourself into the food prep area and stealing the
owner's keys, and contaminating the food that patrons eat.

Scraping too often would be something akin to setting up a camera tripod right
in the building entrance, hindering the influx of new customers. While not
explicitly illegal, it does hinder the operation of the business, and you will
probably be physically removed at some point (this is banning abusive bots,
etc.)

------
inlined
This makes me think of Aaron Swartz. If he had the strength to keep from
harming himself, his whole sentence would have likely been vacated. This is a
reminder to those of us with suicidal thoughts: things can always get better.

------
heyyyouu
Please not that I am personally more pro scraping that anti. However, this
article's interpretation of this ruling (which is from September 2019) is
overblown and suspect:

1) This is the 9th circuit only. The 9th circuit is known for being very
liberal in these kids of rulings. There is a high likelihood that this ruling
could be interpreted differently in other circuits and when it goes to the
supreme court.

2) Saying "it makes it legal" is more of a technicality with this ruling and
NOT practicality. There is still a LOT open to interpretation with this ruling
(see 1, above). And lot of what it's saying just isn't found in the case. And
again, it doesn't apply everywhere.

3) This is actually a much better summary of the case:
[https://www.natlawreview.com/article/data-scraping-
survives-...](https://www.natlawreview.com/article/data-scraping-survives-
least-now-key-takeaways-9th-circuit-ruling-hiq-vs-linkedin) In fact, there's a
number of better summaries are out there by people who actually understand
both technology and law from when it happened (again, back in September 2019).
Here's some others:

[https://www.cooley.com/-/media/cooley/pdf/reprints/2019/2019...](https://www.cooley.com/-/media/cooley/pdf/reprints/2019/2019-10-28-linkedin-
data-scraping-case-shows-9th-circ-shift-on-
cfaa.ashx?la=en&hash=38E87A1CB3C711D3112D127C687F2C26)

[https://www.techdirt.com/articles/20190909/17571342951/big-n...](https://www.techdirt.com/articles/20190909/17571342951/big-
news-appeals-court-says-cfaa-cant-be-used-to-stop-web-scraping.shtml)

Honestly, it comes back to a lot of existing issues, like whether facts can be
copyrighted (e.g., this is why recopies alone can't be), whether logins imply
privacy, etc.

I personally think the ruling is in the right direction but again, I don't
think this source deals with any of the complexity or the reality of what the
ruling does (or still doesn't) mean.

------
Nextgrid
Finally some common sense. I’m also somewhat impressed that it “only” took 3
years from the initial lawsuit to reaching this precedent. I expected this to
go on for a lot more time.

------
fludlight
This only affects the ninth circuit—which includes the tech hubs San
Francisco, Seattle, LA, and Portland. It would only apply to the rest of the
country if the Supreme Court affirmed it. Even then, a well-funded company or
zealous prosecutor could say that it doesn’t apply in your case because of
some technicality. In that case you would need hundreds of thousands or
millions of dollars and a few years to litigate the issue with no guarantee of
the outcome.

~~~
jermaustin1
Is that how really circuit court rulings get applied?

I always understood each ruling on the rungs up the ladder to the supreme
court applied across the land until a final ruling was determined.

~~~
beerandt
Across the land within their circuit, over matters within their jurisdiction.
Elsewhere the ruling is merely advisory in nature.

It gets tricky with nationwide actors though.

Besides some specialized topics like patents and international trade,
nationwide orders and injunctions are the sort-of exception, which are based
on a courts local jurisdictional power over a non-local nationwide actor.

I'm not sure what the generalized name of the principle (beyond injunctions)
is called (I've heard it as a type of jurisdictional overreach), but the
presumption as applied here is that LinkedIn, being in the 9th circuit's
jurisdiction, also adhere to the ruling outside the circuit, absent a
contradictory ruling by a different circuit.

(One of the usual requirements for the Supreme Court to even hear a case is
that different circuits have conflicting rulings on a matter.)

But it looks like the Supreme Court is about to seriously reign that
nationwide power back soon, at least for judges issuing orders to
departments/actors of the executive branch.

~~~
dragonwriter
> One of the usual requirements for the Supreme Court to even hear a case is
> that different circuits have conflicting rulings on a matter.

That's not a “usual requirement”, it's one of many factors that can weigh in
favor of the Supreme Court exercising discretionary appellate jurisdiction
(and it's one that weighs very heavily in favor of it, even when no other
favorable factors are present, since federal law meaning the same thing
everywhere is an important principle.)

~~~
beerandt
Well of course that's what I meant by "usual requirement."

How else would one read it? :)

------
mar77i
The article's title gives me the cognitive dissonance. Glad it was changed for
HN. To add my own interpretation, "US court appears to legalizes X and
technically prohibits it" \-- what is this "it" the court prohibits, and why
do I want to assume the title's subject, which doesn't make sense at all. You
don't "legalize X and technically prohibit X". Was the originally intended
title truncated somehow?

------
Pigo
I always thought scraping was a fun idea, I just couldn't find the right use-
case for it. I'm not a sports guy, and the big sites have pretty extensive
API's. Something music related would be of interest to me, but I can already
get updates on events like concerts (since the monopolies make a fortune
selling tickets). I'm not sure what could be useful.

~~~
TrickyRick
I build [https://awardfares.com](https://awardfares.com) together with a
friend which scrapes airlines' award seat availability. Airlines' websites are
horrible from a UX perspective so scraping the data and presenting it in a
better way was a pretty obvious use case.

~~~
iso1631
No British Airways?

A quick look shows a lot of award availability to China for some reason...

~~~
TrickyRick
It's coming in the next couple of weeks! Both me and my co-founder had Star
Alliance frequent flyer programmes so that's what we focused on first.

------
divbzero
In the past the CFAA has been wielded haphazardly and even maliciously [1]. As
noted by others these cases are far from over, but hopefully we’re shifting
towards a healthier balance between open web and private information.

[1]:
[https://en.wikipedia.org/wiki/Aaron_Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz)

------
kevin2r
Wonder if this means more native apps that are harder to scrap and less
websites from companies that don’t want their data scrapped.

~~~
TrickyRick
Native apps are generally a lot easier to scrape since they rely on an API
which can't be changed willy nilly without breaking compatibility with older
apps. Also you can't do captcha etc on API calls in the same way you can on
websites. And of course the data is neatly formatted to be machine readable.

------
fortran77
Why do we have this editorialized title, @Dang? Why not the actual, neutral,
title: "US court fully legalized website scraping and technically prohibited
it"

BTW: This title is still not factual, but at least it's more neutral. It seems
to be an injunction, which doesn't really mean "fully legalized" but what do I
know....

------
ummonk
How does this not apply to e.g. Clearview AI, Cambridge Analytica, and other
anti-privacy scraping operations?

------
tdevito
Anyone know the laws around scraping news article meta-data(image, headline,
publisher, url) directly from a publishers website? As long as I'm not
scraping the full article, could I have my users share the scraped meta-data
between each other?

------
johnward
So, Google blocks me from automating my rank checking. Will this apply to
google now?

------
marcoperaza
Dangerously misleading headline. There is a circuit split on the
interpretation of the CFAA, including on what counts as “unauthorized access”.
If you are sued outside of the Ninth Circuit, you might get very different
results.

~~~
chadash
A circuit split? Is there another circuit with a contradictory ruling?

------
safsexisgoodsex
Reading this while heading in to work on a LinkedIn shuttle brought a smile to
my face. I've always been embarrassed my company took this position given how
much scraping LinkedIn does. Do as I say, not as I do I guess?

~~~
Dayshine
And as a user of LinkedIn this news is terrible.

I share my personal data with LinkedIn under the terms you promise, which
includes not sharing it with third parties for commercial use.

If you can't enforce that then I can't use LinkedIn...

I've already gone through several 6+ month long GDPR chases to find out
marketing companies are scraping my data from LinkedIn without permission.

------
thomasjames
This wouldn't supersede any violations of the ToS, though. I guess it would
mean they can't sue you (maybe?) but they could still kick you off their
service. Could they still go after you for breach of contract?

~~~
TrickyRick
Wasn't the point that they were scraping public profiles, i.e. they never
signed up and thereby never signed any ToS?

~~~
papln
ToS aren't signed. They are terms of service, not license agreements.

ToS merely need to be disclosed, and the site cna decline service if you
violate the terms. A license agreement may subject you to penalties for
violating the agreement.

~~~
thomasjames
Thanks for the clarification.

------
etxm
If your site has a substantial amount of crawlers, that’s an opportunity to
offer an API subscription.

No one wants to “crawl” your data, they just want your data.

No software developer or PM ever said “it’d be great if this broke every
couple of days and we had to scramble to figure out how to fix it” or “I’d
love to jump through a bunch of hoops to read that H1”

Bot detection and crawling is an arms race. Bots will always win. I’ve droned
on about this here in the past. [1]

Anecdotal evidence: Previous co-founder and software architect of a “crawler”
that crawled ~17000 store and deal sites a day (like BestBuy, Gap, RMN). We
circumvented all bot detection and were only busted and c&d’d once. By
CouponCabin of all people.

[1]
[https://news.ycombinator.com/item?id=16182405](https://news.ycombinator.com/item?id=16182405)

------
buboard
I have a feeling this precedent makes Google happy, if not that they helped
hiQ. While scraping should be legal, nobody should be preventing server owners
from using any countermeasures they like

------
JoeAltmaier
Not sure why bots are guaranteed access _as if they are a normal user_? Can I
bar a mechanical bot from entering my open-to-the-public bodega? I think I
might.

Maybe there's a need for more law here.

------
ink404
Wouldn't be surprised if this is getting pushed through (although with a
seemingly favorable outcome) because of gov't projects similar to ClearView.

------
ehurynovich
The topic disappeared from the main page of HN in 1 second, although it had
been at the first place for several hours. Can someone say what happened?

------
raxxorrax
Since this is probably granted for anyone with a technical understanding, it
is nice to see that the legislative powers are on board with this.

~~~
bkor
Just because something is possible technically doesn't make it ok legally. I
think there's still various issues though, as per GPDR I don't think another
company can just copy that data from Linkedin. That it's easily visible
doesn't matter for GDPR.

~~~
raxxorrax
But the offender would be LinkedIn if it exposed personal data, even per GDPR.

The cases were people were convicted by getting certain webpages is addressed
here.

------
JoeAltmaier
Strange. Something like allowing a combine harvester into a U-Pick-It farm.
Because, its open to the public, right?

------
thebirdsbeak
IAAL - this case covers CFAA as interpreted and enforced by 9th Circuit. A
step in the right direction but not a blank cheque!

------
microcolonel
RIP Aaron Swartz.

------
bryanrasmussen
Does this mean Google can't stop me from scraping their results anymore? If I
do it from an US ip address of course.

------
ccvannorman
So, when can I force companies to stop blocking my crawlers for their public
data, and how do I accomplish this? :-D

~~~
papln
Like hiQ, you can file a lawsuit and try to win. You might want to wait to see
if hiQ wins their case.

------
alephnan
Does this mean someone can scrape a restaurant directory website and create
their own website with this same data?

~~~
buboard
i see lots of business ideas popping up in this thread

------
superkuh
Of course it's legal. Anyone holding other ideas has a fundemental
misunderstanding of how the web and public spaces work. If you want something
to be private then you make it private. If you make it public for the whole
world to see, explicitly going out of your way to send the data to anyone who
requests, it's all on you. Just because someone isn't using Chrome doesn't
make them unethical.

------
pylua
Perhaps linked in should switch to flutter . This type of technology will make
it harder to scrape .

~~~
anticensor
No, they should switch to _Windows Azure Forms_ [0].

[0]: Clean-room reverse-engineered server-side replacement for Flutter,
written in WebAssembly, which compiles source files to an .EXE using a .NET-
enabled WebAssembly precompiler.

------
dhruvkar
I've seen a lot of visual scrapers in the last few years.

Do any of them enable scraping behind a login?

------
eli
Doesn’t the ruling only apply in the states covered by California’s circuit
court?

------
min2bro
it doesn't look like from this post that web scraping is authenticated to be a
legal activity, it's more of what can be scraped by a bot and what cannot.

------
kevin_thibedeau
Is a DDOS with a bunch of HTTP GET's just scraping?

------
65934
Does this mean Google can take lyrics from Genius now?

------
Mikeb85
Of course it's legal. I don't know how you can make the argument that, if you
have a public-facing website, you can arbitrarily make rules concerning
exactly how that website is viewed.

------
aantix
I hope Google eases up with the captchas.

Their only api as far as I can tell is their “custom search engine“, which
doesn’t appear to match the results of their public search and is ungodly
expensive.

------
tiku
But When Will it be illegal to block scraping?

------
alexfromapex
It's the same concept as public vs private. If you want something private
don't blast it publicly.

------
salawat
Hmmm.

So, if I'm reading this right:

If it's not behind a login, it's free game for scraping. CFAA only explicitly
applies once you (the service provider) enters into an explicit relationship
of use with someone else.

I.e. If LinkedIn made an account necessary to even browse the currently
"public" data, and set limits as a condition of being a user that would be
fine, and may give them a leg to stand on for future claims of misuse of their
systems since the behavior is already covered by a default terms of use in
creating an account.

In terms of making currently public data more difficult to access in response
to scraping, I foresee business evolving even more in the direction through
which everything gets locked behind an initial contract step to ensure that
people can reserve the right to refuse access.

I'm very wary of the whole malicious interference with a contract reasoning
that hiQ employed. Just because you're relying on a public data source to be
accessible in a convenient way to you to deliver a service to an unrelated
third-party should not obligate the party of the public data source to deliver
data in a convenient way to you. That type of thing would lead to
ridiculousness like being able to sue the makers of the Yellow/White Pages for
instance for changing things up.

Further, I still see grounds for remedy to LinkedIn in that they do still have
a right to refuse service to anyone, regardless of outstanding contacts
entered into by that person. The only leg to stand on that I can see for hiQ
is if the court limited the precedent being set in the event that LinkedIn
identified them, notified them and requested them to cease and desist, hiQ
refused (or hiQ offered to enter into an alternate business arrangement and
LinkedIn refused to accomodate), and then LinkedIn implemented the interfering
measure. In that circumstance, I can see a malicious interference maybe being
upheld, but I still have very little sympathy for hiQ because whether the
pages are public or not, they are imposing a significant cost on LinkedIn to
serve that data if his is systematically enumerating the entire public
LinkedIn dataset.

To me, hiQ is a guy OCR'ing the White Pages/recording a storefront 24/7\. The
details of how the Internet operates changes that a bit, but nevertheless, it
is there.

It is a fundamental continuation of a trend I've noticed in Tech in that it is
a field where practitioners/the businesses practitioners write code for for
whatever reason assume that the platform running/servicing code they write
fundamentally "belongs" to them. That any data their program can get access to
is free-game.

I don't know where the disconnect is, or if maybe I'm just a freak in that I
treat someone else's hardware as if I were entering their home. Even with the
code I write.

You don't just waltz into someone else's space and start using all the
amenities without asking. It's rude, and grounds to have admittance refused in
the future. You don't do things to the system without asking.

It's like... Imagine most people are blind. You're a salesman walking into
their space talking a good story or providing some service while completely
ransacking their domicile for every shred of info you can possibly take with
you. Filming the inside, the layout, rifling through mail and rolodexes, so on
and so forth, while all they are aware of is explicitly what you tell them is
going on.

As one who writes software, and automates tasks with computers, I feel it to
be our duty to _not_ facilitate that sort of behavior. I just wish I could
figure out a way to spread that ethos and keep the bills paid.

Otherwise I fear we'll lose any benefit of an assumption of integrity we've
managed to build up. Hell, maybe it's too late for that given the way things
are going.

------
chadash
I'm seeing some confusion about how this affects people outside of the West
Coasts 9th circuit. It certainly _affects_ you if you are in other circuits.
There's no question that this precedent will be brought up in other courts,
even if they aren't bound to uphold the decision.

IANAL, but a bit of a supreme court hobbyist. For those unfamiliar, a short
lesson on how the federal court system works that I think would be useful for
HN readers:

Generally speaking, a case first goes to District Court. There are 94
districts in the country and they tend to consist of small regions. For
example, Northern California (note these are for _federal_ courts... each
state has it's own state system, which operates differently in each state). So
you first go to one of these courts when you bring a lawsuit.

Now, the lawsuit is decided and one side is unhappy with the verdict. In a lot
of cases, they might say, "ok, I'm unhappy, but I'm also spending a lot of
money on lawyers, so i'll accept the decision and move on". Or they have the
right to appeal the decision. If they choose to do the latter, then the next
layer of the court system, the Circuit Courts come into play.

The Circuit Courts consist of 11 "circuits" plus the DC [0] and Federal
Circuits [1]. When you appeal a case to one of the circuit courts, they _have_
to hear your case. It's your _right_ to appeal. However, if they think the
case doesn't have merit, they can issue what's called a "summary judgement"
where they issue an opinion without a full trial. In other words, if they
think the lower court issued the correct decision and they think it would be a
waste of time to go through a trial to appeal it, they can look over the facts
of the case and make a decision without a trial.

At the next level up, you have the Supreme Court [2]. Unlike at the circuit
court level, you have no right to appeal to the supreme court. Generally
speaking, they get to decide which cases they want to take, so if you think
that the appeals court screwed up and the supreme court decides to not take
your case, there's nothing you can do. Unlike appeals courts, they aren't even
obligated to look over the facts of the case at all if they don't want to.

Instead, what happens is that you _petition_ the Supreme Court to hear your
case. So you lose your case at appeals court and you basically file paperwork
with the SC saying "please please hear my case, here's why i think you
should".

The SC turns down a lot more cases than it hears. So what makes them take a
case on? One is if the case is super super important. Something like a case
against Obamacare or something else of very high national significance. But
typically, the SC takes a lot of boring cases too, and for the most part, this
has to do with circuit splits. A circuit split is when two of the regional
circuit courts issue conflicting rulings. So for example, if every circuit
more or less rules the same way on a given issue, there's nothing for the
supreme court to decide on. The system is working as intended. But if two
circuits disagree, then the SC's job is to resolve the issue so that federal
law is applied uniformly.

So in this case, the ruling was issued in the Ninth Circuit (which covers
California and much of the West). Technically speaking, I can sue someone over
scraping my site in Wisconsin (in the 7th Circuit) and the judge can rule in
my favor (that scraping is illegal) since she isn't bound to follow 9th
Circuit appeals rulings, whereas a district judge in Colorado or Montana (both
in 9th Circuit) are bound by the appeals court precedent. And then they appeal
the case in Chicago and the appeals court _also_ isn't bound by the Ninth
Circuit ruling. But you can be damn sure that the lawyers for the defense are
going to bring up that Ninth Circuit case as precedent. And generally
speaking, the precedent does matter (circuits don't _want_ to create circuit
splits).

Coming back to this case, does this mean that you have carte blanche to scrape
websites anywhere in the country? No, case law is going to need to evolve more
to get to the point where you can safely think that way. But, this is
definitely an important step in that direction.

[0] Washington DC has its own circuit, even though it's just a city, not a
region. This seems odd at first glance, but in fact, a lot of lawsuits against
the federal government come through this circuit, which is why DC gets it's
own circuit, while for example, NYC does not.

[1] The United States Court of Appeals for the Federal Circuit is a special
case. Most of these circuit courts have to do with regions. So if I commit a
federal crime in Florida, it gets tried in a Florida District court and then
it gets appealed in the seat of the 11th circuit, which is in Atlanta.
However, certain cases, based on subject material, don't get appealed in
Atlanta, but go instead to the Federal Circuit. These tend to be things with
national ramifications. Patents are a prime example, where you really really
don't want a patent being enforced in Iowa, but not in Alabama. So for these
special cases, we set up a different appeals system.

[2] There are a few times when cases go straight to the supreme court. For
example, in disputes between two states or cases involving ambassadors and
other public ministers, a case might go straight to the SC and skip the lower
courts. But this is the exception, not the rule.

