
Accessing Publicly Available Information on the Internet Is Not a Crime - DiabloD3
https://www.eff.org/deeplinks/2017/12/eff-court-accessing-publicly-available-information-internet-not-crime
======
us0r
What makes this extra ridiculous is the fact LinkedIn built its business on
scraping not publicly available information but private address books of
unsuspecting users.

~~~
jacquesm
And spamming those contacts with requests to join that looked as if they
originated from your business relations when that definitely wasn't the case.

~~~
bhhaskin
It's the main reason I don't have a LinkedIn and never will. They are a scummy
company.

------
laderach
This is incredibly important. If you dig deep into why LinkedIn is behaving
the way it is, it is definitely not an attempt into protecting users' privacy.
It's all about maintaining and expanding the ways it can monetize the data
that users provide.

This is the type of thing that we risk loosing as the internet matures and
internet companies with vested interests gain more power. Setting this type of
precedents will absolutely curtail innovation and freedom in the future. Think
about it, would Google have been created in an environment that is
overwhelmingly siloed and filled with red tape?

I see parallels to the net neutrality discussion in this.

------
ynniv
Access that does not require authentication should never be a crime. If
LinkedIn wants the courts to intervene, they must require authentication for
their data. If they also want Google to scrape their site, they must require
Googlebot to authenticate itself.

~~~
gwbas1c
> Access that does not require authentication should never be a crime.

Careful, this could legitimize things like accidental denial of service.
Depending on circumstances, even basic scraping could cause problems.

(I need to be vague to avoid violating an NDA.) A major internet site had a
URL that went something like somedomain/group?id=xxxxx. It turns out that a
simple scraper, that called id=1, id=2, id=3, ect, ect, caused a major
problem! This was because rendering these pages required significant
resources; so most active pages were kept in RAM. Of course, the scraper tried
to read everything.

Of course, no one thought the scraper was malicious in any way!

~~~
odorousrex
>A major internet site had a URL that went something like
somedomain/group?id=xxxxx. It turns out that a simple scraper, that called
id=1, id=2, id=3, ect, ect, caused a major problem!

This is a failure on the part of the developers at that "major internet site".
Using a guid instead of consecutive IDs, a rate limiter, hell even just a
cache...or all of the above. There are lots of solutions here.

You have to take robot scraping and indexing into consideration, and assume
people will ignore robots.txt. (Certain bots, i.e. msnbot/bingbot are quite
aggressive!)

~~~
dec0dedab0de
No, that is a failure of the developer of the scraper. I am definitely pro
scraping, but you have to be a good neighbor.

~~~
jstarfish
How the hell is the scraper dev supposed to anticipate how poorly-written
these particular views are with no backend knowledge? If not an automated
scraper, a thundering herd from content gone viral would trigger the same
result.

~~~
bottled_poe
Scraping is not an intended purpose for most websites. Unless the website
specifically states that this is an intended function, it is not reasonable to
assume so. In fact it may be in violation of the terms and conditions of the
given website.

~~~
nostrademons
If the law assumed that only intended functions are permissible, innovation
would be a crime. By definition, innovation is finding _new and unforeseen_
uses for resources.

~~~
blowski
You both make good points. If you make the law too strict you punish
reasonable uses of the website, like scraping a few publicly available pages
to help users. If you make it too lenient you permit DOS attacks.

It’s not easy to craft a law that will punish bad behaviour without blocking
innovation.

------
feelin_googley
Here is an example of the "good bot"/"bad bot" nonsense in action.

This is an article about the LinkedIn v hiQ case at AdWeek.

    
    
      curl --user-agent INSERT_ANYTHING_HERE http://www.adweek.com/digital/rami-essaid-distil-networks-guest-post-linkedin-hiq-labs/
    

It seems AdWeek can distinguish a "good bot" from a "bad bot" irrespective of
the behavior of the user^W bot, i.e., whether it is one single HTTP request or
10,000 consecutive requests is irrelevant.

How do they do it?

Pattern match against the User-Agent string.

Effective shibboleth.^W engineering.

Clarification: If a _user_ , not a "bot", makes the "wrong" choice of user-
agent string (e.g. in the browser settings), then they will be labeled a "bad
bot", even if their behavior is no different than other users who are not
labeled "bad bots". For example, they make _one_ HTTP GET request just like
any other user. There are databases of "acceptable" user-agent strings
available to anyone. If still unsure about the point I am making, see this
post from several days ago: [https://www.sigbus.info/software-compatibility-
and-our-own-u...](https://www.sigbus.info/software-compatibility-and-our-own-
user-agent-problem.html)

~~~
alexdoma
What would be a better solution, IP address check to allow only known google
crawlers perhaps?

~~~
andreareina
Classify IPs based on their recent behavior[2]. Most bots behave very
differently from the median user, along many different dimensions -- volume of
requests, time between requests, visit length, which links are followed, etc.

And if this means that bots are altered to become indistinguishable from
users, and therefore have a minimal impact on a site's loading? Well, mission
accomplished[1].

[1] [https://xkcd.com/810/](https://xkcd.com/810/)

ETA: [2] Recent behavior (as opposed to all historical behavior) is used so
that someone inheriting a "bad" IP isn't completely screwed over.

~~~
mark-r
That's a superb xkcd that I hadn't seen yet, thanks.

------
ikeboy
>good bots

You mean, bots that obey robots.txt?

[https://www.linkedin.com/robots.txt](https://www.linkedin.com/robots.txt)
very specifically prohibits scraping by any bot besides a small whitelist.

robots.txt compliance is not difficult to build. I'm fine with robots.txt
violations being considered hacking.

~~~
diggan
> robots.txt violations being considered hacking

Hm, I disagree. Either information is public, no matter for who. Or the
information is private, and you should have ACL for accessing the information.
I don't think it's fair to say that information is public if you're a human
but private if you're a machine, or vice versa.

It's not about if it's difficult to build but rather the principle behind if
you can just allow humans to read something.

~~~
ikeboy
Why is discriminating against robots unfair? There are valid reasons (for
instance, robots take a lot of resources to serve and don't lead to revenue).

~~~
diggan
Just because it's a robot doesn't mean that it takes more resources to load a
page. A robot that loads 1000x more pages than a normal user, sure. But then
rate-limit everyone rather than blocking specifically bots.

And that bots don't lead to revenue depends on why the bot is navigating on
your page no? If it's some indexer that links back to your website and it's a
popular index, then you'll maybe end up with more revenue thanks to that bot
than a normal user.

~~~
ikeboy
Accepting robots + humans takes more resources than only accepting humans.

Your arguments about revenue are website-dependant and it's the website owner
who is in the best position to decide whether robots are good for them or not
(and plenty of sites don't ban bots in their robots.txt). In this case, the
company that ran the bots is directly competing with Linkedin's products that
sell aggregated data to employers and such, and linkedin clearly decided it's
not going to lead to more revenue for them.

------
paulus_magnus2
How about: if you want me not to scrape it, keep it off my internet??

Actually I'm considering building "API-fication" of websites with bindings for
major languages (Java, Python, JS). With luck websites could participate by
providing & maintaining a parseable API-sitemap.

This would open door to my 2nd project: orchestration a-la BPEL on top of
websites. visual editor, macros, scripting. Call this PIPES 2.0

~~~
_mhr_
Can you provide some use-cases for why this would be useful in a way that
wouldn't violate most sites' ToS?

~~~
paulus_magnus2
\- a lot of online stores, hotels need to constantly update prices based on
what competitors do.

\- cleaning (big) data. Automatically reconcile data to canonical format /
names using authoritative source (say wikipedia)

Can you understand even the simplest TOS? I'd argue most (all?) are too
restrictive to be enforceable. [https://tosdr.org/](https://tosdr.org/)

------
Analemma_
I mentioned this before in a previous thread on this topic, but I can't
support the EFF on this. This is, at the end, an argument against control over
ones own data: LinkedIn might be doing sketchy things with your data, but it's
all stuff you voluntarily agreed to in exchange for their service. If any
shady data aggregator can vacuum it up and do whatever, I didn't consent to
that and I'm not getting any benefit from it. The EFF shouldn't be defending
that right.

~~~
bo1024
But the EFF isn't arguing that any shady aggregator should be able to vacuum
up anything. LinkedIn would still have the full right and ability to implement
limits, blocks, or so on to prevent this. LinkedIn could still make it against
their terms of service and pursue a civil suit. It just would stop LinkedIn
from being able to pursue felony hacking prosecutions against people for
accessing a public webpage with a script.

------
guywaffle
Make it fair then! Bots can’t scrape LinkedIn, and LinkedIn can’t sell any
consumer data to third parties.

~~~
JepZ
For real: I really hate corporations 'stealing' data from my phone. For
example Google likes to introduce new sync options to Android and every time
they do so it is activated by default. So as soon as the update arrives their
software syncs my data to their servers without my consent. They probably have
some clause in the EULA but as a user of their products I really hate that
behavior. A similar case is not being able to disable address book sync before
it syncs the for the first time.

Those things should be crimes as the data they fetch is not publicly available
on some web page but exists only on my personal device and they take it
without my consent.

~~~
kbart
Install a firewall (for example, NoRoot Firewall) and whitelist only these
apps/services you want to access Internet.

------
UncleEntity
How does a website put reasonable limits on access?

I'm not saying what Linkedin is trying to do is right but it seems to me there
needs to be a way to say "Dude, that's not cool." A regular B&M store can
refuse service to disruptive people and trespass people who don't comply, why
not servers?

\--edit--

Pretty much what rayiner is saying, they posted while I was typing.

~~~
jimktrains2
> How does a website put reasonable limits on access?

1) Blocking TCP connections

2) Returning a 4XX error, perhaps even "401 Authorization Required", "402
Payment Required", "403 Forbidden", or "429 Too Many Requests"

> A regular B&M store can refuse service to disruptive people and trespass
> people who don't comply, why not servers?

A Brick and Mortar store has to _tell_ you you're being banned. The mechanisms
I listed above both tell you and lock the door whenever you attempt to access.

Edit: In this case, it's more like someone was looking in the store window
from the public sidewalk and asked to stop. Can you really ask someone to stop
looking at you from a public place?

~~~
UncleEntity
> In this case, it's more like someone was looking in the store window from
> the public sidewalk and asked to stop.

I think it's more like calling the store and asking them what their prices are
20 times a minute.

~~~
mcguire
No, it's more like you holding the giraffe while I fill the bathtub with
brightly painted power tools. Because reasoning by analogy sucks.

No one is accusing HiQ of performing a denial of service attack.

~~~
UncleEntity
> ...you holding the giraffe while I fill the bathtub with brightly painted
> power tools.

I'm down for that.

------
Miner49er
weev went to jail for accessing publicly available information from AT&T.
There's not a great precedent here for the EFF, unfortunately.

~~~
icebraining
It was only a jury decision by a lower court, it doesn't mean much in terms of
precedent.

------
k3a
I think scraping for personal use (not honorig robots.txt) should always be
legal unless you are attempting DOS. You are accessing public information, the
server is returning HTTP200 and it doesn't matter if you do so using a
browser, phantomjs or curl with -A parameter.

A different situation would be scraping a website to make business. Worst
being directly using the data - for example those StackOverflow clones with
original data doesn't sound ok to me. I am not sure what to think about bots
doing various derived work like stats and analysis. I think that if they are
part of a business, making money, it shouldn't be legal unless those request
are permitted by robots.txt.

------
euske
Question. How this principle can coexist with the idea of "surveillance is
bad"? Because that's mostly to collect publicly available information. Is it
bad because it's done by a government? It's possible to set up a bunch of
privately owned cameras in a city and keep filming people. Is it the
association of infos that makes it bad and not mere collection? Is it okay if
it doesn't have a personally identifiable information (but who knows what one
can make out of them)? I don't know what I should think of this.

~~~
Semiapies
This thought process always bewilders me. Whenever it comes up that government
agencies monitor our emails and phone calls, someone, as if on cue, always
pipes up that that's _totally no different_ from people posting on their
Facebook timeline and other absolutely _mind-bogglingly_ bad equivalences.

You, however, go the extra mile, here. How about _you_ explain exactly _how_
accessing published information on a public website is like building a network
of cameras to monitor a city with?

------
misterhtmlcss
Can data that is supplied with an intention to be publicly accessible i.e.
public domain be restricted. If the public was asked, "When you supplied your
picture, your name, and then created a public URL to become fully searchable,
was your intention that that information was to be restricted or was your
intention that this was information you publicized about yourself to make it
possible for potential employers to find you?". Answer, "Yes, it was 100% my
intention to become searchable so that employers would be able to seek me
out". Conversation is over.

LinkedIn creates an implied covenant with public consent (mostly) to then
publish and make discoverable their professional profiles.

While LinkedIn 100% should have the right to stop others from embedding
without permission since it's possible to claim the data structure and
presentation is proprietary to them, this should never extend to the actual
data itself, since this was willing gifted by the actual owners (Joe public)
into public domain.

I think an argument could be made that LinkedIn is being burdened with a
degree of data mining that affects their business and therefore should be able
to charge a minimal fee e.g. an API firehose to acquire the data in bulk from
providers in an raw data stream.

That seems reasonable depending on the charges associated with that offer,
this would be the correct compromise, since their data structure is all that
actually separates their service from say About.me or any other site of that
type. All of which don't disallow scraping; as long as it doesn't present as a
DOS attack (of course).

Anyway my comments are as a marketer and not a programmer or lawyer, but
personally I'm very interested to see this case resolved in a manner that
doesn't suit LinkedIn in the slightest.

------
tptacek
Are they arguing that it's a _crime_ , or that it's a _tort_?

~~~
metallah
I believe the latter (though IANAL)

------
rayiner
There is a difference between public property and private property that is
made available to the public. Just because the cafe on the corner has its door
open and lets you stroll in off the street doesn't mean that the property
owner doesn't retain the right to exclude people. And if the property owner
revokes your permission, then going onto the property again can be a crime
(trespass).[1]

Servers are no different. The Internet isn't an abstraction--it's just pieces
of private property connected together (servers, routers, switches). When you
make an HTTP request, you're accessing a piece of private property. The owner
of that property has every right to decide not to let you do so.

~~~
cybwraith
That's not a great analogy. The store owner can't just get your
arrested/charged with a crime if they don't tell you that you aren't allowed
first. Http lacks such a human mechanism. The closest thing I can think of in
the standard is the response code. So your server replying 200 OK should
implicitly be considered permission to access that resource legally until it
stops replying with that code.

~~~
rayiner
But that's exactly what happened here:

> LinkedIn sent hiQ cease and desist letters warning that _any future access_
> of its website, even the public portions, were “without permission and
> without authorization” and thus violations of the CFAA.

The EFF's point about terms of service is a good one, but also irrelevant.
Terms of service don't provide adequate notice that someone's implied license
to access a website has been terminated. But here, hiQ had actual notice
through "human" channels.

~~~
cmiles74
The poster is arguing that if you make a request from LinkedIn's website and
it returns a "200" along with data, then you've accessed that data lawfully
and LinkedIn has agreed to serve it to you; I tend to agree. If they don't
want to provide data to hiQ, they should, well, stop providing data to hiQ.

There are many ways to do this short of claiming that hiQ doesn't have
permission or authorization, an argument strikes me as wholly without merit.
If the data is publicly available on the internet then how is permission or
authorization required?

~~~
rblatz
How is that any different than walking up to a store entrance with automatic
doors and a sign that says "Welcome" on it?

~~~
mrguyorama
Those doors get turned off at night, just like a server can ignore an HTTP
request

~~~
rblatz
They can turn the servers off at night too. Some places still choose to do
that. But that is unrelated to the point, if you are told that you are no
longer welcome at a business, you can’t come in without it being considered
trespassing. The doors automatically opening for you (200 Ok) doesn’t matter.
If you wear a disguise (change ip) doesn’t matter. You can’t go in.

Also I would agree that absent a specific order to stop accessing publiclly
available server resources, there is an explicit permission to do so. So I’m
the case of Weev I think he did nothing wrong, AT&T were the ones in the
wrong.

------
FilterSweep
> LinkedIn argues that imposing criminal liability for automated access of
> publicly available LinkedIn data would protect the privacy interests of
> LinkedIn users who decide to publish their information publicly, but that’s
> just not true

Protect them from what, _your unlocked front door_? [0][1]

[0] "Hackers selling 117 million LinkedIn passwords"
[http://money.cnn.com/2016/05/19/technology/linkedin-
hack/ind...](http://money.cnn.com/2016/05/19/technology/linkedin-
hack/index.html)

[1]
[https://en.wikipedia.org/wiki/2012_LinkedIn_hack](https://en.wikipedia.org/wiki/2012_LinkedIn_hack)

I'd also note that these companies are barely (if ever) held liable for life-
compromising hacks on their platforms.

~~~
PatientTrader
> I'd also note that these companies are barely (if ever) held liable for
> life-compromising hacks on their platforms.

You do know it is impossible to stop all cyber attacks? Its always a matter of
when, not if. Zero day attacks are developed everyday with not even the best
funded cyber security systems able to thwart them. The geniuses are on the
offensive side, if they want in, they will get in.

~~~
gkya
The industry is held to no standards at all. You can keep plain-text passwords
in your databases, do no tests at all, and be incompetent in a million other
ways. I usually get downvotes when I say this, but by now there needs to exist
certain regulation on commercial software and software-based services. It
should be ensured that certain practices are followed in security and ethics
(do you take the basic, well known precautions against the well-known
attacks?, do you respect your users' privacy at least as much as the law
requires you to, do you follow the terms and conditions you declare?). What we
need is CE for software, and it's sad that I can ensure my cheese comes from a
certain town and is produced from the milk from cows eating according to a
certain diet, but not if Twitter (or any other commercial website) hashes and
salts my password, and actually uses basic precautions against CSRF or what
not. These companies should be obliged to get their stuff audited by third
parties, and there should be a way to tell if they are really approved to
maintain a certain standard in producing their software. I do understand and
share the hacker culture, and appreciate how it's possible to spin off a
start-up website business on the internet, but business is business. You don't
become exempt from regulations when all you do is to run a tiny B&B with 2
rooms. Similarly, as soon as you're a company selling online services,
regulations and standards should kick in. Because by now those online services
are no less important than food business. You say it's impossible to stop all
cyber attacks. Then, as it is impossible to stop all burglary attempts, should
banks just deposit their money in some apartments, or in some random rooms
where all the security is a wooden door? Fire all the security guards because
it's impossible they survive all the guns out there? These companies like
LinkedIn are no different in banks insomuch as they deposit not our money, but
our personas. They should actually be more cautious because while money can be
replaced, nobody can have a new self.

~~~
nathan_long
> It should be ensured that certain practices are followed in security

Let's not legislate specific practices.

Imagine if we had security legislation from 1995 to follow when programming
today. Imagine trying to explain to senators why last year's XSS protection
rules need updating. Imagine Oracle lobbying to get their database enshrined
as the "security-compliant" one.

The law should focus on outcomes: if a site gets hacked and people are harmed,
the site should be penalized.

~~~
gkya
"Security compliance" is about _how_ you use a given database, not _which_ one
you happen to use. You can securely (but inefficiently) store credentials in a
plain text file.

WRT some defences becoming outdated by time, well, it probably would not be
two-decades behind, but a couple years or so at most. Even then, ensuring
_that_ is better then nothing.

People need tools to judge if they can safely use some product, and that's why
standards exist. Otherwise companies are going to continue to screw us until
they drop the balls.

~~~
nathan_long
> WRT some defences becoming outdated by time, well, it probably would not be
> two-decades behind, but a couple years or so at most. Even then, ensuring
> that is better then nothing.

Not necessarily. What if the law mandates use of, say, an encryption algorithm
that has been cracked? You can't move to a new one without breaking the law.

------
megamindbrian2
So doing QA is a crime now?

Edit: adding context.

I'm doing QA to validate information collected by my recruiting company, both
acting within Linkedin's terms of service for a paid subscription, and
violating their terms of server by improving my own company's process. Like
the article said: Linkedin wants to participate in an open internet and also
abuse CFAA.

------
ProdigalXiao
Yes, it's not a crime previously for both the providers and the viewers;
however, viewers unofficially announce or utility public information is
plagiarism

------
askvictor
Let us remember here that Microsoft owns LinkedIn. There's been a lot of love
for Microsoft here recently (I'm among the many who are liking the 'new' MS).
No doubt, this is quite a separate group to those doing
OSS/Linux/Python/Jupyter/etc, but it's worth pausing to think about what a
move like this says about their overarching corporate strategies.

~~~
FireBeyond
Shit like this has been LinkedIn's modus operandi since day one, not to
mention their own questionable ethics. It has little or nothing to do with
them now being a subsidiary of Microsoft.

~~~
askvictor
Microsoft has the choice to change that behaviour now that they own LinkedIn.
It seems that they choose not to.

------
yeukhon
How do you draw a line between accessing disturbing contents such as child
porn vs accessing a leaked document? It seems the former requires some
additional moral take - what if the click was accidental and it was an attack?
What if the person onlu watched but doesn’t possess the content?

~~~
nothrabannosir
In a court of law, the way we do for all grey areas between legal and illegal
in our society. Law is not binary, it’s fuzzy and requires manual
intervention. That’s ok.

However: That’s not what this article is about. That we don’t have a perfect
solution for whatever weird corner cases (accidentally clicking on child
porn?), should not change this very honest, serious and real issue the eff is
addressing here. It is a distraction. We can hypothesise about edge cases
until the cows come home, but to what end?

I get how a life of working in binary makes us immediately jump to the corner
cases. It’s a curse on any legal discussion on HN. But it’s not relevant, and,
imo, it dilutes the energy.

Edit : that came out harsh so I’d like to clarify: I get, 100%, where this
“looking for the flaws” mentality comes from. It’s what makes a good
programmer. A function that only follows the spec for 75% of its possible
inputs is wrong. A law, not necessarily. We need to be careful not to keep our
engineering hats on when switching to discussing law.

------
mirimir
So why didn't hiQ just operate from some jurisdiction where scraping is legal?
And use VPN services to prevent blocking. I mean, the Internet is global. So
why should US laws matter everywhere?

~~~
us0r
spamhaus did this and the spammer who sued them got a default judgement. when
he went to seize their domain they suddenly cared about us laws.

~~~
mirimir
Well, some domain registries are not under US control. Consider that TPB and
Sci-Hub still have domain names.

And, if push comes to shove, one doesn't really need a domain.

------
mlindner
More bad rulings from the Ninth Circuit.

------
zerostar07
The stupidity of the concept of IP never fails to make itself obvious. Free
information that users enter into linkedin should remain free.

------
spacemanmatt
A Microsoft company attempting to exploit the law for profit? Imagine my
shock.

~~~
dang
Would you please not post unsubstantive comments here?

