Hacker News new | comments | show | ask | jobs | submit login
Accessing Publicly Available Information on the Internet Is Not a Crime (eff.org)
726 points by DiabloD3 3 months ago | hide | past | web | favorite | 284 comments

What makes this extra ridiculous is the fact LinkedIn built its business on scraping not publicly available information but private address books of unsuspecting users.

And spamming those contacts with requests to join that looked as if they originated from your business relations when that definitely wasn't the case.

It's the main reason I don't have a LinkedIn and never will. They are a scummy company.

This is incredibly important. If you dig deep into why LinkedIn is behaving the way it is, it is definitely not an attempt into protecting users' privacy. It's all about maintaining and expanding the ways it can monetize the data that users provide.

This is the type of thing that we risk loosing as the internet matures and internet companies with vested interests gain more power. Setting this type of precedents will absolutely curtail innovation and freedom in the future. Think about it, would Google have been created in an environment that is overwhelmingly siloed and filled with red tape?

I see parallels to the net neutrality discussion in this.

Access that does not require authentication should never be a crime. If LinkedIn wants the courts to intervene, they must require authentication for their data. If they also want Google to scrape their site, they must require Googlebot to authenticate itself.

> Access that does not require authentication should never be a crime.

Careful, this could legitimize things like accidental denial of service. Depending on circumstances, even basic scraping could cause problems.

(I need to be vague to avoid violating an NDA.) A major internet site had a URL that went something like somedomain/group?id=xxxxx. It turns out that a simple scraper, that called id=1, id=2, id=3, ect, ect, caused a major problem! This was because rendering these pages required significant resources; so most active pages were kept in RAM. Of course, the scraper tried to read everything.

Of course, no one thought the scraper was malicious in any way!

>A major internet site had a URL that went something like somedomain/group?id=xxxxx. It turns out that a simple scraper, that called id=1, id=2, id=3, ect, ect, caused a major problem!

This is a failure on the part of the developers at that "major internet site". Using a guid instead of consecutive IDs, a rate limiter, hell even just a cache...or all of the above. There are lots of solutions here.

You have to take robot scraping and indexing into consideration, and assume people will ignore robots.txt. (Certain bots, i.e. msnbot/bingbot are quite aggressive!)

>> This is a failure on the part of the developers at that "major internet site". Using a guid instead of consecutive IDs, a rate limiter, hell even just a cache...or all of the above. There are lots of solutions here.

You are right, but few organizations are sophisticated.. or wealthy enough to employ all of that. I mean, a couple years ago there was a thing that Google's Docs could be enumerated.

And that's Google, they can obiously afford to get competent people working on that, yet they made a mistake (and who doesn't?).

Fair enough. It still shouldn't become a criminal issue.

>You have to take robot scraping and indexing into consideration, and assume people will ignore robots.txt. (Certain bots, i.e. msnbot/bingbot are quite aggressive!)

Who owns LinkedIn again?

No, that is a failure of the developer of the scraper. I am definitely pro scraping, but you have to be a good neighbor.

How the hell is the scraper dev supposed to anticipate how poorly-written these particular views are with no backend knowledge? If not an automated scraper, a thundering herd from content gone viral would trigger the same result.

Scraping is not an intended purpose for most websites. Unless the website specifically states that this is an intended function, it is not reasonable to assume so. In fact it may be in violation of the terms and conditions of the given website.

If the law assumed that only intended functions are permissible, innovation would be a crime. By definition, innovation is finding new and unforeseen uses for resources.

You both make good points. If you make the law too strict you punish reasonable uses of the website, like scraping a few publicly available pages to help users. If you make it too lenient you permit DOS attacks.

It’s not easy to craft a law that will punish bad behaviour without blocking innovation.

I don't intend people named Steve to access my open site so I can sue all Steve's for their felonious behavior?

I've done some scraping work -- one of my rules of thumb is to always assume the worst of their site and try to be as gentle as possible.

Oh come on, you're trying to scrape the data out of a black box. You have no idea what their infrastructure is like, and for your purposes, you don't really care.

Of course, some sense is more than welcome, but if my scraper makes one request every 2 sec knocks down your server, it's your fault, not mine.

Weev went to jail for exploiting a similar flaw in AT&T's website[0]. They had a page that, when provided an ICC-ID, would return the matching customer's email address. He supplied a range of valid ICC-IDs and scraped the returned addresses. He was eventually convicted[1].

[0]: https://arstechnica.com/gadgets/2010/06/ipad-3g-user-e-mail-...

[1]: https://www.wired.com/2013/03/att-hacker-gets-3-years/

And while Weev totally sucks as a person, IMO, it was wrong for him to be convicted in this case. He was punished from AT&T's negligence.

Although, not purposefully exfiltrating loads of data after you've found a vulnerability is like, ethical reporting 101.

Otherwise you get situations like Uber paying out an enormous "bug bounty" totally-not-in-exchange for having their stolen data destroyed. If that person had simply pointed out that they had credentials published in a public repository, how much would they have been paid? Probably somewhere within an order of magnitude of the program's stated maximum payout.

Punk test. Advocacy groups are way less likely to want to turn your case into a test case if you are a racist asshole.

Are you suggesting that someone should do time for running a script that happen to stumble on one of your bugs?

If the activity caused actual damages and was outside the scope of normal usage? Yes.

You're still culpable if your actions break your neighbor's window, even if it was accidentally while you were opening it.

Unless I'm missing something, you're proposing criminal penalties for tort liabilities.

Yes, if my crappy software costs you money by knocking your site offline by accident, I should make you whole.

I think it has to be something substantially more impactful, clearly intentionally malicious, or in some other way much worse than aggressive timeouts before we start thinking criminal penalties.

either I read it wrong the first time as well, or he edited it, but reading it now it clearly says "while opening it" which is a criminal act, in context.

No, when I responded it read "be held liable." I must've commented while it was being edited.

Say a business publishes a phone number and they typically get X calls per day.

After doing something that pisses a lot of people off, they start getting 1000X calls per day on the same number, almost all complaints.

This cases actual damages (no "normal" customers can get through) and is also clearly outside the scope of "normal" usage.

Do you think the same rules apply?

Yeah, that's not how the comment read when I responded. It said "be held liable," not "do time."

I must've responded while he was editing it, and I didn't catch the change.

I think you could be sued for damages, but that's not the same as a criminal case.

Honestly, we all know the wild west mentality of the internet (yes, it is post national as in 'above the law') and therefore, everyone should assume attacks like that and build defenses against them. Building a service which could be brought to a 'major problem' with simple requests leading to high server loads is just negligent. What would that site do if someone actually wanted to attack it?!?

I am not saying that the trouble with the law enforcement in the internet is a neither a good nor a bad thing. Actually, it depends and in the 'real' world I am pretty happy that the law enforcement works quite good where I live. I think the thin line is somewhere where I start to fear my own governments more than the bad guys (while not having any evil intentions or plans at all).

> the wild west mentality of the internet (yes, it is post national as in 'above the law')

No. It's only been ahead of most laws for a while, as all frontiers are while they remain frontiers. But all frontiers eventually close, and laws catch up with them as they do so. That is what we're seeing now, and have been for a decade or more.

Honestly that just means the website sucked and it went down because it sucked. Making it not suck is the solution, persecuting the people who stumbled into your suckiness is not.

> Careful, this could legitimize things like accidental denial of service.

Are you saying that the writers of a bot that causes accidental issues with a site due to poor development standards on that site should spend years in prison with a federal felony conviction?

A while ago I was introduced to a client whose site "was the target of hackers that were deleting all of the content from the CMS". Here's what I discovered:

- the password verification form to access the admin area did the verification check in JavaScript, not on the backend. So if you have JS disabled and click "Submit" on the Admin login form, you're into the admin area.

- the "delete" button in the admin area was implemented as an <a href=...> that simply did a GET request (violating the idempotent nature of GET requests).

Looking at the logs, it was pretty clear who the "hacker" was: Google. They'd come, follow all of the links, make their way into the admin site, and follow all of the delete content links.

I consider the work that the original developers did to be grossly negligent, and I certainly don't fault Google for anything.

Accidental denials of service are indeed a common occurrence. By the way, it's "etc" from latin et cetera - I assume you didn't want to refer to electro-convulsive therapy :-)

As any fule kno, this is how Molesworth writes, ect ect ect.

In general, the law is capable of dealing with this kind of issue - it can look at the intent of the owner of the service.

cf. for example the law on trade secrets. If you take "reasonable steps" to safeguard the secret, and impose NDAs on the people you do grant access, then courts will punish competitors who steal them, even if your security happens to suck.

> Careful, this could legitimize things like accidental denial of service. Depending on circumstances, even basic scraping could cause problems.

I have to "deal" with that problem every day. Misconfigured scrapers are dealt with by apache as are idiots who try to DoS the site (an intelligent attack still needs manual intervention, though).

> Access that does not require authentication should never be a crime.

Linkedin is not trying to prevent access. They want to prevent information from being scraped, and then used to their detriment.

Here is an example of the "good bot"/"bad bot" nonsense in action.

This is an article about the LinkedIn v hiQ case at AdWeek.

  curl --user-agent INSERT_ANYTHING_HERE http://www.adweek.com/digital/rami-essaid-distil-networks-guest-post-linkedin-hiq-labs/
It seems AdWeek can distinguish a "good bot" from a "bad bot" irrespective of the behavior of the user^W bot, i.e., whether it is one single HTTP request or 10,000 consecutive requests is irrelevant.

How do they do it?

Pattern match against the User-Agent string.

Effective shibboleth.^W engineering.

Clarification: If a user, not a "bot", makes the "wrong" choice of user-agent string (e.g. in the browser settings), then they will be labeled a "bad bot", even if their behavior is no different than other users who are not labeled "bad bots". For example, they make one HTTP GET request just like any other user. There are databases of "acceptable" user-agent strings available to anyone. If still unsure about the point I am making, see this post from several days ago: https://www.sigbus.info/software-compatibility-and-our-own-u...

What would be a better solution, IP address check to allow only known google crawlers perhaps?

Classify IPs based on their recent behavior[2]. Most bots behave very differently from the median user, along many different dimensions -- volume of requests, time between requests, visit length, which links are followed, etc.

And if this means that bots are altered to become indistinguishable from users, and therefore have a minimal impact on a site's loading? Well, mission accomplished[1].

[1] https://xkcd.com/810/

ETA: [2] Recent behavior (as opposed to all historical behavior) is used so that someone inheriting a "bad" IP isn't completely screwed over.

That's a superb xkcd that I hadn't seen yet, thanks.

The real solution is disallowing behaviors, instead of shibboleths.

It's surprising that malicious bots aren't exploiting those things already.

That practically invites them to present a different page to google as to a normal user, the former pure SEO, the latter perhaps pure advertising.

And Google will happily deindex the site as soon as they find out

Raising an interesting question: can a website owner (use the law to) ban google from accessing their website by any mechanism other than their crawler in order that google doesn’t find out?

Sure, obviously limited utility just like “the right to be forgotten”s flaw of diffing USA internet from EU internet to find specifically what people want forgotten, but shenanigans interest me.

A related story is that windows 9 isn't a thing because software used to check for windows 95 and 98 by matching the name to "windows 9".

>good bots

You mean, bots that obey robots.txt?

https://www.linkedin.com/robots.txt very specifically prohibits scraping by any bot besides a small whitelist.

robots.txt compliance is not difficult to build. I'm fine with robots.txt violations being considered hacking.

> robots.txt violations being considered hacking

Hm, I disagree. Either information is public, no matter for who. Or the information is private, and you should have ACL for accessing the information. I don't think it's fair to say that information is public if you're a human but private if you're a machine, or vice versa.

It's not about if it's difficult to build but rather the principle behind if you can just allow humans to read something.

Why is discriminating against robots unfair? There are valid reasons (for instance, robots take a lot of resources to serve and don't lead to revenue).

Just because it's a robot doesn't mean that it takes more resources to load a page. A robot that loads 1000x more pages than a normal user, sure. But then rate-limit everyone rather than blocking specifically bots.

And that bots don't lead to revenue depends on why the bot is navigating on your page no? If it's some indexer that links back to your website and it's a popular index, then you'll maybe end up with more revenue thanks to that bot than a normal user.

Accepting robots + humans takes more resources than only accepting humans.

Your arguments about revenue are website-dependant and it's the website owner who is in the best position to decide whether robots are good for them or not (and plenty of sites don't ban bots in their robots.txt). In this case, the company that ran the bots is directly competing with Linkedin's products that sell aggregated data to employers and such, and linkedin clearly decided it's not going to lead to more revenue for them.

my browser is a robot that renders your page.

What is exactly the difference between a robot and a person using a browser?

Does an ad-blocking browser counts as a bot or as a human? And what is something that concatenates all of your infinite scrolling to represent a paginated view? What is something that changes the structure of your page? What is something that concatenates different pages before displaying?

The real life equivalent of this is "if I leave my door unlocked, should someone be allowed to walk in anyways?"

I would definitely want some intent provisions in, but saying something is accessible therefore free game seems too wide.

> The real life equivalent of this is "if I leave my door unlocked, should someone be allowed to walk in anyways?"

The problem with analogies is that many equally valid analogies that can be made, but with many different points. I would argue that the real life equivalent is "Have this free book, but you may not read Chapter 4."

Well I suppose both are possible, and it's on a case by case basis.

If I put on my website terms of service "please don't try to go everywhere" , and then you do... seems like you did _something_.

I don't really get what sort of stuff is enforceable, though.

Or putting up a poster that only some people are allowed to look at (or that the google maps car isn't allowed to photograph).

> ACL for accessing the information

the ACL is the robots.txt. A door with or without a lock doesn't determine whether the place is public or not.

if my bot is actually my cat actuating a switch for it to load a page, does it have to follow robots.txt?

robots.txt is more like a sign that asks certain people not to look at a bunch of other publicly visible signs.

One can't post a sign in public that tells people not to look at other publicly visible signs and expect the government to arrest or fine them for ignoring it.

robot != UA

What if I user curl to pipe web content to my mail so that I can read it in a quirky way? What if I write a Chrome extension to crawl a site? Where does w3m stands?

This is not a question of the tool (UA) but of the intent (mass crawling, indexing, mass-replicating stuff). robots.txt is made as hints for crawlers and the like, not optimistically ACL whether something is public or not.

robots.txt cannot change whether something is public, because it doesn't apply to humans.

For the most part I agree, but I feel there are grey areas. Things like web browsers (which are not robots) can access the content as though they are from a human. But what about extensions or apps that do things in the background, such as caching the contents of several pages for offline viewing. Is that now considered a bot.

The robotstxt.org site states that a robot "should" obey the rules. "should" is not a legal term that implies compliance. "must" would have been more appropriate to indicate enforcement.

That file includes at least two non-standard syntax extensions[0]. Robots is just a de facto standard and respect of some directives varies[1]. So much for it being 'not difficult' while the task is not even clear because there isn't even a clear standard.

Archive.org also dislikes how robots.txt is being used mainly for search engines and goes against their mission in particular[2]. Are they now hackers for not throwing away information just because someone was overzealous with robots.txt or retired a certain website and uses robots.txt as SEO to let another one take its place in Google search results?

If some big corp wants to cry and bring legal matters into software they should first be accountable themselves for not securing themselves and the data of their clients (see the LinkedIn hack people mentioned elsewhere here and in general the high profile hacks like Equifax, Sony, etc.). Or should software shape up to be like many other areas today are - multi-million corporations are free to play fast and loose and endanger people while small guys get fried over meaningless bullshit and vaguely defined "crimes".

[0] - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nons...

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...

[2] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

It contains

User-agent: * Disallow: /

I am pretty sure none of the standard libraries/ tools that respect robots.txt would continue after being fed that file.

>throwing away information

This is entirely irrelevant. If they receive data from someone they have no obligation to discard it because of the current status of robots.txt. The question would be if they should continue to actively scrape that website.

It seems like they've done that for gov sites, but nobody particularly cares about enforcing gov robots.txt. It would've been interesting if the government sued them, although if they cared they probably would've just told them to stop.

So we have an unclear "standard" that is only a de facto standard (and still varies in more advances directives between few big bots) that you're "pretty sure" about but that's seemingly not written down in its entirety anywhere and it'd also be enforced selectively depending on whether or not "someone particularly cares". Truly perfect and foolproof law that would be.

And all this to protect some corp's business model of not letting others collect automatically the public information they provide, while they are free to use outdated or buggy software, store passwords in plaintext, etc. and get away with leaking data of millions of customers that should never be public.

And it'd fail to stop anyone except benign, private and low fund actors because instantly Indian (or other low wage country) services for "scraping by human thus not a bot ignoring robots.txt" would pop up, just like there are captcha solving services that employ humans already, and malicious bots wouldn't care anyway just like they make 0 effort to respect it now and run from servers in some country that isn't friendly towards USA so there is 0 potential for catching the perpetrators.

I disagree that laws that can only be enforced against US companies / people are worthless.

Requiring a human would increase costs and it doesn't seem like a good argument against anything.

But they are in this case. They would not stop any scraped data from popping up for sale in shady places. That can be done by LinkedIn or whoever themselves using some smart way to detect bots and stop them from scraping their website.

The only people a robots.txt law would affect are private users who set up a Python script to scrape a single page for themselves to check for something, things like archive.org, researchers, automated website testers, etc. while anyone nefarious can just rent a shady VPN or use a server in Russia, China, Middle East, etc.

Requiring a human barely increases the cost if that data is so valuable in the first place and would be last resort anyway, far after just running the bots from a shady country, for captcha it's done because it's technically easier/cheaper (although supposedly automated solvers exist too).

But laws that punish outright gross negligence would help protect everyone who uses these American websites (and most of the world does) from data leaks of data that is arguably way more sensitive (emails, unhashed passwords, SS and CC numbers, real names even like in Ashley Madison case, etc.).

LinkedIn used sha1 with no salt as recently as 2012 (when they were hacked) for passwords and over 100 million such username + password combinations got stolen. Not only is sha1 not good enough for passwords but for many common and simple words (yes, yes, they are bad passwords, but people do use them) just googling can "crack" them due to lack of salt. The law should either go both ways or neither.

To suggest such heavy handed laws like considering robots.txt ignorance hacking while multi million corporations with millions of users get away with stuff like that (and I mean true negligence of most basic practices, not some obscure bug in the underlying software or something else that isn't absolutely obvious) over and over and over again that every random free my-first-login-page and my-first-SQL-injection-prevention tutorials advise against is absolutely ridiculous and anti-consumer.

>robots.txt compliance is not difficult to build. I'm fine with robots.txt violations being considered hacking.

I'm not. You can set up a server to serve different versions of robots.txt to different folks. A malicious actor could deliberately feed inputs to a specific crawler that convince it to violate the terms of the robots.txt it serves to everyone else, and then press for criminal charges against the operator of the scraper.

In a sufficiently adversarial relationship, this lets website owners turn any well-behaved site scraper into criminal activity. That's not a power we want to grant.

>I'm fine with robots.txt violations being considered hacking.

Okay. Start with something simple then - how would you define a "bot" and thus subject to your robots.txt rule?

Is my web-browser a bot? What about a proxy? What about a deaf persons screen reader?

If my web-browser pre-fetches links near my mouse pointer, is that a bot? What if it downloads the whole of an article split over, say, ten pages?

I think of robots.txt similar to posting a "No Trespassing" sign. For a private residence, it's almost not even required, yet for something like a shopping mall during opening hours, the default assumption is that anyone is allowed to be there without a specific invitation, until they are expressly asked to leave and not come back.

Trying to nail down the exact line is a tough issue.

I don't know. The pathological case could include a rapidly changing robots.txt. Think about archive.org's policy. If they suddenly find new restrictions on a domain, they hide it in their waybackmachine. Sometimes an old site will go down and be replaced by totally new owners. This breaks some domains of the waybackmachine retroactively.

I think judges are able to ask some questions and tell the difference between an honest mistake and a flagrant disregard for robots.txt, if that were to be the legal standard.

Honoring the robots.txt file is voluntary and ignoring it should in no way be considered hacking. I would go so far as to say that any activity that someone could engage in, simply by loading a URL, should in no way be considered hacking.

Not only does it make it way too easy to prosecute software developers, it really devalues the term "hacking".

Sometimes you can do SQL injections just by loading a URL

Perhaps that shouldn't be construed as hacking either. If I send a link to someone via email, they shouldn't need to worry about breaking the law if they click it.

I do think that a company who has been victim of a SQL injection attack will have a better chance in court then, say, LinkedIn in this specific case. At least this theoretical company has made some small effort to protect their data, however inept.

OTOH, if HiQ employed a team of people to surf to Linkedin and physically type the information into their databases, that would be ok?

> I'm fine with robots.txt violations being considered hacking

Really?? That would mean private corporations, or private citizens, can write laws.

You can put up a "no trespassing sign" on your property (although there's some debate as to how much that actually counts for - a quick search pulls up https://www.washingtonpost.com/news/volokh-conspiracy/wp/201...)

Robots.txt is not a 'no-trespassing' sign. Robots.txt is a 'whites-only' sign.

The information is available to the public, just not for certain classes. This is and should be legally unenforceable.

If something is truly meant to be private it should not be referenced from a public-facing page or it should have access control enabled.

Robots.txt is more like a "No trucks allowed on street" sign. It allows uses that are typically associated with individuals (viewing a web page, being in a car), while disallowing things that are normally associated with business (web scraping, driving a truck).

Except race /skin color is a legally protected class, and robots aren't (and why should they be? They can't enter into contracts, conduct business, etc. So it's perfectly legitimate to exclude them from a site where they cannot use it in the intended manner).

"If you truly didn't want trespassers you should've put up a gate."

Bots being legally protected as a class or not, using robots.txt as the ultimate test of what distinguishes normal traffic from CFAA violations is a very flawed mechanism. It turns your website into a minefield.

As a property owner, a no-trespassing sign won't protect you from the lawsuits that result when a toddler drowns in your pool. You're expected to do more (like putting up that gate).

Equifax's systems are peppered with "no-trespassing" motds at login. They also have a robots.txt file. We expected them to do more.

Same for leaving keys in your ignition, guns unlocked on your nightstand, etc. "Don't touch" signs won't absolve you of responsibility when either gets stolen and used in a spree killing.

So yes, as the owner of any sort of asset, in most contexts it is your responsibility to implement access controls to keep unauthorized traffic out.

> won't protect you from the lawsuits that result when a toddler drowns in your pool

Good analogy. I wonder i operating fa poorly secured website that leaks private information could be seen as an 'Attractive Nuisance' [0] and the owners could be prosecuted for that, rather than the hackers!

0. https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine

Maybe it's closer to a No Trespassing sign, written in a language that only certain classes will understand.

In this case it is a "white bots only" sign, as it allows some bots but wants to block the rest.

Even for those who think that robots.txt should be enforceable, allowing some bots but not others makes it difficult for a new player to have the same equitable access to information as the big players.

They can anyway. That's what contracts are.

I was about to agree that robots.txt prohibitions should be considered a form of authorization.

But I think what is being argued is that "if it's publicly available on a URL, it's available for any client to download and use." I think the latter argument holds more water, as since they are making it publicly available it is implicit authorization.

If robots.txt allows Google and Bing but nobody else, it should be ignored. If it blocks everyone, then I agree. We need to make sure that the next Google has a chance to succeed.

Interesting. They say that crawling is prohibited there, actually, and have a blanked 'Disallow' at the end.

    # Notice: The use of robots or other automated means to access LinkedIn without
    # the express permission of LinkedIn is strictly prohibited.
    User-agent: *
    Disallow: /
All the listed bots are only able to access a small subset of pages, the same for each bot apart from one. The 'deepcrawl' bot is privileged, and gets to see the '/profinder' pages, for some reason?

    # Profinder only for deepcrawl
    Allow: /profinder*
Anyone know who operates this bot?

robots.txt have no legal validity.

I mean, it seems to have been cited in the lawsuit. See e.g. https://static1.squarespace.com/static/5803b57737c581885cbd0... and search for it.

I doubt a bot could legally agree to a license put into robots.txt, even if it were able to make sense of it, and a human is never expected to read it.

The purpose of robots.txt is to guide bots away from circular links and such that would result in bogging down the site and causing undue amounts of nonsense traffic.

The purpose of robots.txt not access control.

EDIT: typo fix

The human is expected to use it to not scrape sites that prohibit it.

Although it appears the court found for HiQ (against LinkedIn): https://regmedia.co.uk/2017/08/14/hiqlinkedintro.pdf

Temporary injunction, not final decision.

Do Not Track compliance is even easier to build. Does the same logic apply?

Yes, this is an extremely good point. If failing to follow robots.txt is a criminal violation of CFAA, then using any of my computers resources (cookies, javascript, etc) to track me while I am sending a DNT header is also a criminal violation of CFAA.

I would almost be willing to concede making not following robots.txt a violation of CFAA if the trade-off was Mark Zuckerberg being brought up on several billion felony charges every year.

How about: if you want me not to scrape it, keep it off my internet??

Actually I'm considering building "API-fication" of websites with bindings for major languages (Java, Python, JS). With luck websites could participate by providing & maintaining a parseable API-sitemap.

This would open door to my 2nd project: orchestration a-la BPEL on top of websites. visual editor, macros, scripting. Call this PIPES 2.0

Can you provide some use-cases for why this would be useful in a way that wouldn't violate most sites' ToS?

- a lot of online stores, hotels need to constantly update prices based on what competitors do.

- cleaning (big) data. Automatically reconcile data to canonical format / names using authoritative source (say wikipedia)

Can you understand even the simplest TOS? I'd argue most (all?) are too restrictive to be enforceable. https://tosdr.org/

I mentioned this before in a previous thread on this topic, but I can't support the EFF on this. This is, at the end, an argument against control over ones own data: LinkedIn might be doing sketchy things with your data, but it's all stuff you voluntarily agreed to in exchange for their service. If any shady data aggregator can vacuum it up and do whatever, I didn't consent to that and I'm not getting any benefit from it. The EFF shouldn't be defending that right.

But the EFF isn't arguing that any shady aggregator should be able to vacuum up anything. LinkedIn would still have the full right and ability to implement limits, blocks, or so on to prevent this. LinkedIn could still make it against their terms of service and pursue a civil suit. It just would stop LinkedIn from being able to pursue felony hacking prosecutions against people for accessing a public webpage with a script.

Make it fair then! Bots can’t scrape LinkedIn, and LinkedIn can’t sell any consumer data to third parties.

For real: I really hate corporations 'stealing' data from my phone. For example Google likes to introduce new sync options to Android and every time they do so it is activated by default. So as soon as the update arrives their software syncs my data to their servers without my consent. They probably have some clause in the EULA but as a user of their products I really hate that behavior. A similar case is not being able to disable address book sync before it syncs the for the first time.

Those things should be crimes as the data they fetch is not publicly available on some web page but exists only on my personal device and they take it without my consent.

Install a firewall (for example, NoRoot Firewall) and whitelist only these apps/services you want to access Internet.

How does a website put reasonable limits on access?

I'm not saying what Linkedin is trying to do is right but it seems to me there needs to be a way to say "Dude, that's not cool." A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?


Pretty much what rayiner is saying, they posted while I was typing.

> How does a website put reasonable limits on access?

1) Blocking TCP connections

2) Returning a 4XX error, perhaps even "401 Authorization Required", "402 Payment Required", "403 Forbidden", or "429 Too Many Requests"

> A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?

A Brick and Mortar store has to _tell_ you you're being banned. The mechanisms I listed above both tell you and lock the door whenever you attempt to access.

Edit: In this case, it's more like someone was looking in the store window from the public sidewalk and asked to stop. Can you really ask someone to stop looking at you from a public place?

LinkedIn sent HiQ a C&D. They were indeed told that they were banned.

Let's try a thought experiment: you're at a supermarket, and you're abusing coupons to the point where you're holding up the line for everyone. Someone complains to the manager, and the manager escorts you out of the store and tells you you're banned for life (as an aside, I wish this would happen to extreme couponers).

The supermarket also has automatic doors and a self-checkout. They're also pretty understaffed, so there's a good chance you won't run into anyone stocking the shelves as you're shopping. A few days after you've been banned, you waltz in through the automatic doors, grab some items off the nearest shelf, go through the self-checkout, and leave without a single employee getting a good look at your face. At the end of the day, the manager starts fast-forwarding though the day's security camera footage looking for anything odd and notices you've been in the store. They call the police and have you charged with trespassing.

Do they have a case, yes or no?

I say yes.

Because of the minimal amount of LinkedIn resources utilized and this not apply to all robots/extreme couponers, wouldn't this be more like a competitor checking your weekly ads posted on your front window?

Walking into a store is a clear violation of private space. Is looking at their window?

So, if you had to have an account to view any linked in information, and you got a c&d and your account banned, and you sign up for a new account, I think it would be like entering a store you've been banned from. But we're talking about information available from a public space: on your window or without an account.

I also take issue with the CFAA being used here. I'm sure there are other laws more applicable to keeping someone from talking with you.

To recap: I don't think LinkedIn is wrong to ask them to stop, I just don't think they're using the appropriate means of forcing them to.

> In this case, it's more like someone was looking in the store window from the public sidewalk and asked to stop.

I think it's more like calling the store and asking them what their prices are 20 times a minute.

No, it's more like you holding the giraffe while I fill the bathtub with brightly painted power tools. Because reasoning by analogy sucks.

No one is accusing HiQ of performing a denial of service attack.

> ...you holding the giraffe while I fill the bathtub with brightly painted power tools.

I'm down for that.

...and the coffee shop doesn't block your number.

Also, a phone call consumes, as a percentage of available resources, vastly more than an HTTP request.

Disregarding that though, I think you'd need a court order telling someone not to talk to you, and you'd have to take action to prevent them as well, blocking their number and tell them to stop before that would be granted. If they persisted after being told explicitly and having their number blocked, then yes, I do think legal action would occure and be swift.

I would also assume, presumably, that "you" can be extended to be an automated phone system. (Which is still more limited in capacity than a server would be, but even disregarding that.)

FWIW, I'm not saying that "hiQ Labs" is blameless or acting in good faith. I'm saying that unimpeded access to publicly accessible information requires more than asking someone to stop and that the CFAA isn't the right tool for this.

I'm not an expert in this field, but I doubt the vast majority of anyone in this thread is. It also becomes interesting because I believe the CFAA has been used in similar situations before, but those were where the accessed knowlege could be assumed to be private, even if made public (client details at a phone company, or articles known to be behind a paywall) (and not that I agree with its usage there either, but the data accessed there could be assumed, by a reasonable person, to not be public).

So the key thing here is: if something is publicly available, can I ask you to stop looking at it, or do I need a more stringent court order to prevent you from viewing public information?

And in this case, I do think the capacity constraints disregarded above would come into play. I think the courts would look differently at someone calling your clerk 20 times a day vs looking at a menu you post on the window.

> ...I think you'd need a court order telling someone not to talk to you, and you'd have to take action to prevent them as well, blocking their number and tell them to stop before that would be granted.

Like, for example, sending a C&D letter?

This whole hubbub is over them sending a C&D, they just made the mistake of trying to use the CFAA as a means to enforce it -- which, honestly, hiQ is fighting the good fight trying to stop.

A c&d is not a court order. It is a not-so-polite request and warning that further action will be taken.

Edit: If that's your point I agree with you. C&d followed by some more appropriate (than the cfaa) seems like a not-raise-everyones-backs approach.

That raises a point - would hiQ be liable in a civil suit if the CFAA were not a factor?

Looking in a window from a public place doesn't use any resources of the company being looked-upon.

This legal complaint is not about resources used; it's not a "They criminally DDoSes us".

This lawsuit is an attempt to stop competition by curbing access to data, not about ensuring reasonable use of apis and rate limits.

They have many options. They can rate limit access by IP address, they can keep information they'd like not to be scraped behind login screens. And so on.

They could add requisite code.

weev went to jail for accessing publicly available information from AT&T. There's not a great precedent here for the EFF, unfortunately.

It was only a jury decision by a lower court, it doesn't mean much in terms of precedent.

I think scraping for personal use (not honorig robots.txt) should always be legal unless you are attempting DOS. You are accessing public information, the server is returning HTTP200 and it doesn't matter if you do so using a browser, phantomjs or curl with -A parameter.

A different situation would be scraping a website to make business. Worst being directly using the data - for example those StackOverflow clones with original data doesn't sound ok to me. I am not sure what to think about bots doing various derived work like stats and analysis. I think that if they are part of a business, making money, it shouldn't be legal unless those request are permitted by robots.txt.

Question. How this principle can coexist with the idea of "surveillance is bad"? Because that's mostly to collect publicly available information. Is it bad because it's done by a government? It's possible to set up a bunch of privately owned cameras in a city and keep filming people. Is it the association of infos that makes it bad and not mere collection? Is it okay if it doesn't have a personally identifiable information (but who knows what one can make out of them)? I don't know what I should think of this.

This thought process always bewilders me. Whenever it comes up that government agencies monitor our emails and phone calls, someone, as if on cue, always pipes up that that's totally no different from people posting on their Facebook timeline and other absolutely mind-bogglingly bad equivalences.

You, however, go the extra mile, here. How about you explain exactly how accessing published information on a public website is like building a network of cameras to monitor a city with?

Surveillance is bad but it is also not hacking.

Boom. Easy to have both opinions.

I would love to limit corporate databases, but not via letting website owners declare arbitrary use to be criminal.

Can data that is supplied with an intention to be publicly accessible i.e. public domain be restricted. If the public was asked, "When you supplied your picture, your name, and then created a public URL to become fully searchable, was your intention that that information was to be restricted or was your intention that this was information you publicized about yourself to make it possible for potential employers to find you?". Answer, "Yes, it was 100% my intention to become searchable so that employers would be able to seek me out". Conversation is over.

LinkedIn creates an implied covenant with public consent (mostly) to then publish and make discoverable their professional profiles.

While LinkedIn 100% should have the right to stop others from embedding without permission since it's possible to claim the data structure and presentation is proprietary to them, this should never extend to the actual data itself, since this was willing gifted by the actual owners (Joe public) into public domain.

I think an argument could be made that LinkedIn is being burdened with a degree of data mining that affects their business and therefore should be able to charge a minimal fee e.g. an API firehose to acquire the data in bulk from providers in an raw data stream.

That seems reasonable depending on the charges associated with that offer, this would be the correct compromise, since their data structure is all that actually separates their service from say About.me or any other site of that type. All of which don't disallow scraping; as long as it doesn't present as a DOS attack (of course).

Anyway my comments are as a marketer and not a programmer or lawyer, but personally I'm very interested to see this case resolved in a manner that doesn't suit LinkedIn in the slightest.

Are they arguing that it's a crime, or that it's a tort?

I believe the latter (though IANAL)

There is a difference between public property and private property that is made available to the public. Just because the cafe on the corner has its door open and lets you stroll in off the street doesn't mean that the property owner doesn't retain the right to exclude people. And if the property owner revokes your permission, then going onto the property again can be a crime (trespass).[1]

Servers are no different. The Internet isn't an abstraction--it's just pieces of private property connected together (servers, routers, switches). When you make an HTTP request, you're accessing a piece of private property. The owner of that property has every right to decide not to let you do so.

That's not a great analogy. The store owner can't just get your arrested/charged with a crime if they don't tell you that you aren't allowed first. Http lacks such a human mechanism. The closest thing I can think of in the standard is the response code. So your server replying 200 OK should implicitly be considered permission to access that resource legally until it stops replying with that code.

But that's exactly what happened here:

> LinkedIn sent hiQ cease and desist letters warning that any future access of its website, even the public portions, were “without permission and without authorization” and thus violations of the CFAA.

The EFF's point about terms of service is a good one, but also irrelevant. Terms of service don't provide adequate notice that someone's implied license to access a website has been terminated. But here, hiQ had actual notice through "human" channels.

The poster is arguing that if you make a request from LinkedIn's website and it returns a "200" along with data, then you've accessed that data lawfully and LinkedIn has agreed to serve it to you; I tend to agree. If they don't want to provide data to hiQ, they should, well, stop providing data to hiQ.

There are many ways to do this short of claiming that hiQ doesn't have permission or authorization, an argument strikes me as wholly without merit. If the data is publicly available on the internet then how is permission or authorization required?

How is that any different than walking up to a store entrance with automatic doors and a sign that says "Welcome" on it?

Well, for one it's not a physical store nor a physical entrance and there is no sign that says "Welcome". I don't think the analogy is helping to make anything more clear... It's possible it's making things more confusing.

In my opinion, the bottom line is that if LinkedIn doesn't want to serve data to this company, then they should immediately cease doing so using the many well established means available to them.

For LinkedIn to claim that following a URL and downloading the data is somehow "hacking their website" is entirely ludicrous. I understand they had a lawyer tell this company that they didn't want them to visit the URL, but I don't see how that somehow turns lawful web browsing into illegal hacking.

Those doors get turned off at night, just like a server can ignore an HTTP request

They can turn the servers off at night too. Some places still choose to do that. But that is unrelated to the point, if you are told that you are no longer welcome at a business, you can’t come in without it being considered trespassing. The doors automatically opening for you (200 Ok) doesn’t matter. If you wear a disguise (change ip) doesn’t matter. You can’t go in.

Also I would agree that absent a specific order to stop accessing publiclly available server resources, there is an explicit permission to do so. So I’m the case of Weev I think he did nothing wrong, AT&T were the ones in the wrong.

The store owner could have told you that are not welcome at any time of day. I don't think a generic "Welcome" sign or automatic door would override that.

In the coffee shop example, would this be like trying to sue someone who is banned from your shop from looking in the window at your price list? In this case, it's more like LinkedIn is attempting to get a PFA order, but I think they need to show abuse, not just looking in the window at the menu you posted on the window?

No because that's not how computers work. Computers don't just emit radiation into the aether that anyone can capture. Accessing a website involves making a physical piece of property do something in response to your HTTP request.

If you are notified in writing that you're banned from a coffee shop, but you walk up to the front door and the "server" (pun intended) greets you warmly and allows you to enter, is that "implied consent" that overrides the prior explicit anti-consent, and therefore undermines the legal authority of that ban?

I think almost any judge or jury would find it implausible if you told them you thought the written ban didn't apply anymore because the server still let you into the coffee shop. We intuitively understand that written notice from a property owner carriers more weight than the actions of one of their workers. I think the same exact reasoning applies where the "worker" is a computer server.

I agree with you, but I think an even better analogy would be a supermarket with automatic doors.

If someone was walked out of a supermarket and explicitly told that they were banned for life, and they tried to claim that the ban was lifted because the automatic doors opened for them, they'd be laughed out of court.

You could extend that further and say that the supermarket has a self-checkout. You may very well be able to walk through the automatic doors, grab something off the shelf, check it out yourself, and leave without anyone noticing you, but it's still trespassing if you've been banned from the store.

I actually think it would be a violation. There is a clear delineation between the private space of the coffee shop and the public space.

It becomes less clear where that delineation is not clear: a menu posted on a window or an automated phone system. These are both private things intended for at-large public consumption. My impression is that the EFF and hiq labs is taking the stance that it's a menu placed in the window, not being let in after being told you can't come in.

So, I can't shine a flashlight in your store window to look at the menu in the middle of the night? I have to send photons into your "physical piece of property do something".

I don't think anyone who understands how computers work would compare the active process of a server responding to an HTTP request to the entirely passive phenomenon of shining light into a window and capturing the photons that bounce off.

I could just as easy say "I don't think anyone who understands how computers work would compare the active process of a server responding to an HTTP request to a coffee shop".

But to respond directly, the paper and tape had to be bought, printed, &c. Capital was expended to place the paper there. Sure there is not the ongoing cost of maintaining this paper in the window, and if that's where your argument lies, then you should be less condescending about it.

Moreover, we're not talking about the costs associated with access, we're talking about the permission granted to access. As such, ignoring the cost of serving an HTTP request is a valid comparison, because it is not at issue here. LinkedIn's argument is just as strong even if their only argument is they denied permission with no reason given.

Thanks for the ad hominem, by the way. Your childishness and inability to conduct a civil discussion has caused this discussion to end.

But you had to take active steps to cause your physical piece of property to respond to HTTP requests...

I give the example elsewhere, can I be prohibited from shining a light onto a menu you posted in your front window in the middle of the night? I had to take the active step of turning on the flashlight and sending photons onto a "physical piece of property to" bounce off the menu.

It's a bit like shouting in through the doorway "Hey, how much is your coffee?"

If you're on public property and yelling, I would assume the coffee shop owner would need a PFA or some other court order to prevent you from access. I don't think they could have you arrested because they simply asked you to not talk to them and they aren't breaking any other laws. (Though asking you not to talk to shop employees would be necessary before the PFA could be granted as I understand it.)

It's closer to going into the store that had sent you a C&D, then browsing the racks to create a price list.

Which they could probably get enforced. Could they prevent you from looking through the window to get prices or just the sale prices being posted in the window?

I'm not a lawyer, but I highly doubt that they could. It's also not what happened here.

Can you expand on the last part? I don't view a publicly accessible webpage as a protected private space, just as I do not view an ad posted in a (private) window as a protected public space.

Of course. My statement was predicated on the need for active network requests to obtain information. If the bot had passively listened to network traffic from LI, then I would argue for sameness with passively looking through a window.

What if it's dark and I shine a light on your ad in the window? (The issue at hand isn't DOS or resource-based, but permission.)

I agree with your premise. I'm just reaching a different conclusion.

As a permission issue, the bot _may_ have been authorized and authenticated, however the company was sent a C&D letter that revoked all authorizations. That is why I say that logging in and accessing the resources did not constitute authorizations.

If a C&D letter would not have been sent, I think I'd agree with you.

You can't prevent me from looking in your window though, at a sign you put up for people to look at none-the-less, with a C&D.

Agreed. That's why I made my earlier comment, that this is closer to your entering a store (not just looking in the window) and examininb the merchandise after you already had been sent away for trespassing, revoking all authorization.


> your server replying 200 OK should implicitly be considered permission to access that resource

I do see your point and how you could disagree with my statement above. However, if the store owner forgets you next time and says "Come on in! Oh and here is a take-home menu with all our items and prices" but then calls the police to have you removed, there is a problem.

Now imagine said store owner actually owns several locations possibly even with different public names and doesn't want to serve said customer. They could provide a list of all addresses of stores they run explicitly banning permission. Otherwise, that customer walking into store B would need to be told again they would not be served at time of entry.

Assuming the CFAA C&D from LinkedIn does have legal standing here... If hiQ were using IP addresses and not DNS resolution to crawl, how would they know a particular IP is a LinkedIn resource they aren't allowed to access? Did the C&D provide all addresses they are not permitted to access?

My point is that its not black and white, and certainly not clear that this should be covered by the CFAA under "hacking".

Edit: You could also make the argument and analogy to a restraining order which places the responsibility for compliance on the banned party. However those don't just happen because one entity sends a letter to another entity, it needs to be explicitly granted via the legal process.

I think the more accurate comparison is that the owner sent you a C&D saying you're banned from the restaurant, and then you try to say "oh, I though the C&D didn't apply any more because the waitress let me in." Would anyone seriously believe that?

The law applies to people, not computers. The only question is: did Linked In convey its revocation of hiQ's implied license in a way a reasonable person would understand? The computer code is only relevant if a reasonable person would take the HTTP status code to take precedence over the C&D letter.


If all requests sent by robots would clearly identify themselves, the server would easily block all of them. But if they fake their user agent to look like a browser and ignore robots.txt, that's not a good faith request and they shouldn't be able to plead ignorance.

I don't believe there's a law requiring the honoring of the robots.txt file. People and services honor the file out of a sense of good manners, not a legal requirement.

It doesn't have to be a specific law. It is a rebuttal to a claim of "I had no idea I shouldn't have requested millions of pages from that site".

If you scrape a site that prohibits it in robots.txt, that should be considered notice that they don't want that, for whatever relevant law. (I don't know if this argument would hold up in court, IANAL.)

I think I see what you're saying, but I disagree that the robots.txt file should have any legal ramifications. Web site operators have many tools that they can use to limit traffic or protect data and they should make good use of those tools.

LinkedIn wants to make their data available publicly, except under certain conditions. In my opinion, if they can't find a technical solution, they should stop making the data available publicly.

What is a robot? Why is the User Agent even important? It's not a standardized value. I could send "User-Agent: ikeboy" and it's perfectly valid.

>Just because the cafe on the corner has its door open and lets you stroll in off the street doesn't mean that the property owner doesn't retain the right to exclude people.

While I don't know about the EFF's overall argument, as an absolute statement I don't think you are correct here. In the USA at least, "Public Accommodations" (which your cafe example would be) are in fact subject to regulations that limit their ability to discriminate, require accommodations for the disabled, etc., and these apply regardless of whether it's public or private property. Something that is open to the general public is different in law then purely private property (private clubs and religious institutions are specifically excluded from federal law, but that's it). There are also going to be different expectations of privacy and default access levels.

Physical to digital analogies are often a poor match anyway, but in this case I'm not sure even if we accept one that it fully supports your point. Private property open to the public is not legally the same as purely private limited access property in terms of who it may exclude, when it may exclude, and why (as well as lots of other standards).

Laws against discrimination don't turn private property into quasi public property. They are narrow exceptions to the way in which property owners exercise their right to exclude.

Neither the corner cafe nor Linked In can refuse to serve a request by someone because the person is black. But both the corner cafe and Linked In can refuse to service someone for any non-discriminatory reason, such as say because they're a Michigan fan.

Or, more reasonably, because they are refusing to comply with some expectations for behavior that apply to everyone there.

Are you just trying to play devil's advocate or do you really believe this? With HTTP, you're requesting access and then the server gives you the some information. It's up to the server to decide to give you the information. If the server doesn't give you the info, you can try to hack it but you might be breaking the law.

Same with a cafe. You can request access and the cafe can turn you away or serve you. If they turn you away and you refuse to leave, then you are breaking the law (like hacking).

Basically if I request something from you and you give it to me, that's your problem, not mine.

Trespass is not illegal until the owner informs you that you are not wanted. Private information that has accidentally been made public is like an unmarked field. It may be private, it may be public, but until the owner takes specific action it is not illegal to use the field. If the owner decides to take action, that action cannot be retroactively applied, even if there is a record of who used the field.

Regardless, this is not analogous. If LinkedIn is making information public, then they cannot simultaneously say that this information is private for a specific use and expect the courts to intervene.

The difference is Linkedin knows they're scraping the site, asked them to stop and is now trying to force them to stop through the courts (in a really bad way).

Google, Bing, etc are also scraping their site, and I see no cease and desist order there. Make Googlebot authenticate itself, or admit the data is publicly accessible.

The Whataburger I went to for breakfast this morning gives some homeless people free coffee and asks others to leave...

And when hiQ shows up looking homeless and accepts the gift of coffee, they are committing a crime?

No, but when they ask them to leave and they still take a coffee cup they are.

But that's not the point, the point is it's possible to give something for free and also refuse to give it to everyone under any circumstance.

They didn't give me a free cup of coffee and someone could reasonably mistake me for a homeless person based on my (lack of) fashion sense but that doesn't mean I could just reach over the counter and grab a cup because I saw them give one to somebody else when I walked through the door.

When a server sends you a response you aren't taking something, you are being given something. If the server thought you shouldn't have it, it wouldn't give it to you.

How can you say that hiQ isn't allowed to have this, but everyone else is allowed to take as much as they like? All that will happen is hiQ will create a string of shell companies that accesses LinkedIn as their proxies, and you will be wasting the court's time. Step zero is to establish that no one can have access unless authorized, and LinkedIn refuses to do this.

> When a server sends you a response you aren't taking something, you are being given something. If the server thought you shouldn't have it, it wouldn't give it to you.

That's not even a rational argument, ask some hacker sitting in prison how well that one went over.

> How can you say that hiQ isn't allowed to have this, but everyone else is allowed to take as much as they like?

Umm, private property? Terms of service? Take your pick...

> All that will happen is hiQ will create a string of shell companies that accesses LinkedIn as their proxies, and you will be wasting the court's time. Step zero is to establish that no one can have access unless authorized, and LinkedIn refuses to do this.

hiQ isn't fighting the validity of giving access to some people while denying them access to the very same data they are fighting a misapplication of a totally unrelated law (because it's the right thing to do).

This whole thing isn't about denying them access but "hiQ challenged LinkedIn’s attempt to use the CFAA as a tool to enforce its terms of use in court."

"Hacking" involves subverting authentication systems, which is a type of fraud. When there is no authentication system there can be no "hacking", and the CFAA should not be applicable.

The data itself isn't LinkedIn's property (argued elsewhere), so they don't have control over it after it leaves their servers.

This is wandering... please decide whether you want to argue the article, the case, or hypothetical free coffee.

I think this is a poor analogy. An argument/analogy like yours would allow me to say you trespassed with your eyeballs. That light travels one way or bytes another doesn't affect the spirit that you are looking at something that was made available to look at. Can I outlaw window shopping? My cafe is going to have a sign that says "If you are employed by a competing cafe and you don't close your eyes when walking by, I will attempt to have you jailed". Or I'll wait until they show a price comparison between their coffee and ours, then send a cease and desist to have them not look at my billboard anymore.

> When you make an HTTP request, you're accessing a piece of private property. The owner of that property has every right to decide not to let you do so.

It can do exactly that. It can respond with an error code or start dropping packets entirely. As far as I'm aware, LinkedIn didn't do that.

Any access to LinkedIn's data requires that LinkedIn send it in a response. If LinkedIn is sending it in a response, LinkedIn can't claim that it's not authorized.

I am not a lawyer and I do not have citations to back this up, but I suspect that, if you put up a billboard and then send cease and desist letters to people looking at it, or taking pictures of it, or whatever analogy to programs examining public web pages you like, then you would be laughed out of court.

I find this argument to be a poor fit for the actual situation. The person that owns a coffee shop needs to let people physically enter their coffee shop in order to purchase coffee, snacks, etc. LinkedIn has no such requirement, they can easily require people establish and log into registered accounts in order to access their data. As you have said, their servers are their property and they have the ability to block access for anyone that they do not wish to serve.

This is entirely different. LinkedIn wants to make the data available on the public internet... Except sometimes. They can't figure out a technical solution so they are pushing for a legal solution. If you'd like to try to further your coffee shop argument, this seems more like a coffee shop giving away free coffee with a notice letting customers know that there's a limit of three free coffees per person and then being shocked when some customers take four or five. Or all of them.

1. If a business offers me one product for free and I take two, that's theft, plain and simple. I'm not sure what your analogy was meant to prove but I think it actually makes a stronger case for the counterpoint to your argument.

2. LinkedIn has every right to define what the use policy is for information it makes available publicly through its own product. In this case, the policy was violated, and the violator was notified through appropriate channels that they were in violation. They continued to access LinkedIn and violate the policy, which is illegal. The critical distinction is that what they were doing only became illegal when LinkedIn notified them that they were in violation of the policy, no longer welcome on the site, and they continued to do what they were doing anyway.

If you leave things out in an open, public space without any access controls, those things are likely to be taken. A note that says "please don't take this" isn't going to change anything and I find it unlikely that you could pursue anyone legally on the grounds of "but I left a note".

LinkedIn has every right to define their use policy through technical means. If they want to make it publicly available, then they understand part of that public is their competitors. In my opinion, website operators should not get any legal protections for things they can easily do themselves through readily available technical means.

I wholeheartedly disagree that LinkedIn has any right to define the use policy for data it makes publicly available. A wide variety of data is available to the public and you can't simply sue people who use that data in a way that you dislike. If you would like to keep that data private then do so.

> When you make an HTTP request, you're accessing a piece of private property. The owner of that property has every right to decide not to let you do so.

So why don't they do that? If they're responding to a bot's HTTP requests with content, they are choosing to give the bot access.

Sure, but if you're never told to leave the coffee shop or no action is taken to prevent you from entering again, say being told your banned, and you continue to walk in and use the coffee shop with no one saying anything, has your permission to enter really been revoked, even if the owner thinks, and only thinks, it has been?

FTA: "LinkedIn sent hiQ cease and desist letters warning that any future access of its website, even the public portions, were “without permission and without authorization” and thus violations of the CFAA."

They were formally told to leave the coffee shop and not return.

No, they were told to stop looking in the window. Can you ask someone not to look at the menu you posted in your sidewalk window? I don't think that'd hold up in court.

Now you are just changing your analogies because you didn't read the article.

> Just because the cafe on the corner has its door open...

Hello, is there a cafe here?

Yes, here’s some coffee! Anyone who asks gets some!

Thanks, I acknowledge receipt!

I’ve changed my mind, I shouldn’t have been giving out coffee! What kind of a business is this? The only way my actions make sense is if you’re a thief. Theif! I will now try to ruin your life via the legal system.

Not true. When you make a HTTP REQUEST, you’re not accessing a piece of private property. You are requesting information. Just because it is requested doesn’t mean it has to be served.

Fine. I hereby forbid access by any entity owned, operated, or otherwise controlled by Microsoft Corporation to any internet server or service operated by me. Disregard of this interdiction shall be considered a crime, the digital equivalent of trespassing.

If you can demonstrate that Microsoft is aware of their ban from your property, then you absolutely would have a case.

And this is why we have judges and juries.

Actually, where I live, we don't have juries.

> LinkedIn argues that imposing criminal liability for automated access of publicly available LinkedIn data would protect the privacy interests of LinkedIn users who decide to publish their information publicly, but that’s just not true

Protect them from what, your unlocked front door? [0][1]

[0] "Hackers selling 117 million LinkedIn passwords" http://money.cnn.com/2016/05/19/technology/linkedin-hack/ind...

[1] https://en.wikipedia.org/wiki/2012_LinkedIn_hack

I'd also note that these companies are barely (if ever) held liable for life-compromising hacks on their platforms.

Is it even comparable to an unlocked door, though? To me it seems a lot more like leaving something on the front of your house and trying to prosecute when someone takes a picture of it.

Nothing is removed or destroyed, and nothing was hidden or publicly unavailable.

And, technically, you did essentially request access. An anonymous HTTP request doesn't have to be honored by the web server.

This right here folks. This is how I would prefer government worked. Imagine putting the liability back on the corporation for confirming access because in place "protocols" that approved it?

I recall seeing in the wild an HTTP User Agent string that included a EULA for the server stating essentially that they, not the client, were on the hook for any BS if they failed to immediately close the connection.

IANAL but, uh, seems legit... ¯\_(ツ)_/¯

Yep, you two are correct. Like ct0 said, companies are currently saying "We are saying 'come in' to bots but we want to pursue them legally as well."


The bot says "GET /blah" and LinkedIn says "200 OK".

Not bot's fault.

Well, there is precedent for that at least in the EU. You are not legally allowed to take a photo of the Eiffel Tower at night, because the arrangement of bulbs are considered works of art, and thus copyrighted.

You mean, you're not allowed to distributed non-transformed copies of the photo?


The linked Snopes article that they use for this viewpoint is badly worded. Although the headline claim is 'It is illegal to take photographs of the Eiffel Tower at night without explicit permission', nowhere in the text does it describe the act of taking a photograph as being illegal. It is all about publishing your photos and sharing them with others.


Awesome I'm going to stick a couple of LEDs on my T-shirt and then hang out in France. It'll be illegal to stick my photo on FB

In that article, and the snopes article it links to, it is implied that it is illegal to even take the picture. But then I fail to actually see where it is implicitly stated that it is illegal to take the picture. I can understand the copyright claim on publishing said photos, because that's actually the case for lots of things that can be viewed in public, but not taking the photo itself.

I really wish articles would refrain from potentially untrue clickbait headlines, but oh well.

I'm not a lawyer, I'm just repeating what I heard on the "Today I Found Out" YouTube Channel.

Regarding your last sentence. In Europe that may change with GDPR.

My account was probably in that password dump. LinkedIn has yet to reach out to me, but will still spam me with people that are not on linkedin

Side rant: LinkedIn is a piece of crap in societal concept and implementation. Recently I was so frustrated by removing old connections I just simply deleted my account.

Warning: I am going to be crude at this point: linkedin is an HR circle jerk of pointlessness

If I leave my front door to my personal residence unlocked, and someone comes to the front door, opens it, and walks inside without permission --- is that illegal?

I'm actually not sure.

Yes, it is.

If you don't have a legal right to be on a piece of property, in a given structure, or in a vehicle, you're trespassing.

If you used force to gain access to the property, vehicle or structure, it will often be considered breaking and entering. Typically, these laws use a very loose definition of "force" which includes opening an unlocked door.

If you leave your door ajar, it's just trespassing. If you had to open the door, it's probably B&E even if you didn't break anything to do it.

What about walking up to someone's door and knocking to see whos home? What if there is a picket fence around the yard with a latched gate that you have to open to get to the front door?

In the only jurisdiction where I've actually read the trespass law, it stated that a "legal fence" was sufficient to indicate no trespass, and that a "legal fence" was any number of acceptable structures that were at least 4 feet high.

In that context, a wall of a house being at least four feet high, would carry an implicit "No Trespassing" sign on it, but the picket fence would not. However, if the property had an obvious path to an entryway, then walking up that path to the entryway was not trespass. So walking through a picket fence with a low-latch would not be trespass, unless the pickets were four feet high, or if the latch was locked.

Isn't it only trespassing if you ask the person to leave? E.g. by putting up signs?

If the door is unlocked and nobody says you can't enter, why can't you enter?

To the best of my knowledge, no - your place of residence is considered to be private property and thus there is an implicit "no access without authorization".

I think a better comparison would be comparing LinkedIn to a public property (such as a commercial store) and thus there is an implicit "access allowed until revoked".

I think that realistically, there are strong parallels to this being a customer/company dispute over who has access to the company's store. The door (HTTP protocol) has to be walked through for the customer to see the wares (LinkedIn profiles) and can be guarded by security (some form of authorization).

I think the question being asked is a valid one - should a company have the right to bar access to otherwise public information if the customer is not tampering with your system? If so, to what extent? If undesirable robots shouldn't be turned away what about DDOS traffic? What forms of flow control become legal in this case?

I'm honestly curious what the courts decide and how that may impact other websites that have tried to combat scraping, such as Craigslist.

Depends on the person (stranger vs close uncle vs not close uncle, etc.), but in general, yes it is. It's also illegal in some states to leave your keys in your running car. It's still illegal for someone to get in and drive off.

It is legal until you inform them they are trespassing and ask them to leave.

Not if it's a personal residence. Entering someone else's property is trespass unless you have license. When private property is open to the public, there is an implied invitation to the public to enter, so you have license to do so unless it's revoked. With a personal residence, however, there is no implied license for strangers to enter (though there might be based on the parties' relationships or prior dealings).

There are some states with specific rules for personal residence, but it's inconsistent.

I don't think this is true. "Trespass is defined by the act of knowingly entering another person’s property without permission."


So it's illegal to, for example, go door to door looking for one that somebody forgot to lock and then spend the night there.

Under UK law, trespassing is a civil not criminal matter and so by some definition it is not illegal.

Still illegal, just not a crime.

No, it's unlawful, not illegal.

Definition of illegal: not according to or authorized by law : unlawful, illicit; also : not sanctioned by official rules (as of a game)


In legal terms, illegal and unlawful are not synonymous. In the UK, trespass is only illegal in certain circumstances: https://cps.gov.uk/legal-guidance/trespass-and-nuisance-land

Beyond that, it's only unlawful.

Where I live it is 100% legal to shoot them with no questions asked.

100% legal (castle doctrine) to shoot them, think about that for a minute, not generally legal to shoot someone engaging in a legal activity.


Also legal to shoot them through the door but probably not such a good plan...

No, you just think it is. The intruder must be there to commit a further crime, usually a violent one.

   An intruder must be making (or have made) an attempt to unlawfully or forcibly enter an occupied 
     residence, business, or vehicle.
   The intruder must be acting unlawfully (the castle doctrine does not allow a right to use force 
     against officers of the law, acting in the course of their legal duties).
   The occupant(s) of the home must reasonably believe the intruder intends to inflict serious bodily 
     harm or death upon an occupant of the home. Some states apply the Castle Doctrine if the occupant(s) of the home 
     reasonably believe the intruder intends to commit a lesser felony such as arson or burglary.
   The occupant(s) of the home must not have provoked or instigated an intrusion; or, provoked/instigated 
     an intruder's threat or use of deadly force.

No, you must reasonably believe they intend to commit a further crime.

Here, unless you give them a reason they're just like "yeah, dude opened the wrong door, heh?"

I am aware of a few southern states where the statute assumes the intruder intends to harm you. UncleEntity is likely correct.

well, 'breaking and entering' in the US requires that something (i.e., the door) actually be broken in the process of entering the house...otherwise that charge doesn't apply.

"Breaking" doesn't mean that, here.


Fun aside: breaking and entering is referred to as such in English Common Law because criminals used to bust through the wattle and daub walls to break in, thus housebreaking, or breaking and entering. [1]

[1] https://books.google.com/books?id=77y2AgAAQBAJ&pg=PA229&lpg=...

The more you know!

> I'd also note that these companies are barely (if ever) held liable for life-compromising hacks on their platforms.

You do know it is impossible to stop all cyber attacks? Its always a matter of when, not if. Zero day attacks are developed everyday with not even the best funded cyber security systems able to thwart them. The geniuses are on the offensive side, if they want in, they will get in.

The industry is held to no standards at all. You can keep plain-text passwords in your databases, do no tests at all, and be incompetent in a million other ways. I usually get downvotes when I say this, but by now there needs to exist certain regulation on commercial software and software-based services. It should be ensured that certain practices are followed in security and ethics (do you take the basic, well known precautions against the well-known attacks?, do you respect your users' privacy at least as much as the law requires you to, do you follow the terms and conditions you declare?). What we need is CE for software, and it's sad that I can ensure my cheese comes from a certain town and is produced from the milk from cows eating according to a certain diet, but not if Twitter (or any other commercial website) hashes and salts my password, and actually uses basic precautions against CSRF or what not. These companies should be obliged to get their stuff audited by third parties, and there should be a way to tell if they are really approved to maintain a certain standard in producing their software. I do understand and share the hacker culture, and appreciate how it's possible to spin off a start-up website business on the internet, but business is business. You don't become exempt from regulations when all you do is to run a tiny B&B with 2 rooms. Similarly, as soon as you're a company selling online services, regulations and standards should kick in. Because by now those online services are no less important than food business. You say it's impossible to stop all cyber attacks. Then, as it is impossible to stop all burglary attempts, should banks just deposit their money in some apartments, or in some random rooms where all the security is a wooden door? Fire all the security guards because it's impossible they survive all the guns out there? These companies like LinkedIn are no different in banks insomuch as they deposit not our money, but our personas. They should actually be more cautious because while money can be replaced, nobody can have a new self.

> It should be ensured that certain practices are followed in security

Let's not legislate specific practices.

Imagine if we had security legislation from 1995 to follow when programming today. Imagine trying to explain to senators why last year's XSS protection rules need updating. Imagine Oracle lobbying to get their database enshrined as the "security-compliant" one.

The law should focus on outcomes: if a site gets hacked and people are harmed, the site should be penalized.

"Security compliance" is about how you use a given database, not which one you happen to use. You can securely (but inefficiently) store credentials in a plain text file.

WRT some defences becoming outdated by time, well, it probably would not be two-decades behind, but a couple years or so at most. Even then, ensuring that is better then nothing.

People need tools to judge if they can safely use some product, and that's why standards exist. Otherwise companies are going to continue to screw us until they drop the balls.

> WRT some defences becoming outdated by time, well, it probably would not be two-decades behind, but a couple years or so at most. Even then, ensuring that is better then nothing.

Not necessarily. What if the law mandates use of, say, an encryption algorithm that has been cracked? You can't move to a new one without breaking the law.

Larger organizations use ISO-27001 and SOC-2 to audit this kind of stuff. But even so, sometimes the devil is in the details and it's possible to comply with the letter of the regulation while still being unprepared for the kinds of attacks that your service attracts.

Thanks, I'll look into them, but are there any compulsory standards anywhere? AFAIK this is entirely optional, i.e. left to the good will of the company.

The EU is right now implementing a directive on how private information must be stored, AFAIK

Oh thanks. I guess you're referring to GDPR. I'll take note to research this in the future and have found some resources after seing your comment, but I'd fancy some links if anybody has them that elaborate this topic.

Certain industries are regulated, although the regulations are not consistent. It is not uncommon for jurisdictions to require by law protections on electric grid control equipment. For example, in some places in the US, servers that can ultimately affect a large scale change in power generation equipment (such as switching the configuration of a power plant) must have anti-virus installed on them (NERC-CIP).

>You do know it is impossible to stop all cyber attacks?

This is a fallacious argument, specifically the Nirvana Fallacy. Perfection not being achievable in no way means that there can't be standard best practices that are a minimum requirement, nor that liability cannot still exist. Certain types of cyberattacks are in fact possible to stop perfectly merely by virtue of not holding onto information at all. As a trivial example, there should be no plaintext password leaks (or even easily brute force password leaks) at all, ever. Adaptive hashes/key stretching have been a thing since the dawn of security, Robert Morris described CRYPT for unix password usage in 1978. bcrypt is from 1999. There has been no reasonable basis at all for plain text or even raw fast hash primitives to be utilized, ever, yet they have been. In no other industry dealing with these kinds of privacy and safety concerns is that sort of practice considered acceptable, not should it be.

Holding personal private information at all long term should fundamentally be considered a liability situation, because it's not necessary, it's a commercial choice. Can't be hacked if it doesn't exist. If businesses choose to hold it, they should also be taking reasonable steps to protect it, and accept liability for failures. That's the natural balancing flip side to them getting profit from using it. If they're allowed to turn any costs of holding it into externalities that distorts the market.

From my random perusal of the various reports of compromises over the last few years, my impression is not that organisations tend to get hacked using the latest zero-day vulnerability, but rather that organisations get hacked because they have glaring security holes that you could drive a double-decker bus through.

For example, bcrypt has been around for how long now? And don't almost all the reports of hacks report that a database was lifted with usernames and passwords either in plaintext (for the love of all that is holy) or hashed with unsalted SHA1, or similar?

I wish there was a "web security checklist" where if you ticked all the boxes, you can be pretty sure you have the well-known holes covered. This is why web frameworks are really useful, the decent ones get you way ahead in securing your application from the most common attacks. But if you self-bake, then you have to manage the entire complexity of the web platform.

This doesn't cover everything, but it's a pretty good starting point:


OWASP top 10 is as close as it gets to a checklist: https://www.owasp.org/index.php/Category:OWASP_Top_Ten_Proje...

I think most "hacks" have been the results of social engineering and misconfigurations rather than software/hardware vulnerabilities.

I keep plaintext passwords, but I reverse the string to prevent the hackers.

While I agree, as a CTO I would be terrified if a data breach could hold me personally liable. It'd be like a Director of Security at a bank being liable for their bank being robbed with a tank.

But at the same time there is a line. I would be for holding companies liable if, for instance, the data gets out there and you find it is entirely unencrypted and the passwords are MD5 hashed or plain text. There has to be a baseline.

Mistakes should not be punished as long as there is not also negligence.

The Director of Security at a bank should at least be fired if their bank is robbed by a guy brandishing a banana. I'd speculate that that's the nature of most data breaches: amateur attackers taking advantage of grossly incompetent security.

Well I think bank tellers where I live are instructed to comply with robber's demands for money even if they are not visibly brandishing any weapon

No one said anything about holding the CTO personally liable. The idea is to hold the company liable. This makes sense because the company is in the best position to prevent the bad outcome. If the company is always liable, it can find an optimal balance between the costs of security and the costs of breaches.

If the company is only liable when negligent, it is incentivized to minimize the cost of security to the bare non-negligent minimum. This pushes all the costs onto the people whose data are compromised. These people are not in a position to spend small amounts of money to dramatically lower the expected costs of breaches, so they just end up paying huge costs that cannot be mitigated.

> While I agree, as a CTO I would be terrified if a data breach could hold me personally liable.

Personal liability is going too far, IMO.

> Mistakes should not be punished as long as there is not also negligence.

The problem with this is that you'd have to enshrine, in law, what "negligence" is. Technology changes too fast to put that into law.

"How many people got hurt and how badly?" is a question attorneys can reasonably address. "Was there sufficient input sanitization?" is not.

You didn't really address the point you quoted.

The problem isn't that someone is getting IN; it's that the company throws up their hands and says "tough sht."

Or in a worse case, when Equifax puts up a compromised site to find if you were hacked that requires a significant amount of your SSN and personal details.

(edit: format)

> it's that the company throws up their hands and says "tough sht."

What exactly is your solution to the problem? You are more or less complaining without providing any insights into addressing the issue or without knowledge of the threat landscape.

Spending money on security architecture/engineering/pen testing/etc in concert with government regulation/oversight.

Full disclosure: I work in security architecture/risk management in the financial services industry.

You also can't stop all failures of infrastructure, but outside of computing, anyone calling themselves an engineer is generally required to hold to various ethical and professional standards or have their work signed off by someone who is.

It's not impossible to stop most though. and Hacks like sony, equifax, linkedin and many others are the result of what should be criminal negligence. I.e. not encrypting sensitive personally-identifiable information.

instead of investing in securing their customer data these companies pad their bottom line. so yes, they should be held accountable for failing to follow basic industry-standard data protection practices.

It's impossible to build a house that can't be burglarized. Does that mean you shouldn't lock your door when you leave in the morning?

Silly; it's impossible to stop all murders, therefore we shouldn't bother with making it a legal liability.

If the criteria is that it must be possible to stop all instances of an action to make it a legal issue, then we should just shut down all the prisons.

So doing QA is a crime now?

Edit: adding context.

I'm doing QA to validate information collected by my recruiting company, both acting within Linkedin's terms of service for a paid subscription, and violating their terms of server by improving my own company's process. Like the article said: Linkedin wants to participate in an open internet and also abuse CFAA.

Yes, it's not a crime previously for both the providers and the viewers; however, viewers unofficially announce or utility public information is plagiarism

Let us remember here that Microsoft owns LinkedIn. There's been a lot of love for Microsoft here recently (I'm among the many who are liking the 'new' MS). No doubt, this is quite a separate group to those doing OSS/Linux/Python/Jupyter/etc, but it's worth pausing to think about what a move like this says about their overarching corporate strategies.

Shit like this has been LinkedIn's modus operandi since day one, not to mention their own questionable ethics. It has little or nothing to do with them now being a subsidiary of Microsoft.

Microsoft has the choice to change that behaviour now that they own LinkedIn. It seems that they choose not to.

How do you draw a line between accessing disturbing contents such as child porn vs accessing a leaked document? It seems the former requires some additional moral take - what if the click was accidental and it was an attack? What if the person onlu watched but doesn’t possess the content?

In a court of law, the way we do for all grey areas between legal and illegal in our society. Law is not binary, it’s fuzzy and requires manual intervention. That’s ok.

However: That’s not what this article is about. That we don’t have a perfect solution for whatever weird corner cases (accidentally clicking on child porn?), should not change this very honest, serious and real issue the eff is addressing here. It is a distraction. We can hypothesise about edge cases until the cows come home, but to what end?

I get how a life of working in binary makes us immediately jump to the corner cases. It’s a curse on any legal discussion on HN. But it’s not relevant, and, imo, it dilutes the energy.

Edit : that came out harsh so I’d like to clarify: I get, 100%, where this “looking for the flaws” mentality comes from. It’s what makes a good programmer. A function that only follows the spec for 75% of its possible inputs is wrong. A law, not necessarily. We need to be careful not to keep our engineering hats on when switching to discussing law.

So why didn't hiQ just operate from some jurisdiction where scraping is legal? And use VPN services to prevent blocking. I mean, the Internet is global. So why should US laws matter everywhere?

spamhaus did this and the spammer who sued them got a default judgement. when he went to seize their domain they suddenly cared about us laws.

Well, some domain registries are not under US control. Consider that TPB and Sci-Hub still have domain names.

And, if push comes to shove, one doesn't really need a domain.

More bad rulings from the Ninth Circuit.

The stupidity of the concept of IP never fails to make itself obvious. Free information that users enter into linkedin should remain free.

A Microsoft company attempting to exploit the law for profit? Imagine my shock.

Would you please not post unsubstantive comments here?

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact