
Web Scraping and Crawling Are Perfectly Legal, Right? (2017) - heshiebee
https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/
======
modeless
If search engines hadn't started scraping the web back in the 90s, imagine the
outcry if you tried to start a search engine today. You would instantly be
sued into oblivion. Copyright, trademark, TOS violations, etc. Nothing
remotely similar to a modern search engine would ever be allowed to start if
it wasn't grandfathered in by the time people started getting serious about
law enforcement on the internet.

The law is not as clear cut as people imagine it to be. In many cases the text
of the law is less important than precedent and established practice.

~~~
beamatronic
If you tried it today, you would need a paid subscription to get access to a
search engine

~~~
ReinholdNiebuhr
I'd be okay with a paid subscription service if I got privacy and some other
things as well. There's room for both free and pay. Duckduckgo is ok, but not
google. Would be nice to have google, without the bias and without the privacy
issues.

~~~
481092
You'll never have a search engine without bias. A search engine's job is just
that, to bias various data, not soak up any and all data on the internet.
That's why humans or anyone or anything in this universe is biased, because we
can only soak up so much info from so many proverbial directories in a
lifetime.

~~~
nine_k
The web may be biased. But a search engine that reflects that bias is just
doing its job correctly.

One inevitable and useful bias a search engine has is the bias against spam.
It's aligned with the interests of the user.

There can be (and are) biases that are against the users' interests but are
aligned with operator's: promotion of some content, censorship of other
content. This bias I would rather see gone.

------
tomhschmidt
This article is a bit dated. AFIK the latest in web scraping legality is
LinkedIn vs. HiQ, where HiQ was scraping public LinkedIn profiles. LinkedIn
issued a C&D under CFAA, but HiQ received an injunction that allowed it to
continue scraping. This was supposed to be tried in the Ninth Circuit court
over a year ago, but not sure what happened

[https://www.eff.org/cases/hiq-v-linkedin](https://www.eff.org/cases/hiq-v-
linkedin)

~~~
staticautomatic
Pretty sure the ninth published what's now the controlling opinion in the
circuit. YMMV in other circuits.

EDIT: Yep. hiQ Labs, Inc. v. LinkedIn Corp., 273 F. Supp. 3d 1099 (N.D. Cal.
2017)

"In summary, the balance of hardships tips sharply in hiQ’s favor. hiQ has
demonstrated there are serious questions on the merits. In particular, the
Court is doubtful that the Computer Fraud and Abuse Act may be invoked by
LinkedIn to punish hiQ for accessing publicly available data; the broad
interpretation of the CFAA advocated by LinkedIn, if adopted, could profoundly
impact open access to the Internet, a result that Congress could not have
intended when it enacted the CFAA over three decades ago. Furthermore, hiQ has
raised serious questions as to whether LinkedIn, in blocking hiQ’s access to
public data, possibly as a means of limiting competition, violates state law."

~~~
brighter2morrow
>In particular, the Court is doubtful that the Computer Fraud and Abuse Act
may be invoked by LinkedIn to punish hiQ for accessing publicly available
data; the broad interpretation of the CFAA advocated by LinkedIn, if adopted,
could profoundly impact open access to the Internet, a result that Congress
could not have intended when it enacted the CFAA over three decades ago.

I didn't realize original intent could be used in courts. How the heck did
"original intent" lead to federal abortion and federal gay marriage when all
law in letter and in practice had delegated these questions to the states?

~~~
javagram
Courts have a variety of legal theories available. They can pick and choose
from textualism, original meaning, original intent, evolving meanings/living
constitution, stare decisis, or common law jurisprudence (i.e. law made up by
judges) to get the result they want or believe should be the law.

------
calebclark
What happened to the web's early vision of an information superhighway, an
open repository of data that could be linked, shared and remixed?

The efforts (and lawsuits) to lock down and control data on the public web
threatens to stifle innovation. Websites are only valuable because they're
connected to the internet, which for all intents and purposes, is a general
utility. It was funded by the U.S. government, and is now managed by a
consortium of NGOs such as ICANN, W3C, etc.

When you publish a website, you're distributing it to the world. It's similar
to when you publish a book. It has always been the right of those who buy and
read a book to be able to use the facts contained therein. Although readers
are not allowed to steal the creative elements, they have a basic human right
to use the facts however they desire. And the medium doesn't change this truth
-- whether you read the book yourself, ask your friend to read it out loud, or
employ a bot to read it. So it should be with websites.

Data placed on the public Web is accessible by everyone and should be usable
by everyone. It shouldn't matter who parses the page's HTML, whether it's a
person, a web browser, a bot or a robot.

We're starting a movement to ensure that the facts of the world are always
available for innovators to build on top of:

[https://dataliberationfoundation.org/](https://dataliberationfoundation.org/)

~~~
masonic

      It's similar to when you publish a book.
    

When you publish a book, copyright and Fair Use still apply.

~~~
calebclark
100% agree.

------
nshepperd
The idea that you could be sued[1] with breach of contract for violating a ToS
in a case like this is goddamn insane. A contract can only be formed if both
parties agree to it; and you can't be forced to agree to a contract against
your will. Coercing someone into entering a contract is literally illegal.

So just don't accept the ToS. They can't _make_ you accept it, no matter what
"by [breathing] you accept our ToS" verbiage they put in it.

Sure, maybe that would make your access of the website "hacking" ("accessing a
computer without permission", since you rejected the contract which would
grant permission)... but the court rejected that argument in HiQ vs LinkedIn.

[I get that this probably isn't how a typical court would see it, but to me
that reflects pro-corporate bias and corruption more than any kind of sane
application of common law.]

[1]. Edit: originally this read "charged", but "sued" is more accurate. But I
guess what I really mean is "successfully sued". Anyhow.

~~~
jammygit
The legal innovation is this: by being here, you consent to <15 page
contract>.

Imagine that in the real world. I honestly expect it to become commonplace -
by entering our store, you agree to the terms posted in the binder you may ask
to view.

~~~
sam0x17
But see _you_ a person is not here. A bot is here, and a bot can't enter a
contract.

~~~
crazygringo
You wrote and are running the bot, and are entirely legally responsible for
its behavior, like it or not.

By your logic, if someone shoots a bullet and it kills someone else, it would
be the bullet's fault and not the person who pulled the trigger...?

~~~
EGreg
By your logic, the fault is of the bullet maker.

If someone uses your open source software to commit a crime, is that your
fault?

------
nkozyra
The short version is you can be sued for pretty much any reason. And as an
individual you're unlikely to have the resources to defend yourself in such a
suit.

In most cases described as "gray area" it comes down to cases where the
stronger party made a compelling argument against what might have been a very
valid reason to scrape. The threat is often wielded as intimidation and
intimidation alone.

~~~
wills_forward
So true. And the United States has 2x the number of civil cases per capita
than the next most litigious country, The United Kingdom. Tort reform just
isn’t a sexy campaign promise either.

~~~
elliekelly
The US is unusual in that if you bring a civil suit and lose you don’t have to
compensate the other party for their legal fees.

Unusually high medical bills in the US are also probably a contributing
factor. I’d be curious to see how often a civil suit is brought after a car
accident in EU vs US.

~~~
PeterisP
> I’d be curious to see how often a civil suit is brought after a car accident
> in EU vs US.

Depends on the country of course (UK is probably closer to USA instead of the
rest of EU with civil law), but in general that would very rarely result in a
dispute that needs to be resolved in court. The mandatory liability insurance
would cover the usual claims without a civil suit, and unusual claims are
unlikely to succeed and expensive to pursue, so they aren't.

------
CaliforniaKarl
Here's a hypothetical scenario:

You've just made a container image that does something cool, and have
published it on your site, which is running on the cheapest Amazon Lightsail
plan (in Ohio). The final container image is 50 MB in size.

(Yes, it's weird that you wouldn't publish it to Docker Hub or the like.
Suspension of disbelief, please…)

A grad student somewhere decides they like your container. They write a script
that downloads your container to $TMP, and runs it with their input. Over a
week, the student runs their job 50,000 times.

(Running a compute job 50,000 times is not unusual in HPC. Each run has a
different input, and since your cluster would have a job scheduler, breaking
up the work helps make better use of the cluster.)

Because the student downloaded the container to $TMP, a new download has to
happen each time the job is run. 50,000 downloads of a 50 MB container is 2.5
TB.

(The docs say TB and GB, so I'm assuming base-10 instead of base-2 here.)

Your lightsail plan is $3.50 per month, and includes 1 TB of transfer. You
will have at least 1.5 TB of extra transfer this month. At 9¢ per GB, 1500 GB
of extra transfer is $135.

Please do not dismiss this out-of-hand. The scenario may seem a bit contrived,
but unexpected bandwidth charges do hit people, and can make you really
question your dedication, when having to pay for the access by people you
don't know.

~~~
codingslave
If you don't want to be charged for the compute power, don't make the service.

~~~
CaliforniaKarl
In this example, the person who created the container image isn’t paying for
the compute power, they’re paying for the bandwidth cost of those users
downloading the thing.

------
elchief
It's funny how a 3-line bash script is considered an automated agent, but the
100's of millions of lines of code that Chrome uses to download, parse, and
display HTML is somehow not "automated"

~~~
mantap
Yes you can write that same bash script as a chrome extension and nobody will
care. Many people use extensions that make automated requests not intended by
the website operators.

Making automated requests isn't illegal and won't per se get you sued, it's
literally how a web browser works. What gets you sued is _pissing somebody
off_. So don't do that.

~~~
wruza
How about someone who is pissed off by looks in a public place or by ignoring
their claims. Maybe they should not leave their visually protected home and
issue commands there instead?

Seriously, if one expects some privacy, they don’t live in a glass house and
then complain if you see ‘em, no matter what ToS on the entrance claims.

------
hardwaresofton
It seems like the biggest step forward that scraping/crawling could take is to
get some relevant legal precedent -- is it possible to manufacture/make a case
that could finally set the legal precedence for this?

For example on the issue of insta-agree TOS's that you are in no way forced to
read ("by being on this website you agree to our TOS" BS), is it possible to
make cases to establish precedence from that? Let's say I set up a site where
the TOS says everyone that views the site agrees to give me $5 and I sue
someone -- clearly that wouldn't hold up in court (gulp, hopefully) -- and
having it struck down would help clear up this gray area...

------
hartator
Disclaimer: I run SerpApi.com and IANAl.

In the US, the law has kind of settle in favor of scrapers. Thanks to the
first amendment and the fair use exemption. It the data doesn’t require to
sign up and your are not DDOS the website. Europe is another story.

~~~
CalRobert
Do you have more of that story re: Europe? Also, how does this work in
international scraping? (Does a French company have recourse against a US one
via any treaty, for instance, or vice versa?)

------
clashmeifyoucan
Now, what I'm not sure about and don't have the means to consult a lawyer
about is

1) How liable is the author of a scraper who put it on GitHub but never did
use it themselves. Because I remember seeing a lot of implementations and
tutorials of tools like BeautifulSoup that don't necessarily comply with the
ToS.

2) What if you edit the User-Agent to something that resembles a browser
provided the scraper is just substituting browser interaction, e.g. those
search from your terminal applications. Could/would it be construed as
impersonation or something?

------
ospider
Things are a little different here in China. The government just published a
law that if you scraping traffic is more the one third of the total traffic of
a website, it is considered illegal.

~~~
meritt
Whoa. Do you have a link to the legislation or news?

~~~
severine
It doesn't reference that measure explicitely, but you might find this article
interesting.

 _China’s Draft Data Security Measures and How They Compare to the GDPR_ :
[https://www.natlawreview.com/article/china-s-draft-data-
secu...](https://www.natlawreview.com/article/china-s-draft-data-security-
measures-and-how-they-compare-to-gdpr)

------
jarfil
1\. Create a random web page with some random words.

2\. Add a ToS notice stating that anyone downloading it now owes you $1000.

3\. List every client IP in the logs and sue everyone even if you don't know
who they belong to.

4\. Profit?

~~~
rocky1138
No, thankfully.

[http://www.internetlibrary.com/cases/lib_case456.cfm](http://www.internetlibrary.com/cases/lib_case456.cfm)

~~~
grenoire
How is the court ruling on that different from the LI situation?

------
rayascott
The author attempts to contrast web scraping and web crawling and gets it very
wrong. This is what it actually is: web scraping - I download the content of a
webpage or select webpages, web crawling - I download the content of the
entire internet (or part thereof). How he can say that someone who's crawling
the internet and building a search index with the contents of the downloaded
data isn't also scraping the web, makes no sense.

------
kamfc
It made billions for companies. The rule of thumb is to start off "legal",
build a big audience with huge benefits to the companies you're burrowing the
data from, then quickly shift to legal. If you're not fast enough, you're
dead. Web scraping and crawling has become less acceptable because the market
leaders carved out the niches before you in a very "legal" way, and DO NOT
want you as a competitor.

------
hyfgfh
"But law has apparently nothing to do with fairness. It's based on rules,
interpreted by people." That`s why I`m chaotic!

------
mgamache
Try and scrape scores or sports results or stock prices (and re-publishing).
Pretty sure you will get a "cease and desist" if you get any traffic at all.
That data is owned by corporations even if it is publicly displayed.

~~~
teddyh
That data isn’t “owned”. Mere data such as you describe (sports results or
stock prices) cannot be owned. There is no legal mechanism for it. Copyright
can’t cover mere facts such as those, and patents or trademarks don’t apply,
nor does trade secrets, obviously. However, the corporations involved sure
would _like_ you to think otherwise, and they will fight tooth and nail to
keep it that way.

~~~
mgamache
It's complicated...

[https://pando.com/2014/02/06/who-owns-real-time-sports-
data/](https://pando.com/2014/02/06/who-owns-real-time-sports-data/)

------
zaro
I am confused. Where does robots.txt come into play here? I always thought
that if url is allowed in robots.txt it's fair game to scrape.

------
maitredusoi
Is life perfectly legal ?

------
tingletech
needs a (2017)

------
patagonia
Just a thought. The entire user data / data broker business is just businesses
“scraping” my digital presence. No one is respecting my terms and conditions.
They’re making trillions off that.

~~~
patagonia
Comments accompanying downvotes always appreciated.

